Identification and epidemiological characterization of Type-2 diabetes sub-population using an unsupervised machine learning approach

Bej, Saptarshi; Sarkar, Jit; Biswas, Saikat; Mitra, Pabitra; Chakrabarti, Partha; Wolkenhauer, Olaf

doi:10.1038/s41387-022-00206-2

Download PDF

Article
Open access
Published: 27 May 2022

Identification and epidemiological characterization of Type-2 diabetes sub-population using an unsupervised machine learning approach

Nutrition & Diabetes volume 12, Article number: 27 (2022) Cite this article

3237 Accesses
10 Citations
5 Altmetric
Metrics details

Subjects

Abstract

Background

Studies on Type-2 Diabetes Mellitus (T2DM) have revealed heterogeneous sub-populations in terms of underlying pathologies. However, the identification of sub-populations in epidemiological datasets remains unexplored. We here focus on the detection of T2DM clusters in epidemiological data, specifically analysing the National Family Health Survey-4 (NFHS-4) dataset from India containing a wide spectrum of features, including medical history, dietary and addiction habits, socio-economic and lifestyle patterns of 10,125 T2DM patients.

Methods

Epidemiological data provide challenges for analysis due to the diverse types of features in it. In this case, applying the state-of-the-art dimension reduction tool UMAP conventionally was found to be ineffective for the NFHS-4 dataset, which contains diverse feature types. We implemented a distributed clustering workflow combining different similarity measure settings of UMAP, for clustering continuous, ordinal and nominal features separately. We integrated the reduced dimensions from each feature-type-distributed clustering to obtain interpretable and unbiased clustering of the data.

Results

Our analysis reveals four significant clusters, with two of them comprising mainly of non-obese T2DM patients. These non-obese clusters have lower mean age and majorly comprises of rural residents. Surprisingly, one of the obese clusters had 90% of the T2DM patients practising a non-vegetarian diet though they did not show an increased intake of plant-based protein-rich foods.

Conclusions

From a methodological perspective, we show that for diverse data types, frequent in epidemiological datasets, feature-type-distributed clustering using UMAP is effective as opposed to the conventional use of the UMAP algorithm. The application of UMAP-based clustering workflow for this type of dataset is novel in itself. Our findings demonstrate the presence of heterogeneity among Indian T2DM patients with regard to socio-demography and dietary patterns. From our analysis, we conclude that the existence of significant non-obese T2DM sub-populations characterized by younger age groups and economic disadvantage raises the need for different screening criteria for T2DM among rural Indian residents.

Towards precision cardiometabolic prevention: results from a machine learning, semi-supervised clustering approach in the nationwide population-based ORISCAV-LUX 2 study

Article Open access 06 August 2021

Analysis and visualisation of electronic health records data to identify undiagnosed patients with rare genetic diseases

Article Open access 01 March 2024

Randomized feature selection based semi-supervised latent Dirichlet allocation for microbiome analysis

Article Open access 17 April 2024

Introduction

Type-2 diabetes mellitus (T2DM) is a multifactorial disease globally estimated to rise to 629 million cases by 2045 (see IDF Diabetes Atlas) [1, 2]. Though conceived as a homogeneous disease for long, several recent studies have found T2DM to be a mix of heterogeneous disease subtypes [3,4,5]. These studies have reported a varied pathophysiology underlying T2DM and thereby suggest the possibility of a personalised treatment for T2DM.

Besides obesity, other factors like age, sex, socio-economic status, place of residence (rural/urban), smoking habit, alcohol intake, food frequency, etc. are significantly associated with T2DM [6,7,8,9,10,11,12,13]. Several of these factors are modifiable in nature and hence are important in the management of T2DM [1]. However, modification of lifestyle-related factors varies and thereby leads to a differential degree of glycemic control among T2DM patients [14]. Glycaemic control and response to anti-diabetics have also been shown to be different among T2DM sub-groups [15]. To explore whether any particular pattern of patient sub-populations exists within the entire T2DM population based on socio-demographic and lifestyle factors, we used an unsupervised clustering approach on the largest and most comprehensive epidemiological dataset in India, the National Family Health Survey-4 (NFHS-4) dataset. Clusters were subsequently characterised to identify unique socio-demographic and lifestyle patterns associated with these sub-populations.

Epidemiological datasets provide a comprehensive set of information regarding socio-demography, lifestyle, addiction and co-morbidities. Variables containing such information are called features in the language of Machine Learning. In the T2DM-NFHS-4 dataset, there are 36 such features, containing information on each diabetes patient. Moreover, in our dataset, the features can be categorised into three types:

1.
Continuous features: These are the features that can assume any numeric value from a continuous range. For example, the BMI of a patient is a continuous feature.
2.
Ordinal features: These are the features that assume values from a discrete range, such that, there is a sense of order in the values assumed by the feature. For example, let us assume a feature ‘meat consumption by a patient’, assumes values ‘daily’, ‘weekly’ or ‘monthly’. Clearly, the range of the feature ‘meat consumption by a patient’ is discrete, since it can assume any one of the three values. Also, there is a sense of order in the values, indicating that daily meat consumption is the highest and monthly meat consumption is the lowest if we want to quantify meat consumption.
3.
Nominal features: These are the features that assume values from a discrete range, such that, there is no sense of order in the values assumed by the feature. For example, let us assume a feature ‘religion of a patient’, assumes values ‘Hindus’, ‘Muslims’ or ‘Christians’. Clearly, the range of the feature ‘religion of a patient’ is discrete, since it can assume any one of the three values. But there is no sense of order in the possible values assumed by the features. Yet, this feature draws its importance from the fact that lifestyle patterns or diets vary largely among these religious groups.

Such diverse types of features in epidemiological data create challenges for the analysis. Conventional application of the state-of-the-art dimension reduction tool Uniform Manifold Approximation (UMAP) was found to be ineffective for the T2DM-NFHS-4 dataset. Continuous features, although smaller in numbers, had an overpowering effect on the distribution of clusters. To address this problem, we implemented a distributed clustering workflow, combining different similarity measure settings of UMAP, for clustering continuous, ordinal and nominal features separately. We integrated the reduced dimensions from each feature-type-distributed clustering to obtain interpretable and unbiased clustering of the data.

The workflow realised for the present study (Fig. 1) involves the investigation of underlying socio-demographic patterns within patient sub-populations using unsupervised learning. Dimension reduction approaches are often used to reduce higher dimensional data to lower dimensions such that in the lower dimensional embedding of the data one can visualize underlying clusters within the data, that are not apparent in the higher dimensions [16]. Several such techniques have been developed over the last few decades. Until recently the dimension reduction technique t-Stochastic Neighbourhood Embedding (t-SNE) was a state-of-the-art algorithm in this field providing numerous applications in various fields [17,18,19]. t-SNE projects high dimensional data to a lower dimension while maintaining the underlying local manifold structure in a sense that, in a lower dimension t-SNE can cluster points, that are close enough in the latent high dimensional manifold [17].

**Fig. 1: Workflow describing the analysis of the T2DM-NFHS-4 Dataset.**

With a rigorous mathematical foundation, considerably high speed and easy to use using scikit-learn API, UMAP has turned out to be one of the most popular choices among data scientists [20,21,22]. As opposed to t-SNE, UMAP uses a graph-based manifold approximation mechanism which contributes to the preservation of the global as well as social properties of the latent data manifold in a lower-dimensional representation of the data. Given some low-dimensional representation of the data, a similar process can be used to construct an equivalent topological representation. UMAP builds a graph considering customized neighbourhoods for every data point. This graph is a representation of the higher dimensional data manifold. The end result is a patchwork of low-dimensional representations of neighbourhoods that groups similar data points on a local scale while better preserving long-range topological connections to more distantly related data points [20, 22]. For the ability of UMAP to preserve the long-range topological connections along with the short-range topological connections and because of its high computational efficiency we choose UMAP for our unsupervised clustering approach. Moreover, UMAP allows a user to specify several similarity measures through the tuning of the metric parameter. This has been critical in our workflow since our data contains continuous and categorical features and choosing suitable similarity measures for continuous and categorical features is crucial for a meaningful and informative clustering [23].

Methodology

Source and description of the T2DM-NFHS-4 dataset

Data preparation and pre-processing are the key aspects of approaching a problem from a Machine Learning perspective. In this section, we provide the details on the pre-processing approach adopted to generate the T2DM-NFHS-4 dataset.

The NFHS-4 dataset was downloaded from The Demographic & Health Surveys (DHS) Program website. NFHS-4 is the fourth version of the national health survey conducted under the supervision of the Ministry of Health and Family Welfare, Government of India with the International Institute for Population Sciences (IIPS), Mumbai serving as the main nodal agency for all the surveys. The sampling procedure followed in NFHS-4 was stratified two-stage sampling covering all the 640 districts of India. The survey was successfully conducted with 601,509 households. In those interviewed households 112,122 men and 699,686 women could be successfully interviewed. Four survey questionnaires (Household Questionnaire, Woman’s Questionnaire, Man’s Questionnaire and Biomarker Questionnaire) were implemented in 17 local languages to collect information on basic demographic information, socio-economic parameters, family planning issues, nutritional status, health indicators, contact with community health workers, etc. The uniqueness of the NFHS-4 study was that it collected data on Diabetes status and performed a Random Blood Glucose for individuals (15–54 years) using a finger-stick blood specimen. As a result, the biomarker measurements and tests besides anthropometric measurements like anaemia testing, blood pressure measurement, blood glucose testing and HIV testing were included in the survey.

Dataset preparation

For dataset preparation and cleaning, the three questionnaires were merged: Woman’s Questionnaire, Man’s Questionnaire and Biomarker Questionnaire. The first two contained information about background characteristics (location, age, sex, religion, social group, literacy, wealth status, etc.), nutritional practices, addictions and co-morbidities while the biomarker questionnaire contained information on height, weight, blood pressure and random blood glucose. A unique code was generated for all individuals in all the three questionnaires by appending the country code and phase, cluster number, household number and line number. The three datasets were joined by the unique code to prepare a single dataset of 810,971 individuals consisting of all men and women between 15–54 years of age. Pregnant women were next excluded to discard the possibility of Gestational Diabetes Mellitus. Individuals with missing diabetic and blood pressure status were also excluded. Variables known to be risk factors for DM (body mass index (BMI), age, place of residence, wealth index, smoking frequency, alcohol intake frequency, hypertension), socio-economic factors (sex, religion, social group, educational status), Dietary frequencies and haemoglobin level were selected for final analysis. BMI, age and haemoglobin level were taken as continuous variables and the rest as categorical variables. Outliers were removed separately for all the three continuous variables to obtain the final dataset with 610, 498 individuals (526, 678 females and 83, 820 males).

Dataset pre-processing

We were interested in detecting significant T2DM sub-populations in the data and further sought to characterize these sub-populations based on the socio-demographic and co-morbid conditions. For this purpose, we extracted patients with a known history of diabetes from the dataset: a total of 10 125 patients. We considered a diverse collection of socio-demographic and co-morbid conditions as ‘features’ in our dataset. Qualitatively our features can be divided into several categories:

1.
Co-morbid conditions: This class of features considers the co-morbid diseases among T2DM patients. We considered whether a T2DM patient had medical conditions such as asthma, thyroid disorder, heart disease, cancer, tuberculosis and hypertension. Thus, there were six features in this category. These features are binary in nature denoting whether a T2DM patient suffered from a given comorbidity or not.
2.
Food habits: This class of features considered the food habits of T2DM patients. The features considered here were how frequently the patient took the food items: milk or curd, pulses or beans, dark leafy vegetables, fruits, eggs, fish, chicken, fried food and aerated drinks. Thus, there were nine features in this category. Features were categorical and ordinal in nature having four possible values: ‘daily’, ‘occasionally’, ‘weekly’ and ‘never’.
3.
Addiction history: This class of features considered the addiction pattern of T2DM patients. There were two features in this class, both binary in nature encoding whether a patient is a smoker or whether a patient takes alcohol.
4.
Socio-demographic features: These included features such as sex, age, wealth index, education level, religion and caste along with BMI and haemoglobin level of the patient. There were eight features in this category.
5.
Living conditions: This class of features quantifies the living conditions of the patients. The features in this class considered whether a patient lives in a household possessing refrigerator, bicycle, motorbike, four-wheeler vehicle and livestock. Moreover, there were features denoting the type of residence, household structure, frequency of household members smoking inside the house, type of cooking fuel used, source of drinking water and time to reach the nearest drinking water source. Thus, there were eleven features belonging to this category.

For our study, 36 features or factors are considered to investigate significant patient populations among the diabetes patients into consideration. Note that there are both continuous and categorical features among these thirty-six features. Among the categorical features, there are both ordinal features and nominal features. Ordinal features have a sense of order among them, such as the features from the ‘food habits’ category as described before. The nominal features are categorical features with no sense of order such as the sex of a patient. Note that for our dataset the continuous features are: age, BMI, haemoglobin level and time to get to drinking water source, whereas the nominal features are: sex, religion, caste, household structure, type of place of residence, type of cooking fuel and source of drinking water. The rest of the features are ordinal features. The categorization of features into continuous, nominal and ordinal is of utmost importance in our clustering paradigm which we discuss in the section “Clustering paradigm using UMAP”.

Identification of T2DM sub-populations using UMAP and DBSCAN

From our detailed description of our dataset, we pointed out that our dataset has a variety of features including continuous and categorical features. Further, there are both ordinal and nominal features among the categorical features in our dataset. A simple UMAP on the entire dataset is depicted in Fig. 2a, revealing two broad clusters. For this clustering of UMAP parameters, n_neighbours have been chosen to be 30, whereas the metric parameter has been chosen to be Euclidean. However, we have a number of important nominal and ordinal categorical features whose effect would not be apparent from such a clustering. Moreover, the Euclidean distance does not always make sense on categorical features, especially if they are nominal in nature. For example, observe Fig. 2d, where we have used UMAP considering only the nominal features with metric parameter hamming (based on hamming distance). This reveals a completely different picture of the dataset, showing several small clusters. Our clustering paradigm is designed to optimise this effect and find a balance in the clustering where a particular type of feature does not have an overpowering effect on the clustering process.

**Fig. 2: The low dimensional UMAP visualisations of data for several data types.**

Clustering paradigm using UMAP

Our clustering paradigm applies UMAP separately on continuous, nominal and ordinal features separately. For each of these feature categories, we create a lower-dimensional embedding of the dataset. Finally, we integrate the lower-dimensional embeddings to extract clusters from them using the DBSCAN algorithm, a clustering algorithm used for extracting clusters from data based on data density. One advantage of this algorithm is that one does not need to specify the number of clusters beforehand. DBSCAN considers closely or densely located points, as clusters [24]. For UMAP, we use the same values for the parameters n_neighbours = 30 and min_distance = 0.1 for all the feature types.

For the continuous features, we use the metric measure to be Euclidean. The Euclidean distance between two vectors is given by:
$$d\left( {x,y} \right) = \sqrt {\mathop {\sum}\nolimits_{i = 1}^n {\left( {x_i - {\it{y}}_i} \right)^2} }$$
(1)
For the nominal features, we use the metric measure to be Hamming. Hamming distance is defined as:
$$d\left( {x,y} \right) = \mathop {\sum}\nolimits_{i = 1}^n {\delta \left( {x_i,y_i} \right)}$$
(2)
where δ(x_i, y_i) = 1, if x_i = y_i and δ(x_i, y_i) = 0 otherwise. Recall that nominal features are also a type of categorical features that do not have a sense of order associated with them. For such features, Hamming distance is widely used as a similarity measure between data points [23].
For the ordinal features, we use the metric measure to be Canberra. It is a weighted version of the Manhattan measure. The Canberra distance is given by:

$$d\left( {x,y} \right) = \sqrt {\mathop {\sum}\nolimits_{i = 1}^n {\frac{{\left| {x_i - y_i} \right|}}{{\left| {x_i} \right| + \left| {y_i} \right|}}} }$$

(3)

Ordinal features are also a type of categorical feature. However, the Hamming metric cannot capture the inherent ordered relationships and statistic information from categorical values [23]. We thus tried using UMAP for several metric measures and noticed that the Canberra distance measure retains a high variance in the lower dimensions. Thus we chose the Canberra distance measure as a similarity metric for ordinal features.

For the categorical and ordinal features, we thus produce a two-dimensional representation of each data point by taking into consideration the first two UMAP coordinates. For the nominal features, we consider we produce a one-dimensional representation since the data points are too scattered in this case as shown in Fig. 2d and thus can lead to too many clusters. Thus, we reduce every data point into a five-dimension representation, two for each of the continuous and ordinal features and one for the nominal features. Finally, we look for clusters in the five-dimensional representation using DBSCAN (eps = 1, min_points = 200). After selecting the final clusters, we characterized them by summarizing all the 36 variables separately for each cluster. The continuous variables were summarized as their mean and the standard error of the mean. The categorical variables were summarized as their frequency distribution and the proportion of each value within each cluster.

Extraction of T2DM sub-populations using DBSCAN

Using our clustering paradigm described before, we can detect seven sub-populations among the patients where 261 patients are considered as outliers. We show the distribution of clusters in Fig. 3a. We further perform a UMAP on the five-dimensional reduced representation of our data to visualize the clusters detected by DBSCAN. For this, we label the data points using the DBSCAN clustering labels and colour code them in the UMAP representation of the five-dimensional reduced data as shown in Fig. 3b. This provides validation to the fact the clustering done by DBSCAN makes sense. Note that, from our clusters, we can detect four significant patient sub-populations containing 2898, 2301, 2226 and 1315 data points.

**Fig. 3: The information on clusters detected in the data.**

Results

Characterization of clusters

Age and BMI both were found to be lower in Cluster 2 and Cluster 4

Age and obesity are the most important risk factors for T2DM. However, we found a heterogeneity in both these variables across all the clusters. Interestingly, the mean age and BMI both were lower in Cluster 2 (Age: 38.3 ± 0.19 years, BMI: 23.9 ± 0.1) and Cluster 4 (Age: 37.9 ± 0.26 years, BMI: 23.6 ± 0.13) compared to Cluster 1 (Age: 41.3 ± 0.14 years, BMI: 26.7 ± 0.09) and Cluster 3 (Age: 39.9 ± 0.18 years, BMI: 26 ± 0.11). However, the distribution of males and females has been found to be similar across all the clusters.

Higher proportion of rural residents and lower proportion of richest wealth quintile in Clusters 2 and 4

The proportion of rural residents was found to be high in Cluster 2 (69.4% were Rural residents) and Cluster 4 (72.02% were Rural residents) compared to the other clusters (31.3% in Cluster 1 and 49.19% in Cluster 3). Surprisingly, only 4.3% of the people in Cluster 2 and 8.37% in Cluster 4 belonged to the richest quintile of the Wealth Index category whereas 64.04% in Cluster 1 and 54.9% in Cluster 3 belonged to the same.

Frequency of co-morbid conditions was similar across all the clusters

Co-morbid conditions included a history of asthma, thyroid disease, heart disease, cancer, history of tuberculosis, haemoglobin level and hypertension. Though the distribution of disease conditions shows minor variation across the clusters (Table 1), the trend is almost similar in all the clusters.

Table 1 Detailed cluster-specific analysis for all numerical and categorical variables.

Full size table

Lifestyle patterns show evidences of a lower quality of life for patient sub-populations in Clusters 2 and 4

Our analysis reveals several other factors that support the fact that T2DM sub-populations from Cluster 2 and Cluster 4 have a considerably lower quality of life.

1.
We observe that only 0.22% and 24.79% of patients belonging to Cluster 2 and Cluster 4, respectively, possess a refrigerator compared to 95.48% and 65.77% of patients belonging to Cluster 1 and Cluster 3, respectively.
2.
Only 30.9% and 32.78% of patients belonging to Cluster 2 and Cluster 4 respectively possess a motorbike compared to 71.53% and 67.03% of patients belonging to Cluster 1 and Cluster 3, respectively.
3.
Only 3.26% and 3.19% of patients belonging to Cluster 2 and Cluster 4 respectively possess a car/truck compared to 23.5% and 17.34% of patients belonging to Cluster 1 and Cluster 3, respectively.
4.
44.24% and 54.98% of patients belonging to Cluster 2 and Cluster 4 respectively, use plant-based cooking fuel, which is relatively cheap, compared to 12.22% and 19.63% of patients belonging to Cluster 1 and Cluster 3 respectively. Moreover, only 41.94% and 36.2% of patients belonging to Cluster 2 and Cluster 4, respectively use gas/oil-based cooking fuel, which is relatively expensive, compared to 84.89 and 70.17% of patients belonging to Cluster 1 and Cluster 3, respectively.
5.
6.35% and 15.51% of patients belonging to Cluster 2 and Cluster 4 respectively, drink water from unprotected sources, compared to 2.62% and 1.98% of patients belonging to Cluster 1 and Cluster 3, respectively.

Intake of non-vegetarian foods is invariably low in Cluster 3

Around 90% of the population in Cluster 3 had no intake of Egg (89.08%), fish (97.12%), chicken or meat (97.71%) whereas only less than 10% of the population in all the other 3 clusters had no intake of these non-vegetarian foods (Table 1). Though the Cluster 3 population had the highest daily intake of milk/curd (61.81%) and pulses/beans (50.31%) compared to the other clusters, other clusters also had an almost similar proportion of people taking milk/curd and pulses/beans daily. Intake of other foods like dark leafy vegetables, fruits, fried foods and aerated drinks showed similar distribution across all the clusters.

Discussion

Rationale of the workflow in clustering epidemiological data

The clustering workflow used arises from some important observations that we will discuss here. To begin with, we have a population of 10,125 T2DM patients with a diverse ensemble of features accounting for information on medical history, dietary and addiction habits, socio-economic and lifestyle patterns. Moreover, the features in the considered dataset are also diverse in terms of data types. We have a total of 36 features, out of which 4 are continuous features, 7 nominal features and 25 ordinal features, all of equal importance by assumption.

The aim is to find significant sub-populations in our data such that the identified sub-populations are interpretable in terms of the considered features. Note here that, by significant sub-populations, we mean a sub-population consisting of at least 10 percent of the total population. If there exist such sub-populations and we can explain the sub-populations in terms of the considered features, we can argue that these patterns exist in a significant number of patients.

We have already argued in favour of using UMAP for our unsupervised approach to find clusters in the data. However, we observed that applying UMAP algorithm conventionally using the Euclidean similarity metric on our entire dataset with 36 features turns out to be ineffective. The reason is, in this case, the continuous features have an overpowering effect over the other feature types in determining the distribution of clusters. This can be observed in Fig. 2a, b. Note that Fig. 2a shows UMAP clustering with all 36 features and 2(b) shows UMAP clustering with only four continuous features. Note that, there is a similarity in the clustering distribution of these figures, each containing one major cluster and seven small minor clusters. We observed that this is because of the fact that UMAP, when applied to all 36 features of the dataset using the Euclidean similarity measure is largely biased towards finding similarity among data points only in terms of the continuous features. Given that we have only four continuous features out of 36, this poses a problem as the diverse information present in the dataset in the form of the ordinal and nominal features is largely ignored.

To solve this problem, the clustering of continuous, ordinal and nominal features was treated separately by using different similarity matrices for them, giving rise to our clustering paradigm. We argued on our choice of similarity measures in Section “Clustering paradigm using UMAP”. This generates for each feature type a data representation of lower dimension shown in Fig. 2b–d. We finally integrated these lower dimension data representations by taking two-dimensional representations for continuous and ordinal features and a one-dimensional representation (the one consisting of the most variance) for nominal features. The reason behind considering one-dimensional representation for nominal features is that using Hamming metrics for such data results in retaining a lot of variance in the data resulting in multiple clusters as we observe in Fig. 2d. Considering a two-dimensional representation for this data while integrating these lower dimension data representations carry forward this variance and result in multiple small clusters in the final clustering distribution, which contradicts our aim of finding significantly large sub-populations (of at least 10 percent of the total population).

Finally, the integration is done by applying UMAP on the five-dimensional reduced representation of the dataset using the Euclidean similarity measure (shown in Fig. 3b). Note here that, in our final clusters, we can observe patterns in all of the continuous, ordinal and nominal data types. For example, in Cluster 4 the continuous feature ‘Time to Water source (min)’ shows very high values compared to other clusters. In Clusters 1 and 3, the nominal feature ‘Cooking fuel used’ shows a higher percentage for Gas/Oil users while in Clusters 2 and 4 the same feature shows a higher percentage for plant-based fuel users. In Cluster 3, the ordinal feature ‘Fish intake frequency’ shows 97% of people to be never consuming fish. Thus, we infer that our clustering paradigm enables us to find significant sub-populations while keeping the clustering distribution unbiased, that is no feature type continuous, ordinal and nominal has an overpowering effect on the other.

Significance of T2DM clusters

T2DM was identified as a homogeneous disease with Insulin Resistance followed by β-cell dysfunction being the underlying pathology. However, recent studies have explored and found T2DM to be a heterogeneous entity with the relative contribution of Insulin Resistance and β-cell dysfunction to differ across T2DM clusters [3]. These studies were performed on clinical and biochemical data with variables having uniform data types. On the other hand, our clustering approach takes into account the diverse data types obtained from an epidemiological dataset and discovers clusters among the T2DM population. Interestingly, two of the four clusters obtained in our study belonged to the non-obese T2DM phenotype where the mean BMI was below 25. These two non-obese clusters also had lower mean age compared to the other clusters. Both these non-obese clusters had a larger proportion of rural residents and a lower proportion of people belonging to the highest wealth quintile concluding to the fact that a large majority of T2DM people from rural India have lower BMI and are younger in age. The T2DM patient sub-population belonging to these clusters has a relatively lower quality of life judging by analysis of the lifestyle pattern-based features. The non-obese phenotype of T2DM has been increasingly reported over the last two decades raising concern about the uniqueness of its underlying pathophysiology with a greater contribution of β-cell dysfunction compared to Insulin Resistance [25,26,27,28]. This non-obese T2DM phenotype has been found among Asians and studies depicting and investigating its similarities and differences have been in place. Studies have concluded that T2DM occurs among Asians at a lower BMI cut-off and also at a younger age [29, 30]. This finding of two non-obese clusters with lower mean age provides confirmation to this. Among the studies aimed to identify T2DM subtypes, the two subtypes severe insulin deficient diabetes (SIDD) and mild-age-related diabetes (MARD) were found to be common [3, 4]. Both cluster 2 and cluster 4 in our study seem to have similarities to the SIDD subtype though the reasons behind obtaining two different epidemiological clusters within the SIDD subtype need further investigation. As our dataset had patients below 49 years of age, we couldn’t obtain any cluster that may be compared to the mild-age-related diabetes (MARD) subgroup. The remaining two clusters in our study, cluster 1 and cluster 3 are both obese groups and therefore may be the epidemiological counterparts of either mild obesity-related diabetes (MOD) or severe insulin-resistant diabetes (SIRD). Hence, the T2DM clusters obtained from epidemiological data provide further strength to the clinical T2DM subtypes and raise the need to further investigate the epidemiological risk factors of T2DM subtypes.

Though non-obese T2DM is being considered a unique phenotype, epidemiological studies for identifying high-risk population groups still remain undone. This is especially important for many Asian countries where over half of the T2DM population is of non-obese phenotype [25]. This analysis, reporting an increased presence of Rural residents in both the non-obese T2DM clusters, calls for a modification in BMI and Age cut-off for T2DM screening among rural residents. However, identification of risk factors for T2DM specific to the rural population needs to be done. Representation of people from the highest wealth quintile was much lower in both the non-obese T2DM clusters. T2DM is a multifactorial disease requiring strict compliance to lifestyle modification, proper diet and antidiabetic therapy. Non-obese T2DM clusters with reduced representation from the highest wealth quintile suggest the possibility of unequal access to care for non-obese T2DM people thereby generating the need for a more equitable healthcare policy in terms of prevention and therapy.

On the other hand, both the obese T2DM clusters had higher ages and more urban residents. The proportion of people from the highest wealth quintile was higher in both the obese clusters. Interestingly one of the obese clusters (Cluster 3) had invariably a low intake of non-vegetarian foods (egg, fish, chicken and meat) pointing out the fact this T2DM cluster comprised vegetarian people mainly. Dietary requirements in diagnosed T2DM patients involve a reduced amount of carbohydrates and fats with an increased amount of protein-rich foods [31]. Animal products, being rich sources of dietary protein, need to be included in the diet. One of the obese T2DM clusters with a strict vegetarian dietary pattern suggests the need to design proper dietary guidelines for this group.

As already mentioned, T2DM is a multifactorial disease with socio-economic inequality suggested to play an important role in the pathology and management of the disease [32]. Studies have identified socio-economic inequalities and allostatic load to associate with T2DM [33]. The negative effect of social stress, uncertainty and poor nutrition [34] seems to be more relevant for clusters 2 and 4 in this study where individuals majorly belong to a weaker socio-economic class. Though this study doesn’t have the data to investigate this association, the possibility of obtaining T2DM subtypes based on the allostatic load seems to be a promising area of diabetes research. Designing a customized healthy diet and lifestyle plan for certain T2DM subtypes with a view to reducing the allostatic load may emerge as an important strategy in T2DM management.

Conclusions

From a data science perspective, this analysis addresses the issue of diverse data types. We have shown that for such data conventional application of dimension reduction approaches might not be fruitful. We develop a workflow that contributes to finding meaningful and interpretable clusters such that the distribution of clusters is not biased by the data types.

The existence of a significant non-obese T2DM patient sub-population belonging to a younger age group and having larger proportions of rural residents raise a lower quality of life, indicating the need for different screening criteria for T2DM among rural Indian residents. The obese T2DM cluster with around 90% of people sticking to the non-vegetarian diet calls for the need for dietary guidelines for T2DM patients having a non-vegetarian dietary pattern.

Data availability

We support the idea of transparency and reproducibility of research. Therefore, all data relevant to this work are made publicly available on a GitHub repository.

Code availability

The Python code (in the form of a Jupyer notebook) for the implementation of our workflow is also provided publicly in GitHub.

References

Zheng Y, Ley S, Hu F. Global aetiology and epidemiology of type 2 diabetes mellitus and its complications. Nat Rev Endocrinol. 2017;14:12. https://doi.org/10.1038/nrendo.2017.151.
Chen L, Magliano D. The worldwide epidemiology of type 2 diabetes mellitus-present and future perspectives. Nat Rev Endocrinol. 2011;8:228–36. https://doi.org/10.1038/nrendo.2011.183.
Anjana R, Baskar V, Thakarakkattil A, Nair N, Jebarani S, Siddiqui MK, et al. Novel subgroups of type 2 diabetes and their association with microvascular outcomes in an asian indian population: a data-driven cluster analysis: the inspired study. BMJ Open Diabetes Res Care. 2020;8:1506. https://doi.org/10.1136/bmjdrc-2020-001506.
Article Google Scholar
Ahlqvist E, Storm P, Käräjämäki A, Martinell M, Dorkhan M, Carlsson A, et al. Novel subgroups of adult-onset diabetes and their association with outcomes: A data-driven cluster analysis of six variables. Lancet Diabetes Endocrinol. 2018;6:03. https://doi.org/10.1016/S2213-8587(18)30051-2.
Cho SB, Kim S, Chung M. Identification of novel population clusters with different susceptibilities to type 2 diabetes and their impact on the prediction of diabetes. Sci Rep. 2019;9:12. https://doi.org/10.1038/s41598-019-40058-y.
Carlsson S, Hammar N, Grill V, Kaprio J. Alcohol consumption and the incidence of type 2 diabetes. Diabetes Care. 2003;26:2785–90. https://doi.org/10.2337/diacare.26.10.2785.
ML Wheeler, SA Dunbar, LM Jaacks, W Karmally, EJ Mayer-Davis, and J Wylie-Rosett et al. Macronutrients, food groups, and eating patterns in the management of diabetes. Diabetes Care. 35:434–45, 2012. https://doi.org/10.2337/dc11-2216.
Agardh E, Ahlbom A, Andersson T, Efendic S, Grill V, Hallqvist J, et al. Socio-economic position at three points in life in association with type 2 diabetes and impaired glucose tolerance in middle-aged Swedish men and women. Int J Epidemiol. 2007;36:84–92. https://doi.org/10.1093/ije/dyl269.
Article CAS PubMed Google Scholar
Agardh E, Allebeck P, Hallqvist J, Moradi T, Sidorchuk A. Type 2 diabetes incidence and socio-economic position: a systematic review and meta-analysis. Int J Epidemiol. 2011;40:804–18. https://doi.org/10.1093/ije/dyr029.
Article PubMed Google Scholar
Nagaya T, Yoshida H, Takahashi H, Kawai M. Heavy smoking raises risk for type 2 diabetes mellitus in obese men; but, light smoking reduces the risk in lean men: A follow-up study in Japan. Ann Epidemiol. 2008;18:113–8. https://doi.org/10.1016/j.annepidem.2007.07.107.
Article PubMed Google Scholar
Schwingshackl L, Hoffmann G, Lampousi A-M, Knüppel S, Iqbal K, Schwedhelm C, et al. Food groups and risk of type 2 diabetes mellitus: a systematic review and meta-analysis of prospective studies. Eur J Epidemiol. 2017;32:04. https://doi.org/10.1007/s10654-017-0246-y.
Liu G, Zong G, Wu K, Hu Y, Li Y, Willett WC, et al. Meat cooking methods and risk of type 2 diabetes: Results from three prospective cohort studies. Diabetes Care. 2018;41:1049–60. https://doi.org/10.2337/dc17-1992.
Connolly V, Unwin N, Sherriff P, Bilous R, Kelly W. Diabetes prevalence and socioeconomic status: A population based study showing increased prevalence of type 2 diabetes mellitus in deprived areas. J Epidemiol Community Health. 2000;54:173–7. https://doi.org/10.1136/jech.54.3.173.
Article CAS PubMed PubMed Central Google Scholar
Borgharkar S, Das S. Real-world evidence of glycemic control among patients with type 2 diabetes mellitus in India: The tight study. BMJ Open Diabetes Res Care. 2019;7:e000654 https://doi.org/10.1136/bmjdrc-2019-000654.
Article PubMed PubMed Central Google Scholar
Dennis J, Shields B, Henley W, Jones A, Hattersley A. Disease progression and treatment response in data-driven subgroups of type 2 diabetes compared with models based on simple clinical features: an analysis using clinical trial data. Lancet Diabetes Endocrinol. 2019;7:04 https://doi.org/10.1016/S2213-8587(19)30087-7.
Article Google Scholar
Sun Z, Xing W, Guo W, Kim S, Li H, Li W, et al. A survey on dimension reduction algorithms in big data visualization. Springer 2020. p. 375–95. https://doi.org/10.1007/978-3-030-48513-9.31.
van der Maaten L. Hinton G, Visualizing data using t-SNE. J Mach Learn Res. 2008;9:2579–605. http://www.jmlr.org/papers/v9/vandermaaten08a.html.
Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nat Commun. 2019;10:5416 https://doi.org/10.1038/s41467-019-13056-x.
Article CAS PubMed PubMed Central Google Scholar
Li W, Cerise JE, Yang Y, Han H. Application of t-sne to human genetic data. J Bioinforma Computational Biol. 2017;15:1750017. https://doi.org/10.1142/S0219720017500172.
Article CAS Google Scholar
McInnes L, Healy J, Saul N, Großberger L. Umap: Uniform manifold approximation and projection. J Open Source Softw. 2018;3:861.
Article Google Scholar
Galow A-M, Wolfien M, Müller P, Bartsch M, Brunner RM, Hoeflich A, et al. Integrative cluster analysis of whole hearts reveals proliferative cardiomyocytes in adult mice. Cells. 2020;9:1–16.
Article Google Scholar
Diaz-Papkovich A, Anderson-Trocmé L, Ben-Eghan C, Gravel S. UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts. PLOS Genet. 2019;15:1–24. https://doi.org/10.1371/journal.pgen.1008432.
Article CAS Google Scholar
Luo S, Miao D, Zhang Z, Zhang Y, Hu S. A neighborhood rough set model with nominal metric embedding. Inf Sci. 2020;520:02 https://doi.org/10.1016/j.ins.2020.02.015.
Article Google Scholar
M Ester, H-P Kriegel, J Sander, and X Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, page 226–31. AAAI Press, 1996.
Gujral U, Weber M, Staimez L, Narayan KMV. Diabetes among non-overweight individuals: an emerging public health challenge. Curr Diabetes Rep. 2018;18:60. https://doi.org/10.1007/s11892-018-1017-1.
Article CAS Google Scholar
Staimez L, Weber M, Ranjani H, Ali M, Echouffo-Tcheugui J, Phillips L, et al. Evidence of reduced beta cell function in asian indians with mild dysglycemia. Diabetes Care. 2013;36:04. https://doi.org/10.2337/dc12-2290.
Sarkar J, Maity SK, Sen A, Nargis T, Ray D, Chakrabarti P. Impaired compensatory hyperinsulinemia among nonobese type 2 diabetes patients: a cross-sectional study. Ther Adv Endocrinol Metab. 2019;10. https://doi.org/10.1177/2042018819889024.
Narayan KMV. Type 2 diabetes: Why we are winning the battle but losing the war? 2015 kelly west award lecture. Diabetes Care. 2016;39:653–63. https://doi.org/10.2337/dc16-0205.05.
Ma R, Chan J. Type 2 diabetes in east asians: similarities and differences with populations in europe and the united states. Ann N Y Acad Sci. 2013;1281:64–91.
Article Google Scholar
Lee JWR, Brancati FL, Yeh H-C. Trends in the prevalence of type 2 diabetes in asians versus whites. Diabetes Care. 2011;34:353–7. ISSN 0149-5992. https://doi.org/10.2337/dc10-0746.
Position Statements. Nutrition principles and recommendations in diabetes. Diabetes Care, 2004;27:s36. https://doi.org/10.2337/diacare.27.2007.S36.
Hackett RA, Steptoe A. Psychosocial factors in diabetes and cardiovascular risk. Curr Cardiol Rep. 2016;18:95. https://doi.org/10.1007/s11886-016-0771-4.
Hackett RA, Steptoe A. Type 2 diabetes mellitus and psychological stress — a modifiable risk factor. Nat Rev Endocrinol. 2017;13:547–60.
Article Google Scholar
Macit MS, Acar-Tek N. Evaluation of Nutritional Status and Allostatic Load in Adult Patients With Type 2 Diabetes. Can J Diabetes. 2020;44:156–161. https://doi.org/10.1016/j.jcjd.2019.05.011.
Article PubMed Google Scholar

Download references

Acknowledgements

This work was in part supported by funds from Bioinformatics Infrastructure (de.NBI) and Establishment of Systems Medicine Consortium in Germany e:Med, as well as, the German Federal Ministry for Education and Research (BMBF) programmes (FKZ 01ZX1709C). JS received a research fellowship from the Indian Council of Medical Research (ICMR) (No.3/1/3/JRF-2017/HRD-LS/56429/54).

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

These authors contributed equally: Saptarshi Bej, Jit Sarkar.

Authors and Affiliations

Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany
Saptarshi Bej & Olaf Wolkenhauer
Leibniz-Institute for Food Systems Biology at the Technical University Munich, Munich, Germany
Saptarshi Bej & Olaf Wolkenhauer
Division of Cell Biology and Physiology, CSIR-Indian Institute of Chemical Biology, Kolkata, India
Jit Sarkar & Partha Chakrabarti
Academy of Innovative and Scientific Research, Ghaziabad, India
Jit Sarkar & Partha Chakrabarti
Advanced Technology Development Centre, Indian Institute of Technology, Kharagpur, India
Saikat Biswas
Department of Computer Science & Engineering, Indian Institute of Technology, Kharagpur, India
Pabitra Mitra
Stellenbosch Institute for Advanced Study (STIAS), Wallenberg Research Centre at Stellenbosch University, Stellenbosch, South Africa
Olaf Wolkenhauer

Authors

Saptarshi Bej
View author publications
You can also search for this author in PubMed Google Scholar
Jit Sarkar
View author publications
You can also search for this author in PubMed Google Scholar
Saikat Biswas
View author publications
You can also search for this author in PubMed Google Scholar
Pabitra Mitra
View author publications
You can also search for this author in PubMed Google Scholar
Partha Chakrabarti
View author publications
You can also search for this author in PubMed Google Scholar
Olaf Wolkenhauer
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SB and JS are the first authors and contributed equally to this work. SB, JS, PM, PC and OW contributed to the study concept and design. SB, JS and SB did the data analysis. SB, JS and OW wrote the manuscript and are the guarantors of this work having full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. All authors approved the final version of the article, including the authorship list.

Corresponding authors

Correspondence to Saptarshi Bej, Jit Sarkar or Olaf Wolkenhauer.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bej, S., Sarkar, J., Biswas, S. et al. Identification and epidemiological characterization of Type-2 diabetes sub-population using an unsupervised machine learning approach. Nutr. Diabetes 12, 27 (2022). https://doi.org/10.1038/s41387-022-00206-2

Download citation

Received: 13 December 2020
Revised: 11 March 2022
Accepted: 18 May 2022
Published: 27 May 2022
DOI: https://doi.org/10.1038/s41387-022-00206-2

This article is cited by

Dietary patterns associated with the incidence of hypertension among adult Japanese males: application of machine learning to a cohort study
- Longfei Li
- Haruki Momma
- Ryoichi Nagatomi
European Journal of Nutrition (2024)
The association between urban land use and depressive symptoms in young adulthood: a FinnTwin12 cohort study
- Zhiyang Wang
- Alyce M. Whipp
- Jaakko Kaprio
Journal of Exposure Science & Environmental Epidemiology (2023)

Subjects

Abstract

Background

Methods

Results

Conclusions

Similar content being viewed by others

Towards precision cardiometabolic prevention: results from a machine learning, semi-supervised clustering approach in the nationwide population-based ORISCAV-LUX 2 study

Analysis and visualisation of electronic health records data to identify undiagnosed patients with rare genetic diseases

Randomized feature selection based semi-supervised latent Dirichlet allocation for microbiome analysis

Introduction

Methodology

Source and description of the T2DM-NFHS-4 dataset

Dataset preparation

Dataset pre-processing

Identification of T2DM sub-populations using UMAP and DBSCAN

Clustering paradigm using UMAP

Extraction of T2DM sub-populations using DBSCAN

Results

Characterization of clusters

Age and BMI both were found to be lower in Cluster 2 and Cluster 4

Higher proportion of rural residents and lower proportion of richest wealth quintile in Clusters 2 and 4

Frequency of co-morbid conditions was similar across all the clusters

Lifestyle patterns show evidences of a lower quality of life for patient sub-populations in Clusters 2 and 4

Intake of non-vegetarian foods is invariably low in Cluster 3

Discussion

Rationale of the workflow in clustering epidemiological data

Significance of T2DM clusters

Conclusions

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Dietary patterns associated with the incidence of hypertension among adult Japanese males: application of machine learning to a cohort study

The association between urban land use and depressive symptoms in young adulthood: a FinnTwin12 cohort study

Search

Quick links