Type-2 diabetes mellitus (T2DM) is a multifactorial disease globally estimated to rise to 629 million cases by 2045 (see IDF Diabetes Atlas) [1, 2]. Though conceived as a homogeneous disease for long, several recent studies have found T2DM to be a mix of heterogeneous disease subtypes [3,4,5]. These studies have reported a varied pathophysiology underlying T2DM and thereby suggest the possibility of a personalised treatment for T2DM.

Besides obesity, other factors like age, sex, socio-economic status, place of residence (rural/urban), smoking habit, alcohol intake, food frequency, etc. are significantly associated with T2DM [6,7,8,9,10,11,12,13]. Several of these factors are modifiable in nature and hence are important in the management of T2DM [1]. However, modification of lifestyle-related factors varies and thereby leads to a differential degree of glycemic control among T2DM patients [14]. Glycaemic control and response to anti-diabetics have also been shown to be different among T2DM sub-groups [15]. To explore whether any particular pattern of patient sub-populations exists within the entire T2DM population based on socio-demographic and lifestyle factors, we used an unsupervised clustering approach on the largest and most comprehensive epidemiological dataset in India, the National Family Health Survey-4 (NFHS-4) dataset. Clusters were subsequently characterised to identify unique socio-demographic and lifestyle patterns associated with these sub-populations.

Epidemiological datasets provide a comprehensive set of information regarding socio-demography, lifestyle, addiction and co-morbidities. Variables containing such information are called features in the language of Machine Learning. In the T2DM-NFHS-4 dataset, there are 36 such features, containing information on each diabetes patient. Moreover, in our dataset, the features can be categorised into three types:

  1. 1.

    Continuous features: These are the features that can assume any numeric value from a continuous range. For example, the BMI of a patient is a continuous feature.

  2. 2.

    Ordinal features: These are the features that assume values from a discrete range, such that, there is a sense of order in the values assumed by the feature. For example, let us assume a feature ‘meat consumption by a patient’, assumes values ‘daily’, ‘weekly’ or ‘monthly’. Clearly, the range of the feature ‘meat consumption by a patient’ is discrete, since it can assume any one of the three values. Also, there is a sense of order in the values, indicating that daily meat consumption is the highest and monthly meat consumption is the lowest if we want to quantify meat consumption.

  3. 3.

    Nominal features: These are the features that assume values from a discrete range, such that, there is no sense of order in the values assumed by the feature. For example, let us assume a feature ‘religion of a patient’, assumes values ‘Hindus’, ‘Muslims’ or ‘Christians’. Clearly, the range of the feature ‘religion of a patient’ is discrete, since it can assume any one of the three values. But there is no sense of order in the possible values assumed by the features. Yet, this feature draws its importance from the fact that lifestyle patterns or diets vary largely among these religious groups.

Such diverse types of features in epidemiological data create challenges for the analysis. Conventional application of the state-of-the-art dimension reduction tool Uniform Manifold Approximation (UMAP) was found to be ineffective for the T2DM-NFHS-4 dataset. Continuous features, although smaller in numbers, had an overpowering effect on the distribution of clusters. To address this problem, we implemented a distributed clustering workflow, combining different similarity measure settings of UMAP, for clustering continuous, ordinal and nominal features separately. We integrated the reduced dimensions from each feature-type-distributed clustering to obtain interpretable and unbiased clustering of the data.

The workflow realised for the present study (Fig. 1) involves the investigation of underlying socio-demographic patterns within patient sub-populations using unsupervised learning. Dimension reduction approaches are often used to reduce higher dimensional data to lower dimensions such that in the lower dimensional embedding of the data one can visualize underlying clusters within the data, that are not apparent in the higher dimensions [16]. Several such techniques have been developed over the last few decades. Until recently the dimension reduction technique t-Stochastic Neighbourhood Embedding (t-SNE) was a state-of-the-art algorithm in this field providing numerous applications in various fields [17,18,19]. t-SNE projects high dimensional data to a lower dimension while maintaining the underlying local manifold structure in a sense that, in a lower dimension t-SNE can cluster points, that are close enough in the latent high dimensional manifold [17].

Fig. 1: Workflow describing the analysis of the T2DM-NFHS-4 Dataset.
figure 1

A visualisation for the novel feature type-wise clustering paradigm used in the study, to account for the bias of UMAP towards over representation of continuous variables during dimension reduction.

With a rigorous mathematical foundation, considerably high speed and easy to use using scikit-learn API, UMAP has turned out to be one of the most popular choices among data scientists [20,21,22]. As opposed to t-SNE, UMAP uses a graph-based manifold approximation mechanism which contributes to the preservation of the global as well as social properties of the latent data manifold in a lower-dimensional representation of the data. Given some low-dimensional representation of the data, a similar process can be used to construct an equivalent topological representation. UMAP builds a graph considering customized neighbourhoods for every data point. This graph is a representation of the higher dimensional data manifold. The end result is a patchwork of low-dimensional representations of neighbourhoods that groups similar data points on a local scale while better preserving long-range topological connections to more distantly related data points [20, 22]. For the ability of UMAP to preserve the long-range topological connections along with the short-range topological connections and because of its high computational efficiency we choose UMAP for our unsupervised clustering approach. Moreover, UMAP allows a user to specify several similarity measures through the tuning of the metric parameter. This has been critical in our workflow since our data contains continuous and categorical features and choosing suitable similarity measures for continuous and categorical features is crucial for a meaningful and informative clustering [23].


Source and description of the T2DM-NFHS-4 dataset

Data preparation and pre-processing are the key aspects of approaching a problem from a Machine Learning perspective. In this section, we provide the details on the pre-processing approach adopted to generate the T2DM-NFHS-4 dataset.

The NFHS-4 dataset was downloaded from The Demographic & Health Surveys (DHS) Program website. NFHS-4 is the fourth version of the national health survey conducted under the supervision of the Ministry of Health and Family Welfare, Government of India with the International Institute for Population Sciences (IIPS), Mumbai serving as the main nodal agency for all the surveys. The sampling procedure followed in NFHS-4 was stratified two-stage sampling covering all the 640 districts of India. The survey was successfully conducted with 601,509 households. In those interviewed households 112,122 men and 699,686 women could be successfully interviewed. Four survey questionnaires (Household Questionnaire, Woman’s Questionnaire, Man’s Questionnaire and Biomarker Questionnaire) were implemented in 17 local languages to collect information on basic demographic information, socio-economic parameters, family planning issues, nutritional status, health indicators, contact with community health workers, etc. The uniqueness of the NFHS-4 study was that it collected data on Diabetes status and performed a Random Blood Glucose for individuals (15–54 years) using a finger-stick blood specimen. As a result, the biomarker measurements and tests besides anthropometric measurements like anaemia testing, blood pressure measurement, blood glucose testing and HIV testing were included in the survey.

Dataset preparation

For dataset preparation and cleaning, the three questionnaires were merged: Woman’s Questionnaire, Man’s Questionnaire and Biomarker Questionnaire. The first two contained information about background characteristics (location, age, sex, religion, social group, literacy, wealth status, etc.), nutritional practices, addictions and co-morbidities while the biomarker questionnaire contained information on height, weight, blood pressure and random blood glucose. A unique code was generated for all individuals in all the three questionnaires by appending the country code and phase, cluster number, household number and line number. The three datasets were joined by the unique code to prepare a single dataset of 810,971 individuals consisting of all men and women between 15–54 years of age. Pregnant women were next excluded to discard the possibility of Gestational Diabetes Mellitus. Individuals with missing diabetic and blood pressure status were also excluded. Variables known to be risk factors for DM (body mass index (BMI), age, place of residence, wealth index, smoking frequency, alcohol intake frequency, hypertension), socio-economic factors (sex, religion, social group, educational status), Dietary frequencies and haemoglobin level were selected for final analysis. BMI, age and haemoglobin level were taken as continuous variables and the rest as categorical variables. Outliers were removed separately for all the three continuous variables to obtain the final dataset with 610, 498 individuals (526, 678 females and 83, 820 males).

Dataset pre-processing

We were interested in detecting significant T2DM sub-populations in the data and further sought to characterize these sub-populations based on the socio-demographic and co-morbid conditions. For this purpose, we extracted patients with a known history of diabetes from the dataset: a total of 10 125 patients. We considered a diverse collection of socio-demographic and co-morbid conditions as ‘features’ in our dataset. Qualitatively our features can be divided into several categories:

  1. 1.

    Co-morbid conditions: This class of features considers the co-morbid diseases among T2DM patients. We considered whether a T2DM patient had medical conditions such as asthma, thyroid disorder, heart disease, cancer, tuberculosis and hypertension. Thus, there were six features in this category. These features are binary in nature denoting whether a T2DM patient suffered from a given comorbidity or not.

  2. 2.

    Food habits: This class of features considered the food habits of T2DM patients. The features considered here were how frequently the patient took the food items: milk or curd, pulses or beans, dark leafy vegetables, fruits, eggs, fish, chicken, fried food and aerated drinks. Thus, there were nine features in this category. Features were categorical and ordinal in nature having four possible values: ‘daily’, ‘occasionally’, ‘weekly’ and ‘never’.

  3. 3.

    Addiction history: This class of features considered the addiction pattern of T2DM patients. There were two features in this class, both binary in nature encoding whether a patient is a smoker or whether a patient takes alcohol.

  4. 4.

    Socio-demographic features: These included features such as sex, age, wealth index, education level, religion and caste along with BMI and haemoglobin level of the patient. There were eight features in this category.

  5. 5.

    Living conditions: This class of features quantifies the living conditions of the patients. The features in this class considered whether a patient lives in a household possessing refrigerator, bicycle, motorbike, four-wheeler vehicle and livestock. Moreover, there were features denoting the type of residence, household structure, frequency of household members smoking inside the house, type of cooking fuel used, source of drinking water and time to reach the nearest drinking water source. Thus, there were eleven features belonging to this category.

For our study, 36 features or factors are considered to investigate significant patient populations among the diabetes patients into consideration. Note that there are both continuous and categorical features among these thirty-six features. Among the categorical features, there are both ordinal features and nominal features. Ordinal features have a sense of order among them, such as the features from the ‘food habits’ category as described before. The nominal features are categorical features with no sense of order such as the sex of a patient. Note that for our dataset the continuous features are: age, BMI, haemoglobin level and time to get to drinking water source, whereas the nominal features are: sex, religion, caste, household structure, type of place of residence, type of cooking fuel and source of drinking water. The rest of the features are ordinal features. The categorization of features into continuous, nominal and ordinal is of utmost importance in our clustering paradigm which we discuss in the section “Clustering paradigm using UMAP”.

Identification of T2DM sub-populations using UMAP and DBSCAN

From our detailed description of our dataset, we pointed out that our dataset has a variety of features including continuous and categorical features. Further, there are both ordinal and nominal features among the categorical features in our dataset. A simple UMAP on the entire dataset is depicted in Fig. 2a, revealing two broad clusters. For this clustering of UMAP parameters, n_neighbours have been chosen to be 30, whereas the metric parameter has been chosen to be Euclidean. However, we have a number of important nominal and ordinal categorical features whose effect would not be apparent from such a clustering. Moreover, the Euclidean distance does not always make sense on categorical features, especially if they are nominal in nature. For example, observe Fig. 2d, where we have used UMAP considering only the nominal features with metric parameter hamming (based on hamming distance). This reveals a completely different picture of the dataset, showing several small clusters. Our clustering paradigm is designed to optimise this effect and find a balance in the clustering where a particular type of feature does not have an overpowering effect on the clustering process.

Fig. 2: The low dimensional UMAP visualisations of data for several data types.
figure 2

a UMAP clusters for all the features with the Euclidean metric. b UMAP clusters for continuous features with Euclidean metric. c UMAP clusters for ordinal features with Canberra metric. d UMAP clusters for nominal features with Hamming metric.

Clustering paradigm using UMAP

Our clustering paradigm applies UMAP separately on continuous, nominal and ordinal features separately. For each of these feature categories, we create a lower-dimensional embedding of the dataset. Finally, we integrate the lower-dimensional embeddings to extract clusters from them using the DBSCAN algorithm, a clustering algorithm used for extracting clusters from data based on data density. One advantage of this algorithm is that one does not need to specify the number of clusters beforehand. DBSCAN considers closely or densely located points, as clusters [24]. For UMAP, we use the same values for the parameters n_neighbours = 30 and min_distance = 0.1 for all the feature types.

  • For the continuous features, we use the metric measure to be Euclidean. The Euclidean distance between two vectors is given by:

    $$d\left( {x,y} \right) = \sqrt {\mathop {\sum}\nolimits_{i = 1}^n {\left( {x_i - {\it{y}}_i} \right)^2} }$$
  • For the nominal features, we use the metric measure to be Hamming. Hamming distance is defined as:

    $$d\left( {x,y} \right) = \mathop {\sum}\nolimits_{i = 1}^n {\delta \left( {x_i,y_i} \right)}$$

    where δ(xi, yi) = 1, if xi = yi and δ(xi, yi) = 0 otherwise. Recall that nominal features are also a type of categorical features that do not have a sense of order associated with them. For such features, Hamming distance is widely used as a similarity measure between data points [23].

  • For the ordinal features, we use the metric measure to be Canberra. It is a weighted version of the Manhattan measure. The Canberra distance is given by:

$$d\left( {x,y} \right) = \sqrt {\mathop {\sum}\nolimits_{i = 1}^n {\frac{{\left| {x_i - y_i} \right|}}{{\left| {x_i} \right| + \left| {y_i} \right|}}} }$$

Ordinal features are also a type of categorical feature. However, the Hamming metric cannot capture the inherent ordered relationships and statistic information from categorical values [23]. We thus tried using UMAP for several metric measures and noticed that the Canberra distance measure retains a high variance in the lower dimensions. Thus we chose the Canberra distance measure as a similarity metric for ordinal features.

For the categorical and ordinal features, we thus produce a two-dimensional representation of each data point by taking into consideration the first two UMAP coordinates. For the nominal features, we consider we produce a one-dimensional representation since the data points are too scattered in this case as shown in Fig. 2d and thus can lead to too many clusters. Thus, we reduce every data point into a five-dimension representation, two for each of the continuous and ordinal features and one for the nominal features. Finally, we look for clusters in the five-dimensional representation using DBSCAN (eps = 1, min_points = 200). After selecting the final clusters, we characterized them by summarizing all the 36 variables separately for each cluster. The continuous variables were summarized as their mean and the standard error of the mean. The categorical variables were summarized as their frequency distribution and the proportion of each value within each cluster.

Extraction of T2DM sub-populations using DBSCAN

Using our clustering paradigm described before, we can detect seven sub-populations among the patients where 261 patients are considered as outliers. We show the distribution of clusters in Fig. 3a. We further perform a UMAP on the five-dimensional reduced representation of our data to visualize the clusters detected by DBSCAN. For this, we label the data points using the DBSCAN clustering labels and colour code them in the UMAP representation of the five-dimensional reduced data as shown in Fig. 3b. This provides validation to the fact the clustering done by DBSCAN makes sense. Note that, from our clusters, we can detect four significant patient sub-populations containing 2898, 2301, 2226 and 1315 data points.

Fig. 3: The information on clusters detected in the data.
figure 3

a Distribution of clusters detected by DBSCAN on the five-dimensional reduced representation of the data. b UMAP clusters for five-dimensional reduced representation of the data annotated by the DBSCAN generated clusters.


Characterization of clusters

Age and BMI both were found to be lower in Cluster 2 and Cluster 4

Age and obesity are the most important risk factors for T2DM. However, we found a heterogeneity in both these variables across all the clusters. Interestingly, the mean age and BMI both were lower in Cluster 2 (Age: 38.3 ± 0.19 years, BMI: 23.9 ± 0.1) and Cluster 4 (Age: 37.9 ± 0.26 years, BMI: 23.6 ± 0.13) compared to Cluster 1 (Age: 41.3 ± 0.14 years, BMI: 26.7 ± 0.09) and Cluster 3 (Age: 39.9 ± 0.18 years, BMI: 26 ± 0.11). However, the distribution of males and females has been found to be similar across all the clusters.

Higher proportion of rural residents and lower proportion of richest wealth quintile in Clusters 2 and 4

The proportion of rural residents was found to be high in Cluster 2 (69.4% were Rural residents) and Cluster 4 (72.02% were Rural residents) compared to the other clusters (31.3% in Cluster 1 and 49.19% in Cluster 3). Surprisingly, only 4.3% of the people in Cluster 2 and 8.37% in Cluster 4 belonged to the richest quintile of the Wealth Index category whereas 64.04% in Cluster 1 and 54.9% in Cluster 3 belonged to the same.

Frequency of co-morbid conditions was similar across all the clusters

Co-morbid conditions included a history of asthma, thyroid disease, heart disease, cancer, history of tuberculosis, haemoglobin level and hypertension. Though the distribution of disease conditions shows minor variation across the clusters (Table 1), the trend is almost similar in all the clusters.

Table 1 Detailed cluster-specific analysis for all numerical and categorical variables.

Lifestyle patterns show evidences of a lower quality of life for patient sub-populations in Clusters 2 and 4

Our analysis reveals several other factors that support the fact that T2DM sub-populations from Cluster 2 and Cluster 4 have a considerably lower quality of life.

  1. 1.

    We observe that only 0.22% and 24.79% of patients belonging to Cluster 2 and Cluster 4, respectively, possess a refrigerator compared to 95.48% and 65.77% of patients belonging to Cluster 1 and Cluster 3, respectively.

  2. 2.

    Only 30.9% and 32.78% of patients belonging to Cluster 2 and Cluster 4 respectively possess a motorbike compared to 71.53% and 67.03% of patients belonging to Cluster 1 and Cluster 3, respectively.

  3. 3.

    Only 3.26% and 3.19% of patients belonging to Cluster 2 and Cluster 4 respectively possess a car/truck compared to 23.5% and 17.34% of patients belonging to Cluster 1 and Cluster 3, respectively.

  4. 4.

    44.24% and 54.98% of patients belonging to Cluster 2 and Cluster 4 respectively, use plant-based cooking fuel, which is relatively cheap, compared to 12.22% and 19.63% of patients belonging to Cluster 1 and Cluster 3 respectively. Moreover, only 41.94% and 36.2% of patients belonging to Cluster 2 and Cluster 4, respectively use gas/oil-based cooking fuel, which is relatively expensive, compared to 84.89 and 70.17% of patients belonging to Cluster 1 and Cluster 3, respectively.

  5. 5.

    6.35% and 15.51% of patients belonging to Cluster 2 and Cluster 4 respectively, drink water from unprotected sources, compared to 2.62% and 1.98% of patients belonging to Cluster 1 and Cluster 3, respectively.

Intake of non-vegetarian foods is invariably low in Cluster 3

Around 90% of the population in Cluster 3 had no intake of Egg (89.08%), fish (97.12%), chicken or meat (97.71%) whereas only less than 10% of the population in all the other 3 clusters had no intake of these non-vegetarian foods (Table 1). Though the Cluster 3 population had the highest daily intake of milk/curd (61.81%) and pulses/beans (50.31%) compared to the other clusters, other clusters also had an almost similar proportion of people taking milk/curd and pulses/beans daily. Intake of other foods like dark leafy vegetables, fruits, fried foods and aerated drinks showed similar distribution across all the clusters.


Rationale of the workflow in clustering epidemiological data

The clustering workflow used arises from some important observations that we will discuss here. To begin with, we have a population of 10,125 T2DM patients with a diverse ensemble of features accounting for information on medical history, dietary and addiction habits, socio-economic and lifestyle patterns. Moreover, the features in the considered dataset are also diverse in terms of data types. We have a total of 36 features, out of which 4 are continuous features, 7 nominal features and 25 ordinal features, all of equal importance by assumption.

The aim is to find significant sub-populations in our data such that the identified sub-populations are interpretable in terms of the considered features. Note here that, by significant sub-populations, we mean a sub-population consisting of at least 10 percent of the total population. If there exist such sub-populations and we can explain the sub-populations in terms of the considered features, we can argue that these patterns exist in a significant number of patients.

We have already argued in favour of using UMAP for our unsupervised approach to find clusters in the data. However, we observed that applying UMAP algorithm conventionally using the Euclidean similarity metric on our entire dataset with 36 features turns out to be ineffective. The reason is, in this case, the continuous features have an overpowering effect over the other feature types in determining the distribution of clusters. This can be observed in Fig. 2a, b. Note that Fig. 2a shows UMAP clustering with all 36 features and 2(b) shows UMAP clustering with only four continuous features. Note that, there is a similarity in the clustering distribution of these figures, each containing one major cluster and seven small minor clusters. We observed that this is because of the fact that UMAP, when applied to all 36 features of the dataset using the Euclidean similarity measure is largely biased towards finding similarity among data points only in terms of the continuous features. Given that we have only four continuous features out of 36, this poses a problem as the diverse information present in the dataset in the form of the ordinal and nominal features is largely ignored.

To solve this problem, the clustering of continuous, ordinal and nominal features was treated separately by using different similarity matrices for them, giving rise to our clustering paradigm. We argued on our choice of similarity measures in Section “Clustering paradigm using UMAP”. This generates for each feature type a data representation of lower dimension shown in Fig. 2b–d. We finally integrated these lower dimension data representations by taking two-dimensional representations for continuous and ordinal features and a one-dimensional representation (the one consisting of the most variance) for nominal features. The reason behind considering one-dimensional representation for nominal features is that using Hamming metrics for such data results in retaining a lot of variance in the data resulting in multiple clusters as we observe in Fig. 2d. Considering a two-dimensional representation for this data while integrating these lower dimension data representations carry forward this variance and result in multiple small clusters in the final clustering distribution, which contradicts our aim of finding significantly large sub-populations (of at least 10 percent of the total population).

Finally, the integration is done by applying UMAP on the five-dimensional reduced representation of the dataset using the Euclidean similarity measure (shown in Fig. 3b). Note here that, in our final clusters, we can observe patterns in all of the continuous, ordinal and nominal data types. For example, in Cluster 4 the continuous feature ‘Time to Water source (min)’ shows very high values compared to other clusters. In Clusters 1 and 3, the nominal feature ‘Cooking fuel used’ shows a higher percentage for Gas/Oil users while in Clusters 2 and 4 the same feature shows a higher percentage for plant-based fuel users. In Cluster 3, the ordinal feature ‘Fish intake frequency’ shows 97% of people to be never consuming fish. Thus, we infer that our clustering paradigm enables us to find significant sub-populations while keeping the clustering distribution unbiased, that is no feature type continuous, ordinal and nominal has an overpowering effect on the other.

Significance of T2DM clusters

T2DM was identified as a homogeneous disease with Insulin Resistance followed by β-cell dysfunction being the underlying pathology. However, recent studies have explored and found T2DM to be a heterogeneous entity with the relative contribution of Insulin Resistance and β-cell dysfunction to differ across T2DM clusters [3]. These studies were performed on clinical and biochemical data with variables having uniform data types. On the other hand, our clustering approach takes into account the diverse data types obtained from an epidemiological dataset and discovers clusters among the T2DM population. Interestingly, two of the four clusters obtained in our study belonged to the non-obese T2DM phenotype where the mean BMI was below 25. These two non-obese clusters also had lower mean age compared to the other clusters. Both these non-obese clusters had a larger proportion of rural residents and a lower proportion of people belonging to the highest wealth quintile concluding to the fact that a large majority of T2DM people from rural India have lower BMI and are younger in age. The T2DM patient sub-population belonging to these clusters has a relatively lower quality of life judging by analysis of the lifestyle pattern-based features. The non-obese phenotype of T2DM has been increasingly reported over the last two decades raising concern about the uniqueness of its underlying pathophysiology with a greater contribution of β-cell dysfunction compared to Insulin Resistance [25,26,27,28]. This non-obese T2DM phenotype has been found among Asians and studies depicting and investigating its similarities and differences have been in place. Studies have concluded that T2DM occurs among Asians at a lower BMI cut-off and also at a younger age [29, 30]. This finding of two non-obese clusters with lower mean age provides confirmation to this. Among the studies aimed to identify T2DM subtypes, the two subtypes severe insulin deficient diabetes (SIDD) and mild-age-related diabetes (MARD) were found to be common [3, 4]. Both cluster 2 and cluster 4 in our study seem to have similarities to the SIDD subtype though the reasons behind obtaining two different epidemiological clusters within the SIDD subtype need further investigation. As our dataset had patients below 49 years of age, we couldn’t obtain any cluster that may be compared to the mild-age-related diabetes (MARD) subgroup. The remaining two clusters in our study, cluster 1 and cluster 3 are both obese groups and therefore may be the epidemiological counterparts of either mild obesity-related diabetes (MOD) or severe insulin-resistant diabetes (SIRD). Hence, the T2DM clusters obtained from epidemiological data provide further strength to the clinical T2DM subtypes and raise the need to further investigate the epidemiological risk factors of T2DM subtypes.

Though non-obese T2DM is being considered a unique phenotype, epidemiological studies for identifying high-risk population groups still remain undone. This is especially important for many Asian countries where over half of the T2DM population is of non-obese phenotype [25]. This analysis, reporting an increased presence of Rural residents in both the non-obese T2DM clusters, calls for a modification in BMI and Age cut-off for T2DM screening among rural residents. However, identification of risk factors for T2DM specific to the rural population needs to be done. Representation of people from the highest wealth quintile was much lower in both the non-obese T2DM clusters. T2DM is a multifactorial disease requiring strict compliance to lifestyle modification, proper diet and antidiabetic therapy. Non-obese T2DM clusters with reduced representation from the highest wealth quintile suggest the possibility of unequal access to care for non-obese T2DM people thereby generating the need for a more equitable healthcare policy in terms of prevention and therapy.

On the other hand, both the obese T2DM clusters had higher ages and more urban residents. The proportion of people from the highest wealth quintile was higher in both the obese clusters. Interestingly one of the obese clusters (Cluster 3) had invariably a low intake of non-vegetarian foods (egg, fish, chicken and meat) pointing out the fact this T2DM cluster comprised vegetarian people mainly. Dietary requirements in diagnosed T2DM patients involve a reduced amount of carbohydrates and fats with an increased amount of protein-rich foods [31]. Animal products, being rich sources of dietary protein, need to be included in the diet. One of the obese T2DM clusters with a strict vegetarian dietary pattern suggests the need to design proper dietary guidelines for this group.

As already mentioned, T2DM is a multifactorial disease with socio-economic inequality suggested to play an important role in the pathology and management of the disease [32]. Studies have identified socio-economic inequalities and allostatic load to associate with T2DM [33]. The negative effect of social stress, uncertainty and poor nutrition [34] seems to be more relevant for clusters 2 and 4 in this study where individuals majorly belong to a weaker socio-economic class. Though this study doesn’t have the data to investigate this association, the possibility of obtaining T2DM subtypes based on the allostatic load seems to be a promising area of diabetes research. Designing a customized healthy diet and lifestyle plan for certain T2DM subtypes with a view to reducing the allostatic load may emerge as an important strategy in T2DM management.


From a data science perspective, this analysis addresses the issue of diverse data types. We have shown that for such data conventional application of dimension reduction approaches might not be fruitful. We develop a workflow that contributes to finding meaningful and interpretable clusters such that the distribution of clusters is not biased by the data types.

The existence of a significant non-obese T2DM patient sub-population belonging to a younger age group and having larger proportions of rural residents raise a lower quality of life, indicating the need for different screening criteria for T2DM among rural Indian residents. The obese T2DM cluster with around 90% of people sticking to the non-vegetarian diet calls for the need for dietary guidelines for T2DM patients having a non-vegetarian dietary pattern.