Introduction

Diabetes mellitus, cancer, heart disease and mental disease are major parts of disease burden in the world. The global diabetes prevalence is predicted to rise from 425 to 693 million during 2017–20451. Cancer, the second cause of global mortality in 2018, was estimated to be responsible for 9.6 million deaths globally2. Cardiovascular disease constituted the greatest part (32%) of global mortality in 2013, that is, 17 million of 54 million deaths3. Moreover, mental disease including depression is a major public health problem. For example, depression is a leading cause of disability in the world, affecting more than 350 million globally4. This global trend agrees with its Korean counterpart. Cancer, heart disease, suicide and diabetes were the first, second, fifth and sixth causes of death in Korea for the year 2018, i.e., 154.3, 62.4, 26.6 and 17.1 per 100,000, respectively5. Cancer (1681), unipolar depression (1508), ischemic heart disease (562) and diabetes (275) were the first, second, third and ninth causes of disability-adjusted life years per 100,000 in the nation for the year 20106.

Then, do these diseases have a strong comorbidity (or association), i.e., diabetes, cancer, heart disease and mental disease? If so, what determines the comorbidity? A couple of previous studies examined these issues, even though they used different variables from this study. For example, one previous study reports that family support (children alive, marriage), socioeconomic status (education, income) and social activity (friendship activity) are major determinants of comorbidity among cerebrovascular disease, hearing loss and cognitive impairment in a middle-aged or old population in Korea and that comorbidity among the three diseases is very strong in the middle-aged or old7,8. Likewise, another previous study reports that family support (brothers/sisters cohabiting, parents alive), socioeconomic status (income) and social activity (voluntary activity, family activity, leisure activity, friendship meeting) are major determinants of comorbidity among diabetes mellitus, visual impairment and hearing loss in a middle-aged or old population in the nation8,9. These previous studies used Korean Longitudinal Study of Ageing (2014–2016) data and artificial intelligence models (the artificial/recurrent neural network) for testing the hypotheses on the comorbidity of the diseases and its major determinants.

This study is an extension of the framework above to comorbidity among diabetes, cancer, heart disease and mental disease in a middle-aged or old population. Moreover, we introduced the Shapley additive explanations (SHAP) values to analyze the direction of association between a major determinant and the comorbidity in the prediction model. To our best knowledge, this is one of the earliest endeavors to adopt a cutting-edge method of explainable artificial intelligence. This study is expected to have global implications, given that cancer, ischemic heart disease, depressive disorders and diabetes were top-4 causes of death or disability in the world for 2017–20181,2,3,4,10. In this context, this study tests the following hypotheses from the literature above, considering three pairs of diabetes and its comorbid condition (diabetes-cancer, diabetes-heart disease and diabetes-mental disease):

Hypothesis 1

The comorbidity of diabetes and its comorbid condition is very strong in the middle-aged or old.

Hypothesis 2

Major determinants of the comorbidity are similar for different pairs of diabetes and its comorbid condition.

Results

Descriptive statistics for participants’ categorical and continuous variables are shown in Tables S1 and S2 (Supplementary Tables), respectively. Among the 5527 participants in 2018, 1532 (27.7%) were diagnosed as diabetes and/or cancer, 1621 (29.3%) as diabetes and/or heart disease, and 1461 (26.4%) as diabetes and/or mental disease. Among the same participants in 2016, 1138 (20.6%), 344 (6.2%), 553 (10.0%) and 273 (4.9%) were characterized by the diagnosis of diabetes, cancer, heart disease and mental disease, respectively. On average, the age, monthly income and body mass index of the participant were 71, $1205 and 24, respectively. According to Table 1, the random forest, the recurrent neural network and logistic regression were the best models in terms of accuracy and the area under the receiver-operating-characteristic curve. Their accuracy measures were 0.9725, 0.9688 and 0.9720 for diabetes-cancer, 0.9776, 0.9703 and 0.9776 for diabetes-heart disease and 0.9797, 0.9711 and 0.9797 for diabetes-mental health, respectively. Likewise, their areas under the receiver-operating-characteristic curves were 0.9625, 0.9630 and 0.9625 for diabetes-cancer, 0.9725, 0.9747 and 0.9725 for diabetes-heart disease, and 0.9775, 0.9779 and 0.9800 for diabetes-mental health, respectively.

Table 1 Model performance.

Based on variable importance from the random forest (Table 2, Fig. 1), diabetes and its comorbid condition in 2016 were top-2 determinants of the comorbidity in 2018 (This supports the hypothesis 1). Top-10 determinants of the comorbidity in 2018 were the same for different pairs of diabetes and its comorbid condition: body mass index, income, age, life satisfaction—health, life satisfaction—economic, life satisfaction—overall, subjective health and children alive in 2016 (This supports the hypothesis 2). The logistic regression results (Table S3, a supplementary table) provide useful information about the effect of a determinant on the comorbidity. For example, let’s consider the odds of having diabetes only compared to the odds of having no disease in 2018 (“Diabetes-Cancer YN” Column). This odds is 100 times as high for those with diabetes in 2016 as for those without the disease in 2016. This odds will increase by 3% if body mass index in 2016 increases by one unit.

Table 2 Top-10 variables for comorbidity among different chronic diseases.
Figure 1
figure 1figure 1

Variable Importance from the Random Forest. Diabetes and its comorbid condition in 2016 were top-2 determinants of the comorbidity in 2018 (This supports the hypothesis 1). Top-10 determinants of the comorbidity in 2018 were the same for different pairs of diabetes and its comorbid condition: body mass index, income, age, life satisfaction—health, life satisfaction—economic, life satisfaction—overall, subjective health and children alive in 2016 (This supports the hypothesis 2).

SHAP values are shown in Table 3 and Figs. 2 and 3. Table 3 and Fig. 2 denote SHAP summary tables and plots, respectively. Figure 3 represents SHAP dependence plots. The SHAP value of a particular determinant for a particular observation measures a difference between what the model (the random forest) predicts for the probability of the comorbidity for the observation with and without the determinant. Indeed, the SHAP dependence plot reveals an interaction between two determinants regarding their effects on the probability prediction of the comorbidity. In Table 3 and Fig. 2, the SHAP values of body mass index (× 033) have the range of (− 0.09, 0.27), (− 0.10, 0.26) and (− 0.04, 0.27) for diabetes-cancer, diabetes-heart disease and diabetes-mental disease, respectively. There exists a strong positive association between body mass index and the comorbidity. Here, the probability of the comorbidity is expected to increase by 0.26–0.27 in case body mass index is included in the model. Based on Table 3 and Fig. 2, an association looks positive between life satisfaction overall (× 041) and the comorbidity as well. But Fig. 3a–c reveal the opposite pattern. In Fig. 3a–c, the SHAP values of life satisfaction overall (× 041) are (1) lower for those with high life satisfaction overall and (2) higher for those with diabetes (× 043) in red dots. Here, the probability of the comorbidity is expected to decrease by 0.02–0.03 in case life satisfaction overall is included in the model.

Table 3 Shapley additive explanations (SHAP) from the random forest.
Figure 2
figure 2

Shapley Additive Explanations (SHAP) Summary Plot from the Random Forest.: The SHAP value of a particular determinant for a particular observation measures a difference between what the model (the random forest) predicts for the probability of the comorbidity for the observation with and without the determinant. The SHAP values of body mass index (× 033) have the range of (− 0.09, 0.27), (− 0.10, 0.26) and (− 0.04, 0.27) for diabetes-cancer, diabetes-heart disease and diabetes-mental disease, respectively. There exists a strong positive association between body mass index and the comorbidity. Here, the probability of the comorbidity is expected to increase by 0.26–0.27 in case body mass index is included in the model.

Figure 3
figure 3

Shapley Additive Explanations (SHAP) Dependence Plot from the Random Forest: Life Satisfaction Overall with Diabetes. The SHAP dependence plot reveals an interaction between two determinants regarding their effects on the probability prediction of the comorbidity. The SHAP values of life satisfaction overall (× 041) are (1) lower for those with high life satisfaction overall and (2) higher for those with diabetes (× 043) in red dots. Here, the probability of the comorbidity is expected to decrease by 0.02–0.03 in case life satisfaction overall is included in the model.

Discussion

In summary, (1) the comorbidity of diabetes and its comorbid condition is very strong in the middle-aged or old and (2) major determinants of the comorbidity are similar for different pairs of diabetes and its comorbid condition (body mass index, income, age, life satisfaction—health, life satisfaction—economic, life satisfaction—overall, subjective health and children alive). Three pairs are considered, diabetes-cancer, diabetes-heart disease and diabetes-mental disease. A few previous studies investigated these issues albeit with different dependent and independent variables from this study. One previous study reported that family support (children alive, marriage), socioeconomic status (education, income) and social activity (friendship activity) are major determinants of association among cerebrovascular disease, hearing loss and cognitive impairment in a middle-aged or old population in Korea and that association among the three diseases is very strong in the middle-aged or old7. Likewise, another previous study suggested that family support (brothers/sisters cohabiting, parents alive), socioeconomic status (income) and social activity (voluntary activity, family activity, leisure activity, friendship meeting) are major determinants of association among diabetes mellitus, visual impairment and hearing loss in a middle-aged or old population in the nation9.

To our best knowledge, this study is the first artificial-intelligence study to compare major determinants across three pairs of diabetes and its comorbid conditions. The largest cohort data in this line of research was obtained from the KLoSA (2016–2018) for 5527 subjects aged 56 or more. The random forest and the recurrent neural network registered remarkable performance in terms of the area under the receiver operating characteristic curve within the range of 96.3–97.8. Furthermore, we calculated the SHAP values to identify the direction of association between a major determinant and the comorbidity in the prediction model (random forest). To our best knowledge, this is one of the earliest achievements to introduce a cutting-edge approach of explainable artificial intelligence.

Specifically, three comments are available in the context of existing literature. Firstly, the results of this study are consistent with previous studies on social determinants of chronic diseases, requesting due attention to socioeconomic status (income) and family support (children alive)7,9,11,12. For example, a review study reports that family support is likely to reduce morbidity and mortality by improving cardiovascular, neuroendocrine and immune functions besides promoting health behavior and mental status12. Secondly, it was also found in this study that the comorbidity of diabetes and its comorbid condition is very strong in a middle-aged or old population. These findings suggest that preventive measures for diabetes and its comorbid condition should become central policy. Thirdly, this study sheds new light on the importance of body mass index and life satisfaction in managing diabetes and its comorbid conditions across board. Several analyses of national surveys in the United States reported a positive association between body mass index and diabetes: the Study to Help Improve Early evaluation and management of risk factors Leading to Diabetes 199413; the National Health and Nutrition Examination Surveys 1999–200213; the National Health Interview Survey 1997–200414; the Medicare Current Beneficiary Survey 1991–201015; and a case–control study using electronic health records in the Middle Atlantic region of the United States during 2004–201116. However, no artificial-intelligence study was available to analyze a negative association of body mass index with three pairs of diabetes and its comorbid conditions. This study is the first investigation in this direction.

In a similar vein, a negative linkage was reported between life satisfaction and diabetes or mental disease in a few studies17,18,19,20. Moreover, several previous studies found that life satisfaction may played as a protective factor for cancer and heart disease21,22,23. To our best knowledge, however, no machine-learning examination has been available on the significance of life satisfaction in managing various pairs of diabetes and its comorbid conditions. Based on the findings of this study, life satisfaction is a major protective factor against diabetes and its comorbid conditions. Existing literature suggests two plausible mechanisms24,25,26. Firstly, life satisfaction has direct effects on behavioral and physiological systems24. It brings autonomic nervous system activation, which reduces the levels of the heart rate, blood pressure and stress-related hormones such as epinephrine and norepinephrine24. Moreover, life satisfaction causes hypothalamic–pituitary–adrenal (HPA) axis activation, which decreases cortisol and increases oxytocin and growth hormone: They play important roles in many physiological outcomes such as immune and inflammatory diseases24. Secondly, life satisfaction can aid in coping with stressors, thereby preventing unhealthy behavioral and physiological responses24,25. These indirect effects of life satisfaction are more apparent among those with greater social capital, i.e., greater social activity, bonding and network26. The discussion above requests due attention to the issue of life satisfaction, which should be a central part of clinical consultation for those with diabetes and its comorbid conditions. More comprehensive effort is need in this direction and this study would be a good starting point for further research.

However, this study had some limitations. Firstly, expanding the scope of this study to other chronic diseases and other determinants of comorbidity such as medication would add a great contribution to this line of research. Secondly, this study did not consider possible relationships or mediating effects among independent variables. Thirdly, sub-group analysis across age and gender (for example, men below 71, women below 71, men above 70, women above 70) would offer more insight on the major determinants of the comorbidity among the three diseases. Fourthly, the variables of life satisfaction and mental disease were measured based on single questions. This would be a good research topic to evaluate and improve their validity and reliability. Finally, different artificial intelligence methods would highlight different social determinants of chronic diseases but little study is available and more investigation is needed on this issue.

In conclusion, the comorbidity of diabetes and its comorbid condition is very strong in the middle-aged or old, and major determinants of the comorbidity are similar for different pairs of diabetes and its comorbid condition. Preventive measures for body mass index, socioeconomic status, life satisfaction and family support would be needed for the effective management of diabetes and its comorbid condition.

Methods

Participants and variables

The data source of this study was the Korean Longitudinal Study of Ageing (KLoSA) (2016–2018)8. This study did not require the approval of the ethics committee given that data were publicly available (https://survey.keis.or.kr/eng/klosa/klosa01.jsp) and de-identified. Among the 6618 participants, 1091 with missing values on any of three dependent variables and thirty one independent variables were deleted (piecewise deletion). The final sample of this study consisted of 5527 subjects aged 56 or more. The purpose of the KLoSA is to build a data source for the preparation of population aging in Korea. It is a good data source for artificial intelligence, given that its size is big and its quality is high enough for the great performance of artificial intelligence. Another desirable data source for artificial intelligence would be Korea National Health Insurance Service (KNHIS) data, which is designed to provide the socioeconomic qualification and medical utilization of all citizens in Korea (https://nhiss.nhis.or.kr/bd/ab/bdaba022eng.do). The KLoSA presents rich information on socioeconomic status and qualification of the old population, whereas the KNHIS data offers a variety of data on medical status and utilization of the entire population.

The KLoSA question on diabetes (or cancer/heart disease/mental disease) in 2016 and 2018 was “Since the last survey, have you ever been diagnosed by a doctor diabetes (or cancer/heart disease/mental disease)? 1. Yes. 5. No.” [C011 (or C016/C033/C043)]. The dependent variable, the comorbidity of diabetes and its comorbid condition in 2018, was divided into four categories: “0” for having no disease; “1” and “2” for having diabetes only and its comorbid condition only, respectively; and “3” for having both diseases. This study focuses on association among diseases as their comorbidity instead of complication. The independent variables were the following determinants in 20167,9: (1) diabetes (no, yes) and its comorbid condition (no, yes); (2) demographic information, i.e., age, gender; (3) family support including children alive, brothers/sisters cohabiting, parents alive (father & mother, father, mother, none), marital status (married, separated, divorced, widowed, unmarried); (4) socioeconomic conditions such as educational level (elementary school or below, junior high school, senior high school, college or above), income (monthly, normalized between 0 and 1), health insurance (Medicare, Medicaid), economic activity (employed, unemployed); (5) social activity (monthly frequency), that is, religious, friendship, leisure, family, voluntary, political; (6) health-related information, i.e., subjective health (very good, good, middle [neither good nor poor], poor, very poor), body mass index, smoker (non, former, current), drinker (non, former, current); and (7) other determinants including region (big urban, small urban, rural), religion (non, Protestant, Catholic, Buddhist, Won-Buddhist, other), residential type (apartment, other), subjective class (high-A, high-B, middle-A, middle-B, low-A, low-B), life satisfaction—health (0–100), life satisfaction—economic (0–100) and life satisfaction—overall (0–100). The English version of the KLoSA questionnaire (2016–2018) is given as a supplementary file in this article.

Analysis

Seven popular artificial intelligence approaches were compared for the prediction of the comorbidity: logistic regression, decision tree, naïve Bayes, random forest, support vector machine, artificial neural network, and recurrent neural network7. Data on 5527 participants were divided into training and validation sets with a 75:25 ratio (4145 vs. 1382 observations). Accuracy, a ratio of correct predictions among 1382 observations, was introduced as a criterion for validating the models trained. Variable importance from the random forest, an accuracy (or mean-impurity) gap between a complete model and a model excluding a certain variable, was used for testing the two hypotheses of this study. The evaluation of the hypothesis 1 was based on whether diabetes and its comorbid condition in 2016 were top-5 determinants of the comorbidity in 2018. The evaluation of the hypothesis 2 was based on whether top-10 determinants of the comorbidity in 2018 were similar for different pairs of diabetes and its comorbid condition. Finally, the SHAP values were calculated to analyze the direction of association between a major determinant and the comorbidity in the model (random forest). The SHAP value of a particular determinant for a particular observation measures a difference between what the model (the random forest) predicts for the probability of the comorbidity for the observation with and without the determinant.

In practice, experts in artificial intelligence use random forest variable importance to derive the rankings and values of all predictors for the prediction of the dependent variable. Then, they employ the SHAP plots to evaluate the directions of associations between the predictors and the dependent variable. Linear or logistic regression used to play this role before the SHAP approach took it over. This is because the SHAP approach has a notable strength compared to linear or logistic regression: the former considers all realistic scenarios, unlike the latter. Let us assume that there are three predictors of the comorbidity, i.e., diabetes, life satisfaction overall and age as in Fig. 3. As defined above, the SHAP value of diabetes for the comorbidity for a particular participant is the difference between what machine learning predicts for the probability of the comorbidity with and without diabetes for the participant. Here, the SHAP value for the participant is the average of the following four scenarios for the participant: (1) life satisfaction overall excluded, age excluded; (2) life satisfaction overall included, age excluded; (3) life satisfaction overall excluded, age included; and (4) life satisfaction overall included, age included. In other words, the SHAP value combines the results of all possible sub-group analyses, which are ignored in linear or logistic regression with an unrealistic assumption of ceteris paribus, i.e., “all the other variables staying constant”. Python 3.52 (Centrum voor Wiskunde en Informatica, Amsterdam, Netherlands) was employed for the analysis on November 2022.

Ethics approval and consent to participate

This study did not require either the approval of the ethics committee or the informed consent of human subjects given that (1) data were publicly available (https://survey.keis.or.kr/eng/klosa/klosa01.jsp) and (2) data were de-identified (patient anonymity was preserved).