Nationwide prediction of type 2 diabetes comorbidities

Identification of individuals at risk of developing disease comorbidities represents an important task in tackling the growing personal and societal burdens associated with chronic diseases. We employed machine learning techniques to investigate to what extent data from longitudinal, nationwide Danish health registers can be used to predict individuals at high risk of developing type 2 diabetes (T2D) comorbidities. Leveraging logistic regression-, random forest- and gradient boosting models and register data spanning hospitalizations, drug prescriptions and contacts with primary care contractors from >200,000 individuals newly diagnosed with T2D, we predicted five-year risk of heart failure (HF), myocardial infarction (MI), stroke (ST), cardiovascular disease (CVD) and chronic kidney disease (CKD). For HF, MI, CVD, and CKD, register-based models outperformed a reference model leveraging canonical individual characteristics by achieving area under the receiver operating characteristic curve improvements of 0.06, 0.03, 0.04, and 0.07, respectively. The top 1,000 patients predicted to be at highest risk exhibited observed incidence ratios exceeding 4.99, 3.52, 1.97 and 4.71 respectively. In summary, prediction of T2D comorbidities utilizing Danish registers led to consistent albeit modest performance improvements over reference models, suggesting that register data could be leveraged to systematically identify individuals at risk of developing disease comorbidities.

. Histogram of lengths of hospital admissions during which individuals received their first hospital diagnosis of T2D. Individuals with a prior T2D prescription-based diagnosis were excluded. The hospital admissions were limited to those lasting less than 100 days. T2D, type 2 diabetes. Overview of study population characteristics among all newly diagnosed type 2 diabetics (T2D population) and two chronic kidney disease comorbidity populations with buffer period set to 30 and 60 days respectively. RFV, register feature vector; #, number of; BPL, buffer period length.

3/19
Chronic kidney disease (BPL 30 days) (incidence:  Table 2. Comparison of AUROC measures for each prediction models best parameterization between chronic kidney disease comorbidity populations with buffer period set to 30 and 60 days respectively. We applied a referenceand three register-based models on fifteen years of health register data comprising hospital diagnoses, hospital procedures, drug prescriptions and interactions with primary care contractors to predict five-year risk for chronic kidney disease comorbidity. For each comorbidity, prediction was performed on a T2D population free of that comorbidity at the date of prediction (date of individuals first T2D diagnosis). The reference model was a logistic ridge regression based on canonical features: age, sex, country or region of birth and date of first T2D diagnosis as well as their interactions, while the register-based models were logistic ridge regression, random forest and gradient boosting based on the canonical features as well as hospital diagnoses, hospital procedures, drug prescriptions and interactions with primary care extracted from Danish health registers. Incidences are proportions of cases within comorbidities sub-population at the end of the prediction horizon. Value ranges in brackets represent 95% confidence intervals based on bootstrap sampling. AUROC, area under receiver operating characteristic curve. combinations of listed hyperparameters were tested to identify those that led to the best average AUROC using 3-fold cross validation on the training set. Parameter names listed correspond to naming in software implementation (Python modules scikit-learn for LR and RF and xgboost for GB). Due to class imbalance (difference between number of cases and non-cases in each population) training error for class representing cases was scaled up by the inverse proportion between cases and non-cases.

6/19
Supplementary Figure 2. Validation set calibration curves of uncalibrated best models for each comorbidity and all-cause mortality with date of prediction set to an individual's first type 2 diabetes diagnosis.

7/19
Supplementary Figure 3. Validation set calibration error (average difference between observed and predicted outcome probabilities for each predicted probability percentile) of uncalibrated best models for each comorbidity and all-cause mortality with date of prediction set to an individual's first type 2 diabetes diagnosis.

8/19
Supplementary Figure 4. Validation set calibration curves of calibrated best models for each comorbidity and all-cause mortality with date of prediction set to an individual's first type 2 diabetes diagnosis. Each best model was calibrated using Platt scaling method on the test set.

9/19
Supplementary Figure 5. Validation set calibration error (average difference between observed and predicted outcome probabilities for each predicted probability percentile) of calibrated best models for each comorbidity and all-cause mortality with date of prediction set to an individual's first type 2 diabetes diagnosis. Each best model was calibrated using Platt scaling method on the test set.
Supplementary Figure 6. (a) five-year incidence of hospital diagnosis of stroke for population percentiles ranked by risk as predicted by the best gradient boosting (blue) and the best baseline (orange) models. (b) Individuals were ranked according to their predicted risk of stroke by the best gradient boosting (blue) and the best baseline (orange) models. For a number of thresholds, shown are risk ratios, calculated as the stroke incidence of individuals ranking above that thresholds over ST incidence in the entire study population. 95% confidence interval (shaded areas) were obtained through bootstrap sampling. (c) 50 most predictive features for stroke according to the best gradient boosting models feature importances.  Figure 7. Top seven most predictive gradient boosting features. For each comorbidity and ACM shown are the top seven features according to gradient boosting feature importance. Feature importance is an estimate of feature's relative contribution to outcome prediction. Box plots show a distribution of a given continuous feature (e.g. age, an interaction between age and sex) among cases and non-cases within validation set. Box plot whiskers represent lowest and highest observations still within 1.5 inter quantile range. To comply with Danish data protection rules, these values as well as values representing 25th, 50th and 75th percentiles were obtained by averaging five closest observations. Bar plots, describing count based features (e.g. count of a given drug prescription, count of a given diagnosis), show the proportion of validation set cases and non-cases with at least a single observation of that feature. HF, heart failure; MI, myocardial infarction; ST, stroke; CVD, cardiovascular disease; CKD, chronic kidney disease; ACM, all-cause mortality; D, diagnosis of; P, prescription of; MO, modulators of; STE, st elevation; RA, renin-angiotensin; ODGRPIS, other disorders of glucose regulation and pancreatic internal secretion; RIPLMD, hmg coa reductase inhibitors and plain lipid modifying drugs.