Abstract
Type 2 diabetes mellitus (T2D) presents a major health and economic burden that could be alleviated with improved early prediction and intervention. While standard risk factors have shown good predictive performance, we show that the use of blood-based DNA methylation information leads to a significant improvement in the prediction of 10-year T2D incidence risk. Previous studies have been largely constrained by linear assumptions, the use of cytosine–guanine pairs one-at-a-time and binary outcomes. We present a flexible approach (via an R package, MethylPipeR) based on a range of linear and tree-ensemble models that incorporate time-to-event data for prediction. Using the Generation Scotland cohort (training set ncases = 374, ncontrols = 9,461; test set ncases = 252, ncontrols = 4,526) our best-performing model (area under the receiver operating characteristic curve (AUC) = 0.872, area under the precision-recall curve (PRAUC) = 0.302) showed notable improvement in 10-year onset prediction beyond standard risk factors (AUC = 0.839, precision–recall AUC = 0.227). Replication was observed in the German-based KORA study (n = 1,451, ncases = 142, P = 1.6 × 10−5).
Similar content being viewed by others
Main
Diabetes mellitus is one of the most prevalent diseases in the world and a leading cause of mortality. Around half a billion people live with diabetes worldwide, with type 2 diabetes (T2D) making up about 90% of these cases1. Individuals with diabetes can suffer from debilitating complications including nerve damage, kidney disease and blindness2. The disease also increases the future risk of dementia and cardiovascular disease3, with recent studies highlighting obesity and T2D as risk factors for coronavirus disease 2019 (COVID-19) disease severity and intensive care unit admission4. Furthermore, the risk of complications increases over time and is exacerbated if blood glucose levels are poorly managed. Despite developments in the way T2D can be managed for patients, these treatments are reactive, focusing on patients that have already been diagnosed. Early intervention with metformin or lifestyle changes have been shown to delay the onset of T2D, although they do not reduce the risk of all-cause mortality5.
Beyond public health costs, T2D also presents a substantial financial burden to the National Health Service (NHS), with estimated annual spending of £10 billion on diabetes in the UK. Around 80% of these costs are for the treatment of complications, many of which are preventable with early intervention6.
While the mechanisms of insulin resistance in T2D are well known, the interactions between genetic and environmental factors that increase T2D susceptibility are less understood. Previous T2D risk prediction models have used a range of health risk factors7. However, these have not used epigenetic information. Epigenetics is the study of heritable changes to DNA that do not modify its nucleotide sequence. A commonly studied form of this is DNA methylation (DNAm), whereby methyl groups are attached to the DNA molecule, most commonly to the five-carbon on a cytosine in a cytosine–guanine pair (CpG). Due to its involvement with gene expression and gene environment interactions, DNAm can provide dynamic predictive information for disease risk for an individual. For example, epigenetic scores built via penalized regression models have been used to show that weighted linear CpG predictors can explain a substantial proportion of phenotypic variance (R2) of modifiable health factors including body mass index (BMI) (12.5%), high-density lipoprotein cholesterol (15.6%) and smoking status (60.9%)8. Blood-based DNAm is of particular interest in predictive modeling and biomarker development because of its comparatively noninvasive sampling procedure. Epigenetic scores have also shown the ability to explain that up to 58% of variance in plasma protein levels are associated with several incident diseases including T2D and several comorbidities9. Epigenome-wide association studies (EWAS) have identified a number of CpG sites significantly associated with T2D10,11,12,13,14 and related risk factors such as cardiovascular disease15 and obesity16,17. While these provide some predictive performance for T2D prevalence, incident T2D has been less well studied. One such EWAS with 563 cases and 701 controls identified 18 CpGs associated with incident T2D but did not consider any prediction models10. Given that preventive lifestyle changes have been shown to effectively reduce T2D onset18, prediction of T2D incidence years ahead of time would be greatly beneficial in stratifying populations so those at high risk can be monitored and treated with early interventions.
Currently, most studies generating DNAm predictors consider marginal CpG effects or assume only linear additive effects between CpGs. The use of predictive models that can incorporate both interaction and nonlinear effects could capture more complex relationships between variables, resulting in greater prediction accuracy. Therefore, our study aimed to evaluate both the additional predictive benefit that DNAm can provide for 10-year T2D risk and the applicability of linear and tree-ensemble survival models.
In this study, we use one of the world’s largest studies with paired genome-wide DNAm and data linkage to electronic health records, Generation Scotland (n = 14,613, n = 626 incident T2D cases over 15 years of follow-up), to develop and validate epigenetic scores for T2D. We show the added contribution of these epigenetic scores to prediction over and above standard risk factors, for example, age, sex and BMI and externally validate these results in the KORA S4 cohort.
Results
After preprocessing, the mean time-to-onset of T2D was 5.7 and 5.9 years for the training (n = 374 cases) and test (n = 252 cases) sets, respectively. Mean age at onset was also similar between the training and test set at 61.2 and 60.4 years and the mean BMI for cases (at baseline) was 31.7 and 32.2 kg m−2. The full set of cohort summary details for cases and controls in both sets are shown in Table 1. The preprocessing steps and sample sizes at each step are shown in Extended Data Fig. 1. The machine learning prediction pipeline of the MethylPipeR package is shown in Fig. 1.
Null model for the incremental modeling approach
A Cox proportional-hazards model in the test set with age, sex, BMI, self-reported hypertension and family history of diabetes as predictors yielded good classification metrics: area under the receiver operating characteristic curve (AUC) = 0.839, area under the precision-recall curve (PRAUC) = 0.227.
Incremental model using the direct epigenetic score
In the risk factors plus direct epigenetic score test set model, Cox proportional-hazards with least shrinkage and selection operator (LASSO) performed the best, showing an AUC and PRAUC of 0.870 and 0.299, respectively (P = 3.6 × 10−27 for the epigenetic score coefficient). This corresponds to an increase of 3.1% and 7.2% over the standard risk factor model.
Overall, the tree-ensemble models used for the direct epigenetic score resulted in lower performance compared to the Cox proportional-hazards LASSO. AUC values for random survival forests (RSF) and survival Bayesian additive regression trees (sBART) were 0.852 and 0.840 and the PRAUC values were 0.247 and 0.230, respectively (Supplementary Table 1). P values for the epigenetic score coefficients and receiver operating characteristic (ROC) curves for all models are given in Supplementary Table 2.
Incremental model using the composite protein epigenetic score
The composite protein epigenetic score (with 109 possible input protein epigenetic score features) derived using a Cox proportional-hazards LASSO model showed good performance with AUC and PRAUC of 0.864 and 0.270, respectively (epigenetic score coefficient P = 1.61 × 10−18). The increase in PRAUC was smaller for the composite protein epigenetic score compared to the direct epigenetic score but still a notable improvement over using risk factors only.
Incremental model using composite protein plus direct epigenetic scores
The full model (risk factors plus composite protein epigenetic score plus direct epigenetic score) with a Cox proportional-hazards LASSO direct epigenetic score gave an AUC and PRAUC of 0.872 and 0.302, respectively. The full models with RSF and sBART-derived direct epigenetic scores showed AUCs of 0.866 and 0.864, respectively. The corresponding PRAUC values were 0.273 and 0.270. Therefore, the best overall model used direct and composite protein epigenetic scores from Cox proportional-hazards LASSO models. The ROC and precision–recall curves for the full models and risk factor-only model are shown in Fig. 2. We also examined if our findings were robust to potential lag effects in T2D diagnosis19. Increases to both the AUC and PRAUC were observed when adding the epigenetic scores to a risk factor-only model after excluding cases diagnosed within the first 2 years of follow-up (Supplementary Table 3).
Binary classification metrics and model calibration
Supplementary Table 4 shows how confusion matrix metrics vary for the null (risk factor-only) model and the Cox proportional-hazards LASSO model across a range of probability classification thresholds. As expected, as the classification probability threshold is increased, sensitivity and negative predictive value decrease while specificity increases. The effects of these differences on the number of true positives and false negatives are illustrated in Fig. 3. The two models also show differences in their calibration plots (Extended Data Fig. 2). In addition, the difference in the number of correctly classified individuals between the two models are given. These were calculated assuming, arbitrarily, a 10-year incidence rate of 33%, for example, in a scenario where high-risk individuals have been selected for screening.
Selected CpGs
The Cox proportional-hazards LASSO model assigned nonzero coefficients to 145 CpGs (Supplementary Table 5). After filtering the EWAS Catalog by P value (P < 3.6 × 10−8)20 and sample size (n > 1,000), 119 (82%) of the model-selected CpGs were present. These CpGs corresponded to 742 entries and showed epigenome-wide associations with traits including serum high-density lipoprotein cholesterol, serum triglycerides, smoking, C-reactive protein, BMI and age (Supplementary Table 6).
Selected protein epigenetic scores
The composite protein epigenetic score Cox proportional-hazards LASSO model assigned nonzero coefficients to 46 protein epigenetic scores. Details on the corresponding proteins and genes are given in Supplementary Table 7. Out of the selected protein epigenetic scores, 21 previously showed associations with incident T2D9.
Validation in the KORA S4 cohort
Prediction of incident diabetes in the KORA S4 cohort using the Cox proportional-hazards LASSO model showed good replication of direct epigenetic score performance (P = 1.6 × 10−5) with increases of 1.6% in absolute terms above the null model values for both AUC and PRAUC. Further details are provided in Supplementary Table 8.
Epigenetic score prediction of COVID-19 outcomes
In all models, incident T2D was predictive of hospitalization with COVID-19 infection. However, neither the composite protein nor the direct epigenetic score were predictive of the same outcome (Supplementary Table 9). Additionally, neither the (direct or protein-based) epigenetic scores nor incident T2D were predictive of ongoing symptomatic COVID-19 after COVID-19 infection.
Discussion
Using a large cohort with genome-wide epigenetic data and health records linkage to longitudinal primary and secondary care data, we showed that DNAm-based predictors augment standard risk factors in the prediction of incident T2D. The best model with traditional risk factors yielded an AUC of 0.839 compared to 0.872 when DNAm was also considered and the PRAUC increased from 0.227 to 0.305. Using several linear and nonlinear survival models, we showed that overall, the Cox proportional-hazards LASSO model produced the most predictive direct epigenetic score. A composite protein epigenetic score also notably increased predictive performance. The direct epigenetic score also showed good external validation performance in the KORA S4 cohort. Beyond the T2D analysis presented in this study, we developed the MethylPipeR R package, along with a user interface version MethylPipeR-UI (shown in Extended Data Fig. 3), to facilitate reproducible machine learning time-to-event (TTE) and binary prediction using DNAm or other types of high-dimensional omics data.
Determining a ‘best’ model is complicated and depends on the trade-off that a user wishes to make. In this study, we optimized AUC and PRAUC but binary classification metrics vary by method or classification threshold. When using classifiers in clinical settings, decisions need to be made about the number of patients that can be recommended for intervention and the acceptable proportion of false positives and false negatives. We showed an increase in the correct identification of positives and negatives at varying probability thresholds when adding direct and composite epigenetic scores above standard risk factors. For instance, given an (arbitrary) incidence rate of 33% (commonly used as a cutoff for high risk of T2D)21 over 10 years in a sample of 10,000 individuals, our best model would correctly classify an additional 449 individuals compared to the risk factor-only model at a threshold of 0.2 (Supplementary Table 4). Given the costs of treating T2D-related complications, our study gives evidence for possible benefits of epigenetic scores on public health that could also alleviate the financial burden to the NHS. In addition, an assessment of calibration is also critical22,23. Investigation of these related criteria could assist in deciding an optimal threshold given clinical constraints and provide a more comprehensive assessment of model predictions than AUCs or metrics at the commonly used threshold of 0.5.
Several CpGs from the direct epigenetic score were previously identified as epigenome-wide significant correlates of traits commonly linked to T2D14,17,24,25,26,27,28. Future work could investigate the overlap between these and TTE EWAS studies. Further studies could also include DNAm predictors for traditional risk factors that are included in the null model, such as BMI8.
Limitations include the relatively small number of disease cases in the dataset, the limited hyperparameter optimization performed for sBART and the relatively simple variable preselection method for tree-ensemble methods. Given the lower performance of these methods compared to the best models in this study, there is potential for additional improvement in predictive performance with further investigation of more advanced feature (pre)selection. This is particularly important when we consider that the preselection step used linear models before the nonlinear model fitting. Model fitting and preselection were also performed using the same training set, which may have introduced issues associated to post-selection inference29,30. In addition, factors such as overfitting, related individuals in the test set and batch effects between the three rounds of DNAm data processing may all have had an effect on the test set AUC. Future studies may also take into account multimorbidities because the presence of competing risks can lead to bias in onset predictions31. Finally, a small proportion of the linkage codes used to define diabetes included broad terms that were nonspecific to T2D; however, the late age of onset in these individuals meant that there was a high likelihood that they had developed T2D. Epigenetic scores for T2D-associated proteins have also been shown to replicate incident T2D–protein associations in this sample9, suggesting that the case definitions we used captured biological signals relevant to T2D.
There are many strengths to our study. First, the models used captured relationships between CpGs and TTE information, which is not possible using traditional EWAS methods. Second, data linkage to health care measures provided comprehensive T2D incidence data in a very large cohort study, that is, Generation Scotland. Validation performance in the KORA S4 cohort also strengthened evidence for the applicability of the models to other populations. Finally, the R package MethylPipeR encourages reproducibility and allows others to develop similar predictors on new data with minimal setup.
In conclusion, we have demonstrated the potential for DNAm data to provide notable improvement in predictive performance for incident T2D, compared to traditional risk factors (age, sex, BMI, hypertension and family history). We evaluated different models with a systematic approach and presented a framework with the ability to generalize to other traits and datasets for training and testing predictors in future studies.
Methods
Statistics and reproducibility
To enhance reproducibility, the analysis pipeline and results presented in this study have been reported using the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) checklist32 (Supplementary File 1). Sample sizes for training, testing and validation of the statistical models were determined by the cohort sizes. Exclusions were made based on DNAm quality control and missingness in health data. Randomization and blinding were not applicable during data collection because the study used observational data from population-based cohorts. Statistical tests for model coefficients assumed a normal distribution but this was not formally tested.
Generation Scotland
Blood-based DNAm and linked health data were obtained from Generation Scotland33, a family-structured, population-based cohort. The cohort consists of 23,960 volunteers across Scotland aged 18–99 years at recruitment (between 2006 and 2011), of whom over 18,000 currently have genome-wide DNAm data available (Illumina EPIC array). In DNAm quality control, CpG sites were filtered by removing those with a low bead count in 5% or more of the samples or a high detection P value (>0.05) in more than 5% of samples. Probes on the X and Y chromosomes were also removed. Samples were filtered by removing those with a mismatch between predicted and recorded sex or 1% or more of CpGs with a detection P > 0.05. Missing CpG values were mean-imputed. To enable the predictors to be applied to existing cohort studies with older Illumina array data, CpGs were filtered to the intersection of the 450k and EPIC array sites (n = 453,093 CpGs).
This study considered DNAm data from three large subsets of the Generation Scotland cohort, with 5,087 (set 1), 4,450 (set 2) and 8,877 (set 3) individuals. Processing took place in 2017, 2019 and 2021, respectively. Set 1 and set 3 included related individuals within and between each set while all individuals in set 2 were unrelated to each other and to individuals in set 1 (genetic relationship matrix threshold < 0.05). In our experiments, the training set consisted of sets 2 and 3 combined; set 1 was used as the test set. To avoid the presence of families with individuals across both training and test sets, any individuals in the training set from the same family as an individual in the test set were excluded from the analysis (nexcluded = 3,138).
Participant health measures including age, BMI, sex, self-reported hypertension and family (parent or sibling) history of T2D were taken at baseline (DNAm sampling) via questionnaire. BMI was calculated as the individual’s weight in kilograms divided by the square of their height in meters. Missing values in the set 1 health measures were treated as missing completely at random and the corresponding individuals were excluded (nset 1 = 99). This was not performed in sets 2 and 3 because the health measures were used for incremental modeling (set 1 only).
Disease cases were ascertained through data linkage to NHS Scotland health records consisting of hospital (International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10) codes) and GP records (Read2 codes). Prevalent cases were identified from a baseline questionnaire (self-reported) or from ICD-10/Read2 codes dated before baseline and removed from the dataset. Type 1 and juvenile cases were treated as control observations. A total of 757 incident cases were observed over the follow-up period (from the recruitment date to January 2022); after preprocessing, 626 cases remained. Mean time to T2D onset was 5.9, 5.4 and 6.0 years for sets 1, 2 and 3, respectively, with ranges of 0.2–14.8 (set 1), 0.2–14.8 (set 2) and 0.1–14.8 (set 3) years. In GP record-derived cases, 81% of cases had a C10F ‘type 2 diabetes mellitus’ code; 12% had a C10 ‘diabetes mellitus’ code and 4% had a C109 ‘non-insulin dependent diabetes mellitus’ code. The full list of included and excluded terms are given in Supplementary Table 10.
Composite protein epigenetic score
A composite protein epigenetic score model for incident T2D was trained using a set of 109 protein epigenetic scores previously shown to associate with a range of incident diseases9. For each protein, the epigenetic score was calculated for each individual in the training and test sets. A Cox proportional-hazards LASSO model was fitted to the training set with the 109 protein epigenetic scores (scaled within set to a mean of 0 and variance of 1) as features. The linear predictor from the Cox proportional-hazards LASSO model was then used as the composite protein epigenetic score in the incremental test set model.
Direct epigenetic score
The direct epigenetic score Cox proportional-hazards LASSO model for incident T2D was fitted to the DNAm data in the training set. Due to memory limitations in the model fitting R package (glmnet34), the CpGs were filtered to the 200,000 sites with the highest variance. The linear predictor from the Cox proportional-hazards LASSO model was then used as the direct epigenetic score in the incremental test set modeling. For the tree-ensemble models, the Cox proportional-hazards LASSO-selected CpGs were used as input and the 10-year onset risk was subsequently used as the direct epigenetic score.
Outcome definition for the 10-year onset prediction
The link to NHS Scotland health records provided dates for disease diagnoses from which age at onset was calculated. Along with age at baseline (DNAm sampling), these were used to calculate the TTE, measured in years, for each individual. For incident T2D cases and controls, TTE was defined as the time from baseline to disease onset and censoring, respectively. Controls were censored at the latest date of available GP records (September 2020). In addition, controls who died during the follow-up period were censored at time of death.
While all models were trained as survival models, our primary prediction outcome was incident T2D diagnosis within 10 years. Therefore, predictions on the test set were calculated using the 10-year onset probability (one minus survival probability). When calculating the binary outcome metrics, cases with TTE > 10 were treated as controls. These metrics included confusion matrices, AUC and PRAUC. After preprocessing and case thresholding (TTE > 10), there were 218 cases and 4,560 controls in the test set.
The numbers of individuals, cases and controls after each preprocessing step are shown in Extended Data Fig. 1.
Incremental modeling
Composite protein epigenetic scores and direct epigenetic scores were generated in the training dataset using different machine learning methods with the MethylPipeR package (Fig. 1), before being applied to the test set using an incremental modeling approach (further detail in the Supplementary Note). In the test set, a (null) risk factor-only model was defined via a Cox proportional-hazards model for T2D with age, sex, BMI, self-reported hypertension and self-reported family (sibling or parental) history of diabetes as predictors. A multitude of risk factors have been used in previous T2D risk prediction tools, most of which include the set that we have used in this study. While additional risk factors, such as waist:hip ratio, may also be relevant7, we selected the null model covariates based on those used in the Diabetes UK: Type 2 Diabetes Know Your Risk tool (https://riskscore.diabetes.org.uk/start) to compare our results to an existing widely used tool. This was with the exception of ethnicity because of the relative homogeneity of the Generation Scotland cohort. These also closely match the top risk factors identified in a systematic review of previous T2D risk predictors (Fig. 2 in Collins et al.7).
Penalized regression predictors
Because the number of CpGs (nCpG = 200,000) was much greater than the number of rows in the training set (n = 9,835 after preprocessing), a regularization method was required to reduce overfitting of the Cox proportional-hazards regression models.
Penalized regression models reduce overfitting by applying a regularization penalty in the model fitting process. This forces regression parameters to remain small or possibly to shrink them to zero. The latter allows the method to be used for variable selection by keeping only the variables with resulting nonzero coefficients.
LASSO penalization was fitted to the training set DNAm and protein epigenetic scores using glmnet34,35 via MethylPipeR with the best shrinkage parameter (λ) chosen by nine-fold cross-validation.
Tree-ensemble models
Tree ensembles are nonparametric models capable of estimating complex functions using a set of decision trees. Two tree-ensemble approaches were used: RSF36 and sBART37. RSF38 is an ensemble machine learning model that estimates a function by averaging the output from a set of independently trained decision trees. During model fitting, each tree is built using a different subset of the variables from the training set to prevent individual trees from overfitting to the whole dataset. sBART is a nonparametric method that estimates a function as a sum over a set of regression trees. sBART incorporates the ability to model both additive and interaction effects and has shown high predictive performance compared with similar methods37,39.
RSF and sBART were fitted to the training set using the R packages randomForestSRC (v.2.11.0)40 and BART (v.2.9)41, respectively via MethylPipeR. Details on hyperparameter selection are given in the Supplementary Note.
Because of computational limitations and probable overfitting in using the tree-ensemble models on all CpGs in the dataset, variable preselection was based on the coefficients in the penalized Cox proportional-hazards models. Each tree-ensemble model was evaluated with the features corresponding to nonzero coefficients from the Cox proportional-hazards LASSO model.
Evaluating predictive performance
Survival models can be used to predict the risk of incident T2D for an arbitrary prediction period. In this study, we focused on classification performance for the binary outcome defined by a 10-year T2D incidence. Incidence probabilities were calculated as one minus 10-year survival probabilities and the binary outcomes were calculated by truncating the observed TTE at 10 years (see ‘Outcome definition for the 10-year onset prediction’ section of this article).
The AUC and PRAUC were calculated as measures of predictive performance because they do not require the choice of a fixed discrimination threshold. PRAUC is more informative in situations where there is a class imbalance in the test set42. Additionally, binary classification metrics consisting of sensitivity (recall), specificity, positive predictive value (precision) and negative predictive value were calculated. These metrics require selection of a discrimination threshold to assign positive and negative class predictions. We evaluated their behavior across a range of discrimination thresholds, between 0 and 1 in increments of 0.1.
Differences in correctly classified individuals between the risk factor-only and Cox proportional-hazards LASSO models were calculated assuming, arbitrarily, a 10-year incidence rate of 33%, for example, in a scenario where high-risk individuals have been selected for screening in a population of 10,000. The numbers of true positives and true negatives were calculated as follows:
\(\mathrm{TP} = \mathrm{sensitivity} \times N_{\mathrm{actual}\,\mathrm{positives}}\) and \(\mathrm{TN} = \mathrm{specificity} \times N_{\mathrm{actual}\,\mathrm{negatives}}\) respectively, where Nactual positives = 3,300 and Nactual negatives = 7,700. The difference between the two was then taken at each classification threshold.
Model calibration was examined by comparing predicted probabilities with actual case or control proportions43.
Selected CpG comparison with the EWAS Catalog
The CpG sites selected by the Cox proportional-hazards LASSO model were queried in the EWAS Catalog44 to identify traits that have previously been linked to these sites. The EWAS Catalog is a database that allows users to search EWAS results from the existing literature. We performed a tissue-agnostic query using the selected CpGs and filtered results to those with an epigenome-wide significance threshold of P < 3.6 × 10−8 (ref. 20) in studies with a sample size greater than 1,000. Almost all (739 out of 742; 99.6%) of the results after filtering were from blood-based studies. The remaining results were from saliva and prefrontal cortex-based studies.
Validation in the KORA S4 cohort
The Cox proportional-hazards LASSO model using the direct epigenetic score was applied to the KORA S4 cohort45. This cohort consisted of 1,451 individuals in southern Germany, aged 25–74 years. Cohort summary details are shown in Supplementary Table 11. Individuals with missing values in the health measures were removed from the dataset. Missing CpG values in the DNAm data were mean-imputed. Like the approach in the Generation Scotland test set, an epigenetic score was computed for each individual in the KORA S4 dataset. Evaluation was then performed using an incremental modeling approach. Additional cohort and methods details such as the outcome definition, follow-up period and preprocessing numbers are provided in the Supplementary Note.
Epigenetic score prediction of COVID-19 outcomes
The subset of the Generation Scotland cohort with reported COVID-19 infection (clinically diagnosed or positive test from linked test data), who had also participated in the CovidLife study46 were used to predict ongoing symptomatic COVID-19 and hospitalization from COVID-19 (n = 703). Ongoing symptomatic COVID-19 cases were defined as individuals with self-reported symptoms lasting 4 weeks or longer47. Hospitalization cases were defined as hospital admissions with accompanying ICD-10 codes U07.1 (confirmed COVID-19 test) and U07.2 (clinically diagnosed), derived from the Scottish Morbidity Records (SMR01). Details of the method and summary statistics are shown in Supplementary Note and Supplementary Table 12.
Ethics approval and consent to participate
All components of Generation Scotland received ethical approval from the NHS Tayside Committee on Medical Research Ethics (research ethics committee (REC) ref. no. 05/S1401/89). Generation Scotland has also been granted Research Tissue Bank status by the East of Scotland Research Ethics Service (REC ref. no. 20-ES-0021), providing generic ethical approval for a wide range of uses within medical research. Written, informed consent was provided by Generation Scotland participants.
The KORA S4 studies were approved by the ethics committee of the Bavarian Medical Association (no. 99186) and were conducted according to the principles expressed in the Declaration of Helsinki (World Medical Association Declaration of Helsinki 2008). All study participants gave their written informed consent.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
According to the terms of consent for Generation Scotland participants, access to data must be reviewed by the Generation Scotland Access Committee. Applications should be made to access@generationscotland.org. The informed consent given by the KORA S4 study participants does not cover data posting in public databases. However, data are available upon request from the KORA Project Application Self-Service Tool (https://epi.helmholtz-muenchen.de/). Data requests can be submitted online and are subject to approval by the KORA board.
Code availability
Analysis scripts for this study are available at https://github.com/marioni-group/episcores-diabetes-prediction and https://doi.org/10.5281/zenodo.7628959. MethylPipeR v.0.1.0 is available at https://github.com/marioni-group/MethylPipeR and https://doi.org/10.5281/zenodo.7628816. MethylPipeR-UI is available at https://github.com/marioni-group/MethylPipeR-UI and https://doi.org/10.5281/zenodo.7635952.
References
Saeedi, P. et al. Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: results from the International Diabetes Federation Diabetes Atlas, 9th edition. Diabetes Res. Clin. Pract. 157, 107843 (2019).
Gregg, E. W., Sattar, N. & Ali, M. K. The changing face of diabetes complications. Lancet Diabetes Endocrinol. 4, 537–547 (2016).
Biessels, G. J. & Despa, F. Cognitive decline and dementia in diabetes mellitus: mechanisms and clinical implications. Nat. Rev. Endocrinol. 14, 591–604 (2018).
McGurnaghan, S. J. et al. Risks of and risk factors for COVID-19 disease in people with diabetes: a cohort study of the total population of Scotland. Lancet Diabetes Endocrinol. 9, 82–93 (2021).
Lee, C. G. et al. Effect of metformin and lifestyle interventions on mortality in the Diabetes Prevention Program and Diabetes Prevention Program Outcomes Study. Diabetes Care 44, 2775–2782 (2021).
Keng, M. J. et al. Impact of achieving primary care targets in type 2 diabetes on health outcomes and healthcare costs. Diabetes Obes. Metab. 21, 2405–2412 (2019).
Collins, G. S., Mallett, S., Omar, O. & Yu, L.-M. Developing risk prediction models for type 2 diabetes: a systematic review of methodology and reporting. BMC Med. 9, 103 (2011).
McCartney, D. L. et al. Epigenetic prediction of complex traits and death. Genome Biol. 19, 136 (2018).
Gadd, D. A. et al. Epigenetic scores for the circulating proteome as tools for disease prediction. eLife 11, e71802 (2022).
Cardona, A. et al. Epigenome-wide association study of incident type 2 diabetes in a British population: EPIC-Norfolk study. Diabetes 68, 2315–2326 (2019).
Meeks, K. A. C. et al. Epigenome-wide association study in whole blood on type 2 diabetes among sub-Saharan African individuals: findings from the RODAM study. Int. J. Epidemiol. 48, 58–70 (2019).
Walaszczyk, E. et al. DNA methylation markers associated with type 2 diabetes, fasting glucose and HbA1c levels: a systematic review and replication in a case-control sample of the Lifelines study. Diabetologia 61, 354–368 (2018).
Al Muftah, W. A. et al. Epigenetic associations of type 2 diabetes and BMI in an Arab population. Clin. Epigenetics 8, 13 (2016).
Chambers, J. C. et al. Epigenome-wide association of DNA methylation markers in peripheral blood from Indian Asians and Europeans with incident type 2 diabetes: a nested case-control study. Lancet Diabetes Endocrinol. 3, 526–534 (2015).
Nakatochi, M. et al. Epigenome-wide association of myocardial infarction with DNA methylation sites at loci related to cardiovascular disease. Clin. Epigenetics 9, 54 (2017).
Wang, X. et al. An epigenome-wide study of obesity in African American youth and young adults: novel findings, replication in neutrophils, and relationship with gene expression. Clin. Epigenetics 10, 3 (2018).
Wahl, S. et al. Epigenome-wide association study of body mass index, and the adverse outcomes of adiposity. Nature 541, 81–86 (2017).
Haw, J. S. et al. Long-term sustainability of diabetes prevention approaches: a systematic review and meta-analysis of randomized clinical trials. JAMA Intern. Med. 177, 1808–1817 (2017).
Samuels, T. A., Cohen, D., Brancati, F. L., Coresh, J. & Kao, W. H. Delayed diagnosis of incident type 2 diabetes mellitus in the ARIC study. Am. J. Manag. Care 12, 717–724 (2006).
Saffari, A. et al. Estimation of a significance threshold for epigenome-wide association studies. Genet. Epidemiol. 42, 20–33 (2018).
Ekoe, J.-M., Goldenberg, R. & Katz, P. Screening for diabetes in adults. Can. J. Diabetes 42, S16–S19 (2018).
Van Calster, B. et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 17, 230 (2019).
Van Calster, B. et al. A calibration hierarchy for risk models was defined: from utopia to empirical data. J. Clin. Epidemiol. 74, 167–176 (2016).
Demerath, E. W. et al. Epigenome-wide association study (EWAS) of BMI, BMI change and waist circumference in African American adults identifies multiple replicated loci. Hum. Mol. Genet. 24, 4464–4479 (2015).
Mendelson, M. M. et al. Association of body mass index with DNA methylation and gene expression in blood cells and relations to cardiometabolic disease: a Mendelian randomization approach. PLoS Med. 14, e1002215 (2017).
Sayols-Baixeras, S. et al. Identification and validation of seven new loci showing differential DNA methylation related to serum lipid profile: an epigenome-wide approach. The REGICOR study. Hum. Mol. Genet. 25, 4556–4565 (2016).
Braun, K. V. E. et al. Epigenome-wide association study (EWAS) on lipids: the Rotterdam Study. Clin. Epigenetics 9, 15 (2017).
Kriebel, J. et al. Association between DNA methylation in whole blood and measures of glucose metabolism: KORA F4 study. PLoS ONE 11, e0152314 (2016).
Lee, J. D., Sun, D. L., Sun, Y. & Taylor, J. E. Exact post-selection inference, with application to the lasso. Ann. Stat. 44, 907–927 (2016).
Taylor, J. & Tibshirani, R. Post‐selection inference for ℓ1‐penalized likelihood models. Can. J. Stat. 46, 41–61 (2018).
Austin, P. C., Lee, D. S. & Fine, J. P. Introduction to the analysis of survival data in the presence of competing risks. Circulation 133, 601–609 (2016).
Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. M. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Br. J. Surg. 102, 148–158 (2015).
Smith, B. H. et al. Cohort Profile: Generation Scotland: Scottish Family Health Study (GS: SFHS). The study, its participants and their potential for genetic research on health and illness. Int. J. Epidemiol. 42, 689–700 (2013).
Simon, N., Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J. Stat. Softw. 39, 1–13 (2011).
Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).
Ishwaran, H., Kogalur, U. B., Blackstone, E. H. & Lauer, M. S. Random survival forests. Ann. Appl. Stat. 2, 841–860 (2008).
Sparapani, R. A., Logan, B. R., McCulloch, R. E. & Laud, P. W. Nonparametric survival analysis using Bayesian additive regression trees (BART). Stat. Med. 35, 2741–2753 (2016).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Chipman, H. A., George, E. I. & McCulloch, R. E. BART: Bayesian additive regression trees. Ann. Appl. Stat. 4, 266–298 (2010).
Ishwaran, H. & Kogalur, U. Fast unified random forests for survival, regression, and classification (RF-SRC). R package version 2.11.0 (2021).
Sparapani, R., Spanbauer, C. & McCulloch, R. Nonparametric machine learning and efficient computation with Bayesian additive regression trees: the BART R package. J. Stat. Softw. 97, 1–66. (2021).
Saito, T. & Rehmsmeier, M. The precision–recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10, e0118432 (2015).
De Cock, B., Nieboer, D., Van Calster, B., Steyerberg, E. W. & Vergouwe, Y. The CalibrationCurves package: validating predicted probabilities against binary events. R package version 0.1.5. https://github.com/BavoDC/CalibrationCurves (2023).
Battram, T. et al. The EWAS Catalog: a database of epigenome-wide association studies. Wellcome Open Res. 7, 41 (2022).
Wichmann, H.-E., Gieger, C. & Illig, T. KORA-gen—resource for population genetics, controls and a broad spectrum of disease phenotypes. Gesundheitswesen 67, S26–S30 (2005).
Fawns-Ritchie, C. et al. CovidLife: a resource to understand mental health, well-being and behaviour during the COVID-19 pandemic in the UK. Wellcome Open Res. 6, 176 (2021).
Shah, W., Hillman, T., Playford, E. D. & Hishmeh, L. Managing the long term effects of covid-19: summary of NICE, SIGN, and RCGP rapid guideline. BMJ 372, n136 (2021).
Acknowledgements
This research was funded in whole, or in part, by the Wellcome Trust (nos. 104036/Z/14/Z, 108890/Z/15/Z and 216767/Z/19/Z). For the purpose of open access, we have applied a CC BY public copyright licence to any author accepted manuscript version arising from this submission. Generation Scotland received core support from the Chief Scientist Office of the Scottish Government Health Directorates (no. CZD/16/6) and the Scottish Funding Council (no. HR03006) and is currently supported by the Wellcome Trust (no. 216767/Z/19/Z). DNAm profiling of the Generation Scotland samples was carried out by the Genetics Core Laboratory at the Edinburgh Clinical Research Facility and was funded by the Medical Research Council UK and the Wellcome Trust (Wellcome Trust Strategic Award ‘STratifying Resilience and Depression Longitudinally’ (ref. no. 104036/Z/14/Z)). The DNAm data assayed for Generation Scotland was partially funded by a 2018 NARSAD Young Investigator Grant from the Brain & Behavior Research Foundation (ref. no. 27404; awardee: D. M. Howard) and by a JMAS SIM fellowship from the Royal College of Physicians of Edinburgh (awardee: H. C. Whalley). Y.C. is supported by the University of Edinburgh and University of Helsinki joint PhD program in Human Genomics. D.A.G. is supported by funding from the Wellcome Trust 4-year PhD in Translational Neuroscience—training the next generation of basic neuroscientists to embrace clinical research (no. 108890/Z/15/Z). C.A.V. is a Chancellor’s Fellow funded by the University of Edinburgh. D.L.M. and R.E.M. are supported by an Alzheimer’s Research UK major project grant no. ARUK-PG2017B-10. R.E.M. is supported by an Alzheimer’s Society major project grant no. AS-PG-19b-010. M.J.S., N.S. and E.L. are supported by the United Kingdom Research and Innovation (grant EP/S02431X/1), UKRI Centre for Doctoral Training in Biomedical AI at the University of Edinburgh, School of Informatics. Recruitment to the CovidLife study was facilitated by the Scottish Health Research Register (SHARE) and Biobank. SHARE is supported by NHS Research Scotland, the universities of Scotland and the Chief Scientist Office of the Scottish Government. The KORA S4 study was initiated and financed by the Helmholtz Zentrum München—German Research Center for Environmental Health, which is funded by the German Federal Ministry of Education and Research and by the State of Bavaria. Furthermore, the KORA research has been supported by the Munich Center of Health Sciences, Ludwig-Maximilians-Universität München as part of LMUinnovativ and is supported by the German Centre for Cardiovascular Research. The KORA S4 study is funded by the Bavarian State Ministry of Health and Care through the research project DigiMed Bayern (www.digimed-bayern.de).
Author information
Authors and Affiliations
Contributions
Y.C., D.A.G. and C.G. performed the data analysis. Y.Z., I.B., M.J.S., N.S. and E.L. conducted the preliminary analyses. Generation Scotland cohort: N.W. and L.M. were responsible for the data collection. A.C., C.N., R.M.W., K.L.E. and D.L.M. prepared the data. D.J.P. and A.M.M. were responsible for data collection and cohort management. KORA S4 cohort: A.P., W.R. and M.W. were responsible for data collection and preparation. CovidLife: C.F.-R. performed the data collection and preparation. K.M.-G. and T.I.C. provided input on statistical modeling. A.G., C.A.V. and R.E.M. were responsible for study design and methodology. All authors contributed to drafting the paper and the figures.
Corresponding authors
Ethics declarations
Competing interests
R.E.M. has received a speaker fee from Illumina and is an advisor to the Epigenetic Clock Development Foundation. A.M.M. has previously received speaker fees from Janssen and Illumina and research funding from The Sackler Trust. L.M. has received payment from Illumina for presentations and consultancy. The other authors declare no competing interests.
Peer review
Peer review information
Nature Aging thanks Srikanth Bellary and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Preprocessing steps for Generation Scotland and KORA S4.
The number of individuals/cases and controls in are given after each step.
Extended Data Fig. 2 Calibration plots for incremental models in Generation Scotland.
Plots are shown for the full model (risk factors + composite protein epigenetic score + Cox PH lasso direct epigenetic score) (top-left) and the risk factors only model (bottom-left). The black line shows the loess calibration regression curve. The grey area shows 95% confidence intervals calculated from 2000 bootstrap samples. The ideal calibration line (observed = predicted) is shown in red. The histogram shows the distribution of predicted probabilities. The wider confidence intervals at higher predicted probabilities are due to the small number of predictions in those ranges. Most predictions are low in the probability range, emphasised in the zoomed-in plots (top-right and bottom-right).
Extended Data Fig. 3 An example from the MethylPipeR-UI Shiny app.
The left hand panel provides functionality for uploading data and specifying pipeline parameters. The right hand tabs show output such as model diagnostics, performance metrics and console output.
Supplementary information
Supplementary Information
Supplementary Methods.
Supplementary Tables
Supplementary Tables 1–14.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cheng, Y., Gadd, D.A., Gieger, C. et al. Development and validation of DNA methylation scores in two European cohorts augment 10-year risk prediction of type 2 diabetes. Nat Aging 3, 450–458 (2023). https://doi.org/10.1038/s43587-023-00391-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43587-023-00391-4
This article is cited by
-
DNA Methylation-derived biological age and long-term mortality risk in subjects with type 2 diabetes
Cardiovascular Diabetology (2024)
-
Epigenetic scores derived in saliva are associated with gestational age at birth
Clinical Epigenetics (2024)
-
Blood protein assessment of leading incident diseases and mortality in the UK Biobank
Nature Aging (2024)
-
Epigenomic insights into common human disease pathology
Cellular and Molecular Life Sciences (2024)
-
Integrated epigenome, whole genome sequence and metabolome analyses identify novel multi-omics pathways in type 2 diabetes: a Middle Eastern study
BMC Medicine (2023)