Development and validation of DNA methylation scores in two European cohorts augment 10-year risk prediction of type 2 diabetes

Type 2 diabetes mellitus (T2D) presents a major health and economic burden that could be alleviated with improved early prediction and intervention. While standard risk factors have shown good predictive performance, we show that the use of blood-based DNA methylation information leads to a significant improvement in the prediction of 10-year T2D incidence risk. Previous studies have been largely constrained by linear assumptions, the use of cytosine–guanine pairs one-at-a-time and binary outcomes. We present a flexible approach (via an R package, MethylPipeR) based on a range of linear and tree-ensemble models that incorporate time-to-event data for prediction. Using the Generation Scotland cohort (training set ncases = 374, ncontrols = 9,461; test set ncases = 252, ncontrols = 4,526) our best-performing model (area under the receiver operating characteristic curve (AUC) = 0.872, area under the precision-recall curve (PRAUC) = 0.302) showed notable improvement in 10-year onset prediction beyond standard risk factors (AUC = 0.839, precision–recall AUC = 0.227). Replication was observed in the German-based KORA study (n = 1,451, ncases = 142, P = 1.6 × 10−5). Early type 2 diabetes (T2D) risk assessment could help slow or prevent disease onset. Here the authors used blood-based DNA methylation data to develop 10-year risk prediction models for incident T2D. The results show an improvement in performance beyond standard risk factors typically used to predict the risk of T2D onset.

Type 2 diabetes mellitus (T2D) presents a major health and economic burden that could be alleviated with improved early prediction and intervention. While standard risk factors have shown good predictive performance, we show that the use of blood-based DNA methylation information leads to a significant improvement in the prediction of 10-year T2D incidence risk. Previous studies have been largely constrained by linear assumptions, the use of cytosine-guanine pairs one-at-a-time and binary outcomes. We present a flexible approach (via an R package, MethylPipeR) based on a range of linear and tree-ensemble models that incorporate time-to-event data for prediction. Using the Generation Scotland cohort (training set n cases = 374, n controls = 9,461; test set n cases = 252, n controls = 4,526) our best-performing model (area under the receiver operating characteristic curve (AUC) = 0.872, area under the precision-recall c ur ve ( PR AUC) = 0.302) showed notable improvement in 10-year onset prediction beyond standard risk factors (AUC = 0.839, precision-recall AUC = 0.227). Replication was observed in the German-based KORA study (n = 1,451, n cases = 142, P = 1.6 × 10 −5 ).
In this study, we use one of the world's largest studies with paired genome-wide DNAm and data linkage to electronic health records, Generation Scotland (n = 14,613, n = 626 incident T2D cases over 15 years of follow-up), to develop and validate epigenetic scores for T2D. We show the added contribution of these epigenetic scores to prediction over and above standard risk factors, for example, age, sex and BMI and externally validate these results in the KORA S4 cohort.

Results
After preprocessing, the mean time-to-onset of T2D was 5.7 and 5.9 years for the training (n = 374 cases) and test (n = 252 cases) sets, respectively. Mean age at onset was also similar between the training and test set at 61.2 and 60.4 years and the mean BMI for cases (at baseline) was 31.7 and 32.2 kg m −2 . The full set of cohort summary details for cases and controls in both sets are shown in Table 1. The preprocessing steps and sample sizes at each step are shown in Extended Data Fig. 1. The machine learning prediction pipeline of the MethylPipeR package is shown in Fig. 1.

Null model for the incremental modeling approach
A Cox proportional-hazards model in the test set with age, sex, BMI, self-reported hypertension and family history of diabetes as predictors yielded good classification metrics: area under the receiver operating managed. Despite developments in the way T2D can be managed for patients, these treatments are reactive, focusing on patients that have already been diagnosed. Early intervention with metformin or lifestyle changes have been shown to delay the onset of T2D, although they do not reduce the risk of all-cause mortality 5 .
Beyond public health costs, T2D also presents a substantial financial burden to the National Health Service (NHS), with estimated annual spending of £10 billion on diabetes in the UK. Around 80% of these costs are for the treatment of complications, many of which are preventable with early intervention 6 .
While the mechanisms of insulin resistance in T2D are well known, the interactions between genetic and environmental factors that increase T2D susceptibility are less understood. Previous T2D risk prediction models have used a range of health risk factors 7 . However, these have not used epigenetic information. Epigenetics is the study of heritable changes to DNA that do not modify its nucleotide sequence. A commonly studied form of this is DNA methylation (DNAm), whereby methyl groups are attached to the DNA molecule, most commonly to the five-carbon on a cytosine in a cytosine-guanine pair (CpG). Due to its involvement with gene expression and gene environment interactions, DNAm can provide dynamic predictive information for disease risk for an individual. For example, epigenetic scores built via penalized regression models have been used to show that weighted linear CpG predictors can explain a substantial proportion of phenotypic variance (R 2 ) of modifiable health factors including body mass index (BMI) (12.5%), high-density lipoprotein cholesterol (15.6%) and smoking status (60.9%) 8 . Blood-based DNAm is of particular interest in predictive modeling and biomarker development because of its comparatively noninvasive sampling procedure. Epigenetic scores have also shown the ability to explain that up to 58% of variance in plasma protein levels are associated with several incident diseases including T2D and several comorbidities 9 . Epigenome-wide association studies (EWAS) have identified a number of CpG sites significantly associated with T2D 10-14 and related risk factors such as cardiovascular disease 15 and obesity 16,17 . While these provide some predictive performance for T2D prevalence, incident T2D has been less well studied. One such EWAS with 563 cases and 701 controls identified 18 CpGs associated with incident T2D but did not consider any prediction models 10 . Given that preventive lifestyle changes have been shown to effectively reduce T2D onset 18 , prediction of T2D incidence years ahead of time would be greatly beneficial in stratifying populations so those at high risk can be monitored and treated with early interventions.
Currently, most studies generating DNAm predictors consider marginal CpG effects or assume only linear additive effects between CpGs. The use of predictive models that can incorporate both interaction and nonlinear effects could capture more complex relationships between variables, resulting in greater prediction accuracy. Therefore, our study aimed to evaluate both the additional predictive benefit that

Incremental model using the direct epigenetic score
In the risk factors plus direct epigenetic score test set model, Cox proportional-hazards with least shrinkage and selection operator (LASSO) performed the best, showing an AUC and PRAUC of 0.870 and 0.299, respectively (P = 3.6 × 10 −27 for the epigenetic score coefficient). This corresponds to an increase of 3.1% and 7.2% over the standard risk factor model.
Overall, the tree-ensemble models used for the direct epigenetic score resulted in lower performance compared to the Cox proportional-hazards LASSO. AUC values for random survival forests (RSF) and survival Bayesian additive regression trees (sBART) were 0.852 and 0.840 and the PRAUC values were 0.247 and 0.230, respectively (Supplementary Table 1). P values for the epigenetic score coefficients and receiver operating characteristic (ROC) curves for all models are given in Supplementary Table 2.

Incremental model using the composite protein epigenetic score
The composite protein epigenetic score (with 109 possible input protein epigenetic score features) derived using a Cox proportional-hazards LASSO model showed good performance with AUC and PRAUC of 0.864 and 0.270, respectively (epigenetic score coefficient P = 1.61 × 10 −18 ). The increase in PRAUC was smaller for the composite protein epigenetic score compared to the direct epigenetic score but still a notable improvement over using risk factors only.

Incremental model using composite protein plus direct epigenetic scores
The full model (risk factors plus composite protein epigenetic score plus direct epigenetic score) with a Cox proportional-hazards LASSO direct epigenetic score gave an AUC and PRAUC of 0.872 and 0.302, respectively. The full models with RSF and sBART-derived direct epigenetic scores showed AUCs of 0.866 and 0.864, respectively. The corresponding PRAUC values were 0.273 and 0.270. Therefore, the best overall model used direct and composite protein epigenetic scores from Cox proportional-hazards LASSO models. The ROC and precisionrecall curves for the full models and risk factor-only model are shown in Fig. 2. We also examined if our findings were robust to potential lag effects in T2D diagnosis 19 . Increases to both the AUC and PRAUC were observed when adding the epigenetic scores to a risk factor-only model after excluding cases diagnosed within the first 2 years of follow-up (Supplementary Table 3).

Binary classification metrics and model calibration
Supplementary Table 4 shows how confusion matrix metrics vary for the null (risk factor-only) model and the Cox proportional-hazards LASSO model across a range of probability classification thresholds. As expected, as the classification probability threshold is increased, sensitivity and negative predictive value decrease while specificity increases. The effects of these differences on the number of true positives and false negatives are illustrated in Fig. 3. The two models also show differences in their calibration plots (Extended Data Fig. 2). In addition, the difference in the number of correctly classified individuals between the two models are given. These were calculated assuming, arbitrarily, a 10-year incidence rate of 33%, for example, in a scenario where high-risk individuals have been selected for screening.

Selected protein epigenetic scores
The composite protein epigenetic score Cox proportional-hazards LASSO model assigned nonzero coefficients to 46 protein epigenetic scores. Details on the corresponding proteins and genes are given in Supplementary Table 7. Out of the selected protein epigenetic scores, 21 previously showed associations with incident T2D 9 .

Validation in the KORA S4 cohort
Prediction of incident diabetes in the KORA S4 cohort using the Cox proportional-hazards LASSO model showed good replication of direct epigenetic score performance (P = 1.6 × 10 −5 ) with increases of 1.6% in absolute terms above the null model values for both AUC and PRAUC. Further details are provided in Supplementary Table 8.

Epigenetic score prediction of COVID-19 outcomes
In all models, incident T2D was predictive of hospitalization with COVID-19 infection. However, neither the composite protein nor the direct epigenetic score were predictive of the same outcome (Supplementary Table 9). Additionally, neither the (direct or protein-based) epigenetic scores nor incident T2D were predictive of ongoing symptomatic COVID-19 after COVID-19 infection.

Discussion
Using a large cohort with genome-wide epigenetic data and health records linkage to longitudinal primary and secondary care data, we showed that DNAm-based predictors augment standard risk factors in the prediction of incident T2D. The best model with traditional risk factors yielded an AUC of 0.839 compared to 0.872 when DNAm was also considered and the PRAUC increased from 0.227 to 0.305. Using several linear and nonlinear survival models, we showed that overall, the Cox proportional-hazards LASSO model produced the most predictive direct epigenetic score. A composite protein epigenetic score also notably increased predictive performance. The direct epigenetic score also showed good external validation performance in the KORA S4 cohort. Beyond the T2D analysis presented in this study, we developed the MethylPipeR R package, along with a user interface version MethylPipeR-UI (shown in Extended Data Fig. 3), to facilitate reproducible machine learning time-to-event (TTE) and binary prediction using DNAm or other types of high-dimensional omics data. Determining a 'best' model is complicated and depends on the trade-off that a user wishes to make. In this study, we optimized AUC and PRAUC but binary classification metrics vary by method or classification threshold. When using classifiers in clinical settings, decisions need to be made about the number of patients that can be recommended for intervention and the acceptable proportion of false positives and false negatives. We showed an increase in the correct identification of positives and negatives at varying probability thresholds when adding direct and composite epigenetic scores above standard risk factors. For instance, given an (arbitrary) incidence rate of 33% (commonly used as a cutoff for high risk of T2D) 21 over 10 years in a sample of 10,000 individuals, our best model would correctly classify an additional 449 individuals compared to the risk factor-only model at a threshold of 0.2 (Supplementary Table 4). Given the costs of treating T2D-related complications, our study gives evidence for possible benefits of epigenetic scores on public health that could also alleviate the financial burden to the NHS. In addition, an assessment of calibration is also critical 22,23 . Investigation of these related criteria could assist in deciding an optimal threshold given clinical constraints and provide a more comprehensive assessment of model predictions than AUCs or metrics at the commonly used threshold of 0.5.
Several CpGs from the direct epigenetic score were previously identified as epigenome-wide significant correlates of traits commonly linked to T2D 14,17,[24][25][26][27][28] . Future work could investigate the overlap between these and TTE EWAS studies. Further studies could also include DNAm predictors for traditional risk factors that are included in the null model, such as BMI 8 .
Limitations include the relatively small number of disease cases in the dataset, the limited hyperparameter optimization performed for sBART and the relatively simple variable preselection method for tree-ensemble methods. Given the lower performance of these methods compared to the best models in this study, there is potential for additional improvement in predictive performance with further investigation of more advanced feature (pre)selection. This is particularly important when we consider that the preselection step used linear models before the nonlinear model fitting. Model fitting and preselection were also performed using the same training set, which may have introduced issues associated to post-selection inference 29,30 . In addition, factors such as overfitting, related individuals in the test set and batch effects between the three rounds of DNAm data processing may all have had an effect on the test set AUC. Future studies may also take into account multimorbidities because the presence of competing risks can lead to bias in onset predictions 31 . Finally, a small proportion of the linkage codes used to define diabetes included broad terms that were nonspecific to T2D; however, the late age of onset in these individuals meant that there was a high likelihood that they had developed T2D. Epigenetic scores for T2D-associated proteins have also been shown to replicate incident T2D-protein associations in this sample 9 , suggesting that the case definitions we used captured biological signals relevant to T2D. There are many strengths to our study. First, the models used captured relationships between CpGs and TTE information, which is not possible using traditional EWAS methods. Second, data linkage to health care measures provided comprehensive T2D incidence data in a very large cohort study, that is, Generation Scotland. Validation performance in the KORA S4 cohort also strengthened evidence for the applicability of the models to other populations. Finally, the R package MethylPipeR encourages reproducibility and allows others to develop similar predictors on new data with minimal setup.
In conclusion, we have demonstrated the potential for DNAm data to provide notable improvement in predictive performance for incident T2D, compared to traditional risk factors (age, sex, BMI, hypertension and family history). We evaluated different models with a systematic approach and presented a framework with the ability to generalize to other traits and datasets for training and testing predictors in future studies.

Statistics and reproducibility
To enhance reproducibility, the analysis pipeline and results presented in this study have been reported using the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) checklist 32 (Supplementary File 1). Sample sizes for training, testing and validation of the statistical models were determined by the cohort sizes. Exclusions were made based on DNAm quality control and missingness in health data. Randomization and blinding were not applicable during data collection because the study used observational data from population-based cohorts. Statistical tests for model coefficients assumed a normal distribution but this was not formally tested.

Generation Scotland
Blood-based DNAm and linked health data were obtained from Generation Scotland 33 , a family-structured, population-based cohort. The cohort consists of 23,960 volunteers across Scotland aged 18-99 years at recruitment (between 2006 and 2011), of whom over 18,000 currently have genome-wide DNAm data available (Illumina EPIC array). In DNAm quality control, CpG sites were filtered by removing those with a low bead count in 5% or more of the samples or a high detection P value (>0.05) in more than 5% of samples. Probes on the X and Y chromosomes were also removed. Samples were filtered by removing those with a mismatch between predicted and recorded sex or 1% or more of CpGs with a detection P > 0.05. Missing CpG values were mean-imputed. To enable the predictors to be applied to existing cohort studies with older Illumina array data, CpGs were filtered to the intersection of the 450k and EPIC array sites (n = 453,093 CpGs).
This study considered DNAm data from three large subsets of the Generation Scotland cohort, with 5,087 (set 1), 4,450 (set 2) and 8,877 (set 3) individuals. Processing took place in 2017, 2019 and 2021, respectively. Set 1 and set 3 included related individuals within and between each set while all individuals in set 2 were unrelated to each other and to individuals in set 1 (genetic relationship matrix threshold < 0.05). In our experiments, the training set consisted of sets 2 and 3 combined; set 1 was used as the test set. To avoid the presence of families with individuals across both training and test sets, any individuals in the training set from the same family as an individual in the test set were excluded from the analysis (n excluded = 3,138).
Participant health measures including age, BMI, sex, self-reported hypertension and family (parent or sibling) history of T2D were taken at baseline (DNAm sampling) via questionnaire. BMI was calculated as the individual's weight in kilograms divided by the square of their height in meters. Missing values in the set 1 health measures were treated as missing completely at random and the corresponding individuals were excluded (n set 1 = 99). This was not performed in sets 2 and 3 because the health measures were used for incremental modeling (set 1 only).
Disease cases were ascertained through data linkage to NHS Scotland health records consisting of hospital (International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10) codes) and GP records (Read2 codes). Prevalent cases were identified from a baseline questionnaire (self-reported) or from ICD-10/Read2 codes dated before baseline and removed from the dataset. Type 1 and juvenile cases were treated as control observations. A total of 757 incident cases were observed over the follow-up period (from the recruitment date to January 2022); after preprocessing, 626 cases remained. Mean time to T2D onset was 5.9, 5.4 and 6.0 years for sets 1, 2 and 3, respectively, with ranges of 0.2-14.8 (set 1), 0.2-14.8 (set 2) and 0.1-14.8 (set 3) years. In GP record-derived cases, 81% of cases had a C10F 'type 2 diabetes mellitus' code; 12% had a C10 'diabetes mellitus' code and 4% had a C109 'non-insulin dependent diabetes mellitus' code. The full list of included and excluded terms are given in Supplementary Table 10.

Composite protein epigenetic score
A composite protein epigenetic score model for incident T2D was trained using a set of 109 protein epigenetic scores previously shown to associate with a range of incident diseases 9 . For each protein, the epigenetic score was calculated for each individual in the training and test sets. A Cox proportional-hazards LASSO model was fitted to the training set with the 109 protein epigenetic scores (scaled within set to a mean of 0 and variance of 1) as features. The linear predictor from the Cox proportional-hazards LASSO model was then used as the composite protein epigenetic score in the incremental test set model.

Direct epigenetic score
The direct epigenetic score Cox proportional-hazards LASSO model for incident T2D was fitted to the DNAm data in the training set. Due to memory limitations in the model fitting R package (glmnet 34 ), the CpGs were filtered to the 200,000 sites with the highest variance. The linear predictor from the Cox proportional-hazards LASSO model was then used as the direct epigenetic score in the incremental test set modeling. For the tree-ensemble models, the Cox proportional-hazards LASSO-selected CpGs were used as input and the 10-year onset risk was subsequently used as the direct epigenetic score.

Outcome definition for the 10-year onset prediction
The link to NHS Scotland health records provided dates for disease diagnoses from which age at onset was calculated. Along with age at baseline (DNAm sampling), these were used to calculate the TTE, measured in years, for each individual. For incident T2D cases and controls, TTE was defined as the time from baseline to disease onset and censoring, respectively. Controls were censored at the latest date of available GP records (September 2020). In addition, controls who died during the follow-up period were censored at time of death.
While all models were trained as survival models, our primary prediction outcome was incident T2D diagnosis within 10 years. Therefore, predictions on the test set were calculated using the 10-year onset probability (one minus survival probability). When calculating the binary outcome metrics, cases with TTE > 10 were treated as controls. These metrics included confusion matrices, AUC and PRAUC. After preprocessing and case thresholding (TTE > 10), there were 218 cases and 4,560 controls in the test set.
The numbers of individuals, cases and controls after each preprocessing step are shown in Extended Data Fig. 1.

Incremental modeling
Composite protein epigenetic scores and direct epigenetic scores were generated in the training dataset using different machine learning Analysis https://doi.org/10.1038/s43587-023-00391-4 methods with the MethylPipeR package (Fig. 1), before being applied to the test set using an incremental modeling approach (further detail in the Supplementary Note). In the test set, a (null) risk factor-only model was defined via a Cox proportional-hazards model for T2D with age, sex, BMI, self-reported hypertension and self-reported family (sibling or parental) history of diabetes as predictors. A multitude of risk factors have been used in previous T2D risk prediction tools, most of which include the set that we have used in this study. While additional risk factors, such as waist:hip ratio, may also be relevant 7 , we selected the null model covariates based on those used in the Diabetes UK: Type 2 Diabetes Know Your Risk tool (https://riskscore.diabetes.org.uk/start) to compare our results to an existing widely used tool. This was with the exception of ethnicity because of the relative homogeneity of the Generation Scotland cohort. These also closely match the top risk factors identified in a systematic review of previous T2D risk predictors (Fig. 2 in Collins et al. 7 ).

Penalized regression predictors
Because the number of CpGs (n CpG = 200,000) was much greater than the number of rows in the training set (n = 9,835 after preprocessing), a regularization method was required to reduce overfitting of the Cox proportional-hazards regression models.
Penalized regression models reduce overfitting by applying a regularization penalty in the model fitting process. This forces regression parameters to remain small or possibly to shrink them to zero. The latter allows the method to be used for variable selection by keeping only the variables with resulting nonzero coefficients.
LASSO penalization was fitted to the training set DNAm and protein epigenetic scores using glmnet 34,35 via MethylPipeR with the best shrinkage parameter (λ) chosen by nine-fold cross-validation.

Tree-ensemble models
Tree ensembles are nonparametric models capable of estimating complex functions using a set of decision trees. Two tree-ensemble approaches were used: RSF 36 and sBART 37 . RSF 38 is an ensemble machine learning model that estimates a function by averaging the output from a set of independently trained decision trees. During model fitting, each tree is built using a different subset of the variables from the training set to prevent individual trees from overfitting to the whole dataset. sBART is a nonparametric method that estimates a function as a sum over a set of regression trees. sBART incorporates the ability to model both additive and interaction effects and has shown high predictive performance compared with similar methods 37,39 .
RSF and sBART were fitted to the training set using the R packages randomForestSRC (v.2.11.0) 40 and BART (v.2.9) 41 , respectively via MethylPipeR. Details on hyperparameter selection are given in the Supplementary Note.
Because of computational limitations and probable overfitting in using the tree-ensemble models on all CpGs in the dataset, variable preselection was based on the coefficients in the penalized Cox proportional-hazards models. Each tree-ensemble model was evaluated with the features corresponding to nonzero coefficients from the Cox proportional-hazards LASSO model.

Evaluating predictive performance
Survival models can be used to predict the risk of incident T2D for an arbitrary prediction period. In this study, we focused on classification performance for the binary outcome defined by a 10-year T2D incidence. Incidence probabilities were calculated as one minus 10-year survival probabilities and the binary outcomes were calculated by truncating the observed TTE at 10 years (see 'Outcome definition for the 10-year onset prediction' section of this article).
The AUC and PRAUC were calculated as measures of predictive performance because they do not require the choice of a fixed discrimination threshold. PRAUC is more informative in situations where there is a class imbalance in the test set 42 . Additionally, binary classification metrics consisting of sensitivity (recall), specificity, positive predictive value (precision) and negative predictive value were calculated. These metrics require selection of a discrimination threshold to assign positive and negative class predictions. We evaluated their behavior across a range of discrimination thresholds, between 0 and 1 in increments of 0.1.
Differences in correctly classified individuals between the risk factor-only and Cox proportional-hazards LASSO models were calculated assuming, arbitrarily, a 10-year incidence rate of 33%, for example, in a scenario where high-risk individuals have been selected for screening in a population of 10,000. The numbers of true positives and true negatives were calculated as follows: TP = sensitivity × N actual positives and TN = specificity × N actual negatives respectively, where N actual positives = 3,300 and N actual negatives = 7,700. The difference between the two was then taken at each classification threshold.
Model calibration was examined by comparing predicted probabilities with actual case or control proportions 43 .

Selected CpG comparison with the EWAS Catalog
The CpG sites selected by the Cox proportional-hazards LASSO model were queried in the EWAS Catalog 44 to identify traits that have previously been linked to these sites. The EWAS Catalog is a database that allows users to search EWAS results from the existing literature. We performed a tissue-agnostic query using the selected CpGs and filtered results to those with an epigenome-wide significance threshold of P < 3.6 × 10 −8 (ref. 20) in studies with a sample size greater than 1,000. Almost all (739 out of 742; 99.6%) of the results after filtering were from blood-based studies. The remaining results were from saliva and prefrontal cortex-based studies.

Validation in the KORA S4 cohort
The Cox proportional-hazards LASSO model using the direct epigenetic score was applied to the KORA S4 cohort 45 . This cohort consisted of 1,451 individuals in southern Germany, aged 25-74 years. Cohort summary details are shown in Supplementary Table 11. Individuals with missing values in the health measures were removed from the dataset. Missing CpG values in the DNAm data were mean-imputed. Like the approach in the Generation Scotland test set, an epigenetic score was computed for each individual in the KORA S4 dataset. Evaluation was then performed using an incremental modeling approach. Additional cohort and methods details such as the outcome definition, follow-up period and preprocessing numbers are provided in the Supplementary Note.

Epigenetic score prediction of COVID-19 outcomes
The subset of the Generation Scotland cohort with reported COVID-19 infection (clinically diagnosed or positive test from linked test data), who had also participated in the CovidLife study 46 were used to predict ongoing symptomatic COVID-19 and hospitalization from COVID-19 (n = 703). Ongoing symptomatic COVID-19 cases were defined as individuals with self-reported symptoms lasting 4 weeks or longer 47 . Hospitalization cases were defined as hospital admissions with accompanying ICD-10 codes U07.1 (confirmed COVID-19 test) and U07.2 (clinically diagnosed), derived from the Scottish Morbidity Records (SMR01). Details of the method and summary statistics are shown in Supplementary Note and Supplementary Table 12.  Written, informed consent was provided by Generation Scotland participants.

Ethics approval and consent to participate
The KORA S4 studies were approved by the ethics committee of the Bavarian Medical Association (no. 99186) and were conducted according to the principles expressed in the Declaration of Helsinki (World Medical Association Declaration of Helsinki 2008). All study participants gave their written informed consent.

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability
According to the terms of consent for Generation Scotland participants, access to data must be reviewed by the Generation Scotland Access Committee. Applications should be made to access@generationscotland.org. The informed consent given by the KORA S4 study participants does not cover data posting in public databases. However, data are available upon request from the KORA Project Application Self-Service Tool (https://epi.helmholtz-muenchen.de/). Data requests can be submitted online and are subject to approval by the KORA board.