Associated factors of white matter hyperintensity volume: a machine-learning approach

To identify the most important parameters associated with cerebral white matter hyperintensities (WMH), in consideration of potential collinearity, we used a data-driven machine-learning approach. We analysed two independent cohorts (KORA and SHIP). WMH volumes were derived from cMRI-images (FLAIR). 90 (KORA) and 34 (SHIP) potential determinants of WMH including measures of diabetes, blood-pressure, medication-intake, sociodemographics, life-style factors, somatic/depressive-symptoms and sleep were collected. Elastic net regression was used to identify relevant predictor covariates associated with WMH volume. The ten most frequently selected variables in KORA were subsequently examined for robustness in SHIP. The final KORA sample consisted of 370 participants (58% male; age 55.7 ± 9.1 years), the SHIP sample comprised 854 participants (38% male; age 53.9 ± 9.3 years). The most often selected and highly replicable parameters associated with WMH volume were in descending order age, hypertension, components of the social environment (i.e. widowed, living alone) and prediabetes. A systematic machine-learning based analysis of two independent, population-based cohorts showed, that besides age and hypertension, prediabetes and components of the social environment might play important roles in the development of WMH. Our results enable personal risk assessment for the development of WMH and inform prevention strategies tailored to the individual patient.

www.nature.com/scientificreports/ anthropometrics, life-style factors, somatic and depressive symptoms, and sleep were collected in a standardised method as part of the KORA study design and have been described previously 24 . Briefly, the applied definition of hypertension was systolic blood pressure ≥ 140 mmHg, diastolic blood pressure ≥ 90 mmHg and/or use of antihypertensive medication, given that the individuals were aware of being hypertensive. "Hypertension, unknown" was defined as unawareness of hypertension in a participant with hypertension, "controlled hypertension" as self-reported diagnosis of hypertension by a physician and intake of antihypertensive medication. Medications were classified as antihypertensive if the compounds were regarded as antihypertensively effective by recent guidelines. Diabetes was determined as either established type 2 diabetes validated by a physician or by fasting glucose level and OGTT. For the definition of prediabetes and diabetes the World Health Organization/International Diabetes Federation criteria were applied 27 . HbA 1c values were assessed. Details on the collected parameters are provided in Supplementary Table 2. MRI. In the KORA sample, image acquisition was performed on a single 3 T MRI system (Magnetom Skyra; Siemens AG, Healthcare Sector, Erlangen, Germany). WMH volume was assessed on T2w 3D-FLAIR sequences (SPACE, slice thickness (ST): 0.9 mm, 0.5 mm × 0.5 mm in-plane spatial resolution, repetition time (TR): 5000 ms, echo time (TE): 389 ms, inversion time (TI): 1800 ms, flip angle: 120°), in accordance with STRIVE recommendations 28 .
In the SHIP sample, imaging was performed on a single 1.5 T MRI system (Magnetom Avanto; Siemens AG, Healthcare Sector, Erlangen, Germany). WMH volume was assessed on T1w sequences (ST: 1.0 mm, 1 × 1 mm in-plane spatial resolution, TR: 1900 ms, TE: 3.4 ms, flip angle: 15°) and T2w 3D-FLAIR sequences (ST: 3.0 mm, 0.9 × 0.9 mm in-plane spatial resolution, TR: 5000 ms, TE: 325 ms, flip angle: 15°), in accordance with STRIVE recommendations 28 . WMH volume. In the KORA sample, ITK-SNAP Version 3.6.0 was used for segmentation 29 . Cerebral WMH were manually segmented by a radiology resident (2 years of experience in neuroimaging), and edited and modified where necessary by a board-certified radiologist (7 years of experience in neuroimaging) on sagittal acquired FLAIR images reconstructed in axial plane with a ST of 0.5 mm (see Fig. 1). For homogeneous image intensity the tool "auto-adjust contrast" in ITK-SNAP Version 3.6.0 was used 29 . WMH were defined as signal abnormalities of variable size in the white matter of the brain that show a hyperintense signal on FLAIR images 28 . WMH in the brainstem and the cerebellum were not included.
In the SHIP sample an automated multimodal segmentation algorithm for WMH quantification was used. The algorithm produced a probabilistic map that was further thresholded to generate a binary image 3 . Furthermore, to calculate WMH volume within specific regions of interest a multiatlas segmentation method was applied 30 .  www.nature.com/scientificreports/ This included nonlinear registration of multiple atlases with ground-truth labels for every individual scan. Finally, WMH volume was determined for every region of the brain by masking WMH from all other regions 31 . For the present analysis, measurements from the frontal, parietal, temporal and occipital lobes were summarised. Image analyses were performed blinded to all clinical data as well as other measurements.
Descriptive statistics. Continuous predictor covariates are described as arithmetic means with standard deviation (SD) or medians with 1st and 3rd quartile. Categorical predictor variables are presented as counts and percentages. P-values < 0.05 are considered to denote statistical significance.
Analysis model. In both the KORA and SHIP sample, the outcome of interest was WMH volume on a continuous scale. We identified relevant predictor covariates associated with WMH volume in the KORA cohort by penalised zero-inflated negative binomial (ZINB) regression models based on elastic net regularization 32 . The ZINB model accounts simultaneously for the skewed distribution of WMH volumes with overdispersion ("count part") and the large point mass at zero stemming from those participants without WMH ("structural zeros"). Elastic net combines the properties of both Ridge and least absolute shrinkage and selection operator (LASSO) regression and is therefore appropriate for variable selection on potentially correlated covariates 22 . The amount of blending between Ridge and LASSO regression is regulated by the hyperparameter α (Ridge: α = 0, LASSO: α = 1). All analyses were computed on a grid of α values from 0.01 to 1 with stepwise increments of 0.1. The model was derived and evaluated on 1000 data splits. A data split was defined as a random division of the full data set into 90% training data and 10% testing data. By evaluating the model on 1000 data splits with mutually exclusive training and testing data, we ensured a very comprehensive internal model validation. Continuous covariates were standardised to mean = 0 and SD = 1. A ZINB model was computed on the training data, with WMH volume data being modelled by a negative binomial model using logarithmic link. The shrinkage parameter λ was determined by internal tenfold cross validation on the training data with upper thresholds being fractions of 0.5 and 0.1 of λ max (the smallest value of λ for which all coefficients are shrunk to zero). Selection frequency across the 1000 splits served as a measure of variable importance. To disentangle and assess the roles of different variables in their association with WMH volume, both the model's explanatory performance and its predictive performance have to be evaluated. Root Mean Squared Error (RMSE) served as a measure of predictive performance on the testing data and Akaikes Information Criterion (AIC) served as measures of model fit, i.e. explanatory performance, on the training data. Coefficient estimates are reported as raw beta values which have to be exponentiated to obtain incidence rate ratios. For comparison, we calculated the Null ZINB model that includes no covariates and predicts the mean WMH volume for each participant. A likelihood-ratio test was used to formally assess the model fit of the final model compared to the Null model.
Predictors of WMH volume identified in the KORA sample were then ranked according to selection frequency across 1000 data splits. A cut-off value based on selection frequency was not applicable due to the varying numbers of parameters in the KORA sample (N = 90) and SHIP sample (N = 34). Consequently, the ten most frequently selected variables were subsequently examined in the SHIP sample. A negative binomial regression model was evaluated on the whole sample and compared to the Null model predicting constant WMH volume in terms of RMSE and AIC. Furthermore, analogous to the procedure on the KORA cohort, variables were evaluated on 1000 data splits with 90% training and 10% testing data and ranked according to selection frequency.
As a sensitivity analysis, the complete machine-learning pipeline was re-run on a subsample of N = 333 participants of the KORA study with available data on intracranial volume (ICV), including ICV as an additional predictor to all the other variables. R version 3.4.4 was used for all statistical analyses, including descriptive statistics. Package zipath 0.3-5 was used for calculation of ZINB models.

Results
Study population. In the KORA sample of 400 participants 12 had to be excluded due to insufficient MRI image quality, 2 due to visible lesions with other aetiology (1 participant with lesions suspicious for multiple sclerosis; 1 participant with not WMH-like FLAIR-hyperintense lesion in the left parietal lobe) and 16 participants due to missing covariate data. The final KORA sample consisted of 370 participants (58% male; age: 55.7 ± 9.1 years). In the SHIP sample of 2188 participants 229 had to be excluded due to missing WMH data, 322 due to insufficient MRI image quality, 415 due to missing covariate data. In addition, for consistency between the KORA and SHIP study 86 SHIP participants with prior myocardial infarction, stroke or revascularization and 368 SHIP participants younger than 39 years or older than 73 years were excluded. The final SHIP sample consisted of 854 participants (38% male; age 53.9 ± 9.3 years), as presented in Fig. 2. Further details are presented in Table 1. In the KORA sample, mean WMH volume was 2798 ± 7392 mm 3 (median: 997 mm 3 ) compared to 532 ± 1750 mm 3 (median: 135 mm 3 ) in the SHIP sample. The distribution of WMH volume is presented in Fig. 3, an example of different WMH volumes in Fig. 4.
Variable selection was relatively stable across all values of α, as presented in Supplementary Figure 1. In the sensitivity analysis including only KORA participants with available ICV volume, ICV was not selected in the top ten of predictors (Supplementary Table 4).

Robustness testing of predictors of WMH volume-SHIP sample.
In the SHIP sample, a negative binomial model incorporating the top ten predictors from the KORA sample yielded a RMSE of 1667 mm 3 and an AIC of 11,600 on the whole cohort, whereas the Null model (predicting constant WMH volume) yielded an RMSE of 1749 mm 3 and an AIC of 12,000 on the whole cohort. When evaluating the elastic net regression model on 1000 data splits, the best model in terms of RMSE and AIC was obtained for α = 1 (RMSE = 1499 mm 3 , AIC = 10,443). For both KORA and SHIP, prediction seemed to be worse for high WMH volumes, i.e. on average the model underestimated true WMH volumes (Supplemental Figure 5). Ranking of the variables according to selection frequency is presented in Fig. 5 and Supplementary  Figure 4. "Age" (selection frequency 100%), "controlled hypertension" (100%), "unknown hypertension" (97%) and "prediabetes" (66%) were replicated as important predictors. Furthermore, while "widowed" family status was not replicated in the SHIP sample, "separated or divorced" (87%) and "living alone" (80%) were selected. Other variables were either not replicated (HbA 1c , antiplatelet medication) or showed different effect directions compared to the KORA sample (alcohol consumption, physical activity, NSAID medication).

Discussion
In a population-based sample, we performed a data-driven analysis without a-priori hypotheses including 90 different parameters in order to disentangle and better understand the respective roles of these parameters on WMH volume. Relevant parameters were re-examined in an independent population-based sample.
Considering that WMH are associated with cognitive decline, increased stroke risk and worse outcome post stroke, decreased mobility due to gait disturbance as well as increased risk of depression, having a clear picture of the associated risk factors is important, especially regarding treatment and prevention 2,33-37 . Although a lot of information is nowadays available on the epidemiology and risk factors of WMH, some of these data are conflicting 1 . Given that plenty partially inter-correlated factors with potentially small effect size impact WMH volume, drawing a clear picture of the most powerful determinants of WMH volume is challenging. In order to overcome limitations of traditional regression models we used a machine-learning approach that allows for the selection and ranking of the most important factors out of a large number of variables, even when intercorrelation is present. Our machine-learning based model identified age and hypertension as well established determinants of WML volume. Additionally, the model identified the less established, more controversial parameters "prediabetes", "HbA 1c ", "alcohol consumption", "NSAID medication" and components of the social environment such as "widowed marital status" as potential factors that contribute to WMH burden. Interestingly, ICV was not selected among the ten most important predictors. We hypothesise that in our analysis the association of ICV with WMH was captured by other variables in the model.
Diabetes-related atherosclerosis appears to be an important component in the development of WMH 38,39 . However, the relation between diabetes and particularly prediabetes and WMH is under debate 1,12,14 . Studies assessing the correlation between WMH and HbA 1c show conflicting results 40,41 . Interestingly, our analysis yielded prediabetes, but not diabetes, as a relevant predictor of WMH volume. This result was replicated in the  4): CVD exclusion criteria were prior myocardial infarction, stroke or revascularization. KORA participants fulfilling these criteria were not eligible for MRI by study design 26 . For consistency between the two studies SHIP participants fulfilling these criteria were excluded. (5) www.nature.com/scientificreports/ SHIP sample. Furthermore, HbA 1c was only identified in the KORA sample, but failed to replicate in the SHIP sample. These results might be due to the co-occurrence of diabetes and hypertension and the different risk factor distribution in the two studies: Individuals with diabetes in SHIP are more likely to have hypertension (prevalence of "controlled hypertension" is 41%) compared to KORA (prevalence of "controlled hypertension"  www.nature.com/scientificreports/ is 31%). Therefore, unfavourable effects of diabetes might be superimposed by stronger effects of unfavourable blood pressure profiles.
Our results thus suggest that the prediabetic phenotype as a dynamic state between normoglycemia and type 2 diabetes represents an independent risk factor for the development of WMH. Hence, individuals with prediabetes would need more comprehensive assessments for signs of early pathophysiological changes, preventive measures and adequate treatment not only to stop the underlying development of diabetes, but also to avoid the development of WMH-associated morbidity.
The relation between WMH and alcohol consumption is still unclear. Heavy alcohol consumption might lead to a higher WMH burden through cerebrovascular effects of associated hypertension 42 . However, after correcting for hypertension, heavy alcohol did not show a significant association with WMH in previous analyses 43,44 . Prior studies also demonstrated a protective effect of moderate alcohol consumption on the development of WMH through multiple potential pathways, including anti-atherosclerotic, anti-thrombotic and anti-inflammatory mechanisms reducing the risk of cardiocerebrovascular morbidity 43,45,46 . However, while moderate alcohol consumption was associated with decreased WMH volume in the SHIP sample, it was associated with increased WMH volume in the KORA sample; heavy alcohol consumption was associated with increased WMH volume in the SHIP sample.
In both the KORA sample and the SHIP sample, "NSAID medication" was identified as a relevant predictor of WMH volume, albeit with different effect directions. NSAID medications are generally used for pain and inflammation treatment, which comprises multiple conditions, thus rendering the groups under treatment quite heterogeneous. In KORA, all participants under NSAID medication reported regular intake of their medication, as opposed to intake as needed. No distinction between regular intake and intake as needed of NSAID medication was made in SHIP. This does not only explain the difference in prevalence (2.2% in KORA vs. 8.5% in SHIP), but might indicate that participants in KORA are affected by more severe and chronic pain. Severity of pain could be a relevant determinant of WMH volume. In this regard, our findings might support findings of previous studies that report associations of pain and increased WMH burden 47 . www.nature.com/scientificreports/ The elastic net model showed a strong correlation between age and WMH burden. This result is in keeping with literature suggesting that age is the most important risk factor for WMH 1,10 . WMH are a common finding in elderly people, where WMH burden increases with age 1 . As such, WMH may to some extent be part of the normal aging process of the brain, yet precise data on the burden of WMH that can be regarded as "normal" at a certain age do not exist. As populations are aging, WMH-related morbidities, such as cognitive decline and increased stroke risk will have an increasing impact on individuals and health care systems 34,35 .
Hypertension is strongly associated with and probably the most important modifiable risk factor of WMH. Several studies clearly indicate that an increased systolic as well as diastolic blood pressure favour the development of WMH 1,3,5,7-9 . Our results are consistent with these findings. The variables "hypertension, controlled" and "hypertension, unknown (hypertension unawareness of a participant with hypertension)" were both among the 10 most frequently selected variables of the elastic net model associated with WMH. The frequent selection of the variable "hypertension, controlled" in the elastic net model could be indicative of irreversible brain damage caused by micro-and macroangiopathy in the pre-treatment episode of hypertension 38 .
A meta-analysis showed that higher physical activity was cross-sectionally associated with lower WMH volume, although effect sizes were small and many studies reported null findings 48 . A recent longitudinal study and a recent intervention trial found no effect of physical activity on WMH volumes 49,50 . In our study, high physical activity was associated with decreased WMH volume in the KORA sample, but not in the SHIP sample, indicating that the role of physical activity is unstable.
"Widowed family status", "separated or divorced" and "living alone" are components of the social environment that were revealed to be relevant predictors of WMH volume in our study. It can be hypothesised that these predictors might comprise a cluster of mental-health related factors, such as loneliness, anxiety or post-traumatic stress disorder, which were not assessed in this study 38 . The importance of social networks and stressful life events on mental well-being and health in general 51,52 , as well as the association of social economic factors with WMH have been established 53 . It was also shown that widowhood accelerates cognitive decline in cognitively normal older adults 54 . Our results support this finding, having in mind that WMH are associated with cognitive decline 3 . However, further studies are needed to clarify the association of components of the social environment with WMH volume.
The results of this study need to be interpreted in light of its limitations. The regularised regression employed here is not a causal model in the formalised sense and thus cannot identify whether the reported variables are etiologically linked to WMH volume. For observational data, different statistical tools can be used to evaluate causality, e.g. graphical models such as directed acyclic graphs, methods based on counterfactuals from the potential outcomes framework, methods based on instrumental variables such as Mendelian Randomization which emulate the design of randomised controlled trials, or structural causal models 55 . These methods, however, require prior knowledge and assumptions about the potential etiologic layout in the analysed variables. By our statistical model, we used a hypothesis-free approach without assumptions about the underlying etiologic factors. It has been emphasised, especially within the epidemiologic field, that evidence from different study designs and models should be taken into account to investigate causality 56 . We therefore believe that the results of our prediction-based analysis can provide useful starting points to inform further, more formalised, causal reasoning.
For (zero-inflated) negative binomial models as employed here, easily interpretable metrics of the proportion of outcome variance explained are not straightforward. Therefore, the relative contribution of the respective predictor variables has to be assessed by the inclusion frequencies only.
In the same vein, the regularization by elastic net and the underlying ZINB regression represent an intricate multi-layered model with complex interpretation. However, elastic net regularization is an appropriate and established method for variable selection, and the ZINB model captures the data distribution best. By presenting a ranking according to inclusion frequencies of the identified variables, we can still provide an adequate interpretation of the findings.
Furthermore, in the KORA sample mean WMH volume was significantly higher than in the SHIP sample. Possible explanations for this discrepancy in WMH volume are different measurement methods and different study collectives. However, identical methodologies across large population-based studies are not to be expected, and the fact that some parameters were consistently associated with WMH volume in both, KORA and SHIP, does show a certain robustness of the association. Further well-characterised MRI studies are needed to corroborate these findings.

Conclusion
In conclusion, a systematic machine-learning based analysis of 90 parameters showed in two independent samples, that besides age and hypertension prediabetes and components of the social environment (i.e. widowed, living alone) might play important roles in the development of WMH. Our results therefore enable personal risk assessment for high WMH burden and prevention strategies tailored to the individual patient. www.nature.com/scientificreports/