Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study

We present a simple and efficient hypothesis-free machine learning pipeline for risk factor discovery that accounts for non-linearity and interaction in large biomedical databases with minimal variable pre-processing. In this study, mortality models were built using gradient boosting decision trees (GBDT) and important predictors were identified using a Shapley values-based feature attribution method, SHAP values. Cox models controlled for false discovery rate were used for confounder adjustment, interpretability, and further validation. The pipeline was tested using information from 502,506 UK Biobank participants, aged 37–73 years at recruitment and followed over seven years for mortality registrations. From the 11,639 predictors included in GBDT, 193 potential risk factors had SHAP values ≥ 0.05, passed the correlation test, and were selected for further modelling. Of the total variable importance summed up, 60% was directly health related, and baseline characteristics, sociodemographics, and lifestyle factors each contributed about 10%. Cox models adjusted for baseline characteristics, showed evidence for an association with mortality for 166 out of the 193 predictors. These included mostly well-known risk factors (e.g., age, sex, ethnicity, education, material deprivation, smoking, physical activity, self-rated health, BMI, and many disease outcomes). For 19 predictors we saw evidence for an association in the unadjusted but not adjusted analyses, suggesting bias by confounding. Our GBDT-SHAP pipeline was able to identify relevant predictors ‘hidden’ within thousands of variables, providing an efficient and pragmatic solution for the first stage of hypothesis free risk factor identification.

www.nature.com/scientificreports/ neural networks 11 have been found to outperform traditional risk scoring systems 4,5,12,13 . Supplementary Note online describes these ML approaches. Among the strongest approaches is GBDT, which according to a review comparing 13 different state-of-art ML methods, was ranked as the best of all methods in tasks related to predictive analytics (appreciating that no single algorithm performs the best across all datasets) 14 .
In this study, our intention is not to build competing predictive models or to argue that ML methods (specifically, GBDT) is 'better' , or epidemiological analyses are 'worse' but to present an approach which can accommodate their distinct strengths and limitations. We test the proof of principle of such an approach for its ability to discover potential risk factors amongst thousands of predictors by combining GBDT modelling with standard epidemiological practices. We use data from over 11,000 predictors and mortality for over 500,000 participants in the UK Biobank. Our novel analysis pipeline uses GBDT, CatBoost implementation 15 for its inherent capability to handle missing values and a large volume of data, without having to convert variables to any specific format. We screen for potentially interesting predictors using SHAP (SHapley Additive exPlanation) values 16,17 , a Shapley value based additive feature attribution method, reflecting variable 'importance' to guide the selection of mortality predictors for further epidemiological modelling. Furthermore, we use penalized (LASSO) logistic regression as an alternative baseline approach and also include comparisons with another feature selection method (XGBoost 18 using five different built-in ways of calculating feature importance, as done by other studies [19][20][21][22], and as we describe in this paper, our data suggests that the proposed GBDT-SHAP pipeline has certain advantages over them.

Methods
Participants. The UK Biobank is a cohort of over 500,000 participants recruited between March 13, 2006 and October 1, 2010 through 22 assessment centers across England, Wales, and Scotland 23 . Data collection during the baseline assessment covered touch screen questionnaire surveys, face-to-face interviews, and physical measurements, with blood sampling and urine collection for genetic assays and biomarker assessments. Further information on disease outcomes was obtained through record linkages, including mortality statistics from the UK Office of National Statistics, cancer registrations, and hospital episodes statistics.
The outcome variable indicating the mortality status of the participants as of March 1, 2016, was created using the UK Biobank date of death field 40,000. In this study, we considered those information that were collected at the baseline assessment, including data obtained using the touchscreen questionnaires and results from clinical examinations. In addition, we included disease codings derived from linkage to cancer registrations and hospital episodes statistics. We removed baseline variables which were recorded for less than 95% of the participants. Information obtained from online follow-up surveys or sub-samples of the cohort were excluded from our analyses due to their low coverage. Supplementary Table S1 online lists all the variables included.
The UK Biobank project was approved by the National Information Governance Board for Health and Social Care and North West Multi-center Research Ethics Committee (11/NW/0382). Participants provided electronic informed consent to use their anonymized data and samples for health-related research, to be recontacted for further sub-studies and for the UK Biobank to access their health-related records 24 . This study was conducted under application number 20175 to the UK Biobank and all methods were performed in accordance with the relevant guidelines and regulations.
Model development pipeline and statistical analyses. The GBDT-SHAP pipeline is shown in Fig. 1.
As the data were not sufficiently structured for our analyses, we used a specifically designed software package for UK Biobank, PHESANT (PHEnome Scan Analysis) 25 , available in R and ran an automated pre-processing step before developing ML models (Supplementary Methods online). In the below we describe the analyses using the GBDT-SHAP pipeline, with details for comparisons against LASSO and XGBoost with built-in feature importance methods for feature selection described in Supplementary Methods online.
Identifying potential risk factors. Potential risk factors were identified by following the six steps, namely, (a) learning curve analysis to determine sufficient amount of data for training (b) developing GBDT models with all available predictors and assessing model performance, (c) calculating variable importance using SHAP values and eliminating predictors based on a threshold, (d) further elimination of highly monotonically correlated predictors, (e) learning curve analysis to determine sufficient amount of data for training with the reduced set of predictors and (f) ensuring that the reduced set of predictors is appropriate from model performance perspective as well as from an epidemiological perspective.
Our GBDT models used in the above steps are binary classifiers, that is, their input are the predictors for each individual and their output is model's confidence for mortality status of that individual. The classes were imbalanced (death rate was around 2.9%) and to address the class imbalance problem, all our ML models were developed with the hyperparameter for scaling positive class weight set to the ratio of negative to positive training samples 26,27 . We initially split the entire data into random training, development, and test sets at the ratio of 60:20:20. The training and the development sets were used as the derivation cohort and the test set as the validation cohort. To avoid overfitting to the training data (which is more common in high-dimensional datasets 28 ), we used the development set in all our GBDT models for early stopping of training, and thus effectively tuning the hyper parameter for number of estimators. We assessed model performance using area under the receiver operating characteristics (AUROC), a widely used threshold independent metric in assessing binary classifiers. Confidence intervals of AUROC were calculated using 1000 bootstrap 29 datasets based on the test set. We also report sensitivity and specificity at Youden index optimal cut-off point 30 . We used CatBoost version 0.21 implemented in Python (version 3.5.2, Python Software Foundation) for GBDT model development.
In step (a), we trained GBDT models with increasing numbers of training samples from the training split, starting from 20,000 participants and incremented by 20,000 each time until all training samples were used and verified the adequacy of training samples. Subsequently, we developed GBDT models with all available predictors and assessed their performance in step (b). In step (c), we calculated the importance of each predictor as the mean absolute SHAP value in the training set and normalized variable importance so that they summed up to 100%. We eliminated 'irrelevant' predictors using an arbitrarily chosen SHAP threshold. We explored different thresholds to identify 'important' predictors and assessed the effects on model performance when using reduced sets of predictors (step (f)). We used Spearman's ρ (above 0.9) to identify sets of highly correlated predictors and removed all but one (the one recorded for the greatest number of samples) from those sets to produce the final set of predictors for further epidemiological analyses. Such a step avoided carrying forward predictors such as both 'left leg fat percentage' and 'right leg fat percentage' for further analyses. We repeated the learning curve analysis (step (e)) and model performance check (step (f)) with the reduced set of predictors GBDT models before proceeding to further analyses.  www.nature.com/scientificreports/ Epidemiological analyses. As the next stage in our proof of principle analyses, we fitted Cox regression models to allow for confounder adjustments and direct interpretation of the associations between the predictors and mortality risk 31 . We fitted univariate Cox models, and also adjusted for baseline predictors identified as 'important' by GBDT, including age, sex, Townsend deprivation index, assessment center and month of birth using the test dataset. We used FDR to account for multiple testing. We present data as Hazard ratios (HR) and their 95% confidence intervals. For selected known mortality predictors which were picked up by GBDT, but not supported by simple Cox modelling, we fitted non-linear models and accounted for selected interactions. We constructed a loop iterating through two-way interactions between predictors which had shown evidence for an association in the GBDT-SHAP pipeline but not confirmed in the Cox models and other predictors identified by GBDT-SHAP pipeline, to further examine mortality associations. All the interaction analyses in the loop were adjusted for baseline predictors as for the other analyses. As our intention was to test for proof of principle in risk factor discovery rather than predictive modelling, for simplicity, we interpret coefficients from Cox models as 'average associations' , avoiding the requirement to test for proportionality of hazards assumption. All epidemiological models were done using STATA (version 15, StataCorp, College Station, TX, USA).

Results
Participants characteristics. Of the 502,506 participants included in our study, 14,421 participants (2.9%) died over the median of 7 years (IQR 6.3-7.7 years) of follow-up. Table 1 shows the distribution of the study population and mortality according to selected baseline characteristics. The training and development sets had 8552 and 2941 deaths respectively, while the test set contained 2928 deaths. Participants who died during the follow-up were older and more commonly male compared to those who stayed alive. Those who died during the follow-up period were less educated, had poorer self-rated health, were current or previous smokers and from more deprived backgrounds.
Pre-processing. PHESANT pre-processing, after satisfying our missing value criterion for baseline predictors, derived 11,639 predictors falling under ten broad categories, baseline characteristics, demographics, lifestyle and environment, physical measurements, cognitive function, psychosocial factors, self-reported diseases, medications and operations, health and medical history and hospital diagnoses (Supplementary Table S2 online). Hospital diagnoses (through record linkage) accounted for 98% of the predictors.
Identifying potential risk factors. Our learning curve analysis using all the predictors showed improvements in AUROC as more and more training samples were used ( Supplementary Fig. S1 online). We found that in the range of 40% to 60% of samples used for  Figure 2 shows the category-wise predictor importance distribution and Supplementary Table S3 online lists all important predictors. Hospital diagnoses, health and medical history, and self-reported health jointly covered about 60% of the total variable importance summed up, with baseline characteristics (e.g., age, sex), sociodemographics (e.g., employment, education, housing, ethnicity), and lifestyle factors (e.g., smoking, physical activity, diet) each contributing about 10%. Since the mean absolute SHAP values do not directly indicate the direction of association, we show the SHAP summary plot for all important predictors in Supplementary Fig. S3 online.
Epidemiological analyses. In Cox models adjusted for age, sex, Townsend deprivation index, assessment center and month of birth, 166 out of 193 predictors had an association with mortality at P < 0.05 after correcting for FDR (Supplementary Table S4 Supplementary Fig. S4 online shows hazard ratios for all the important predictors. Confirmed predictors included expected mortality associations for various disease outcomes, sociodemographic characteristics, and some lifestyle indicators. After FDR correction, there were 19 predictors which showed evidence of association in the unadjusted models but not in the adjusted models, such as length of time at current address, sensitivity/hurt feelings, worrier/anxious feelings, guilty feelings, risk taking, hearing difficulties, whole body fat-free mass, experiencing of headache and knee pain in last month, diagnoses of inguinal hernia, polyp of colon and gonarthrosis. Eight predictors did not meet the P-value threshold in either unadjusted or adjusted models, including month of birth, comparative height size at the age of ten, cheese intake, handedness, irritability, using a gas fire in winter time, gastro-esophageal reflux disease without esophagitis, and other and unspecified malignant neoplasm of skin of other and unspecified parts of face. As an attempt to understand why these eight predictors may have been picked up by GBDT-SHAP modelling, we next looked in more detail at their associations with mortality. For example, GBDT picked up month of birth as an important predictor and when we recoded it to seasons, both univariate and multivariate models showed modest evidence for an association (P ≤ 0.02). Interaction loop analyses suggested that association between comparative height at age 10 and mortality might have arisen from an interaction with secondary malignant neoplasm of brain and cerebral meninges (P interaction = 1.48E−05). Similarly, gastro-esophageal reflux disease without esophagitis showed some evidence for interaction with hypertension (P interaction = 0.02) and malignant neoplasm of skin of other and unspecified parts of face had an interaction with fed-up feelings (P interaction = 0.009) . Other factors such as cheese www.nature.com/scientificreports/ www.nature.com/scientificreports/ intake, handedness, irritability, and gas/solid fuel cooking all had low SHAP values (all < 0.08%). Although univariate Cox model showed an association between tea intake and mortality, an adjusted Cox model showed no evidence for a linear association, (P = 0.31). However, there was significant non-linearity (P curvature = 2.27E−10), with lower mortality for participants drinking 1 to 7 cups per day compared to non-drinkers and the very high intake group ( Supplementary Fig. S5 online).

Comparison of feature selection between GBDT-SHAP against LASSO and XGBoost. LASSO
performed well in identifying disease associated features (Supplementary Table S5). However, LASSO did not return an association between BMI (or other adiposity indicators), possibly due to a non-linear association with mortality. LASSO did not also select other important predictors such as Townsend deprivation index, or age at cancer diagnosis which were picked up in GBDT-SHAP pipeline. Age at cancer diagnosis is a variable from cancer registration linkage, where information is only included for those participants who had had cancer (84% of the values were missing). While LASSO did not detect the relevance of this information for mortality prediction in the presence of missing information, GBDT-SHAP pipeline was able to rank this feature as the second most important feature. XGBoost was expected to return similar features compared to GBDT-SHAP. However, in analyses of adjusted stability scores, GBDT-SHAP pipeline had higher adjusted features stability scores and highest stability scores, from top 200 features onwards (Supplementary Fig. S7). More importantly, we observed that the default feature importance method of XGBoost, 'weight' (based on number of times a feature was used for splitting in creating decision trees), had poorer adjusted stability scores as compared to GBDT

Discussion
We examined the value of GBDT-SHAP pipeline in risk factor discovery using mortality prediction in the UK Biobank as the test case. Our test case picked up the expected predictors (e.g., age, sex, palliative care, disease diagnoses) and many other well-known associations (e.g., smoking, BMI, social differentials). This demonstrates the effectiveness and viability of GBDT-SHAP pipeline for large-scale hypothesis-free screening in this type of multivariable context where standard epidemiological approaches are not feasible. It also provides a better alternative to other approaches such as LASSO, in terms of capturing non-linear predictors or embedded feature selection method such as XGBoost with built-in feature importance, in terms of better feature stability. Feature www.nature.com/scientificreports/ www.nature.com/scientificreports/ stability is an important aspect of feature selection for domain experts as it provides assurance to them that the selected features are robust to the perturbation of input data 32 . Also, other studies have shown the inconsistent feature attribution behavior of feature selection method such as gain and split count used in XGBoost 16 . Our approach has particular interest in the context of relatively rare diseases for which we know little about, and where large-scale data now provides the first opportunities to identify candidates for prevention. ML methods are increasingly used in disease prognosis and there is one previous study using ML to predict all-cause mortality in the UK Biobank. This earlier study compared artificial neural network and random forest methods against Cox regression using a set of 60 variables, selected based on their biological plausibility 7 . As an additional validation, where included in our dataset, our hypothesis-free approach, including over 11,000 predictors in the GBDT-SHAP pipeline, picked up all these risk factors or their equivalent. These included key characteristics, such as age, sex, ethnicity, education, Townsend deprivation index, prior cancer diagnoses, smoking, physical activity, blood pressure, diabetes, and adiposity, confirming that our approach is able to identify www.nature.com/scientificreports/ relevant indicators 'hidden' amongst thousands of predictors. While most of the mortality predictors identified in our models were very logical and expected, in a context when less is known about potential predictors, our comprehensive hypothesis-free approach shows great promise for the identification of novel risk factor candidates. Although GBDT and other ML models tend to be complex and less interpretable than traditional approaches 33,34 , a strength with this approach is the ability to identify relevant risk factors in the context of interactions and non-linear associations. Here, epidemiological analyses using Cox, or any other generalized linear models require careful model construction which is often impractical when dealing with a very large number of predictors, and complex unknown interactions. Data pre-processing requirements are less for GBDT than that required for standard epidemiological approaches, and our analyses provided examples where non-linear associations which would have remained hidden in standard epidemiological analyses were picked up by the GBDT-SHAP pipeline. Another strength is the ability to incorporate information from thousands of predictors, and to better cope with missing information (without having the need to impute using linear approaches such as MICE 35 or non-linear approaches such as MissForest 36 ) in this type of multivariable context. However, in this type of real-life setting, GBDT-SHAP approach is unsuitable for simultaneous inclusion of data from the UK Biobank sub-samples collected after the baseline as participation in the follow-up surveys is correlated with mortality. GBDT-SHAP pipeline also picked up indicators which were associated with the outcome purely due to confounding, as shown by associations of several of the identified 'important' predictors being explained by a standard adjustment for baseline factors. Confounding and multicollinearity can also notably affect the SHAP based importance ranking. For example, while the number of cancer diagnoses came among the most important factors, SHAP ranked some cancer diagnoses as less important than age related predictors such as experiencing of knee pain last month, sitting height, and gonarthrosis. For risk factor discovery this may have relatively little importance if at least one relevant indicator is picked up, however, this highlights the importance of replication and more detailed modelling, with caution required when interpreting apparent associations without clear explanations.
One of the challenges with our approach arises from the need to account for multiple testing and the lack of a pre-specified cut off value to consider indicators as 'important' in the context of feature selection. Some authors suggest that 3% of total number of features could be considered as relevant if the number of predictors included exceeds 100 37 . Our choice was based on a SHAP value threshold (0.05), which selected a slightly smaller proportion of features (1.87%). Our chosen threshold resulted from a pragmatic assessment where we hoped to be inclusive enough to allow for possibly relevant and interesting features be taken forward for further analyses, while at the same time limiting numbers to a manageable amount and not losing too much on the predictive ability of the model. After screening under SHAP threshold, we further used FDR correction threshold to account for multiple testing with the aim of reducing Type I error. Another pragmatic approach which we could consider in this context would be to ignore the pre-screening by SHAP threshold and to use Bonferroni correction based on the total number of features when determining the P-value threshold for the epidemiological analyses. However, this would be overly conservative and increase the risk of Type II error, as suggested by the inability to identify well-known mortality risk factors such as the BMI and other adiposity indices. Having said that, even this approach would not have led to findings very dissimilar to those reported, as Bonferroni correction based on the total number of features would have led to 133 features for follow on analyses (compared to 193 with our approach).
Our study demonstrates some of the opportunities in ML based risk factor discovery utilizing the recent implementation of GBDT (CatBoost) and Shapley values-based feature importance method (TreeSHAP). There are also limitations, some of which are specific with respect to the dataset. Indeed, reliable analyses from any model require the understanding of the data from which the results are derived. Here, UK Biobank is a cohort of volunteers with higher education and socio-economic status, and lower mortality rates compared to the general population 38 . This type of healthy volunteer bias may affect the external validity of our findings. However, it was reassuring that our data-driven approach identified the traditional risk factors, suggesting the ability to obtain valuable insights in other, less explored settings of risk prediction. Furthermore, as all the analyses in our study were done using a single dataset, we cannot exclude problems with overidentification. Also, as our model used only those predictors available for at least 95% of the participants, we may have left out important determinants which had not been captured in the full cohort. In this observational method exploration, we also cannot establish causal effects, and as we only included adjustments for very basic covariates in our proof of concept test-case, confounding is likely to explain some of the associations.
In conclusion, our data-driven, hypothesis-free approach utilizing specific ML methods was a viable, fast, and pragmatic approach to risk factor discovery in a highly phenotyped high dimensional tabular data. Our approach was able to pick up traditional risk factors from among thousands of possible predictors and showed potential for discovering relevant mortality predictors in the context of interactions, non-linear associations, and missing values. However, to ensure interpretability of the identified predictor-outcome associations, a more detailed modelling utilizing domain expertise and traditional methods is still required.