Predictive Big Data Analytics using the UK Biobank Data

The UK Biobank is a rich national health resource that provides enormous opportunities for international researchers to examine, model, and analyze census-like multisource healthcare data. The archive presents several challenges related to aggregation and harmonization of complex data elements, feature heterogeneity and salience, and health analytics. Using 7,614 imaging, clinical, and phenotypic features of 9,914 subjects we performed deep computed phenotyping using unsupervised clustering and derived two distinct sub-cohorts. Using parametric and nonparametric tests, we determined the top 20 most salient features contributing to the cluster separation. Our approach generated decision rules to predict the presence and progression of depression or other mental illnesses by jointly representing and modeling the significant clinical and demographic variables along with the derived salient neuroimaging features. We reported consistency and reliability measures of the derived computed phenotypes and the top salient imaging biomarkers that contributed to the unsupervised clustering. This clinical decision support system identified and utilized holistically the most critical biomarkers for predicting mental health, e.g., depression. External validation of this technique on different populations may lead to reducing healthcare expenses and improving the processes of diagnosis, forecasting, and tracking of normal and pathological aging.

health, such as unable to work because of sickness or disability, was the strongest predictor of all-cause mortality in men and a previous cancer diagnosis was the strongest predictor of all-cause mortality in women. When excluding individuals with major disease or mental disorders, measures of smoking habits were the strongest predictors of all-cause mortality. Yes, smoking and other strongest predictors may simply be obtained by quick questionnaires and without extensive physical examination. Thus, for high-risk individuals some specific univariate clinical outcomes may easily be identified leading to establishing of effective public health policies. Many complex heterogeneous disorders and polymorphic and their detection, modeling, tracking and analytics require deeper computable phenotyping. A genome-wide association study of cognitive functions and educational attainment in UK Biobank participants was carried out in 2016 4 . This study investigated the genetic contributions to variation in tests of three cognitive functions and in educational attainment. It demonstrated that the genome-wide significant single-nucleotide polymorphism (SNP)-based associations were located on several specific genes: ATXN2, CYP2DG, APBA1 and CADM2. In addition, this study reported significant SNP-based heritability of 31% for verbal-numerical reasoning, 5% for memory, 11% for reaction time, and 21% for educational attainment. Although a lot of interesting findings have been discovered based on the studies of UK Biobank clinical and demographic features, few studies have used structural and functional brain neuroimaging biomarkers to examine mental health 5 . Currently, most neuroimaging studies continue to utilize modest sample sizes and limited amounts of data collected for each subject, which potentially may reduce the reproducibility and replicability of the research findings 6 . In 2014, to facilitate advanced computational neuroscientific explorations, the UK Biobank began the process of inviting back 100,000 of the original volunteers for imaging scans including brain, heart, and torso 7 . With the large number of participants, the increasing overall UK Biobank data presents both problems and opportunities. For instance, the emergence of big neuroimaging data analytics uncovers the big confounder and the small effect size issues 5 . Making meaningful interpretations and deriving valid inference using big data archives is often times tricky. Some preliminary studies are underway trying to take advantage of this deluge of big imaging data. For instance, to help convert the UK Biobank neuroimaging data into useful summary information, Alfaro-Almagro and others have developed an automated processing and QC (Quality Control) pipeline that is available for use by other researchers 7 .
This manuscript aims to address three specific UK Biobank analytic challenges. Challenge 1 (Feature Selection): Reduce the high dimensionality of the derived neuroimaging biomarkers. Presently, there are thousands of morphometric measures that are computed by parcellating the brain, modeling the boundary of each region of interest, and computing intrinsic or extrinsic morphological characteristics for each region [8][9][10] . Simplifying the resulting derived imaging signature vector would expose the salient features that may be highly associated with specific observed computable or clinical phenotypes.
Challenge 2 (Data harmonization): Integrate the derived salient neuroimaging biomarker features with the corresponding clinical and demographic data to obtain harmonized computable data objects. The latter can be interrogated using model-based statistical methods, model-free machine learning techniques, or exploratory data analytics to examine predefined associations as well as formulate new research hypotheses 11,12 .
Challenge 3 (Data Analytics): Develop a decision support system capable of supervised classification, unsupervised clustering, model-free forecasting, and prediction of clinical traits. This protocol may be used to prognosticate normal development from birth to maturation and aging, as well as to predict the trajectories of various health conditions.

Methods
The complete data preprocessing protocol is described in Supplementary Materials. Briefly, we employed the UK Biobank archive (n = 502,627 cases and k = 4,316 features) with demographic, clinical, biological specimen, imaging, genomic, and questionnaire data elements. We used the structural magnetic resonance imaging data (sMRI) to obtain 3,297 derived neuroimaging morphometry measures of the 3D neuroanatomical integrity of the participants' brains 8,9,13 . The complete dataset was randomly divided into a training set (n = 7,931, 80%), used for clustering and model building, and an independent testing set (n = 1,983, 20%), used for external validation.
To obtain derived computed phenotypes without a priori knowledge or specific clinically relevant traits, we relied on unsupervised machine learning methods. We split the entire population using two unsupervised clustering methods. K-means clustering and Ward's hierarchical clustering [14][15][16][17] were applied independently to the neuroimaging biomarkers to stratify the data into separate cohorts. Linear (multidimensional scaling, MDS, and principal component analysis, PCA) 18-20 and non-linear (t-distributed stochastic neighbor embedding, t-SNE) dimensionality reduction methods [21][22][23] were employed to project the high-dimensional data into 2D or 3D spaces. These low-dimensional Euclidean and curved manifold projection spaces illustrate the separation between the derived cohorts labels, as well as, the consistency of the computed phenotype clusters (see Supplementary Materials for more details). To pinpoint data features that may be highly predictive of specific computed phenotypes, we examined the distributional differences of the derived neuroimaging biomarkers as well as the quantitative, categorical, and clinical measures between the clusters. The most salient neuroimaging biomarkers discriminating between different clusters were identified using parametric (Student's t) and non-parametric (Kolmogorov-Smirnov and Mann-Whitney-Wilcoxon) statistical tests. The important categorical variables related to mental disorders differently distributed between the cohorts were identified using Chi-square and Fisher's exact tests. We harmonized and aggregated the important clinical and neuroimaging features and then jointly interrogated the entire merged data. We then constructed decision trees 11,24 and random forests 25 to predict specific clinical traits, like the presence of common mental disorders, using the identified salient imaging biomarkers and categorical variables related to mental disorders. The pipeline workflow for obtaining the neuroimaging biomarkers and R analytic scripts are provided in Fig. S1 in the Supplementary Materials.

Results
Unsupervised clustering of UK Biobank data into two separate computed phenotypes. The first step of unsupervised clustering is to determine the optimal number of clusters. Figure 1 shows a plot of the average silhouette value (indicator of cluster robustness and reliability) for different number of clusters. The cluster number optimization results based on total within-cluster sum of squares is shown in Fig. S2 in the Supplementary Materials. The plots of the silhouette values for both k-means clustering and hierarchical clustering suggest that the optimal number of clusters is two.
We assessed the reliability and reproducibility of the derived computed clustering phenotypes using several alternative strategies: (1) evaluate the robustness of clustering with repeated experiments, (2) validate the clustering results with illustrations based on dimensionality reduction methods, (3) compare clustering consistency between independent clustering methods, and (4) validate the computed clustering phenotypes by supervised clustering methods. To evaluate the clustering robustness, we performed 1,000 times repeated k-means clustering. The consistency among these 1,000 randomly initialized k-means clustering results is summarized in Supplementary Materials Table S1. Consistency was defined as the probability of an arbitrary pair of two subjects being clustered in the same derived computed phenotype across different clustering experiments. Based solely on the neuroimaging biomarkers, these results suggest that k-means clustering represents highly robust and consistent mapping of computed phenotypes.
K-means clustering of the neuroimaging biomarkers suggests the existance of structural patterns in the data. Figure 2a shows the multidimensional scaling (MDS) plot of the biomarkers color coded by the derived computed phenotypes generated by k-means clustering. The misclassification rate (MCR) is calculated based on the 1,000 k-means clustering trials as the probability a single subject being clustered into a different computed phenotype relative to its major clustering label. The subjects with higher MCR are located along the boundary of the two clusters, which is to be expected as the subjects near the boundary are more likely to be misclassified. According to Fig. 2a, we can see that although the two clusters may not be widely separated, k-means clustering generates a clear boundary between the two computed phenotypes. It appears as if the first MDS coordinate plays the most important role in this 2D separation. To further validate the clustering results using dimensionality reduction methods, 2D and 3D PCA and t-SNE plots were generated. These Euclidean and curved-manifold projections of the high-dimensional data into low-dimensional spaces provide independent evidence of the separation of the derived computed phenotypes (Fig. 2b,c and Fig. 3). Figure 2b shows that PCA yields a similar separation of clusters as MDS. Although there is no distinct separation of the two clusters, there is a clear boundary between them. Again, the first dimension plays the most important role in cluster separation. The t-SNE plot tells a similar story; however, it generates a more detailed (perhaps non-linear) structure of cluster separation (Fig. 2c). The 3D PCA and t-SNE plots show a similar separation boundary. These findings suggest that k-means clustering picks up important information from the thousands of neuroimaging biomarkers as it splits the data into two computed phenotypes. www.nature.com/scientificreports www.nature.com/scientificreports/  www.nature.com/scientificreports www.nature.com/scientificreports/ The robustness of k-means clustering is evaluated by comparing the k-means derived computed phenotypes to those obtained by an independent hierarchical clustering. The consistency of the two data partitioning schemes is shown in Supplementary Materials Table S2. K-means and hierarchical clustering generate very consistent sub-cohort divisions. About 81.5% of the subjects are clustered into the same sub-cohort groups by both k-means and hierarchical clustering. Because of this strong clustering agreement, we focused all subsequent analyses on the results based on k-means clustering.
In addition to hierarchical clustering, the computed phenotypes derived from k-means clustering were also validated by supervised clustering methods, including k-nearest neighbor (kNN) and artificial neural network (ANN). 5-fold cross validation was applied to evaluate the consistency of the classification results. Our results showed that kNN gave a 93.7% consistency and ANN gave a 97.3% consistency of labeling the computed phenotypes derived by k-means clustering, which indicates a very strong agreement of these independent clustering and classification methods.

Challenge 1 (feature selection): reduce the high dimensionality of neuroimaging biomarkers.
To address the first challenge of feature selection, we identified the top twenty salient biomarkers that are significantly different between the derived two computed phenotypes based on parametric and nonparametric tests comparing their distributions. The density plots of the selected twenty biomarkers are illustrated in Fig. 4. All the selected biomarkers appear fairly normally distributed, with cluster 1 having negative means and cluster 2 having positive means. This indicates that these salient biomarkers present strong signals separating the two computed phenotypes. Table 1 summarizes the descriptive statistics of the raw (unscaled) values of these biomarkers. Next, we focus the analysis on these twenty salient biomarkers.

Challenge 2 (data harmonization): integrate salient imaging biomarkers and clinical data.
To address the second challenge of data harmonization, we integrate the derived salient neuroimaging biomarkers with some clinical and demographic data and obtained a harmonized computable data object. The categorical variables that are significantly different between the two clusters, based on Chi-square tests and Fisher's exact tests, are summarized in Supplementary Materials Table S4. The mosaic plots illustrated in Fig. 5 show distribution differences of some of the most significantly different categorical variables between the two clusters. We can see that the distributions of females and males are vastly different between the two clusters, with cluster 1 including the majority of the females and cluster 2 containing predominantly males. The other five mosaic plots show that the subjects in cluster 1 tend to (1) have more sensitive/hurt feelings and more worried/anxious feelings; (2) have less willingness to take risks; (3) are more likely to feel depressed for a whole week; and (4) have more difficulties in sleeping. All these results may be highly associated with the significant gender disparity between the clusters. Albeit these findings illustrate how unsupervised clustering may be used for exploratory as well as confirmatory analytics in mental disorders, the same approach may be used to tackle different types of health conditions or to monitor normal development and aging.
Challenge 3 (data analytics): develop a decision support system using random forests. Once we identified the salient neuroimaging biomarkers and categorical variables, we proceeded to examine the data in its entirety, e.g., look for potential associations between the categorical variables related to mental illness and the  Table 1 and shown in Fig. 4  www.nature.com/scientificreports www.nature.com/scientificreports/ neuroimaging biomarkers. In order to address the third challenge (i.e., developing a decision support system capable of predicting the disease state), we started by fitting a single decision tree. Supplementary Materials Figure S3 illustrates a direct example of an explicit decision support system that can be used for prediction of the mental state, e.g., sensitivity/hurt feelings. This specific decision tree shows that three of the salient biomarkers are important in the decision classification of the participants, namely rh_aparc.DKTatlas_area__rh_superiorfrontal_area, lh_aparc. DKTatlas_area__lh_superiortemporal_area, and aseg__MaskVol. The other two features are both categorical variables and include "Worry too long after embarrassment, " which plays the most important role in the classification. Subjects responding "yes" to the question "Worry too long after embarrassment" are more likely to have sensitivity/hurt feelings than those responding "no. " Using the remaining variable together with the three neuroimaging biomarkers provides a deeper classification phenotyping. For instance, subjects with worrier/anxious feelings and lh_aparc.DKTatlas_area__lh_superiortemporal_area larger than −1.158 (scaled values) are 16.2% less likely to have sensitivity/hurt feelings. Similarly, subjects without worrier/anxious feelings and lh_aparc.DKTatlas_area__ lh_superiortemporal_area larger than −1.393 are 26.7% less likely to have sensitivity/hurt feelings. Supplementary Materials Figure S3 illustrates how the decision tree can be used as a clinical decision support system guiding physicians in using the specific imaging biomarkers and categorical variables for prognostication or treatment planning. In addition, this information may be useful to guide prospective studies, i.e., what prospective data should be collected (for future clinical trials) or mined (for retrospective data analytics). Of course, this simple example is just an illustration. To avoid the difficulties of unavoidable large variance or large bias issues in using single decision trees, in practice, we use the much more reliable ensemble classification and regression methods like random forests.
Next, we focus on developing a classifier that can predict the presence of some specific mental disorders. Random forest prediction relies on boosting hundreds of decision tree classifiers using the combination of the salient neuroimaging biomarkers and the salient categorical features. Figure 6 illustrates four examples of the top twenty variables identified to be important in the prediction of "sensitivity/hurt feelings," "ever depressed for a whole week," "worrier/anxious feelings," and "miserableness" based on the mean decrease of the Gini values. "Worry too long after embarrassment," "worrier/anxious feelings," and 18 other neuroimaging biomarkers are listed as the top twenty features for predicting "sensitivity/hurt feelings. " In developing the decision rules for "ever depressed for a whole week, " we first included all selected imaging biomarkers and all categorical features. A deeper examination into the categorical variables revealed that two variables, "seen doctor (GP) for nerves, anxiety, tension or depression" and "frequency of depressed mood in last 2 weeks, " were highly associated with the response variable we predicted. Therefore, we retrained the random forest classifier excluding these two specific predictors to avoid confounding problems. The result showed that "Ever unenthusiastic/disinterested for a whole week" was the most important predictor in forecasting depression. The other features had approximately similar contributions. Indeed, depression, and other mental health disorders, represent complex heterogeneous conditions, and one would not expect a small number of imaging biomarkers to yield extremely accurate, consistent, or reliable predictions. In the prediction of "worrier/anxious feelings" and "miserableness, " the salient neuroimaging biomarkers also played an important role. www.nature.com/scientificreports www.nature.com/scientificreports/ accuracy (with 95% confidence intervals, CI), sensitivity and specificity for predicting four specific mental health outcomes: "sensitivity/hurt feelings, " "ever depressed for a whole week, " "worrier/anxious feelings, " and "miserableness. " The consistent 70-80% accuracy across these four mental conditions suggests that these machine-learning strategies may be useful to support physicians in their diagnosis, prognosis, and disease progression tracking. The prediction performance of the established models on the testing dataset is summarized in Table 3, which indicates a high prediction consistency for an independent dataset.

Discussion
The UK Biobank is a complex data archive with an ongoing data collection process. The large number of observations and the complex composition of the data elements make it very difficult for researchers to statistically mine and computationally generate precise, reliable and consistent inference. In this manuscript, we employ a three-challenge approach to extract useful information from the UK Biobank, like deriving prediction models www.nature.com/scientificreports www.nature.com/scientificreports/ detecting the presence and tracking the progression of depression and other mental illness. The first challenge was to reduce the dimension of the dataset and identify salient features among thousands of the variables. By performing unsupervised clustering, we successfully computed derived phenotypes (clusters) that were consistent (across methods) and reliable (across experiments). We also identified the top twenty salient imaging biomarkers that contribute most to the separation of the two clusters. Examining the distributions of these twenty neuroimaging biomarkers, we found that they appear to be approximately normally distributed, with cluster 1 having predominantly negative means and cluster 2 having positive means. We tested many more of the neuroimaging biomarkers to determine whether they were significantly differently distributed between the two clusters. Figure S4, in Supplementary Materials section, shows the distributions of twenty neuroimaging biomarkers that ranked in the middle, and another twenty biomarkers that ranked at the bottom of feature significance according to parametric and nonparametric tests. These groups of features were not significantly different across the computed phenotypes and their density plots illustrate no obvious differences. Therefore, our forecasting and prediction of mental health outcomes only used the top twenty imaging biomarkers. The number of selected imaging biomarkers was determined by the consistency of their significances in separating the computed phenotypes based   Table 2. Cross-validated random forest prediction results for "sensitivity/hurt feelings, " "ever depressed for a whole week, " "worrier/anxious feelings, " and "miserableness. "  Table 3. Random forest prediction results for "sensitivity/hurt feelings, " "ever depressed for a whole week, " "worrier/anxious feelings, " and "miserableness" in the testing dataset.
www.nature.com/scientificreports www.nature.com/scientificreports/ on k-means clustering and hierarchical clustering. All the top twenty selected neuroimaging biomarkers were common according to the rankings of the significance tests with clustering results generated by k-means and the hierarchical clustering.
Challenge two involved harmonizing and aggregating imaging, clinical and demographic data elements and the joint interrogation of the holistic dataset. We demonstrated how unsupervised clustering may be used for either exploratory or confirmatory analytics in many health studies.
The final challenge addressed in this study was to develop an effective decision support system that is capable of detecting the presence of and predicting the progression of common illnesses. Our approach relies on unsupervised learning of derived neuroimaging biomarkers and categorical phenotypic features. Following the unsupervised clustering, we performed Chi-square and Fisher's exact tests to determine the categorical variables that are discriminating between the two clusters. One interesting discovery from the tests of the categorical variables is the significant gender disparity between the clusters. This supports previous reports of association between the gender disparity and the prevalence of mental disorders [26][27][28] . The subjects in the majority female cluster were more likely to experience mental disorders. This finding suggests gender differences in mental health. It is demonstrated that across many nations, cultures, and ethnicities, females are about twice as likely as males to develop depression 29,30 . Women have a lifetime prevalence for major depressive disorder of 21.3%, compared with 12.7% in men 31 . In addition to depression, females are more likely to express anxiety and worry, and also reported a more negative problem orientation and engaging in more thought suppression than males 32,33 . Our finding is consistent with the previous discoveries, indicating that females are more vulnerable to emotional fluctuation and mental disorders.
Aggregating the salient neuroimaging biomarkers and the selected categorical variables allowed us to generate a computable data object that can be interrogated to examine predefined associations (confirmatory analytics) as well as formulate new research hypotheses (exploratory analytics). The final decision guidelines for predicting some mental disorders, e.g., depression, were developed using a random forest classifier. Despite the fact that there are similarities between random forest and individual decision tree classification, the decision-making criteria determined by random forest prediction cannot be directly explicated the way a single decision tree classification can be explained. As our decision tree classifier demonstrated, categorical variables seem to play a dominant role in the classification. However, involvement of the neuroimaging biomarkers can provide additional stratification complementing the classification procedure. Random forest classification ranked some imaging biomarkers higher than some of the categorical variables, which illustrates differences with the single decision tree classifier. In the Supplementary Materials section, we also demonstrate an approach to derive deeper computable phenotypes by stratifying clusters within the low-dimensional t-SNE manifold (see Supplementary Materials Figures S5  and S6, as well as Tables S4 and S5). Figure S6b and Table S6 show how biomarkers representing the size of specific cortical surface areas in the right hemisphere might indicate reduced functional activities, as many prior studies have shown. For instance, Kuperberg and colleagues demonstrated selective thinning of the cerebral prefrontal cortices (including precentral and postcentral gyri) in patients with schizophrenia 34 . Others have shown similar reductions of pre-and post-central areas in bipolar disease and Williams syndrome [35][36][37] .
The observed consistency of the derived computed phenotypes and the reliability of the chosen top salient biomarkers contributing to the unsupervised clustering suggest that the information structure in the UKBB dataset can be exploited using various analytical techniques. The example of a rudimentary clinical decision support system we illustrated here specifically identified the critical biomarkers used in forecasting of depression, anxiety, and mood disorders.
As firm supporters of open-science, the authors encourage independent validation, reproducibility and expansion of the reported results. Innovative collaborations using similar techniques may reduce healthcare costs and improve patient diagnosis and disease tracking of normal and pathological conditions. The entire computational protocol, software code, pipeline workflows, and R-scripts are available on the SOCR GitHub site (https://github. com/SOCR/UKBB_Analytics). All UK Biobank data is available online at http://www.ukbiobank.ac.uk.