ANTsX neuroimaging-derived structural phenotypes of UK Biobank

UK Biobank is a large-scale epidemiological resource for investigating prospective correlations between various lifestyle, environmental, and genetic factors with health and disease progression. In addition to individual subject information obtained through surveys and physical examinations, a comprehensive neuroimaging battery consisting of multiple modalities provides imaging-derived phenotypes (IDPs) that can serve as biomarkers in neuroscience research. In this study, we augment the existing set of UK Biobank neuroimaging structural IDPs, obtained from well-established software libraries such as FSL and FreeSurfer, with related measurements acquired through the Advanced Normalization Tools Ecosystem. This includes previously established cortical and subcortical measurements defined, in part, based on the Desikan-Killiany-Tourville atlas. Also included are morphological measurements from two recent developments: medial temporal lobe parcellation of hippocampal and extra-hippocampal regions in addition to cerebellum parcellation and thickness based on the Schmahmann anatomical labeling. Through predictive modeling, we assess the clinical utility of these IDP measurements, individually and in combination, using commonly studied phenotypic correlates including age, fluid intelligence, numeric memory, and several other sociodemographic variables. The predictive accuracy of these IDP-based models, in terms of root-mean-squared-error or area-under-the-curve for continuous and categorical variables, respectively, provides comparative insights between software libraries as well as potential clinical interpretability. Results demonstrate varied performance between package-based IDP sets and their combination, emphasizing the need for careful consideration in their selection and utilization.


UK Biobank data description
The study was conducted under UKBB Resource Application ID 63965.The total number of subjects at the time of download was 502,413 with 49,351 T1 and FLAIR images from the baseline assessment.Although follow-up visits were available for many participants, only the T1 and FLAIR images from the baseline visit were used for this study.Prior to this study, and as part of UKBB data repository, the FSL and FreeSurfer packages were used to generate sets of IDPs calculated from these baseline images which are made available as tabulated data as part of the resource application.The UKBB's strict quality control protocols 5 and the intersection between FSL and FreeSurfer complete sets of IDPs resulted in a UKBB-derived cohort of 40,898 sets of measurements.Intersection with the final ANTs complete processed IDP set resulted in a total study cohort size of 40,796.

FreeSurfer structural phenotypes
Several categories of IDPs are available for FreeSurfer comprising a total of 1242 measurements. 37However, to make the study dataset more computationally tractable and reduce set size differences between packages, we selected the following popular IDP subsets:

ANTsX structural phenotypes
Both sociodemographic and bulk image data were downloaded to the high performance cluster at the University of Virginia for processing.Grad-warped distortion corrected 38 T1-weighted and FLAIR image data were used to produce the following ANTsX IDPs: • Deep Atropos brain tissue volumes (i.e., CSF, gray matter, white matter, deep gray matter, brain stem, and cerebellum); • DKT DiReCT cortical thickness and volumes; • DKT-based regional volumes; • DeepFLASH regional volumes; • Cerebellum regional thickness and volumes; • Regional WMH loads totaling 7 DeepAtropos + 88 DKTreg + 128 DKTDiReCT + 20 DeepFLASH + 48 Cerebellum + 13 WMH = 302 IDPs which are illustrated in Fig. 1.We have reported previously on the first three categories of ANTsX IDPs 16 but provide a brief description below.We also provide further details concerning both DeepFLASH and the cerebellum morphology algorithms.www.nature.com/scientificreports/

Brain tissue volumes
The ANTsXNet deep learning libraries for Python and R (ANTsPyNet and ANTsRNet, respectively) were evaluated in terms of multi-site cortical thickness estimation. 16This extends previous work 24,25 in replacing key pipeline components with deep learning variants.For example, a trained network, denoted Deep Atropos, replaced the original Atropos algorithm 23 for six-tissue segmentation (CSF, gray matter, white matter, deep gray matter, cerebellum, and brain stem) similar to functionality for whole brain deep learning-based brain extraction.

DKT cortical thickness, regional volumes, and lobar parcellation
As part of the deep learning refactoring of the cortical thickness pipeline mentioned in the previous section, a framework was developed to generate DKT cortical and subcortical regional labels from T1-weighted MRI. 16his facilitates regional averaging of cortical thickness values over that atlas parcellation as well as being the source of other potentially useful geometry-based IDPs.In terms of network training and development, using multi-site data, 24 two separate U-net 39 networks were trained for the "inner" (e.g., subcortical, cerebellar) labels and the "outer" cortical labels, respectively.Similar to Deep Atropos, preprocessing includes brain extraction and affine transformation to the space of the MNI152 template 40 which includes corresponding prior probability maps.These maps are used as separate input channels for both training and prediction-a type of surrogate for network attention gating. 41Using FreeSurfer's DKT atlas label-to-lobe mapping, 42 we use a fast marching approach 43 to produce left/right parcellations of the frontal, temporal, parietal, and occipital lobes, as well as left/right divisions of the brain stem and cerebellum.Using the segmentation output from Deep Atropos, the DiReCT algorithm 29 generates the subject-specific cortical thickness map which, as previously mentioned, is summarized in terms of IDPs by DKT regional definitions.Given the diffeomorphic and thickness constraints dictated by the DiReCT algorithm, we generate additional DKT regional labels (cortex only) from the non-zero cortical thickness regions to also be used as IDPs.

Fused labeling for automated segmentation of the hippocampus and extra-hippocampal regions (DeepFLASH)
A set of IDPs was generated using a deep learning-based framework for hippocampal and extra-hippocampal subfield parcellation which is also publicly available within ANTsXNet (refered to as "DeepFLASH").6][47][48][49][50] DeepFLASH comprises both T1/T2 multi-modality and T1-only imaging networks for parcellating the following MTL regions: • Hippocampal subfields • Dentate gyrus/cornu ammonis 2-4 (DG/CA2/CA3/CA4) • Cornu ammonis 1 (CA1) • Subiculum • Extra-hippocampal regions DeepFLASH employs a traditional 3-D U-net model 39 consisting of five layers with 32, 64, 96, 128, and 256 filters, respectively.In addition to the multi-region output, three additional binary outputs (the entire medial temporal lobe complex, the whole hippocampus, and the extra-hippocampal cortex) are incorporated as a hierarchical structural output set.Data augmentation employed both randomized shape (i.e., linear and deformable geometric perturbations ) and intensity variations (i.e., simulated bias fields, added noise, and intensity histogram warping).Further information regarding training and prediction can be found at our ANTxNet GitHub repositories. 51,52rebellum morphology ANTsX cerebellum IDPs comprise both regional volumes and cortical thickness averages based on the Schmahmann atlas 28 for cerebellar cortical parcellation.Cortical regions include the following left and right hemispherical lobules: I/II, III, IV, V, VI, Crus I, Crus II, VIIB, VIIIA, VIIIB, IX, and X. Quantifying cerebellar cortical thickness utilizes the DiReCT algorithm. 29Both tissue segmentation (CSF, gray matter, and white matter) and regional parcellation is based on a similar deep learning network as that described previously for DeepFLASH.Training data 53 was coupled with previously described data augmentation.In contrast to DeepFLASH which utilized a single network with multiple outputs, cerebellum output is derived from first extracting the whole cerebellum and then using it as input to both the tissue segmentation network and Schmahmann regional atlas network.

Predictive modeling for IDP characterization
Insight into the relationships between neurostructural and phenotypic measures is often possible through predictive modeling of sociodemographic targets and neuroimaging biomarkers.Many strategies for data exploration leverage standardized quantities derived from existing pipelines, which constitutes a form of dimensionality reduction or feature extraction based on clinically established relevance.Such tabulated data has several advantages over direct image use including being relatively easier to access, store, and manage.Analyses with off-the-shelf statistical packages is also greatly simplified.Additionally, using standardized features in predictive modeling, where feature importance is a component of the analysis, significantly facilitates the clinical interpretability of the modeling process.Herein, baseline models are made using standard linear regression where linear dependencies between covariates were resolved using findLinearCombos of the caret R package. 56Although other modeling approaches were explored (e.g., XGBoost, 57 TabNet), 58 the linear models were the top performing models in terms of predictive accuracy so, in the interest of simplicity, we only discuss those here and refer the interested reader to the GitHub repository associated with this work for these additional explorations.We selected several target variables for our comparative evaluation (cf.Table 1) and generated models of the form: where i indexes over the set of N IDPs for a particular grouping.In the cases where Age or Genetic Sex is the target variable, it is omitted from the right side of the modeling equation.
Assessment of the models based on the three individual sets of IDPs and their combination employs standard quality measures: area under the curve (AUC) for classification targets and root-mean-square error (RMSE) for regression targets.We also explored individual IDP importance through the use of model-specific parameter assessment metrics (i.e.., the absolute value of the t-statistic).

Package-wise group IDP comparison
To compare the groups of IDPs, we used the three IDP sets (FSL, FreeSurfer, ANTsX) and their combination ("All") to train predictive models using the preselected target sociodemographic variables from Table 1.We first (1) revisit a previous evaluative framework of ANTsX cortical thickness values by comparing their ability to predict Age and Genetic Sex with corresponding FreeSurfer cortical thickness values. 16Following this initial comparative analysis, ten-fold cross validation, using random training/evaluation sampling sets (90% training/10% evaluation), per IDP set per target variable was used to train and evaluate the models described by Eq. (1).

Revisiting ANTs and FreeSurfer cortical thickness comparison
In previous publications, 16,24 IDPs under consideration were limited to ANTsX-based and FreeSurfer cortical thickness measurements averaged over the 62 regions of the DKT parcellation.These IDP sets were specifically compared in terms of the predictive capability vis-à-vis Age and Genetic Sex.With respect to UKBB-derived cortical thickness IDPs, similar analysis demonstrates consistency with prior results (see Fig. 2).

Package IDP comparison via continuous target variables
Predictive models for cohort Age, Fluid Intelligence Score, Neuroticism Score, Numeric Memory, Body Mass Index, and Townsend Deprivation Index were generated and evaluated as described previously.Summary statistics for these variables are provided in Table 2.The resulting accuracies, in terms of RMSE, are provided in Fig. 3.These linear models provide consistently accurate results across the set of continuous target variables with the combined set of IDPs performing well for the majority of cases.All models demonstrate significant correlations across IDP sets (cf.Fig. 4).

Package IDP comparison via categorical target variables
Predictive models for cohort categories associated with Genetic Sex, Hearing Difficulty, Risk Taking, Same Sex Intercourse, Smoking Frequency, and Alcohol Frequency were generated and evaluated as described previously.The resulting accuracies, in terms of binary or multi-class AUC, are provided in Fig. 5. Similar to the continuous  www.nature.com/scientificreports/variables, the linear models perform well for most of the target variables.Superior performance is seen for predicting Genetic Sex.

Individual IDP comparison
To compare individual IDPs, for each target variable, we selected the set of results corresponding to the machine learning technique which demonstrated superior performance, in terms of median predictive accuracy, for the combined (All) IDP grouping.The top ten features for the principle continuous variables of Age, Fluid Intelligence Score, and Neuroticism Score are listed in Table 3 and ranked according to variable importance score (specifically, absolute t-statistic value for linear models).The ranked lists are also color-coded by IDP package.For additional insight into individual IDPs, full feature lists with feature importance rankings are available for all target variables in the supplementary material hosted at the corresponding GitHub repository 59 .
Regression regions defined by the linear models represented in Fig. 3 showing the relationship between the predicted and actual target values.We also plot the median line for each model-based grouping as defined by the slope and list the average R 2 values for each IDP set.
Vol.:(0123456789) www.nature.com/scientificreports/ In addition to the availability of these ANTsX UKBB IDPs, we explored their utility with respect to other package-specific groupings and their combinations.For exploration of these IDP group permutations, we used linear modeling to predict commonly studied sociodemographic variables of current research interest (Table 1).In addition to research presentation in traditional venues, at least two of these target variables, specifically Age and Fluid Intelligence, have been the focus of two recent competitions.
Regarding the former, research concerning brain age estimation from neuroimaging is extensive and growing (cf.recent reviews). 34,60,61It was also the subject of the recent Predictive Analytics Competition held in 2019 (PAC2019).This competition featured 79 teams leveraging T1-weighted MRI with a variety of quantitative approaches from convolutional neural networks (CNNs) to common machine learning frameworks based on morphological descriptors (i.e., structural IDPs) derived from FreeSurfer. 62The winning team, 63 using an ensemble of CNNs and pretrained on a UKBB cohort of N = 14, 503 subjects, had a mean absolute error (MAE) of 2.90 years.Related CNN-based deep learning approaches achieved comparable performance levels and simultaneously outperformed more traditional machine learning approaches.
Given that RMSE provides a general upper bound on MAE (i.e., MAE ≤ RMSE), the accuracy levels yielded by our FSL, FreeSurfer, ANTsX models can be seen from Fig. 3 to perform comparatively well.The FreeSurfer and ANTsX linear models performed similarly with RMSE prediction values of approximately 4.4 years whereas FSL was a little higher at 4.96 years.However, combining all IDPs resulted in an average RMSE value of 3.8 years.When looking at the top 10 overall linear model features (Table 3) ranked in terms of absolute t-statistic value, all three packages are represented and appear to reflect both global structures (white matter and CSF volumes) and general subcortical structural volumes (ANTsX "deep GM" and both FreeSurfer and ANTsX bi-hemispherical ventral dienchephalon volumes).][66] Similarly, the association between brain structure and fluid intelligence has been well-studied 67 despite potentially problematic philosophical and ethical issues. 68With intentions of furthering this research, the ABCD Neurocognitive Prediction Challenge (ABCD-NP-Challenge) was held in 2019 which concerned predicting fluid intelligence scores (using the NIH Toolbox Cognition Battery) 69  ethnicity, genetic sex, and parental attributes of income, education, and marriage (additional data processing details are provided in the Data Supplement). 70f the 29 submitting teams, the first place team of the final leaderboard employed kernel ridge regression with voxelwise features based on the T1-weighted-based probabilistic tissue segmentations specifically, CSF, gray matter, and white matter-both modulated and unmodulated versions for a total of six features per subject.In contrast to the winning set of predictive sparse and global features, the second place team used 332 total cortical, subcortical, white matter, cerebellar, and CSF volumetric features.Although exploring several machine learning modeling techniques, the authors ultimately used an ensemble of models for prediction which showed improvement over gradient boosted decision trees.From Table 3, most predictive features from our study, regardless of package, are localized measures of gray matter.
Although the stated, primary objective of these competitions is related to superior performance in terms of algorithmic prediction of quantitative sociodemographics, similar to the evaluation strategy used in this work, outside of the clinical research into brain age estimation, none of these performance metrics reach the level of individual-level prediction.Consequently, these may be more informative as an interpretation of the systemslevel relationship between brain structure and behavior.An obvious secondary benefit is the insight gained into the quality and relevance of measurements and modeling techniques used.In this way, these considerations touch on fundamental implications of the No Free Lunch Theorems for search and optimization 71 where prior Table 3. Top 10 features for Age, Fluid Intelligence Score, and Neuroticism Score target variables specified for the combined (i.e., All) IDP set.distributions (i.e., correspondence of measurements and clinical domain for algorithmic modeling) differentiate general performance.Relatedly, although all packages are represented amongst the top-performing IDPs, their relative utility is dependent, expectedly so, on the specific target variable, and, to a lesser extent, on the chosen machine learning technique.Such considerations should be made along with other relevant factors (e.g., computational requirements, open-source availability) for tailored usage.

Conclusion
The UK Biobank is an invaluable resource for large-scale epidemiological research which includes a thorough neuroimaging battery for a significant subset of the study volunteers.For quantitative exploration and inference of population trends from leveraging imaging data, well-vetted measurement tools are essential.The primary contribution that we have described is the generation and public availability of the set of UK Biobank neuroimaging structural IDPs generated using the ANTsX ecosystem.These ANTsX IDPs, which includes DeepFLASH for hippocampal and extra-hippocampal parcellation, complement the existing sets of FSL and FreeSurfer IDPs.A predictive modeling strategy using a variety of sociodemographic target variables was used to explore IDP viability, importance, and utility via linear modeling.

Figure 1 .
Figure 1.Illustration of the IDPs generated with ANTsX ecosystem tools.Using the gradient-distortion corrected versions of the T1 and FLAIR images, several categories of IDPs were tabulated.These include global brain and tissue volumes, cortical thicknesses averaged over the 62 DKT regions, WMH intensity load per lobe based on the SYSU algorithm, cortical and subcortical volumes from the DKT labeling, MTL regional volumes using DeepFLASH, and morphological cerebellum quantities.

Figure 2 .
Figure 2. Results for predicting Age (left) and Genetic Sex (right) using both ANTsX and FreeSurfer cortical thickness data averaged over the 62 cortical regions of the DKT parcellation.RMSE and AUC were used to quantify the predictive accuracy of Age and Genetic Sex, respectively.
in 2017.Image data from five sites were used for both training and testing of segmentation algorithms from 20 different teams.Both the architecture and ensemble weights were made publicly available by the SYSU team which permitted a direct porting into ANTsXNet.

Table 2 .
Summary statistics for the selected continuous UKBB sociodemographic target variables.Comparison of machine learning frameworks for training and prediction of selected continuous UKBB sociodemographic continuous variables (cf.Table1) with the different IDP sets and their combination (FSL, FreeSurfer, ANTsX, and All).
in a population of 9-10 year pediatric subjects using T1-weighted MRI.Fluid intelligence scores were residualized from brain volume, acquisition site, age, AccuracyFigure5.Comparison of prediction accuracy of selected binary and multilabel categorical UKBB sociodemographic variables (cf.Table1) with the different IDP sets and their combination (FSL, FreeSurfer, ANTsX, and All).Smoking and Alcohol target variables have more than two labels.