Deep learning based brain age prediction uncovers associated sequence variants

Machine learning algorithms trained to recognize age-related structural changes in magnetic resonance images (MRIs) of healthy individuals can be used to predict biological brain age in independent samples. The difference between predicted and chronological age, predicted age difference (PAD), is a phenotype holding promise for the study of normal brain ageing and brain diseases, and genetic discovery via genome-wide association studies (GWASs). Here, we present a new deep learning approach to predict brain age from a T1-weighted MRI. The method was trained on a dataset of healthy Icelanders (N = 1264) and tested on two datasets, the IXI (N = 544) and UK Biobank (N = 12395) datasets, utilizing transfer learning to improve accuracy on new sites. A GWAS of PAD in the UK Biobank data (discovery set: N=12395, replication set: N=4453) yielded two sequence variants, rs1452628-T (β=-0.08, P = 1.15 · 10−9) and rs2435204-G (β=0.102, P = 9.73 · 10−12). The former is near KCNK2 and correlates with reduced sulcal width, whereas the latter correlates with reduced white matter surface area and tags a well-known inversion at 17q21.31 (H2). The genetic association analysis was also confined to variants known to associate with brain structure, yielding three additional sequence variants associating with PAD.


Introduction
Ageing has a significant structural impact on the brain that correlates with decreased mental and physical fitness [12] and increased risk of neurodegenerative diseases such as Alzheimer's disease [1] and Parkinson's disease [59]. Recent publications, have demonstrated that MRIs can be used to predict chronological age with reasonably good accuracy [20,49,12]. Such predictions provide an estimate of biological brain age in independent samples. The traditional way to perform brain age prediction is to extract features from brain MRIs followed by classification or regression analysis. This includes extracting principal components [20], cortical thickness and surface curvature [77], volume of gray matter (GM), white matter (WM), and cerebrospinal fluid [38], and constructing 1 a similarity matrix [10]. The drawback of using feature extraction methods is loss of information since the features are likely not designed explicitly for extracting information relevant to brain age. Recently, deep learning (DL) methods have garnered much interest [46]. These methods learn features that are important without apriori bias or hypothesis. Convolutional neural networks (CNNs) [45] are deep learning techniques that are especially powerful for image processing and computer vision. Previously, they have been applied to brain age prediction [32,11]. Notably, Cole et al. [11] implemented a 3D CNN trained on T1-weighted MRIs to predict brain age and achieved promising results. PAD (the difference between predicted brain age and chronological age) estimates the deviation from healthy ageing. Studies have shown that positive PAD correlates with measures of reduced mental and physical fitness; including weaker grip strength, poorer lung function, slower walking speed, lower fluid intelligence, higher allostatic load and increased mortality risk [12]. In addition, positive PAD has been shown to associate with cognitive impairments [21,24,10,49], diabetes [22], traumatic brain injuries [10], schizophrenia [41,64], and chronic pain [43]. On the other hand, a negative PAD associates with higher educational attainment [72], increased physical activity [72] and meditation [50]. Moreover, PAD has been demonstrated to be heritable [11,35] and to have a polygenic overlap with brain disorders such as schizophrenia, bipolar disorder, multiple sclerosis, and Alzheimer's disease [35]. Furthermore, the high degree of genetic correlation found among psychiatric and some neurological disorders suggests that current diagnostic boundaries do not necessarily reflect underlying biology [14]. Hence, defining a novel phenotype capturing global age-related changes in brain structure could, via variants in the sequence of the genome that associate with these changes, provide novel biological insights.
Here we present a new brain age prediction method ( Figure 1A) that uses a 3D CNN trained on MRIs to predict brain age. The input data are a T1-weighted image registered to Montréal Neurological Institute (MNI) space and data derived from the T1-weighted image, i.e., a Jacobian map, and gray and white matter segmented images ( Figure 1B). The input data also include information about the subject's sex and the type of MRI scanner. The output of the network is the predicted brain age.
As mentioned above, Cole et al. [11] trained a 3D CNN to perform brain age prediction. Our network is different in four key ways. 1) We use a significantly different architecture. While their architecture resembles a standard VGGNet architecture [66] our architecture uses the recent ResNet design [26]. One of the drawbacks of the VGG architecture is that the vanishing gradient problem limits the potential depth of the network. In contrast, the ResNet architecture has no such depth limits. ResNets also have smoother loss surfaces [48], which in turn helps speeding up convergence. 2) We add inputs to the final CNN layer to factor in information about sex and scanner. 3) Our technique is the first to use deformation information encoded in Jacobian maps to predict brain age. 4) As we have mentioned, our method combines predictions from multiple CNNs by either averaging predictions or by training a data blender.
In experiments, we compare our proposed method to a few brain age prediction methods based on feature extraction and machine learning. We also demonstrate that transfer learning is useful for adapting a CNN trained to predict brain age on one site to a new site while retaining predictive accuracy. And we look at how the PAD calculated with our method is affected by random weight initialization and retraining.We then check for associations between PAD and performance on neuropsychological tests. Finally, we 2 perform genetic analysis on PAD using UK Biobank data, resulting in identification of associations with five sequence variants for which we provide detailed phenotypic characterizations.
(A) (B) Figure 1: (A) A flowchart showing a high-level overview of the proposed brain age prediction system. (B) Examples of image types generated by the preprocessing step. From left to right: a registered T1-weighted slice, a Jacobian map slice, a GM segmented slice and a WM segmented slice.

Combining CNN outputs improves prediction accuracy
Our brain age prediction method was developed using images from structural brain MRIs for 1264 healthy Icelanders. To overcome problems caused by training a DL method on such a small dataset we use multiple images of the same individuals and utilize a data augmentation strategy. We start off by training the method independently on the four previously mentioned image types (Table 1A). The CNN that predicts the test set with the least error is the CNN trained on T1-weighted images followed by the CNN trained on WM segmented images. 1 Having four predictions from four different data sources opens up the possibility of combining the predictions. The most straightforward way of fusing the forecasts is by using a majority voting scheme, e.g. by averaging the predictions made by the four CNNs. Another way to combine forecasts is to implement a data blender, for example, by implementing a linear regression model trained to predict brain age from the four CNN brain age predictions. This technique attempts to find the best linear combination of the four brain age predictions so in theory it should be guaranteed to be at least as good as the best predicting CNN method. To demonstrate this, we tried combining CNN brain age predictions using majority voting and linear regression data blending (Table  1B). Comparing the test set results of Table 1B to the results in Table 1A, we see that combining predictions results in lower test error than achieved by the CNN trained on T1-weighted images.
It is not straightforward to compare the accuracy of our method to previous brain age prediction methods, because they are evaluated on other datasets. However, to establish a baseline for the CNN based techniques, we investigated methods based on feature extraction such as surface-based morphometry (SBM) [19], voxel-based morphometry (VBM) [3], and similarity matrices. Machine learning regression methods were trained on these three types of features separately (Table 1C). 2 If we compare the results in Table 1B and 1C we see that by combining CNN outputs we get predictions that are more accurate than those based on the feature extraction methods.

Testing the CNN on other datasets
Next, we examine how the method performs if we predict brain age of images from other datasets. To do so, we evaluate it on the IXI 3 and UK Biobank [74] datasets and combine predictions using majority voting. We use this combination method rather than data blending, because it has similar accuracy to linear regression blender with the added benefit that it is unnecessary to train an extra linear model on the predictions. We observe that the initial prediction error of the method is high ( Table 2). The problem is that there can be subtle differences between data from different scanning sites which will cause a model trained on one site to fail when predicting on the other site. There are multiple reasons for this. The MRI scanner type and parameters between sites can be different, which can cause differences between resolution, contrast and noise levels. Also, the distribution of age can be different between sites, for example, it is problematic if the new site has a wider age range than the training set.
We hypothesize that a CNN that is already proficient at predicting brain age at one site only needs a small adjustment to adapt to data from a new site. A transfer learning strategy achieves this: First, we freeze the model weights of the convolutional layers so that only the fully connected layers are trainable. Second, the CNN is re-trained on a portion of the data from the new site. An advantage of this strategy is that there are now fewer parameters to train, which means we can use less data and training will be faster. We carry out the transfer learning strategy by re-training the majority voting CNN on 440 images from the IXI dataset.
The re-trained CNN is validated on 104 images from the IXI dataset left out during training (validation set) and tested on 12395 images from the UK Biobank dataset (test set). Table 2 shows that the prediction accuracy is increased significantly by doing so. 4 Surprisingly the accuracy of predictions for the UK Biobank site improve even though the CNN was not explicitly trained on it. This is intriguing and is perhaps explained by the fact that the IXI set includes a wider age range than the Icelandic set and includes 3T MRI images unlike the Icelandic set.

Effect of random CNN weight initialization on PAD
We know that because CNNs start out in random initial states, and because they have highly non-convex loss functions [48], it is possible that two randomly initialized instances of our brain age prediction method will converge to two distinct local minima. These states could in theory both predict age equally well but have uncorrelated PAD values. Here we face a potential problem, because in the absence of a ground truth for the PAD there is no way to tell if either one of these PAD predictions is accurate. This sort of unreliable CNN behaviour would be problematic for any downstream analysis that utilizes the brain age prediction, because any conclusions made about the PAD would depend on the initialization of the CNN. In light of this, it would be reassuring if we could demonstrate that our method generally converges to similar PAD predictions after training.
To test this, four additional randomly initialized instances of our brain age prediction method are trained and the agreement between their PADs is examined. This procedure entails repeating these three main steps four times: 1) Train four CNNs on the Icelandic dataset on the four previously mentioned image types. 2) Freeze convolutions layers and train the CNNs on the IXI dataset (transfer learning step). 3) Predict brain age in the UK Biobank dataset using CNNs, combine the predictions with majority voting and calculate PAD values.
After repeating these steps, we get four instances of the brain age prediction method that predict brain age of the 12395 subjects in the UK Biobank with mean absolute error (MAE) equal to 4.6, 5.5, 5.4, and 4.9 respectively. The reason why the error is higher here compared to the original results is that we did not spend as much time optimizing these CNNs. Nevertheless, if we look at the agreement of the original and the four new PADs we find that the intraclass correlation (ICC) is estimated to be equal to 0.86 (95% confidence interval [CI] = [0.855, 0.863]). This indicates that the UK Biobank PAD calculated using our method stays rather consistent between the five different training runs and is relatively robust to random weight initialization.

Associations between PAD and performance on neuropsychological tests
As mentioned above, previous studies have linked high PAD to cognitive impairment [21,24,10,49]. In light of this, we are interested in looking at if PAD associates with performance on neuropsychological tests. Specifically, performance on tests administered by the UK Biobank that are designed to measure: fluid intelligence, numeric memory, visual memory, prospective memory, simple processing speed, complex processing speed, visual attention, and verbal fluency. To estimate PAD in the UK Biobank, we train four CNNs on the Icelandic set, then the IXI set using transfer learning, and combine their predictions using majority voting. We see from Table 3 that PAD is associated with worse performance on the digit substitution test (DSST), trail making tests (TMTs), and the reaction time test. 5 As expected, these results indicate that PAD is in fact associated with cognitive impairment.

Genome-wide association study
PAD has previously been shown to be heritable [11,35], however, to our knowledge no sequence variants conferring risk of or protecting against PAD have been identified. In order to look for such variants, we ran a genome wide association scan (GWAS) in the UK Biobank sample on PAD (same PAD as Section 2.4). This scan yields two sequence variants, rs2435204-G and rs1452628-T ( Figure 2 and Table 4A). Additionally, given that sequence variants known to associate with brain structure are likely to be enriched for variants that associate with PAD. We decided to test a smaller set of 331 brain structure variants for association with PAD 6 . This yielded associations with three additional variants (Table 4B). The high number of tests conducted in GWAS combined with the general small effect size of common markers greatly increases the risk of a false postives [67]. To protect against potential confound effects we adjusted for variables, such as age, gender, total intracranial volume, and removed individuals of non-white British ancestry 7 . And then to thoroughly vet each hit we took three steps. 1) We performed a replication test on held out data. 2) Looked at whether the association results are affect by randomly 1) The five reported sequence variants also associated with PAD in a replication set of 4453 subjects (Table 12 [Appendix G]). Four other variants which came up in the discovery stage were omitted because they did not replicate. 2) The five sequence variants still associate with PAD even after we randomize the CNN weights and retrain the method (Table 13 [Appendix H]). This fits with the PAD ICC results from Section 2.3, which indicate that retraining the method should not have a large effect on any downstream analysis. The reason why the original PAD generally has a stronger association than the other four PAD estimates could be due to the Winner's curse [80], or the fact that the new PAD estimates are less optimized. 3) The identified sequence variants also associate with brain structure likely to be affected by brain ageing. 8 Table 7 (Appendix F) shows that both PAD and rs1452628-T associate with lower cerebrospinal fluid (CSF) throughout the cerebral cortex which is consistent with reduced cortical sulcal openings. On the other hand, rs2435204-G associates with lower total white matter surface area, and reduced area in a number of cortical brain regions (Table 8). We also see that the other three sequence variants and PAD are associated with numerous structural brain phenotypes (Tables 9-11). In addition, we scanned for the phenotype effects of all five single-nucleotide polymorphisms (SNPs) in the UK Biobank data analyzed by the Roslin Institute [7] (Supplementary Tables 1-5).

Discussion
Here, we have presented a novel deep learning approach, using residual convolutional neural networks to predict brain age from a T1-weighted MRI, a Jacobian map, and gray and white matter segmented images, to study the discrepancy between age-related structural brain changes and chronological age. The MRI based deep learning system was shown to predict brain age from T1-weighted MRI data with a MAE = 3.39 and R 2 = 0.87 on test data. Comparing our approach to other machine learning methods trained on surface-based morphometry, voxel-based morphometry, and similarity matrix features, we showed that our approach predicts brain age more accurately. We showed that transfer learning can be used to efficiently increase prediction accuracy for new sites. The PAD calculated using this method was shown to be relatively robust to random weight initialization and retraining, a result that indicates that the PAD estimated using our method can be used as a reliable phenotype in the study of brain ageing, as well as in the study of specific disorders of the brain. We also proposed that PAD could be an informative phenotype for genetic association studies, and indeed, our association analysis of PAD in a discovery set of 12395 subjects and replication set of 4453 subjects yielded five sequence variants.
The sequence variant with the strongest association, rs2435204-G, tags the H2 (inverted) form of the 17q21.31 inversion polymorphism [71]. This inversion spans approximately 1 Mb and includes 10 genes, including MAPT, a gene that encodes the tau protein which has been implicated in various dementias [58]. In addition, micro-deletions within the inversion are known to cause intellectual disability [40]. The H1 inversion haplotype has been associated with increased risk of Parkinson's disease, male pattern baldness, and several other phenotypes, whereas H2 has been associated with a number of phenotypes including neuroticism [54], fibromyalgia [43], lower educational attainment, increased fecundity [39], and smaller intracranial volume 9 (ICV) [33]. Due to the extensive linkage disequilibrium (LD) the 17q21.31 inversion region, reported markers for various associations in the region often differ between studies. For example, the most recent GWAS meta-analysis of Parkinson's disease reports an association with rs17649553-T, that is fixated on and highly correlated with the H2-tagging rs2435204-G (r 2 = 0.82, D = 1), with OR = 0.78 (95% CI = [0.76, 0.80]), P = 1.26 · 10 −68 [15].
rs2435204-G also associates with brain structure phenotypes. Table 8 (Appendix F), shows that both PAD and rs2435204-G associate with increased thickness and decreased area in cortical brain regions. Interestingly, this pattern of increased thickness and decreased area has previously been associated with neuroticism [61]. Thus, lifestyle or phenotypes associated with a high neuroticism score, including anxiety, worry, fear, anger, frustration, depressed mood and loneliness may associate with PAD.
The other genome-wide significant sequence variant, rs1452628-T, is located close to KCNK2 (also known as TREK1 ), which belongs to the two-pore domain potassium channel family and is mainly expressed in the brain [28]. In mice, KCNK2 has been implicated in neuroinflammation [4], cerebral ischemia [6], and blood-brain barrier dysfunction [78]. rs1452628-T correlates with SNPs that have previously been associated with cortical sulcal opening and GM thickness, rs6667184 (r 2 = 0.68), and rs864736 (r 2 = 0.49) [44].
In addition, we identified three sequence variants associated with PAD by restricting the analysis to SNPs known a priori to associate with structural phenotypes. 1) rs2790099-C is located in an intron of RUNX2, a gene that encodes the RUNX2 protein which is essential for osteoblastic differentiation and skeletal morphogenesis and has 9 PAD is adjusted for ICV, thus the observed effect on PAD is not caused by ICV.
been shown to play several roles in cell cycle regulation [73]. Supplementary Figure 12 shows that rs2790099-C is a possible cis-eQTL of RUNX2 and it is most expressed in the basal ganglia (caudate and putamen). This lines up with the a priori brain structure GWAS that shows that rs2790099-C has genome-wide significant associations with white matter volume of regions in the basal ganglia (putamen and pallidum) ( Table 9). 2) rs6437412-C is an intron variant of LINC01968 that associates with increased cortical CSF (Table 10) and hematological traits (Supplementary Table 4). 3) rs2184968-C is located in an intron of CENPW, a gene that has previously been associated with traits, such as, height [55], cognitive performance [47], and male-pattern baldness [57]. Our analysis shows that rs2184968-C is associated with increased CSF in subcortical regions and increased size of the fourth ventricle (Table 11).
Confound effects are a problem for big imaging studies due to the huge number of imaging artifacts that can potentially influence both imaging and non-imaging variables of interest [67]. Some of the confound effects we have tried to control for are effects due to age, sex, head size, population structure, and scanner type. Head motion is another potentially problematic confound effect, because it causes reduction of estimated gray matter volume and thickness in MRI images similar to what we expect to see due to ageing [60]. While head motion is not important in the evaluation of our method (see Cole et al. [11]), it is potentially a problematic confound for GWAS analysis because certain clinical groups associate more with scanner motion. Elliott et al. [17] suggest to use fMRI-derived head motion estimates to correct for confound effects due to head motion when running GWAS analysis on brain structure phenotypes. We did try to correct PAD for head motion as they suggest, however, this correction only had a small effect on our results. Other potential confounds that we looked at were sample relatedness (the first 40 principal from components genetic ancestry analysis), genotyping array, and the assessment center were neuropsychological testing was performed. As with head motion, adjusting for these variables did not affect our results.
From our analysis we see that PAD associated with worse performance on neuropsychological tests, specifically poor performance on DSST, TMT, and the reaction time tests (Table 3). Interestingly, both the DSST and the reaction time test are designed to measure cognitive processing speed. The TMT is designed to asses visual attention. However, psychomotor speed is a factor in successful TMT performance [63]. Furthermore, a decline in processing speed along with impairment of reasoning, memory, and executive function are well documented to occur in age-associated cognitive decline [16]. As such, these results are in line with other studies that link high PAD to cognitive impairment [21,24,10,49]. We note, that the association between PAD and TMT is consistent with the previous finding of Cole et al. [10]. However, the large dataset used here gives more conclusive results. Supporting this, we additionally find that schizophrenia, a brain disorder characterized by complex patterns of cognitive impairment, correlates with positive PAD (greater brain ageing than chronological age) and (Table 6 [Appendix E]).
In conclusion, we have presented a new method for predicting brain age using cuttingedge machine learning techniques. Our deep learning method produces a single measure (PAD) from raw MRI data that captures complex underlying correlated changes in MRI and can be used to study various traits and diseases, and in particular for genetic discovery. Using such a method represents one potential way for overcoming challenges with high dimensional data and multiple testing that plagues MRI research. By applying our method to large genomic datasets such as the UK Biobank has enabled us to identify 10 novel genetic components that influence brain ageing. Further research into these components has potential to shed more light on the biological underpinnings of the ageing brain and its connection to various diseases and disorders.

Datasets
The proposed method was evaluated on T1-weighted MR images from three independent datasets: an Icelandic dataset, the UK Biobank dataset, and the IXI dataset. DeCODE genetics provided the Icelandic MR data, consisting of scans from 1264 healthy subjects aged between 18 and 75 years. This dataset includes 1815 scans in total, since some subjects have several scans. The Icelandic data were acquired using two different scanners, a 1.5T Phillips Achieva scanner, and a 1.5T Siemens Magnetom Aera scanner. Scans were imaged using a T1-weighted gradient echo sequence (Philips Achieva: repetition time (TR) = 8. . Any serious neurological disorders were prescreened and removed. Additionally, we removed from the training and holdout sets subjects diagnosed with neurodevelopmental and mental disorders such as autism, bipolar disorder, intellectual disability, or schizophrenia, and subjects with any copy number variations previously associated with neurodevelopmental or psychiatric disorders.
The UK Biobank dataset 10 consists of T1-weighted MR images of 15040 healthy subjects aged between 46 and 79 years old. The data were all collected using a 3T Siemens Skyra scanner. It is well-known that the presence of undetected population structure can lead to both false positive results and failure to detect genuine associations in genetic association studies [51], in an effort to combat this our analysis was constrained to 12395 individuals of white British ancestry. An additional release of MRI images by UK Biobank was added to a replication set. This set contains 6888 subjects (thereof 4458 subjects of white British ancestry) aged between 47 and 80 years old. The images in this set were collected using the same protocol as the previous UK Biobank set.
The IXI dataset consists of T1-weighted MR images of 544 healthy subjects and is freely available online. The subjects age at imaging was between 20 and 86 years old. The IXI data was collected from three different sites. The Hammersmith Hospital using a Philips 3T system, Guy's Hospital using a Philips 1.5T system and the Institute of Psychiatry using a GE 1.5T system. 11

Preprocessing
Preprocessing was carried out using the computational anatomy toolbox (CAT12) [23]. First, the input data were inhomogeneity corrected. Then the skull and other nonbrain elements were removed. Finally, the images were registered into the standard MNI space using the deformable registration algorithm DARTEL [2]. For further information, refer to the CAT12 manual [8].
There are three types of images that the preprocessing step generates. The first is an MNI-registered image. Second, a Jacobian map which is a by-product of the deformable registration. Lastly, a gray matter and white matter soft segmented images. All of the image types mentioned above have voxel size 1.5 mm 3 and size 121x145x121.

CNN architecture
The CNN uses a residual architecture [26] as depicted in Figure 3. It consists of five residual blocks, each followed by a max pooling layer of stride 2x2x2 and kernel size 3x3x3, and one fully connected block. The convolutional part of the CNN reduces the input image from size 121x145x121 to 128 feature maps of size 4x5x4. The fully connected part reduces these feature maps down to an age prediction. The residual block, displayed in Figure 4, consists of a combination of layers which are repeated twice inside the residual blocks. This combination is composed of a 3D convolutional layer with stride 1x1x1 and kernel size 3x3x3, a batch re-normalization layer [34], and an ELU activation function [9]. The defining element of the residual block is the skip connection which adds the signal feeding into the residual block to the output of a layer close to the end of the block. The number of feature maps in block number n was chosen by the rule 2 n+2 . The fully connected block, depicted in Figure 5, is a multilayer perceptron (MLP) [81] with one hidden layer. The input layer has 128 × 4 × 5 × 4 = 10240 neurons, the hidden layer (FC 1) has 256 neurons that use an ELU activation function, and the output layer has a single neuron. Following the hidden layer, a dropout [69] layer with keep rate equal to 0.8 is employed. The output layer (FC 2) has no activation function which means that it performs a linear regression on the hidden layer features. To account for factors such as scanner type and sex that can affect the estimated brain age of an individual we include them as inputs in the linear regression by concatenating them with the hidden features of the MLP. The mean absolute error was used as the loss function and the CNN was optimized using Adam [37] with parameters: learning rate = 0.001, decay= 10 −6 , β 1 = 0.9, β 2 = 0.999, and batch size = 4. The He initialization strategy [27] was used to initialize the weights, and each trainable node in the CNN was regularized with l 2 weight decay [42], with λ = 5 × 10 −5 . Early stopping [53] was used, i.e., if the validation error did not improve in 100 epochs the training was stopped. Furthermore, to reduce the risk of overfitting, data augmentation [25] was used to generate new training instances by applying a coordinate transformation to a random subset of the training data, consisting of a combined 3D rotation and a 3D translation. The rotation angles were between -40 and 40 degrees with equal probability, and the translation distance, for each direction, was selected between -10 and 10 voxels with equal probability.

SBM, VBM, and similarity matrix brain age prediction
The SBM features were generated using FreeSurfer's recon-all algorithm 12 [18] and the VBM features were generated using the CAT12 toolbox 13 . The similarity matrix was constructed by taking the inner product between the combined gray and white matter segmented images of each subject. The SBM and VBM features were adjusted for intracranial volume, sex and scanner type. The features were then zero centered and normalized to unit variance. The regression methods that were tested were, linear regression [65], lasso [75], ridge regression [30], elastic net [82], random forest regression [29], and SVR [68]. A grid search was used to find the tuning parameters corresponding to the lowest cross-validation error for the methods mentioned.

Statistical methods
To assessing the accuracy of the machine learning methods we performed simple training and validation splits, and selected a suitable model by evaluating the validation MAE. The subjects from the Icelandic sample were split between these three sets, and if a subject had multiple images, the images were all put in the same set. The data were divided into 64% training set (N s = 809, N i = 1171), 16% validation set (N s = 202, N i = 298), and 20% test set (N s = 253, N i = 346), were N s is the number of subjects and N i is the number of images. When evaluation the machine learning models the MAE and R 2 score for the images in the validation and test set is calculated.
To assess the transfer learning performance, the IXI dataset was split into 80% training set (N = 440), 20% validation set (N = 104) and the whole UK Biobank dataset was used as a test set (N = 12395). As before, we evaluate accuracy by calculating the MAE and R 2 score on the validation and test set.
In order to test the reliability of PAD, the intraclass correlation was calculated with ICCbare from the ICC R package 14 . The 95% confidence interval was estimated using bootstrapping with 2000 sampling iterations.
The Pearson correlation coefficient was calculated in order to test for association between PAD and performance on neuropsychological tests. Before performing the analysis we first adjusted the PAD for age, age 2 , total intracranial volume, sex, the interaction between sex and age. The correction was performed using linear regression. Ten correlation tests were performed, so a Bonferroni adjusted significance level α B2 = 0.05/10 = 0.005.
We performed a GWAS on PAD to find associated sequence variants. For the genetic analysis we used version 3 of the imputed genetic dataset released by UK Biobank in July 2017 [5]. The UK Biobank genetic data was assayed using two very similar genotyping arrays (95% of marker content is shared). Roughly 10% of the subjects were genotyped using applied Biosystems UK BiLEVE Axiom Array by Affymetrix and the rest using the closely related Applied Biosystems UK Biobank Axiom Array [5]. Variants with imputation quality score below 0.3, and minor allele frequency below 0.1% were filtered out, which left~20 million variants to be considered for GWAS. Before performing GWAS, the PAD was adjusted for age, age 2 , total intracranial volume, sex, the interaction between sex and age using linear regression. The adjusted PAD was then normalized with an inverse normal transformation. Sequence variants associated with PAD are only reported if they reach genome-wide significance. If two genome-wide significant variants are in LD (r 2 > 0.1) we report the variant with the lower p-value.
In addition, we tested for association between PAD and sequence variants known to associate with structural brain phenotypes. These variants were found by performing GWAS separately on 305 SBM phenotypes generated with recon-all by using the Freesurfer 6.0 software [18] and 540 VBM phenotypes generated by using CAT12 [23]. All genome-wide significant markers were then aggregated into a single list. In cases where variants were in LD (r 2 > 0.5), only the variant with the lower p-value was selected. The final list included 331 variants, to account for testing test variants for the second time a Bonferroni adjusted significance level α B3 = 0.05 2·331 ≈ 7.5 · 10 −5 was used for the PAD association test.
To reduce the risk of false positive sequence variant associations we additionally checked for association in a replication set of 4453 subjects. To pass this test the association between the variants under consideration and PAD need to show evidence of statistical significance (α R < 0.05).

eQTL analysis
To investigate if any of the variants are expression quantitative trait loci (eQTLs) we used the GTEx database 15 [13]. Our eQTL analysis was carried out by logging onto https://gtexportal.org, typing in the corresponding rs number of identified variants, and checking if they have any associated eQTLs. However, identifying whether a variant is truly causal in both GWAS and eQTL is challenging because of the uncertainty caused by LD [31]. Therefore, we only report variants as eQTLs of genes if they are close to being the most significant eQTL of that specific gene.

Code availability
Any custom code or software used to implement the brain age prediction method detailed in this paper will be made available upon request.

Data availability
The genetic and phenotype datasets generated by UK Biobank used in this study are available via the UK Biobank data access process (see http://www.ukbiobank. ac.uk/register-apply/). Detailed information about the genetic data and MRI data available in UK Biobank is listed here: http://www.ukbiobank.ac.uk/scientists-3/ genetic-data/, https://www.fmrib.ox.ac.uk/ukbiobank/.

Additional results
Appendix A: Chronological age distribution of the datasets Figures 6, 7, 8 show histograms of the distribution of age in the Icelandic dataset, UK Biobank dataset, and IXI dataset.

Appendix B: SBM, VBM, and similarity matrix results
All of the SBM, VBM, and similarity matrix results are shown in Table 5. The method that performed best on the SBM features was a radial basis function (RBF) kernel SVR method with C = 10 and γ = 10 −3 . Similarly, the best performing method on VBM features was an RBF kernel SVR method with parameters C = 100 and γ = 10 −3 . The best performing method on similarity matrix features was a linear kernel SVR method with penalty parameter C = 100.     Figure 11: Four scatter plots that show the effect of using transfer learning, when predicting brain age of the IXI (validation set) and UK Biobank (test set) datasets, using a CNN trained on images from the Icelandic dataset. These plots show the chronological age against brain age predicted by the CNNs trained on T1-weighted images. The top left plot shows the UK Biobank brain age predictions without transfer learning. The top right plot shows the UK Biobank brain age predictions with transfer learning. The bottom left plot shows the IXI brain age predictions without transfer learning. The bottom right plot shows the IXI brain age predictions with transfer learning.

Appendix E: Predicting brain age of neurodevelopmental and mental disorder cases
Subjects from the Icelandic sample diagnosed with neurodevelopmental or mental disorders were left out while training the brain age prediction method. This includes subjects with autism, bipolar disorder, schizophrenia, and intellectual disability. We now look for deviation in PAD in these subjects compared to the PAD of the healthy Icelanders.
A two sample t-test was used to test for the difference between the mean PAD of the neurodevelopmental and mental disorder cases and healthy controls. Before performing the analysis we first compute the PAD. In cases where subjects had multiple images the average PAD was used instead. In this particular instance the PAD was adjusted for age, gender, and total intracranial volume using a generalized additive model. Five correlation tests were performed, so we used a Bonferroni adjusted significance level α B1 = 0.05/5 ≈ 0.01. The controls are taken from the Icelandic test set (N=291) and their average PAD is 0.1. Most of the subjects in the schizophrenia group are young males. Therefore, in an effort to compare similar groups we performed an additional test where only the PAD of males under 35 years is compared. After applying this constraint there are 22 control subjects left and their average PAD is −0.6.  Table 6 shows that PAD is higher in individuals with schizophrenia than in controls. This is consistent with findings from other studies [56,64,41,35] that have looked at brain ageing of schizophrenia patients. Brain structure irregularities in schizophrenia, such as cortical thinning [76] and cerebral ventricular enlargement [79] also seen in healthy ageing [62,36] may be driving the prediction. Why there are structural differences in brains from schizophrenia patients is not fully understood. Psychotic episodes and longtime use of antipsychotic drugs [52] probably contribute. Sequence variants conferring high-risk of the disease have also been shown to affect brain structure in controls [70].
Appendix F: SBM and VBM phenotypes associated with variants from brain age GWAS Table 7: SBM and VBM phenotypes associated rs1452628. Only structural brain phenotypes that reach genome-wide significance in the discovery set (UK Biobank) are listed. In addition, the correlations between PAD and the SBM/VBM phenotypes are shown. Abbreviations: cerebrospinal fluid (CSF), gray matter (GM), gyrus (Gy), inferior (Inf), left hemisphere (LH), lobe (Lo), right hemisphere (RH), superior (Sup), thickness (Thk), volume (Vol).  Appendix F: SBM and VBM phenotype associations for structural brain prior variants