Building a genetic risk model for bipolar disorder from genome-wide association data with random forest algorithm

A genetic risk score could be beneficial in assisting clinical diagnosis for complex diseases with high heritability. With large-scale genome-wide association (GWA) data, the current study constructed a genetic risk model with a machine learning approach for bipolar disorder (BPD). The GWA dataset of BPD from the Genetic Association Information Network was used as the training data for model construction, and the Systematic Treatment Enhancement Program (STEP) GWA data were used as the validation dataset. A random forest algorithm was applied for pre-filtered markers, and variable importance indices were assessed. 289 candidate markers were selected by random forest procedures with good discriminability; the area under the receiver operating characteristic curve was 0.944 (0.935–0.953) in the training set and 0.702 (0.681–0.723) in the STEP dataset. Using a score with the cutoff of 184, the sensitivity and specificity for BPD was 0.777 and 0.854, respectively. Pathway analyses revealed important biological pathways for identified genes. In conclusion, the present study identified informative genetic markers to differentiate BPD from healthy controls with acceptable discriminability in the validation dataset. In the future, diagnosis classification can be further improved by assessing more comprehensive clinical risk factors and jointly analysing them with genetic data in large samples.


Result
The accuracy of the RF procedure was evaluated in the GAIN training set created during the forest growing process. The accuracy of the RF classification in the GAIN data was 0.939 in controls and 0.852 in BPD patients. We ranked markers based on values of the two indices, the Gini Index (GI) and the conditional variable importance (VI) from the RF procedure. Because the two indices did not completely agree with each other, we used the union markers of the top ranked 200 in each index. As Table 1 shows, we included 348 single nucleotide polymorphisms (SNPs) to calculate the performance of discrimination ability. After excluding SNPs with complete linkage disequilibrium (LD), 289 SNPs were retained as the candidate risk markers in the regression model. As shown in Fig. 1A, the 289 candidate markers had a good discrimination ability with an area under the receiver operating characteristic (AUROC) of 0.944 (95% confidence interval (CI), 0.935-0.953), and calibration ability measured by the Hosmer-Lemeshow test (p-value = 0.933). The multivariable logistic regression with stepwise selection retained 121 SNPs as the final optimal marker set. A good discrimination ability was observed with an AUROC of 0.924 (95% CI, 0.913-0.935) based on the 121 markers (Fig. 1B).
The genetic risk score based on this final prediction model was calculated for each individual from accumulating numbers of the risk alleles and weighted by the beta regression coefficient. The risk scores among all participants were ranged from 143.8 to 228.4 in the GAIN dataset, with a mean of 175.4 in the controls and 191.3 in the BPD patients. To find an optimal cutoff point, we used the Youden Index to obtain the risk score cutoff as 184, and the corresponding sensitivity and specificity for BPD were 0.777 and 0.854, respectively ( Table 2). The likelihood ratios of the risk score with the optimal cutoff point were 5.322 for a positive result and 0.261 for a negative result, indicating moderate evidence for the differentiation between BPD patients and healthy controls.
We then used leave-one-out cross-validation to conduct internal validation for classification accuracy ( Table 3). The predicted error rates were around 0.2 using either the 289 candidate risk markers, the 121 optimal markers and the risk score. The STEP dataset was used as an external validation dataset to evaluate the  Table 1. The performance of discrimination ability for the union marker numbers of the top ranked in the two indices, the Gini Index and the conditional variable importance from the random forest procedure. Note: AUROC: the area under the receiver characteristic curve; 95% C.I.: 95% confidence interval. *Markers in complete linkage disequilibrium (LD, D' = 1) were removed before regression analysis and only one SNP was kept for each case of complete LD situation.
performance of discrimination ability. In general, the discrimination performance was acceptable using 289 candidate risk predictors in the STEP data, which had an AUROC of 0.702 (95% CI, 0.681-0.723) and a good calibration ability (Hosmer-Lemeshow test, p-value = 0.681). With only the 121 optimal predictors, the AUROC dropped to 0.639 (95% CI, 0.617-0.662), with a low calibration ability (Hosmer-Lemeshow test, p-value = 0.0002). The risk score method showed the poorest discrimination ability (AUROC = 0.506; 95% CI, 0.482-0.529).   We performed the same set of analyses using STEP data as the training dataset, and the GAIN data as the validation. The results are displayed in Supplementary Table1. In total, 354 candidate risk markers were identified for the STEP dataset. After excluding SNPs with complete LD, 312 SNPs were retained in the regression model and had the discrimination ability with AUROC of 0.934 (95% CI, 0.925-0.944). In the external validation GAIN dataset, decreased but acceptable discrimination performance was again observed, with an AUROC of 0.732 (95% CI, 0.711-0.754) (Supplementary Table1). It is worth noting that there were no overlapping candidate risk markers between the two datasets. If we mapped all candidate risk markers from the two datasets to genes, there were in total 233 gene regions, including 98 genes in the GAIN dataset and 144 genes in the STEP dataset. Only 9 genes (3.8%) overlapped between the two datasets, including genes ALK, TACR1, LRP1B, GALNT17, NAV2, ODZ4, RAD51L1, KTN1, and CACNG2.
Significantly enriched gene-sets were identified for these mapped genes in the two GWA datasets of BPD. Table 4 showed that 43 pathways were identified in the GAIN dataset with a q-value of less than 0.01 after correction for multiple comparisons. In the STEP dataset, 28 significant pathways were identified. Important biological pathways were reported, including cation ion channel activity (such as voltage-gated calcium channel activity and complex, regulation of action potential and cation transport), membrane structure (such as plasma membrane, transmembrane receptor activity and establishment of location), neuron function (such as brain development, axon guidance and GABA receptor activity) and cytoskeleton (such as cytoskeletal protein binding and actin filament).

Discussion
It is a common interest to explore the usage of genetic findings for heritable complex traits. There is an absence in the literature of a risk score model based on genetic information for the diagnosis of BPD. Non-replication across datasets is often observed, especially when focusing on specifically significant markers. Hence, using extremely significant markers in one sample to construct a genetic risk model to apply to other samples might not produce good prediction accuracy. On the other hand, informative genetic markers, which are selected by methods of machine learning, have been used for the classification of outcomes or for predicting the risk of developing diseases, such as early detection of prostate cancer 26 , treatment response in attention deficit hyperactivity disorder 27 , and identification of idiopathic autism spectrum disorder (ASD) patients 28 . Among the many machine learning methods, such as support vector machine, linear discriminant analysis, and k-nearest neighbour classification, RF is often applied in biomedical research with different data sources, such as gene expression 29 . Similar applications are reported using GWA datasets for complex traits with low prediction errors, such as severe asthma 30,31 . To our best knowledge, the present study reports the first prediction results for BPD using an RF approach to select informative markers which jointly consider the main and interaction effects among genetic variants. Our results revealed that these informative markers possess fair to good discriminability for BPD patients in the training and validation datasets.
Diagnoses of psychiatric disorders often largely depend on clinical interview rather than biomarkers. With the RF procedures, the 289 candidate risk predictors in the GAIN dataset perform well with an AUROC of 0.944 and a 0.702 AUROC in the validation dataset. Moreover, high sensitivity and specificity are also observed using the more parsimonious 121 optimal markers and the risk score, with acceptable predicted error rates less than 0.2 in leave-one-out cross-validation. Without the RF procedures, if we selected the same 289 candidate predictors based on p-values significance in the GAIN data to estimate the discriminability in the STEP dataset, the AUROC slightly dropped to 0.686 (data not shown). In the literature, some risk score models were built using genetic information in aid of improving disease classification for complex traits, without satisfactory prediction power. For instance, Golan and colleagues (2014) used the random effect approach and reported the discrimination ability with an AUROC of 0.62 for BPD patients from the Wellcome Trust Case Control Consortium dataset 32 . Using SNPs within 13 significant pathways in a study of ASD, one recent study included 237 SNPs to generate a genetic diagnostic classifier and reported an 85.6% prediction accuracy in the Central European cohort, but the accuracy dropped to 50.6% in the Han Chinese cohort 33 . Moreover, based on differentially expressed 762 unique genes, a previous study reported an 82.5% prediction accuracy for ASD 34 . In our study, the RF approach demonstrated a fine performance in selecting the informative genetic markers from massive GWA data. The classification accuracy for BPD in the current study is at the higher end with low error rates. In particular, we still obtained fair results with an AUROC of 0.702 in the STEP validation dataset.
It is commonly observed that the accuracy of the genetic prediction model is reduced in external validation samples 33,34 . Schulze and colleagues (2014) constructed a polygenic model for BPD, however, the performance of this model is poor in two external validation datasets, with AUROC ranged between 0.55 to 0.57 35 . The heterogeneity inherited in different studies and samples is often noted, which might reflect differences in sample ascertainment, population stratification, or experimental variations. An example of this is demonstrated in a large-scale study of Psychiatric Genomics Consortium (PGC) for population stratification. Using the multivariate linear mixed model approach, Maier and colleagues (2015) created genomic risk scores for severe psychiatric disorders, including schizophrenia, BPD, and major depressive disorder, using GWA data in PGC as the training set. In the validation data, the correlation coefficients between the observed status of the psychiatric disorders and their predicted genomic risk scores were low, ranged from 0.076-0.224 7 . To evaluate the population stratification, they calculated ancestry principle components of PGC data and then divided GWA data into four groups of the first ancestry principal component that reflect the population difference between individuals. Their results indicated significant heterogeneity for BPD in PGC GWA datasets (p-value = 0.0017), which is likely attributed to the ancestral population differences 7 . Therefore, heterogeneity derived from many sources might result in lowered prediction accuracy in an external dataset and hinders clinical usage and further application to assist diagnosis.
We ran both GWA datasets as the training and the other as the validation dataset for model construction. In either scenario, a very similar classification performance is observed, suggesting the stability of current Continued procedures. However, we also noticed that there were no overlapping markers selected by RF in the two datasets as candidate risk markers. The agreement increased in gene levels across both datasets, where the same 9 genes are mapped in the two datasets. It may be intuitive, as genetic markers identified are often not causal variants, but rather the proxy for real causal variants. Therefore, the agreement may be the least in marker level, and the increase in gene and pathway levels, especially when heterogeneity exists among datasets. The 9 genes contained both sets of the candidate risk markers, and many studies have indicated that some of these genes are associated with BPD or brain function, such as ODZ4 36 , TACR1 37 , KTN1 38 and CACNG2 39 . A previous GWA study using 11,974 BPD cases and 51,792 controls, identified a new intronic variant in ODZ4 36 . A recent study indicated that the genetic variants showed specific volumetric effects on the putamen and altered the expression of the KTN1 gene in both brain and blood tissue 38 . Similarly, we found a number of significant pathways for the identified genes. These pathway results are quite consistent with pathways findings in previous GWA studies for bipolar disorder 40,41 . There were some limitations in the present study. Although risk score models have been used to capture genetic effects from large-scale GWA studies, the power of discrimination was, however, inadequate in previous studies. Dudbridge and colleagues (2013) indicated that the power of polygenic score might be sufficient to use about 2,000 cases and controls, respectively 42 . Even with a good discrimination ability, a smaller sample size of the present study might cause the low power of classification model for BPD patients. In addition, we only used the genetic information to create the risk score model for BPD. The complex psychiatric disorders were caused by the interactions of genetic and environmental risk factors, such as substance dependence and childhood maltreatment. Wong and colleagues (2012) used a risk-classification tree analysis to create a reliable framework based on interactions of genetic variants and environmental factors 43 . Lacking the information from environmental factors might hinder the application of a prediction model for clinical diagnosis.
In conclusion, we successfully used a machine learning approach to extract informative genetic markers for the construction of a risk score model. Our results indicated a fair discrimination ability for BPD patients with AUROCs of around 0.70 in the external validation datasets. Integration of more comprehensive risk factors from family and environmental data in larger samples is necessary to construct a more precise and applicable risk score model for BPD, to assist with clinical diagnosis in the future.

Materials and Methods
The study design and analysis flow chart are displayed in Fig. 2. Details of the datasets and analytic procedures for the selection of candidate markers in model construction are described below. Imputation and quality control in the GWA datasets. We used two individual GWA datasets of BPD in the Caucasian populations, the GAIN (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?login= &page= login) 24 and the STEP data (https://www.nimhgenetics.org/available_data/bipolar_disorder/) 25 . The details of participant enrolment and genotyping of the two GWA studies were provided in their original articles 24,25 . In brief, in the GAIN dataset, individuals were Americans with European ancestry, including 1,001 BPD cases and 1,034 controls. In the STEP dataset, there were 955 BPD cases and 1,498 healthy subjects from the National Institute of  Mental Health Genetics Initiative. After quality control for genotypic data, a total of 699,550 and 370,995 autosomal SNPs were retained in the GAIN and STEP datasets, respectively. Because the two GWA studies used different genotyping platforms, we imputed all the autosomal SNPs for the two GWA datasets based on those genotyped SNPs that passed quality control, to obtain the maximal number of common SNPs for the construction of prediction models for the two datasets. We applied Markov Chain Haplotyping (MACH) 1.0 44 to perform genotype imputation, and the HapMap II CEU (release 22) samples were used as the reference panel (www.hapmap.org). Well-imputed SNPs had squared correlations ≥ 0.30 between imputed and true genotypes, which were suggested by MACH. After removing markers with minor allele frequency (MAF) less than 1%, the number of well-imputed SNPs was 2,238,297 and 2,109,280 for the GAIN and the STEP datasets, respectively. In total, there were 1,992,730 well-imputed SNPs overlapped in the two datasets.
Criteria for candidate markers in the construction of prediction models. To construct the genetic prediction model for BPD, we first performed association analyses with the additive model for 1,992,730 well-imputed SNPs using PLINK versions 1.07 45 . As the study design flow chart shows in Fig. 1, SNPs with p-value ≤0.01 were selected as candidate SNPs in RF analysis, for which 19,701 SNPs were in the GAIN dataset and 19,524 SNPs were in the STEP dataset. Each candidate SNP was then mapped to a gene region (using NCBI build 36) if the SNP was located within the 50 kb of upstream or downstream of a gene. These SNPs were mapped to 2,832 genes in the GAIN dataset, and 3,156 genes in the STEP dataset. In total, there were 802 genes overlapped in both GWA datasets.

Random forest procedures.
To select informative risk predictors for model construction, we used the RF method for classification and building regression trees. All candidate SNPs were used to build a training model (i.e. 19,701 candidate SNPs in the GAIN dataset). The results from the growing of ensemble trees as a forest could provide a list of important variables for disease outcome. The RF procedures were performed using the Random Jungle package 46 , which facilitates the rapid analysis of large-scale GWA data 47 . Detailed procedures are described in the following steps: (1) Two-thirds of the subjects in the GWA dataset were taken as the training set using the bootstrap procedure, and the remaining subjects were treated as the test set. (2) Second, is the splitting step. A random subset of markers was chosen from all candidate markers without replacement. The size of each marker subset was equal to the square root of numbers of all candidate markers. The decrease in impurity for all markers was then calculated. The definition of the decrease in impurity was detailed elsewhere 46 . The marker with the best classification by the decrease in impurity was used as a node to split subjects of a training set into two distinct subsets, that is, one node split into two nodes. (3) The 2 nd splitting step was repeated until the tree is grown with its largest extent in the tree growing step. (4) Steps 1 to 3 were repeated to grow 5,000 classification trees to build a forest. (5) The prediction error of particular markers was estimated by permutation procedure from the test set. (6) Two indices of the RF procedures were used for the selection of relevant risk predictors, the GI and the VI. The GI represents the total decrease in impurity of the whole dataset by summing the probability of each risk predictor being chosen multiplied by the probability of a mistake in categorizing a subset. The VI means the decrease in accuracy for every predictor. To avoid the bias induced by including highly correlated candidate markers such as SNPs in LD, we used the conditional permutation scheme in the tree building procedure 48 . The conditional importance permutation groups were created, which involve all variables of a Pearson's correlation coefficient of r ≥0.2, that is, the dependency structure between SNPs in linkage disequilibrium was preserved in the calculation of VI. To obtain an appropriate amount of predictors, the corresponding top ranked 200 SNPs of the GI and the conditional VI were considered as the candidate risk predictors for model construction in the next step.
Model construction and performance evaluation. Among the candidate risk predictors identified in the GAIN dataset using the RF analysis, we applied multivariable logistic regression with variable-selection methods (i.e. stepwise selection) to select the optimal predictors and to obtain p-values, odds ratios (OR), and 95% CI. The genetic risk score for each individual was calculated by summing across all predictors in the model using the numbers of risk allele multiplied by the beta regression coefficient of each marker. The highest Youden Index 49 was used to define the optimal cutoff point, which equals to (sensitivity + specificity)-1.
We examined the performance of the classification models by several indices. First, the discrimination capability of the established prediction model was assessed with the receiver operating characteristic (ROC) curve, and an AUROC was calculated. The ROC curve was plotted by false-positive rate versus sensitivity measure. The goodness of fit for each prediction model was assessed by the Hosmer-Lemeshow test, which calculates the difference between the predicted and the observed risk. Leave-one-out cross-validation was performed for internal validation to obtain a bias-corrected estimation of error rate in prediction. The STEP GWA dataset was used as external validation for the prediction models. Statistical analyses in this stage were performed with SAS version 9.2 (SAS Institute, Cary, NC).
Identified significantly enriched pathways. We used the Molecular Signatures Database (MSigDB, http://www.broadinstitute.org/gsea/msigdb/annotate.jsp) to examine the common processes or the underlying biological gene sets of the selected candidate genes 50 . In the present study, we used databases including GO terms (domains in biological process, cellular component and molecular function), chromosome positional gene sets, and the curated gene sets (e.g. canonical pathways, KEGG, Biocarta and Reactome). In total, 3,100 collections of gene sets were available, which includes 45,956 unique genes. Enriched gene sets were identified using the hypergeometric method, with the false discovery rate less than 0.01 50 .