An accurate risk score model has substantial benefits in assisting the early screening of diseases, patient management, and clinical diagnosis. In addition, a risk score model has been applied to the prognosis prediction for complex traits, such as cardiovascular diseases1,2 and cancers3. With the advances in generating and the accumulation of genetic information, especially the increasing popularity in genome-wide association (GWA) studies to provide more comprehensive information about genetic variation for the trait of interest, some studies have incorporated such data to construct risk score models for disease diagnosis, such as schizophrenia and post-traumatic stress disorder4,5,6,7.

Although many GWA studies have been conducted for heritable traits, only a few susceptible loci are reproducibly reported, with a small effect size between 1.2 and 2.08,9. In a recent massive GWA study of schizophrenia, 108 loci were identified. However, each of these loci contributed a tiny fraction of the risk in the population; even the polygenic risk score explained less than 3.4% of the variance10. The general consensus is that multiple genes may act synergistically and interactively to influence the risk for complex diseases11. Jointly considering the main and interaction effects among multi-loci in whole-genome analyses is advantageous in constructing the genetic risk score model. Traditional statistical approaches in GWA studies usually focus on detecting the main effects without considering the non-linear patterns among markers and their interactions. Machine learning methods have fewer assumptions about analysis models, and have been frequently used in the analysis of high-dimensional data12. The random forest (RF) method is one commonly used algorithm in machine learning techniques. Because the hierarchical decision tree structure in RF can model non-linear associations, this method is often used to capture interactions among loci, and is likely to perform well in uncovering the genetic causes underlying the etiology of complex disorders.

The RF is an ensemble-based machine learning method that uses multiple classification and regression trees as classifiers. After taking the majority vote over all classifiers, the RF method combines information across all trees to reveal variable importance. It can assess associations and interactions simultaneously13. The RF has been applied to many biological studies, such as gene expression14, metabolomics15, proteomics16, and GWA17. These studies showed that the RF method provides good accuracy, less internal examination of error, and high variable importance from mass biological data. Thus, important variables (e.g. biomarkers or SNPs in genetic data) that are selected by the RF procedure form a useful basis to construct risk score models for the risk of developing complex disorders.

In the current study, we focused on a complex psychiatric trait, bipolar disorder (BPD) with a high heritability of around 0.6 to 0.818,19, as the target for the risk score model construction. BPD is a severe psychiatric disorder, which affects approximately one in a hundred people worldwide. Many studies have suggested that the prevalence of BPD has increased worldwide in the last decade20,21,22. Without proper treatment, 15% of patients committed suicide23. Patients with BPD experience intermittent manic and depressive episodes, and usually exhibit chronic courses. Several large-scale GWA studies have been conducted to provide a list of susceptible genetic loci for BPD, without considering the join or interaction effects among loci8,24,25. So far, reliable genetic markers or objective biological indices are unavailable for clinical use in assisting diagnosis or prognosis. The diagnosis of BPD largely depends on a subjective report of patient’s syndromes and clinical characteristics.

In the present study, we used the RF-based method to construct a genetic risk prediction model for BPD, using pre-screened potentially associated markers in large-scale GWA datasets. The procedures are as follows. First, the RF method was used to select important variables as candidate risk markers in a BPD GWA dataset from the Genetic Association Information Network (GAIN). Second, the multivariable logistic regression with variable-selection methods was used to further select a smaller optimal subset of the variables from all candidate risk markers. We could then build risk score models using the identified optimal markers for the GAIN-BPD dataset. Third, to estimate the performance of the risk score models, leave-one-out cross-validation was performed for internal validation. We also used the other BPD-GWA dataset, Systematic Treatment Enhancement Program (STEP), as an external validation data. Finally, we performed a gene sets analysis to uncover the underlying biological mechanisms for those loci that were identified as the candidate risk markers.


The accuracy of the RF procedure was evaluated in the GAIN training set created during the forest growing process. The accuracy of the RF classification in the GAIN data was 0.939 in controls and 0.852 in BPD patients. We ranked markers based on values of the two indices, the Gini Index (GI) and the conditional variable importance (VI) from the RF procedure. Because the two indices did not completely agree with each other, we used the union markers of the top ranked 200 in each index. As Table 1 shows, we included 348 single nucleotide polymorphisms (SNPs) to calculate the performance of discrimination ability. After excluding SNPs with complete linkage disequilibrium (LD), 289 SNPs were retained as the candidate risk markers in the regression model. As shown in Fig. 1A, the 289 candidate markers had a good discrimination ability with an area under the receiver operating characteristic (AUROC) of 0.944 (95% confidence interval (CI), 0.935–0.953), and calibration ability measured by the Hosmer-Lemeshow test (p-value = 0.933). The multivariable logistic regression with stepwise selection retained 121 SNPs as the final optimal marker set. A good discrimination ability was observed with an AUROC of 0.924 (95% CI, 0.913–0.935) based on the 121 markers (Fig. 1B).

Table 1 The performance of discrimination ability for the union marker numbers of the top ranked in the two indices, the Gini Index and the conditional variable importance from the random forest procedure.
Figure 1
figure 1

The receiver characteristic curves for (A) the 289 candidate markers and (B) the 121 optimal markers in the GAIN bipolar disorder dataset.

The genetic risk score based on this final prediction model was calculated for each individual from accumulating numbers of the risk alleles and weighted by the beta regression coefficient. The risk scores among all participants were ranged from 143.8 to 228.4 in the GAIN dataset, with a mean of 175.4 in the controls and 191.3 in the BPD patients. To find an optimal cutoff point, we used the Youden Index to obtain the risk score cutoff as 184, and the corresponding sensitivity and specificity for BPD were 0.777 and 0.854, respectively (Table 2). The likelihood ratios of the risk score with the optimal cutoff point were 5.322 for a positive result and 0.261 for a negative result, indicating moderate evidence for the differentiation between BPD patients and healthy controls.

Table 2 Sensitivity, specificity, Youden Index, and the likelihood ratios of the risk score model in the GAIN bipolar disorder dataset as a training set.

We then used leave-one-out cross-validation to conduct internal validation for classification accuracy (Table 3). The predicted error rates were around 0.2 using either the 289 candidate risk markers, the 121 optimal markers and the risk score. The STEP dataset was used as an external validation dataset to evaluate the performance of discrimination ability. In general, the discrimination performance was acceptable using 289 candidate risk predictors in the STEP data, which had an AUROC of 0.702 (95% CI, 0.681–0.723) and a good calibration ability (Hosmer-Lemeshow test, p-value = 0.681). With only the 121 optimal predictors, the AUROC dropped to 0.639 (95% CI, 0.617–0.662), with a low calibration ability (Hosmer-Lemeshow test, p-value = 0.0002). The risk score method showed the poorest discrimination ability (AUROC = 0.506; 95% CI, 0.482–0.529).

Table 3 The performance of discrimination ability for the genetic risk score model in the training set and the validation dataset.

We performed the same set of analyses using STEP data as the training dataset, and the GAIN data as the validation. The results are displayed in Supplementary Table1. In total, 354 candidate risk markers were identified for the STEP dataset. After excluding SNPs with complete LD, 312 SNPs were retained in the regression model and had the discrimination ability with AUROC of 0.934 (95% CI, 0.925–0.944). In the external validation GAIN dataset, decreased but acceptable discrimination performance was again observed, with an AUROC of 0.732 (95% CI, 0.711–0.754) (Supplementary Table1). It is worth noting that there were no overlapping candidate risk markers between the two datasets. If we mapped all candidate risk markers from the two datasets to genes, there were in total 233 gene regions, including 98 genes in the GAIN dataset and 144 genes in the STEP dataset. Only 9 genes (3.8%) overlapped between the two datasets, including genes ALK, TACR1, LRP1B, GALNT17, NAV2, ODZ4, RAD51L1, KTN1, and CACNG2.

Significantly enriched gene-sets were identified for these mapped genes in the two GWA datasets of BPD. Table 4 showed that 43 pathways were identified in the GAIN dataset with a q-value of less than 0.01 after correction for multiple comparisons. In the STEP dataset, 28 significant pathways were identified. Important biological pathways were reported, including cation ion channel activity (such as voltage-gated calcium channel activity and complex, regulation of action potential and cation transport), membrane structure (such as plasma membrane, transmembrane receptor activity and establishment of location), neuron function (such as brain development, axon guidance and GABA receptor activity) and cytoskeleton (such as cytoskeletal protein binding and actin filament).

Table 4 The significant gene sets for bipolar disorder based on the candidate markers from the GAIN and the STEP datasets.


It is a common interest to explore the usage of genetic findings for heritable complex traits. There is an absence in the literature of a risk score model based on genetic information for the diagnosis of BPD. Non-replication across datasets is often observed, especially when focusing on specifically significant markers. Hence, using extremely significant markers in one sample to construct a genetic risk model to apply to other samples might not produce good prediction accuracy. On the other hand, informative genetic markers, which are selected by methods of machine learning, have been used for the classification of outcomes or for predicting the risk of developing diseases, such as early detection of prostate cancer26, treatment response in attention deficit hyperactivity disorder27, and identification of idiopathic autism spectrum disorder (ASD) patients28. Among the many machine learning methods, such as support vector machine, linear discriminant analysis, and k-nearest neighbour classification, RF is often applied in biomedical research with different data sources, such as gene expression29. Similar applications are reported using GWA datasets for complex traits with low prediction errors, such as severe asthma30,31. To our best knowledge, the present study reports the first prediction results for BPD using an RF approach to select informative markers which jointly consider the main and interaction effects among genetic variants. Our results revealed that these informative markers possess fair to good discriminability for BPD patients in the training and validation datasets.

Diagnoses of psychiatric disorders often largely depend on clinical interview rather than biomarkers. With the RF procedures, the 289 candidate risk predictors in the GAIN dataset perform well with an AUROC of 0.944 and a 0.702 AUROC in the validation dataset. Moreover, high sensitivity and specificity are also observed using the more parsimonious 121 optimal markers and the risk score, with acceptable predicted error rates less than 0.2 in leave-one-out cross-validation. Without the RF procedures, if we selected the same 289 candidate predictors based on p-values significance in the GAIN data to estimate the discriminability in the STEP dataset, the AUROC slightly dropped to 0.686 (data not shown). In the literature, some risk score models were built using genetic information in aid of improving disease classification for complex traits, without satisfactory prediction power. For instance, Golan and colleagues (2014) used the random effect approach and reported the discrimination ability with an AUROC of 0.62 for BPD patients from the Wellcome Trust Case Control Consortium dataset32. Using SNPs within 13 significant pathways in a study of ASD, one recent study included 237 SNPs to generate a genetic diagnostic classifier and reported an 85.6% prediction accuracy in the Central European cohort, but the accuracy dropped to 50.6% in the Han Chinese cohort33. Moreover, based on differentially expressed 762 unique genes, a previous study reported an 82.5% prediction accuracy for ASD34. In our study, the RF approach demonstrated a fine performance in selecting the informative genetic markers from massive GWA data. The classification accuracy for BPD in the current study is at the higher end with low error rates. In particular, we still obtained fair results with an AUROC of 0.702 in the STEP validation dataset.

It is commonly observed that the accuracy of the genetic prediction model is reduced in external validation samples33,34. Schulze and colleagues (2014) constructed a polygenic model for BPD, however, the performance of this model is poor in two external validation datasets, with AUROC ranged between 0.55 to 0.5735. The heterogeneity inherited in different studies and samples is often noted, which might reflect differences in sample ascertainment, population stratification, or experimental variations. An example of this is demonstrated in a large-scale study of Psychiatric Genomics Consortium (PGC) for population stratification. Using the multivariate linear mixed model approach, Maier and colleagues (2015) created genomic risk scores for severe psychiatric disorders, including schizophrenia, BPD, and major depressive disorder, using GWA data in PGC as the training set. In the validation data, the correlation coefficients between the observed status of the psychiatric disorders and their predicted genomic risk scores were low, ranged from 0.076–0.2247. To evaluate the population stratification, they calculated ancestry principle components of PGC data and then divided GWA data into four groups of the first ancestry principal component that reflect the population difference between individuals. Their results indicated significant heterogeneity for BPD in PGC GWA datasets (p-value = 0.0017), which is likely attributed to the ancestral population differences7. Therefore, heterogeneity derived from many sources might result in lowered prediction accuracy in an external dataset and hinders clinical usage and further application to assist diagnosis.

We ran both GWA datasets as the training and the other as the validation dataset for model construction. In either scenario, a very similar classification performance is observed, suggesting the stability of current procedures. However, we also noticed that there were no overlapping markers selected by RF in the two datasets as candidate risk markers. The agreement increased in gene levels across both datasets, where the same 9 genes are mapped in the two datasets. It may be intuitive, as genetic markers identified are often not causal variants, but rather the proxy for real causal variants. Therefore, the agreement may be the least in marker level, and the increase in gene and pathway levels, especially when heterogeneity exists among datasets. The 9 genes contained both sets of the candidate risk markers, and many studies have indicated that some of these genes are associated with BPD or brain function, such as ODZ436, TACR137, KTN138 and CACNG239. A previous GWA study using 11,974 BPD cases and 51,792 controls, identified a new intronic variant in ODZ436. A recent study indicated that the genetic variants showed specific volumetric effects on the putamen and altered the expression of the KTN1 gene in both brain and blood tissue38. Similarly, we found a number of significant pathways for the identified genes. These pathway results are quite consistent with pathways findings in previous GWA studies for bipolar disorder40,41.

There were some limitations in the present study. Although risk score models have been used to capture genetic effects from large-scale GWA studies, the power of discrimination was, however, inadequate in previous studies. Dudbridge and colleagues (2013) indicated that the power of polygenic score might be sufficient to use about 2,000 cases and controls, respectively42. Even with a good discrimination ability, a smaller sample size of the present study might cause the low power of classification model for BPD patients. In addition, we only used the genetic information to create the risk score model for BPD. The complex psychiatric disorders were caused by the interactions of genetic and environmental risk factors, such as substance dependence and childhood maltreatment. Wong and colleagues (2012) used a risk-classification tree analysis to create a reliable framework based on interactions of genetic variants and environmental factors43. Lacking the information from environmental factors might hinder the application of a prediction model for clinical diagnosis.

In conclusion, we successfully used a machine learning approach to extract informative genetic markers for the construction of a risk score model. Our results indicated a fair discrimination ability for BPD patients with AUROCs of around 0.70 in the external validation datasets. Integration of more comprehensive risk factors from family and environmental data in larger samples is necessary to construct a more precise and applicable risk score model for BPD, to assist with clinical diagnosis in the future.

Materials and Methods

The study design and analysis flow chart are displayed in Fig. 2. Details of the datasets and analytic procedures for the selection of candidate markers in model construction are described below.

Figure 2: The summary description of the selection of candidate markers for the construction of the genetic risk score model.
figure 2

GAIN: the Genetic Association Information Network; STEP: the Systematic Treatment Enhancement Program; MACH: The program Markov Chain Haplotyping.

Imputation and quality control in the GWA datasets

We used two individual GWA datasets of BPD in the Caucasian populations, the GAIN ( and the STEP data ( The details of participant enrolment and genotyping of the two GWA studies were provided in their original articles24,25. In brief, in the GAIN dataset, individuals were Americans with European ancestry, including 1,001 BPD cases and 1,034 controls. In the STEP dataset, there were 955 BPD cases and 1,498 healthy subjects from the National Institute of Mental Health Genetics Initiative. After quality control for genotypic data, a total of 699,550 and 370,995 autosomal SNPs were retained in the GAIN and STEP datasets, respectively.

Because the two GWA studies used different genotyping platforms, we imputed all the autosomal SNPs for the two GWA datasets based on those genotyped SNPs that passed quality control, to obtain the maximal number of common SNPs for the construction of prediction models for the two datasets. We applied Markov Chain Haplotyping (MACH) 1.044 to perform genotype imputation, and the HapMap II CEU (release 22) samples were used as the reference panel ( Well-imputed SNPs had squared correlations ≥0.30 between imputed and true genotypes, which were suggested by MACH. After removing markers with minor allele frequency (MAF) less than 1%, the number of well-imputed SNPs was 2,238,297 and 2,109,280 for the GAIN and the STEP datasets, respectively. In total, there were 1,992,730 well-imputed SNPs overlapped in the two datasets.

Criteria for candidate markers in the construction of prediction models

To construct the genetic prediction model for BPD, we first performed association analyses with the additive model for 1,992,730 well-imputed SNPs using PLINK versions 1.0745. As the study design flow chart shows in Fig. 1, SNPs with p-value 0.01 were selected as candidate SNPs in RF analysis, for which 19,701 SNPs were in the GAIN dataset and 19,524 SNPs were in the STEP dataset. Each candidate SNP was then mapped to a gene region (using NCBI build 36) if the SNP was located within the 50 kb of upstream or downstream of a gene. These SNPs were mapped to 2,832 genes in the GAIN dataset, and 3,156 genes in the STEP dataset. In total, there were 802 genes overlapped in both GWA datasets.

Random forest procedures

To select informative risk predictors for model construction, we used the RF method for classification and building regression trees. All candidate SNPs were used to build a training model (i.e. 19,701 candidate SNPs in the GAIN dataset). The results from the growing of ensemble trees as a forest could provide a list of important variables for disease outcome. The RF procedures were performed using the Random Jungle package46, which facilitates the rapid analysis of large-scale GWA data47. Detailed procedures are described in the following steps:

  1. 1

    Two-thirds of the subjects in the GWA dataset were taken as the training set using the bootstrap procedure, and the remaining subjects were treated as the test set.

  2. 2

    Second, is the splitting step. A random subset of markers was chosen from all candidate markers without replacement. The size of each marker subset was equal to the square root of numbers of all candidate markers. The decrease in impurity for all markers was then calculated. The definition of the decrease in impurity was detailed elsewhere46. The marker with the best classification by the decrease in impurity was used as a node to split subjects of a training set into two distinct subsets, that is, one node split into two nodes.

  3. 3

    The 2nd splitting step was repeated until the tree is grown with its largest extent in the tree growing step.

  4. 4

    Steps 1 to 3 were repeated to grow 5,000 classification trees to build a forest.

  5. 5

    The prediction error of particular markers was estimated by permutation procedure from the test set.

  6. 6

    Two indices of the RF procedures were used for the selection of relevant risk predictors, the GI and the VI. The GI represents the total decrease in impurity of the whole dataset by summing the probability of each risk predictor being chosen multiplied by the probability of a mistake in categorizing a subset. The VI means the decrease in accuracy for every predictor. To avoid the bias induced by including highly correlated candidate markers such as SNPs in LD, we used the conditional permutation scheme in the tree building procedure48. The conditional importance permutation groups were created, which involve all variables of a Pearson’s correlation coefficient of r 0.2, that is, the dependency structure between SNPs in linkage disequilibrium was preserved in the calculation of VI. To obtain an appropriate amount of predictors, the corresponding top ranked 200 SNPs of the GI and the conditional VI were considered as the candidate risk predictors for model construction in the next step.

Model construction and performance evaluation

Among the candidate risk predictors identified in the GAIN dataset using the RF analysis, we applied multivariable logistic regression with variable-selection methods (i.e. stepwise selection) to select the optimal predictors and to obtain p-values, odds ratios (OR), and 95% CI. The genetic risk score for each individual was calculated by summing across all predictors in the model using the numbers of risk allele multiplied by the beta regression coefficient of each marker. The highest Youden Index49 was used to define the optimal cutoff point, which equals to (sensitivity + specificity)-1.

We examined the performance of the classification models by several indices. First, the discrimination capability of the established prediction model was assessed with the receiver operating characteristic (ROC) curve, and an AUROC was calculated. The ROC curve was plotted by false-positive rate versus sensitivity measure. The goodness of fit for each prediction model was assessed by the Hosmer-Lemeshow test, which calculates the difference between the predicted and the observed risk. Leave-one-out cross-validation was performed for internal validation to obtain a bias-corrected estimation of error rate in prediction. The STEP GWA dataset was used as external validation for the prediction models. Statistical analyses in this stage were performed with SAS version 9.2 (SAS Institute, Cary, NC).

Identified significantly enriched pathways

We used the Molecular Signatures Database (MSigDB, to examine the common processes or the underlying biological gene sets of the selected candidate genes50. In the present study, we used databases including GO terms (domains in biological process, cellular component and molecular function), chromosome positional gene sets, and the curated gene sets (e.g. canonical pathways, KEGG, Biocarta and Reactome). In total, 3,100 collections of gene sets were available, which includes 45,956 unique genes. Enriched gene sets were identified using the hypergeometric method, with the false discovery rate less than 0.0150.

Additional Information

How to cite this article: Chuang, L.-C. and Kuo, P.-H. Building a genetic risk model for bipolar disorder from genome-wide association data with random forest algorithm. Sci. Rep. 7, 39943; doi: 10.1038/srep39943 (2017).

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.