Genetic validation of bipolar disorder identified by automated phenotyping using electronic health records

Chen, Chia-Yen; Lee, Phil H.; Castro, Victor M.; Minnier, Jessica; Charney, Alexander W.; Stahl, Eli A.; Ruderfer, Douglas M.; Murphy, Shawn N.; Gainer, Vivian; Cai, Tianxi; Jones, Ian; Pato, Carlos N.; Pato, Michele T.; Landén, Mikael; Sklar, Pamela; Perlis, Roy H.; Smoller, Jordan W.

doi:10.1038/s41398-018-0133-7

Download PDF

Article
Open access
Published: 18 April 2018

Genetic validation of bipolar disorder identified by automated phenotyping using electronic health records

Chia-Yen Chen^1,2,3,4,5,
Phil H. Lee^1,3,4,5,
Victor M. Castro ORCID: orcid.org/0000-0001-7390-6354^1,6,7,
Jessica Minnier⁸,
Alexander W. Charney^9,10,11,
Eli A. Stahl^9,10,
Douglas M. Ruderfer¹²,
Shawn N. Murphy^7,13,14,
Vivian Gainer⁷,
Tianxi Cai¹⁵,
Ian Jones¹⁶,
Carlos N. Pato¹⁷,
Michele T. Pato¹⁷,
Mikael Landén^18,19,
Pamela Sklar^9,10,11,
Roy H. Perlis^1,3,6 &
…
Jordan W. Smoller^1,3,5

Translational Psychiatry volume 8, Article number: 86 (2018) Cite this article

2805 Accesses
18 Citations
19 Altmetric
Metrics details

Subjects

Abstract

Bipolar disorder (BD) is a heritable mood disorder characterized by episodes of mania and depression. Although genomewide association studies (GWAS) have successfully identified genetic loci contributing to BD risk, sample size has become a rate-limiting obstacle to genetic discovery. Electronic health records (EHRs) represent a vast but relatively untapped resource for high-throughput phenotyping. As part of the International Cohort Collection for Bipolar Disorder (ICCBD), we previously validated automated EHR-based phenotyping algorithms for BD against in-person diagnostic interviews (Castro et al. Am J Psychiatry 172:363–372, 2015). Here, we establish the genetic validity of these phenotypes by determining their genetic correlation with traditionally ascertained samples. Case and control algorithms were derived from structured and narrative text in the Partners Healthcare system comprising more than 4.6 million patients over 20 years. Genomewide genotype data for 3330 BD cases and 3952 controls of European ancestry were used to estimate SNP-based heritability (h²_g) and genetic correlation (r_g) between EHR-based phenotype definitions and traditionally ascertained BD cases in GWAS by the ICCBD and Psychiatric Genomics Consortium (PGC) using LD score regression. We evaluated BD cases identified using 4 EHR-based algorithms: an NLP-based algorithm (95-NLP) and three rule-based algorithms using codified EHR with decreasing levels of stringency—“coded-strict”, “coded-broad”, and “coded-broad based on a single clinical encounter” (coded-broad-SV). The analytic sample comprised 862 95-NLP, 1968 coded-strict, 2581 coded-broad, 408 coded-broad-SV BD cases, and 3 952 controls. The estimated h²_g were 0.24 (p = 0.015), 0.09 (p = 0.064), 0.13 (p = 0.003), 0.00 (p = 0.591) for 95-NLP, coded-strict, coded-broad and coded-broad-SV BD, respectively. The h²_g for all EHR-based cases combined except coded-broad-SV (excluded due to 0 h²_g) was 0.12 (p = 0.004). These h²_g were lower or similar to the h²_g observed by the ICCBD + PGCBD (0.23, p = 3.17E−80, total N = 33,181). However, the r_g between ICCBD + PGCBD and the EHR-based cases were high for 95-NLP (0.66, p = 3.69 × 10^–5), coded-strict (1.00, p = 2.40 × 10⁻⁴), and coded-broad (0.74, p = 8.11 × 10^–7). The r_g between EHR-based BD definitions ranged from 0.90 to 0.98. These results provide the first genetic validation of automated EHR-based phenotyping for BD and suggest that this approach identifies cases that are highly genetically correlated with those ascertained through conventional methods. High throughput phenotyping using the large data resources available in EHRs represents a viable method for accelerating psychiatric genetic research.

Clinical and genetic contributions to medical comorbidity in bipolar disorder: a study using electronic health records-linked biobank data

Article 28 March 2024

Jorge A. Sanchez-Ruiz, Brandon J. Coombes, … Joanna M. Biernacka

Dissecting clinical heterogeneity of bipolar disorder using multiple polygenic risk scores

Article Open access 18 September 2020

Brandon J. Coombes, Matej Markota, … Joanna M. Biernacka

Clinical characteristics indexing genetic differences in bipolar disorder – a systematic review

Article 15 November 2023

Hanna M. van Loo, Ymkje Anna de Vries, … Kenneth S. Kendler

Introduction

Although twin studies first documented the high heritability of bipolar disorder (BD) decades ago, only recently have robustly associated genetic risk loci been identified through genomewide association studies (GWAS)^{1,2,3,4,5,6,7,8}. At present, the major rate-limiting step for GWAS of BD is the need for ever-larger sample sizes to detect both common modest-effect variants and rarer large effect variants. In recent years, the widespread adoption of longitudinal electronic health records (EHRs) has provided a vast and growing repository of phenotypic data that can be leveraged for psychiatric research⁹. In particular, when linked to sample collections through biobanks and other efforts, EHR data provide a relatively untapped opportunity to enhance the power of genetic research. Nevertheless, establishing the validity of EHR-derived phenotypes remains an important pre-requisite for leveraging these resources.

In an effort to rapidly increase available samples for genomewide studies of BD, we established the International Cohort Collection for Bipolar Disorder (ICCBD) through which we applied high-throughput phenotyping methods at sites in the United States (US), United Kingdom (UK), and Sweden⁷. At the US site (Partners Healthcare), we developed and applied EHR phenotyping algorithms to identify approximately 4500 cases and 5000 controls for whom DNA was obtained from discarded blood samples. The use of EHR data to define valid phenotypes is particularly challenging for psychiatric disorders. Because there are no pathognomonic laboratory or pathologic findings, psychiatric diagnosis has traditionally relied on self-reported symptoms, behavioral observations, and clinical judgment. Thus, genomic studies have typically utilized structured or semi-structured diagnostic interviews as the gold-standard method to establish case and control status. EHR data, on the other hand, are limited to information (e.g., billing codes, medication lists, and narrative notes) collected in the course of clinical care rather than for research purposes. Recognizing this, we have undertaken systematic efforts to evaluate the validity of our EHR-based phenotyping algorithms.

In an earlier report¹⁰, we described the development of our automated phenotyping algorithms for BD cases and controls. Briefly, we developed four case definitions, one of which included natural language processing of narrative EHR notes and three based on structured coded data using rule-based classifiers that differed in their stringency. Another rule-based algorithm was developed to identify controls. To establish the clinical validity of these algorithms, we conducted an in-person diagnostic validation study (N = 190) in which algorithm diagnoses were compared to diagnoses made by blinded expert clinicians using a gold-standard in-person diagnostic interview (SCID-IV). Three of the four case definitions achieved high positive predictive value (PPV) compared with diagnostic interviews (up to 0.86) and the PPV for the control algorithm was 1.0. Thus, we demonstrated that automated EHR-based phenotyping can be used to identify clinically valid case and control definitions for BD. However, an important remaining question is whether these case and control sets are genetically comparable to traditionally ascertained samples that have been used in most genomic studies of BD. This is an important issue in evaluating whether EHR-based samples can be combined (e.g., through meta-analyses) with data from other ongoing genomic studies (e.g., by consortia such as the Psychiatric Genomics Consortium) to enhance gene discovery.

Here, we report genetic validation of our EHR phenotyping algorithms by using genomewide data to estimate their SNP-based heritability (h²_g) and genetic correlation (r_g) with other large-scale traditionally ascertained BD GWAS samples. We further examined genetic correlations with other phenotypes of interest and performed genome-wide heterogeneity testing to validate the consistency of genome-wide association results. Our results demonstrate that automated EHR phenotyping can be used to assemble case/control cohorts that are both clinically and genetically comparable to traditionally ascertained samples and thus represent a valuable tool for accelerating psychiatric genetic research.

Materials and methods

Study subjects

Cases and controls were collected as part of the International Cohort Collection for Bipolar Disorder (ICCBD), a US, UK, and Swedish consortium established to accelerate genomic studies of BD by applying high throughput phenotyping methods^7,10. The Massachusetts General Hospital site of the ICCBD aimed to collect DNA from 4500 cases and 4500 controls by linking discarded blood samples to de-identified EHR data. As described in detail elsewhere¹⁰, cases and controls were identified by deriving EHR-based phenotyping algorithms applied to the Partners Healthcare Research Patient Data Registry (RPDR), which spans more than 20 years of data from 4.6 million patients. As the first step in building our BD phenotyping algorithms, we created a “datamart” of 52,235 individuals by filtering medical records to identify patients seen at Massachusetts General Hospital, Brigham and Women’s Hospital, or McLean Hospital who had at least one diagnosis of bipolar disorder (ICD- 9 and DSM-IV-TR codes 296.4*–296.8*) or manic disorder (ICD 296.0*–296.1*). Next, four phenotyping algorithms were developed to identify cases and one algorithm to identify controls.

The development and clinical validation of case and control algorithms described here is adapted from Castro et al. ¹⁰. The five phenotyping algorithms developed comprised the following:

1.
95-NLP: This BD case algorithm incorporated natural language processing (NLP) of narrative notes using the i2b2 suite of software¹¹. Expert clinicians manually reviewed 612 notes from 209 randomly selected patients to identify gold-standard cases and to extract relevant features from narrative notes to be processed by NLP. We trained a model based on 414 features to predict the probability of BD using a logistic regression classifier with the adaptive least absolute shrinkage and selection operator (LASSO) procedure. The final model, comprising 13 features, achieved an area under the receiver operating curve (AUC) of 0.93, with a sensitivity of 0.53 when the specificity was set to 0.95.
2.
Coded-strict: This algorithm was a rule-based classifier that required at least three ICD codes for BD, a predominance of BD diagnoses in the longitudinal record, and either (a) treatment with lithium or valproate within a year of BD diagnosis or (b) treatment at a bipolar specialty clinic.
3.
Coded-broad: This algorithm required at least two ICD codes for BD, a predominance of BD diagnoses, and treatment with at least two bipolar medications (lithium, valproate, carbamazepine, or an atypical antipsychotic).
4.
Coded-broad-SV: This algorithm was the same as “Coded-broad” except that two or more BD diagnoses were allowed to occur during the same inpatient or outpatient episode of illness.
5.
Controls: This algorithm defined controls as those age 30 years or older with no ICD-9 codes or history of medications related to a psychiatric or neurological condition.

As reported earlier, we conducted a direct-interview study to examine the predictive validity of these algorithms. Patients in the Partners Healthcare system who were identified by each algorithm as BD cases or controls were invited by mail to participate. Participants in the validation study were independent of the patient samples on which the algorithms were trained. After informed consent was obtained, participants underwent semistructured diagnostic interviews (SCID-IV) conducted by experienced doctoral-level clinicians blinded to classifier diagnosis. To further preserve clinician blinding, we recruited individuals from MGH clinics who reported a previous diagnosis of schizophrenia or major depressive disorder, disorders commonly considered in the differential diagnosis of BD. The protocol was approved by the Partners HealthCare system institutional review board. A total of 190 participants were interviewed and PPVs for each algorithm were calculated as the proportion of algorithm defined BD cases (or controls) who received a concordant diagnosis by SCID interview. The PPVs for each algorithm using a non-hierarchical approach (where each case was assigned to any algorithm for which they satisfied inclusion criteria) are shown in Table 1 and reported in Castro et al.¹⁰.

Table 1 SNP-based heritability (h²g) for EHR-based bipolar disorder from the Partners Healthcare Research Patient Data Registry

Full size table

DNA sample collection and genotyping

The phenotyping algorithms were applied to the Partners Healthcare system to ascertain case and control DNA samples by linking phenotypic data to discarded blood samples as previously described^10,11,12. In brief, case and control medical record numbers are submitted to the Partners HealthCare Crimson system, which acts as an “honest broker” to match deidentified phenotypic data to discarded blood samples under an approved Institutional Review Board (IRB) protocol. Genotyping was performed in five batches that included case and control samples at the Broad Institute of MIT and Harvard using the Illumina PsychChip, a genomewide array that includes ~250,000 common variants as a backbone for GWAS imputation, ~250,000 rare variants, and ~50,000 additional markers focusing on psychiatrically relevant loci.

Genotype quality control (QC) and imputation

A total of 3772 BD cases and 4141 controls with genomewide data were available for this analysis. The number of genotyped BD cases by algorithm and controls are listed in Table 1 and described in Supplementary Figure 1. We performed QC on each genotyping batch separately with the following steps: (1) we removed single nucleotide polymorphisms (SNPs) with genotype missing rate >0.05 before sample-based QC; (2) we excluded samples with genotype missing rate >0.02, absolute value of heterozygosity >0.2, or failed sex checks; (3) after sample-based QC, we then removed SNPs with missing rate >0.02, with differential missing rate between cases and controls >0.02, or failed Hardy–Weinberg equilibrium test (p-value <1.0 × 10⁻⁶ in controls and p-value <1.0 × 10⁻¹⁰ in cases). To merge genotyping batches for imputation and analyses, we performed batch QC by removing SNPs with differential missing rate >0.005 between batches or significant batch association (p-value <5.0 × 10⁻⁸ between controls form different batches). All QC were conducted using PLINK v1.9¹³.

The BD cases and controls included individuals from diverse populations. To control for population stratification and ensure the comparability between the current sample and previous European ancestry BD GWAS, we extracted samples with European ancestry for imputation and analyses. We used the International Haplotype Map project (HapMap3) samples as a population reference panel and performed principal component analysis (PCA) with the study samples and HapMap3 samples combined¹⁴. We calculated the distance between each study sample and the average European population samples in HapMap3 using PC1 and PC2. We selected the study samples with distance to average European HapMap3 samples <0.01 (Supplementary Figure 2–4)¹⁴. We also removed one sample from each pair of related or duplicate samples (\(\hat \pi\)> 0.2).

The final analytic dataset comprised 3330 BD cases (862 95-NLP, 1968 coded-strict, 2581 coded-broad, and 408 coded-broad-SV) and 3952 controls. The sum of the individual cases groups exceeds 3330 due to the non-hierarchical design in which cases were assigned to each phenotype for which they met inclusion criteria. We performed two-step genotype imputation with Eagle2 software for pre-phasing and IMPUTE2 on the European population study samples^15,16.

Statistical analysis

To assess whether our EHR-based phenotypes capture heritable components of BD, we used LD score regression (LDSC)^7,17,18 to estimate SNP-based heritability (h²_g) for each EHR-based BD cohort. We then examined the degree to which heritable influences on our BD phenotypes overlap with those traditionally ascertained BD cases in other large-scale GWAS samples. To do this, we used LDSC to compute the genetic correlation (r_g) between EHR-based BD samples and previously published BD GWAS by other ICCBD cohorts and the PGC (ICCBD + PGCBD)^7,17,18. The LDSC requires association summary statistics for genome-wide SNPs to estimate h²_g and r_g. To obtain these summary statistics, we first performed GWAS for each of the four EHR-based BD definitions separately and for our combined BD case–control sample. We used a BD prevalence of 1% to obtain liability-scale h²_g from LDSC^19,20,21,22. Prior studies have documented substantial genetic correlation between BD and other psychiatric disorder phenotypes, most notably schizophrenia (SCZ) and major depressive disorder (MDD)²⁰. To examine the genetic relationship between EHR-based BD samples and related phenotypes, we used LD Hub²³ to estimate r_g with schizophrenia (SCZ), major depressive disorder (MDD), subjective well-being, and, as a negative control, mean platelet volume (MPV). Finally, we performed genome-wide Cochran’s Q-test to look for heterogeneity between association summary statistics from the EHR-based BD samples and the ICCBD + PGCBD samples at single variant level, using SNPs with association p-value <0.001 in the ICCBD + PGCBD GWAS.

Results

We first estimated SNP-based heritability (h²_g) for the four EHR-based BD samples (Table 1). The liability-scale h²_g estimates were largest for the 95-NLP BD algorithm (0.24, p = 0.015) and smallest for the coded-broad-SV algorithm (0.0, p = 0.59), with intermediate but statistically significant values for the coded-strict and coded-broad algorithms. The h²_g of BD in the ICCBD + PGCBD sample was 0.23, which matches the h²_g for the 95-NLP algorithm but is greater than that of the rule-based algorithms. Of note, the coded-broad-SV case set had the least power with only 408 cases. As shown in Table 1, this distribution of heritability estimates mirrors the relative PPVs obtained in our clinical validation study. To maximize the BD case–control sample size, we combined the BD case–control samples across algorithms into a single case–control dataset. Since the coded-broad-SV had no evidence of heritability, we created two combined BD datasets; one included all BD cases and one included all but the coded-broad-SV cases). The h²_g was 0.11 (p-value = 0.006) for all algorithms combined BD and 0.12 (p-value = 0.004) for all algorithms excluding coded-broad-SV.

We next estimated the SNP-based genetic correlation (r_g) between the EHR-based BD samples and the ICCBD + PGCBD samples (Table 2). The r_g estimates were 95-NLP (0.66), coded-strict (1.0), and coded-broad (0.74) were all statistically significant. (Note that r_g could not be estimated for coded-broad-SV given its h²_g of 0). The r_g for all algorithms excluding coded-broad-SV was 0.83 (p = 7.19 × 10⁻⁷). Adding coded-broad-SV BD cases to the combined case set did not substantially change the r_g estimate although the standard error (SE) increased and p-value rose to 2.88 × 10^-6. We also estimated the pairwise r_g between the EHR-based BD case–control samples and the final combined BD samples. As expected, the r_g estimates are high and ranged from 0.90 to 0.98 between algorithms, and were 1.00 between each algorithm and the combined sample (excluding coded-broad-SV). Finally, the r_g between ICCBD and PGCBD was 1.00 (SE = 0.065, p-value = 1.45 × 10^-74).

Table 2 SNP-based genetic correlation (r_g) between EHR-based bipolar disorder and bipolar disorder ascertained by traditional methods from ICCBD + PGCBD

Full size table

Given prior evidence that traditionally ascertained BD GWAS show significant positive genetic correlations with SCZ and MDD^18,20 and significant negative genetic correlation with subjective well-being²⁴, we examined these correlations using our EHR-based algorithms as another index of their genetic validity. As a negative control, we also examined their genetic correlation with mean platelet volume (MPV), a phenotype for which we would not expect significant genetic correlation. (Fig. 1; Supplementary Table 1). For comparison, we also examined these patterns of correlation with BD in traditionally ascertained samples (ICCBD + PGCBD). As expected based on prior data^18,20,24, EHR-based BD was positively genetically correlated with SCZ and MDD, negatively correlated with subjective well-being, and uncorrelated with MPV (Fig. 1). Of note, the BD-MDD genetic correlation was higher than the BD-SCZ genetic correlation using EHR-based BD (left side, Fig. 1) while the opposite was true using the ICCBD + PGCBD cohort (right side, Fig. 1). We hypothesized that this difference in r_g patterns might be related to differences in the proportions of BD subtypes among the EHR-based BD cases and those included in the traditionally ascertained BD samples. To investigate this, we calculated the percentage of BD case subtypes, including bipolar I disorder (BD1), bipolar II disorder (BD2), schizoaffective disorder bipolar type (SAB), and bipolar disorder not otherwise specified (NOS) for the EHR-based BD cases and the ICCBD cases (the subtypes of PGCBD cases were not available). We found that the EHR-based BD cases comprised a lower proportion of SAB subtype cases (0.6–1.6%) compared with the ICCBD samples (9.1%) (Supplementary Figure 5). This difference would be consistent with a relatively smaller BD-SCZ genetic correlation seen in the EHR-based samples compared with the ICCBD samples.

Finally, we performed Cochran’s Q-test to identify potential heterogeneity of the association summary statistics between EHR-based BD samples and the ICCBD + PGCBD samples. This analysis was restricted to SNPs with association p < 0.001 in the ICCBD + PGCBD GWAS in order to exclude SNPs with weak association results whose directionality might be less robust. We identified a single locus with significant heterogeneity across the genome after Bonferroni correction (SNP N = 28,320) for both coded-broad and for the combined EHR-BD sample (excluding coded-broad-SV) (Fig. 2). This locus on chromosome 22 (peak Q-test p-value at rs196065 = 3.34 × 10⁻⁷), showed modest association with BD (p-value = 5.78 × 10⁻⁵ in ICCBD + PGCBD) and did not overlap with any previously reported BD-associated loci. Thus, we found negligible evidence of heterogeneity of genomewide association results between EHR-based BD and traditionally ascertained BD.

**Fig. 2: Genome-wide Cochran’s Q-test for heterogeneity of SNP effects between ICCBD + PGCBD and EHR-based bipolar disorder.**

Discussion

As an ever-growing longitudinal repository of the clinical phenome, EHRs represent a new and powerful resource for psychiatric research⁹. Nevertheless, their utility depends on the validity of the clinical and phenotypic data that can be extracted. We have previously demonstrated the feasibility of deriving diagnoses with high predictive value compared with a gold standard of clinician-administered diagnostic interviews¹⁰. However, in the context of psychiatric genetic research, establishing the genetic validity of these phenotypes is crucial. In the present study, using genomewide genotype data for more than 7000 cases and controls, we demonstrate that EHR-based algorithms can be used to ascertain BD phenotypes that are heritable and genetically comparable to traditionally ascertained samples. Automated algorithm-based phenotyping linked to biospecimens provides substantial efficiencies in terms of the time and costs involved in assembling large-scale samples for genetic research. Prior simulations have documented up to a ten-fold reduction in the cost associated with phenotyping and sample collection¹¹. Using our case/control BD definitions linked to discarded blood samples, we were able to collect approximately 5000 controls over 10 weeks and more than 4000 cases over 3 years. As described below, three sets of findings from our analyses are particularly noteworthy.

First, our results document that EHR-based diagnostic algorithms can be used to ascertain BD phenotypes that yield SNP-based heritability comparable to that observed in GWAS that have relied on more time-intensive, cost-intensive, and labor-intensive recruitment and diagnostic evaluation. The highest heritability (0.24) was seen with our 95-NLP algorithm which combined NLP of narrative test features and coded EHR data. This estimated heritability was nearly identical to that derived from GWAS of the larger traditionally ascertained cohorts of the international ICCBD and PGC (h²_g = 0.23 for 13,902 cases and 19,279 controls). The 95-NLP algorithm also achieved the highest positive predictive value in our previous clinical validation study. For two of the remaining three algorithms which involved rule-based algorithms of structured EHR data, we also observed significant, though relatively lower, heritability estimates (h²_g = 0.09–0.12). The least restrictive algorithm (coded-broad-SV) did not exhibit significant heritability, though the small sample size of this subgroup limited the power of our analyses. Of note, this last algorithm also performed poorly in our prior clinical validation study (PPV = 0.5). Nevertheless, the overall heritability of our EHR-based BD was 0.12 (p = 0.004), dropping slightly to 0.11 (p = 0.006) when the coded-broad-SV was included. In addition, the EHR-based BD definitions were nearly perfectly genetically correlated. Pairwise genetic correlations between the phenotypes ranged from 0.98 to 1.0 except for 95-NLP and coded-broad-SV (r_g = 0.90).

Second, we found that our cohorts ascertained by automated EHR phenotyping exhibited substantial genetic correlations (r_g) with the large ICCBD + PGCBD samples. Overall, the r_g between our EHR-based BD case/control samples and the ICCBD + PGCBD samples was 0.83 (p = 2.88 × 10⁻⁶), demonstrating that our approach captures genetic influences that strongly overlap with those acting on BD in traditionally ascertained samples. In addition to providing further genetic validation of EHR-derived phenotypes, these results indicate that such samples can be combined with other existing samples to enhance the power of genetic discovery.

Finally, we demonstrate that our phenotyping approach replicates patterns of cross-disorder genetic overlap that have previously been reported in genetic studies of BD^7,25. In particular, EHR-based BD exhibited positive genetic correlations with SCZ and MDD and negative correlations with subjective well-being. Once again, this supports the genetic validity of our algorithm-defined BD phenotype. Unexpectedly, the genetic correlation with SCZ was less than that seen with MDD, a finding that may be attributable to the relatively low frequency of SAB cases in our sample.

We acknowledge that our results have certain limitations. First, our sample size, while substantial, is smaller than that of some other existing samples (e.g., ICCBD and PGCBD), which may have limited the power and precision of our heritability and genetic correlation analyses. Second, the portability of our specific phenotyping algorithms to other healthcare settings remains to be determined. Notably, however, our results demonstrate that a range of algorithms—with and without NLP and using diagnostic rules of varying stringency—yield phenotypes that are clinically and genetically comparable to those obtained by in-person standardized diagnostic assessments. With the exception of the 95-NLP algorithm, each of our case and control phenotypes are based on structured diagnostic codes and medication data that are widely available and standardized in EHR systems, facilitating their application in other systems. In the case of the 95-NLP algorithm, the narrative features comprise nine features (enumerated in Castro et al.¹⁰) that refer to either a diagnosis, medication, or visit type—none of which we would expect to be unique to our healthcare system. There is now a growing literature demonstrating the portability of EHR-based diagnostic algorithms²⁶. In particular, the Electronic Medical Records and Genomics (eMERGE) Network has created a catalog of EHR-based diagnostic phenotypes (PheKB) that includes NLP-based algorithms developed in one healthcare system and applied in others²⁷. Finally, diagnostic misclassification could bias or reduce the power of genetic analyses^28,29. However, this caveat applies to any phenotyping method, and we note that our clinical validation studies¹⁰ demonstrate relatively low levels of diagnostic misclassification in comparison to a gold-standard clinician-administered diagnostic interview.

In summary, the current study provides the first genetic validation of EHR-based phenotyping for BD and suggests that automated phenotyping algorithms can identify samples that are highly genetically correlated with those ascertained through conventional methods. Taken together, the present results and those of our prior clinical validation study, suggest that the use of any or all three of the heritable EHR-based algorithms we derived (i.e., 95-NLP, coded-strict, and coded-broad) can facilitate genetic studies of bipolar disorder. High throughput phenotyping using the large data resources available in the EHR database represents a viable method for accelerating psychiatric genetic research.

References

Schulze, T. G. et al. Two variants in Ankyrin 3 (ANK3) are independent genetic risk factors for bipolar disorder. Mol. Psychiatry 14, 487–491 (2009).
Article CAS PubMed Google Scholar
M hleisen, T. W. et al. Association between schizophrenia and common variation in neurocan (NCAN), a genetic risk factor for bipolar disorder. Schizophr. Res. 138, 69–73 (2012).
Article Google Scholar
Chen, D. T. et al. Genome-wide association study meta-analysis of European and Asian-ancestry samples identifies three novel loci associated with bipolar disorder. Mol. Psychiatry 18, 264–266 (2013).
Article Google Scholar
Psychiatric, G. W. A. S., Consortium Bipolar Disorder Working Group. Large-scale genome-wide association analysis of bipolar disorder identifies a new susceptibility locus near ODZ4. Nat. Genet. 43, 977–983 (2011).
Article Google Scholar
Mühleisen, T. W. et al. Genome-wide association study reveals two new risk loci for bipolar disorder. Nat. Commun. 5, 3339 (2014).
Article PubMed Google Scholar
Cichon, S. et al. Genome-wide association study identifies genetic variation in neurocan as a susceptibility factor for bipolar disorder. Am. J. Hum. Genet. 88, 372–381 (2011).
Article CAS PubMed PubMed Central Google Scholar
Charney, A. W. et al. Evidence for genetic heterogeneity between clinical subtypes of bipolar disorder. Transl. Psychiatry 7, e993 (2017).
Article CAS PubMed PubMed Central Google Scholar
Ikeda, M. et al. A genome-wide association study identifies two novel susceptibility loci and trans population polygenicity associated with bipolar disorder. Mol. Psychiatry 511, 421 (2017).
Google Scholar
Smoller, J. W. The use of electronic health records for psychiatric phenotyping and genomics. Am. J. Med. Genet. Part B 67, 1124 (2017).
Google Scholar
Castro, V. M. et al. Validation of electronic health record phenotyping of bipolar disorder cases and controls. Am. J. Psychiatry 172, 363–372 (2015).
Article PubMed Google Scholar
Murphy, S. et al. Instrumenting the health care enterprise for discovery research in the genomic era. Genome Res. 19, 1675–1681 (2009).
Article CAS PubMed PubMed Central Google Scholar
Kurreeman, F. et al. Genetic basis of autoantibody positive and negative rheumatoid arthritis risk in a multi-ethnic cohort derived from electronic health records. Am. J. Hum. Genet. 88, 57–69 (2011).
Article CAS PubMed PubMed Central Google Scholar
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Article PubMed PubMed Central Google Scholar
Altshuler, D. M. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).
Article CAS PubMed Google Scholar
Loh, P.-R., Danecek, P., Palamara, P. F., Fuchsberger, C. A., Reshef, Y. K. & Finucane, H. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48, 1443–1448 (2016).
Article CAS PubMed PubMed Central Google Scholar
Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009).
Article PubMed PubMed Central Google Scholar
Bulik-Sullivan, B. K., Loh, P.-R., Finucane, H. K., Ripke, S. & Yang, J., Schizophrenia Working Group of the Psychiatric Genomics Consortium. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).
Article CAS PubMed PubMed Central Google Scholar
Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47, 1236–1241 (2015).
Article CAS PubMed PubMed Central Google Scholar
Lee, S. H., Goddard, M. E., Wray, N. R. & Visscher, P. M. A better coefficient of determination for genetic profile analysis. Genet. Epidemiol. 36, 214–224 (2012).
Article PubMed Google Scholar
Cross-Disorder Group of the Psychiatric Genomics Consortium, Lee, S. H., Ripke, S., Neale, B. M., Faraone, S. V. & Purcell, S. M. et al. Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs. Nat. Genet. 45, 984–994 (2013).
Article PubMed Central Google Scholar
Merikangas, K. R. et al. Lifetime and 12-month prevalence of bipolar spectrum disorder in the National Comorbidity Survey replication. Arch. Gen. Psychiatry 64, 543–552 (2007).
Article PubMed PubMed Central Google Scholar
Merikangas, K. R. et al. Prevalence and correlates of bipolar spectrum disorder in the world mental health survey initiative. Arch. Gen. Psychiatry 68, 241–251 (2011).
Article PubMed PubMed Central Google Scholar
Zheng, J. et al. LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics 33, 272–279 (2017).
Article PubMed Google Scholar
Okbay, A. et al. Genetic variants associated with subjective well-being, depressive symptoms, and neuroticism identified through genome-wide analyses. Nat. Genet. 48, 624–633 (2016).
Article CAS PubMed PubMed Central Google Scholar
Cross-Disorder Group of the Psychiatric Genomics Consortium. Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis. Lancet 381, 1371–1379 (2013).
Article PubMed Central Google Scholar
Roden, D. M. & Denny, J. C. Integrating electronic health record genotype and phenotype datasets to transform patient care. Clin. Pharmacol. Ther. 99, 298–305 (2016).
Article CAS PubMed PubMed Central Google Scholar
Kirby, J. C. et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J. Am. Med. Inform. Assoc. 23, 1046–1052 (2016).
Article PubMed PubMed Central Google Scholar
Wray, N. R., Lee, S. H. & Kendler, K. S. Impact of diagnostic misclassification on estimation of genetic correlations using genome-wide genotypes. Eur. J. Hum. Genet. 20, 668–674 (2012).
Article PubMed PubMed Central Google Scholar
Duan, R. et al. An empirical study for impacts of measurement errors on EHR based association studies. AMIA Annu. Symp. Proc. 2016, 1764–1773 (2016).
PubMed Google Scholar

Download references

Acknowledgements

This work was supported in part by NIMH grants R01MH085542 (JWS and PS), R01MH085545 (JWS), U01HG008685 (JWS), and K24MH094614 (JWS) and by support from the Demarest Lloyd, Jr. Foundation. Dr. Smoller is a Tepper Family MGH Research Scholar.

Author information

Authors and Affiliations

Psychiatric and Neurodevelopmental Genetics Unit, Massachusetts General Hospital, 185 Cambridge St., Boston, MA, 02114, USA
Chia-Yen Chen, Phil H. Lee, Victor M. Castro, Roy H. Perlis & Jordan W. Smoller
Analytic and Translational Genetics Unit, Center for Human Genetic Research, Massachusetts General Hospital, 185 Cambridge Street, Boston, MA, 02114, USA
Chia-Yen Chen
Center for Genomic Medicine, Massachusetts General Hospital, 185 Cambridge Street, Boston, MA, 02114, USA
Chia-Yen Chen, Phil H. Lee, Roy H. Perlis & Jordan W. Smoller
Department of Psychiatry, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
Chia-Yen Chen & Phil H. Lee
Broad Institute of MIT and Harvard, 75 Ames Street, Cambridge, MA, 02142, USA
Chia-Yen Chen, Phil H. Lee & Jordan W. Smoller
Center for Experimental Drugs and Diagnostics, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
Victor M. Castro & Roy H. Perlis
Partners Research Information Systems and Computing, Partners HealthCare System, One Constitution Center, Charlestown, MA, 02129, USA
Victor M. Castro, Shawn N. Murphy & Vivian Gainer
Oregon Health & Sciences University, 3181 SW Sam Jackson Park Rd, Portland, OR, 97239, USA
Jessica Minnier
Department of Psychiatry, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY, 10029, USA
Alexander W. Charney, Eli A. Stahl & Pamela Sklar
Department of Genetics and Genomic Sciences, Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY, 10029, USA
Alexander W. Charney, Eli A. Stahl & Pamela Sklar
Friedman Brain Institute, Department of Neuroscience, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY, 10029, USA
Alexander W. Charney & Pamela Sklar
Division of Genetic Medicine, Vanderbilt University Medical Center, Nashville, TN, 37212, USA
Douglas M. Ruderfer
Department of Neurology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
Shawn N. Murphy
Department of Biomedical Informatics, Harvard Medical School, 10 Shattuck Street, Boston, MA, 02115, USA
Shawn N. Murphy
Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA, 02115, USA
Tianxi Cai
National Centre for Mental Health, MRC Centre for Neuropsychiatric Genetics and Genomics, Cardiff University, Cardiff, CF24 4HQ, UK
Ian Jones
SUNY Downstate Medical Center, Brooklyn, NY, 11203, USA
Carlos N. Pato & Michele T. Pato
Department of Psychiatry and Neurochemistry, Institute of Neuroscience and Physiology, The Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
Mikael Landén
Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
Mikael Landén

Authors

Chia-Yen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Phil H. Lee
View author publications
You can also search for this author in PubMed Google Scholar
Victor M. Castro
View author publications
You can also search for this author in PubMed Google Scholar
Jessica Minnier
View author publications
You can also search for this author in PubMed Google Scholar
Alexander W. Charney
View author publications
You can also search for this author in PubMed Google Scholar
Eli A. Stahl
View author publications
You can also search for this author in PubMed Google Scholar
Douglas M. Ruderfer
View author publications
You can also search for this author in PubMed Google Scholar
Shawn N. Murphy
View author publications
You can also search for this author in PubMed Google Scholar
Vivian Gainer
View author publications
You can also search for this author in PubMed Google Scholar
Tianxi Cai
View author publications
You can also search for this author in PubMed Google Scholar
Ian Jones
View author publications
You can also search for this author in PubMed Google Scholar
Carlos N. Pato
View author publications
You can also search for this author in PubMed Google Scholar
Michele T. Pato
View author publications
You can also search for this author in PubMed Google Scholar
Mikael Landén
View author publications
You can also search for this author in PubMed Google Scholar
Pamela Sklar
View author publications
You can also search for this author in PubMed Google Scholar
Roy H. Perlis
View author publications
You can also search for this author in PubMed Google Scholar
Jordan W. Smoller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jordan W. Smoller.

Ethics declarations

Disclosure

Dr. Smoller is an unpaid member of the Scientific Advisory Board of PsyBrain Inc. and the Bipolar/Depression Research Community Advisory Panel of 23andme.

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, CY., Lee, P.H., Castro, V.M. et al. Genetic validation of bipolar disorder identified by automated phenotyping using electronic health records. Transl Psychiatry 8, 86 (2018). https://doi.org/10.1038/s41398-018-0133-7

Download citation

Received: 30 December 2017
Accepted: 08 January 2018
Published: 18 April 2018
DOI: https://doi.org/10.1038/s41398-018-0133-7

This article is cited by

Development and multi-site external validation of a generalizable risk prediction model for bipolar disorder
- Colin G. Walsh
- Michael A. Ripperger
- Jordan W. Smoller
Translational Psychiatry (2024)
Exome sequencing in bipolar disorder identifies AKAP11 as a risk gene shared with schizophrenia
- Duncan S. Palmer
- Daniel P. Howrigan
- Benjamin M. Neale
Nature Genetics (2022)
Using whole genome scores to compare three clinical phenotyping methods in complex diseases
- Wenyu Song
- Hailiang Huang
- Adam Wright
Scientific Reports (2018)