Introduction

Insomnia is a highly prevalent sleep disorder characterized by the inability to fall asleep or maintain sleep1 and affects 10–20% of the adult population2,3. It is characterized by heterogeneous phenotypes and equifinality, which might reflect different underlying causal mechanisms4, including life style, stress and molecular mechanisms (for a review, see5). It is commonly comorbid with other physical and psychiatric disorders6,7.

Genetic contributions to insomnia have been demonstrated in both family and twin studies with the reported heritability being estimated at 25–45%8. Candidate gene studies have highlighted genetic variants in numerous systems including the circadian gene CLOCK9, the GABAergic system10, the adenosinergic system11, and the serotonergic system12.

A number of genome-wide association studies (GWAS) have been conducted examining the insomnia phenotype. In two recent studies, large-scale cohorts were developed using data from UK Biobank and the combination of UK Biobank and 23andMe yielding 57 and 202 significant loci, respectively13,14. Another study using survey data of soldiers in the Army Study To Assess Risk and Resilience in Servicemembers (STARRS) study identified one significant locus15. These studies also identified genetic correlations between insomnia and various clinical conditions, such as schizophrenia, type 2 diabetes, and depression13,15. Other studies have identified several insomnia related genes, such as CACNA1C16, RBFOX317, PAX818 and MEIS119.

In most previous studies, insomnia phenotypes were assessed through self-report, which could miss useful information and reflect only part of disorder status. Since insomnia can be a chronic process with different trajectories and multiple complications in clinical settings, it is important to conduct studies specifically targeting clinical patient populations20. Because of complex underlying mechanisms of insomnia and its various clinical manifestations, obtaining a clinically well-defined subject cohort is critical for genetic association analysis. Electronic health records (EHRs) from large medical institutes comprise a uniquely valuable data source to help identify genetic associations within very specific clinical conditions21.

In this study, we utilized a large-scale clinical database to explore the genetic underpinnings of insomnia and calculated the genetic correlation between insomnia and various clinical conditions. Further, we conducted a meta-analysis of our results combined with recent insomnia GWAS to discover novel genomic loci.

Methods

Clinical database

All the clinical data and genetic data in this study were obtained from the Partners Biobank22. The Partners Biobank is a large integrated database which contains clinical data from Partners HealthCare for approximately 90,000 consented patients, and genomic data for approximately 25,000 of them. The clinical data including patient family history, demographic information, diagnosis, medication records, lab test results and clinical notes. The clinical data is derived from the electronic health records, which have been collecting patient data since 1990. The informed consent was obtained from all study participants and/or their legal guardians. The study’s protocol was reviewed and approved by Partners Human Research Committee. All methods were performed in accordance with the relevant guidelines and regulations.

Electronic health record-derived phenotypes

We generated an ICD 9 and ICD10 code list for insomnia, three major substance use disorders and a series of relevant clinical conditions, including multiple psychiatric disorders and type 2 diabetes, then used these codes to identify our case cohort (Supplementary Table 1).

The ICD codes of insomnia include the following definitions

307.4*: specific disorders of sleep of nonorganic origin; 327.0*: organic disorders of initiating and maintaining sleep; 780.51: insomnia with sleep apnea, unspecified; 780.52: insomnia, unspecified; G47.0*: insomnia; F51.0*: insomnia not due to a substance or known physiological condition.

We reviewed 15,750,104 diagnosis records, which were collected between 1991 and 2018, to identify patients meeting our insomnia phenotype definition. The control cohort consisted of patients not meeting the insomnia phenotype, and also excluded patients with any other kind of sleep disorders, including snoring, periodic limb movement, sleep related leg cramps, sleep related bruxism and hypersomnia.

For the three substance use disorders, the case cohort included patients with at least one corresponding ICD code of substance dependence, substance abuse or long-term substance use disorder. The control group consisted of 12,205 patients without any record of substance use disorder (nicotine, alcohol, opioid, cannabis, cocaine and amphetamine).

Genotyping, imputation and quality control

The genotyping was performed by Partners Biobank using the Illumina Multi-Ethnic Global (MEG) array (Illumina, Inc., San Diego, CA) including 1,779,763 SNPs. Prior to imputation, QC steps were conducted, including: a. sample-level filtration: any samples with a discrepancy between the reported and predicted sex were removed. b. SNP-level filtration: removal of sites with invalid alleles, duplicate, monomorphic, indel, allele mismatch, low call rate (less than 90%). The SNPs that were not in the reference panel were also removed. The imputation was performed using the Michigan Imputation Server with Minimac323. The HRC (Version r1.1 2016) reference panel consisting of 64,940 haplotypes of predominantly European ancestry was used24.

Post-imputation quality control was conducted to select high-quality SNPs and control for population stratification. In all analyses, only autosomal biallelic SNPs with minor allele frequencies (MAF) of at least 1%, an info score above 0.8 and call rates above 98% were retained, which led to 5,508,534 SNPs. The present analysis included only individuals of European ancestry, which were reported by patients, to minimize the risk for confounding due to ancestry differences. A principal components analysis (PCA) was applied to characterize population structure.

Statistical analysis

PLINK 1.90 was used to conduct the genome-wide association analysis, adjusted for age, sex and the top 10 principal components25.The Genome-based restricted maximum likelihood (GREML) method implemented in GCTA was used to estimate the percentage of variance explained by common SNPs and calculate the genetic correlations26,27. LD Score Regression (LDSC) was used to calculate the genetic correlations between our results and publicly available GWAS studies28. FUMA and MAGMA were used to conduct the gene-based test and pathway enrichment analysis29. METAL was used for the meta-analysis between our results and published insomnia GWAS30.

A standard genome-wide significance threshold of p < 5 × 10−8 was chosen for SNP identification and r2 = 0.6 was set as the cutoff to define LD block. All phenotyping analyses were conducted using R (version 3.3.3).

Results

We used diagnostic data from Partners Biobank to identify cases with insomnia and controls. The study cohort comprised of 21,310 patients of European ancestry with 11,420 females (53.6%) and 9,890 males (46.4%). The mean age was 59.7 (SD = 16.70). From a total of 15,750,104 patient visit records, we generated an ICD9/ICD10 list for the insomnia phenotype. The diagnosis definition for the cases included primary insomnia, insomnia due to medical conditions and insomnia due to psychiatric disorders. We removed patients with documented comorbid sleep disorder symptoms, including snoring, periodic limb movement, sleep related leg cramps, sleep-related bruxism and hypersomnia. Using this list, we obtained 3,135 case subjects. The control group consisted of 14,920 patients without any record of insomnia or other sleep disorder symptoms.

Using high-quality imputed SNPs, a genome wide association analysis was conducted for the insomnia phenotype. Setting the p-value threshold at 5 × 10−8, one novel genomic risk locus was identified on chromosome 8p21.2 (Fig. 1a, Supplementary Fig. 1a, Genomic Inflation Factor λ 1.007). The leading SNP was rs17052966 (p = 4.53 × 10−9) (Table 1), located inside the gene region of the long non-coding RNA (lncRNA), CTD-2168K21.1. Using FUMA (Functional Mapping and Annotation) and MAGMA (Multi-marker Analysis of GenoMic Annotation) pipeline, 8 protein-coding genes were identified in the 10 kb distance window, including LOXL2, ENTPD4, ADAMDEC1, ADAM7, NEFM, EBF2, BNIP3L and ADRA1A. Previous research has linked these genes to sleep related disorders, psychiatric disorders and neurodegenerative disorders (Table 2)31,32,33,34,35,36,37,38. 27 other SNPs reached suggestive threshold (5 × 10−6) were also identified (Table 1). Among them, multiple SNPs on Chromosome 4 were close to gene SORCS2, which functions as a receptor for the precursor form of neurotrophin39.

Figure 1
figure 1

Manhattan plot for Insomnia. (a) EHR based phenotype (b). Meta-analysis 1 (c). Meta-analysis 2.

Table 1 Summary of variants associated with insomnia.
Table 2 Clinical function annotation of mapped genes on chromosome 8.

We also attempted to replicate previous GWAS study reported sleep disorder associated variants13,40. Among reported significant SNPs, 5 SNPs (rs8180817, 7q31.1; rs7044885, 9q31.32; rs113851554, 2p14; rs12187443, 5q21.1; and rs701394, 5q14.1) showed significances between 3.50 × 10−4 and 9.70 × 10−3 in our samples (Table 1). In addition, 8 SNPs that showed suggestive significances in our study had marginal p values in previous studies13,14 (Supplementary Table 2).

GCTA was used to estimate the proportion of phenotypic variance explained by common SNPs. The common SNPs could explain 7% (se = 0.02, p = 0.015) of the phenotypic variability. This is consistent with several previous GWAS studies on insomnia15,41. Using GCTA, we also calculated the genetic correlation between insomnia and three substance use disorder phenotypes, namely alcohol (3,594 cases, 12,205 controls), nicotine (4,896 cases, 12,205 controls) and opioid (1,039 cases, 12,205 controls) use disorders, which were also extracted from the same study cohort using ICD codes (Supplementary Table 1). The strongest correlation was found between insomnia and alcohol use disorder (rG = 0.56, se=0.14, p < 0.001), followed by nicotine use disorder (rG = 0.50, se=0.12, p < 0.001) and opioid use disorder (rG = 0.43, se = 0.18, p = 0.02) (Table 3). Furthermore, we evaluated the genetic correlations between insomnia and a series of clinical conditions extracted from Partners Biobank using codified data. Among them, a moderate correlation was observed between insomnia and anxiety or type 2 diabetes (rG = 0.76, se = 0.38, p = 0.17; rG = 0.31, se = 0.14, p = 0.25) (Table 3). Limited by the sample size, we did not observe the significant correlations.

Table 3 Genetic correlation between insomnia and other clinical conditions.

To gain more statistical power and further validate our results, we obtained the summary statistics from two recent insomnia GWAS, using data from UK Biobank or STARRS dataset13,15. We calculated the pair-wise genetic correlations between results from the Partners Biobank and these two studies and observed a moderate correlation between our results and Jansen et al. 2019 study (rG = 0.68, se    = 0.36, p = 0.18), while no significant correlation was found between Partners Biobank and Stein’s study (rG = 0.57, se = 1.28, p = 0.86). Lastly, a moderate correlation was observed between Jansen’ study and Stein’s study (rG = 0.35, se = 0.16, p = 0.07). We also checked the top two SNPs identified in Partners Biobank (rs17052966 and rs117915572) in both Stein’s and Jansen’s studies, but did not observe significant signals (STRASS: rs17052966: p = 0.95, beta = 0.002; rs117915572: p = 0.34, beta = 0.062; UKBB: rs17052966: p = 0.73, beta = 0.003; rs117915572: p = 0.83, beta = −0.002).

The meta-analysis was then conducted by combining our results from Partners Biobank and these two studies. Since the sample size of UK Biobank is significantly larger than our cohort and STARRS cohort, which can lead to a UK Biobank dominated meta-analysis result, we divided the meta-analysis into two steps: combining our results with the STARRS data alone (meta-1, N = 35,706) or combining all three studies (meta-2, N = 422,239) (Fig. 1b,c, Supplementary Fig. 1b,c, Supplementary Tables 3, 4). Two significant genomic loci were identified from meta-1 on chromosome 7 and 9, (Supplementary Tables 5 and 7). The leading SNPs, rs147549871 (p = 9.10 × 10−9) and rs7855172 (p = 1.32 × 10−8), from the identified loci were the top SNPs of the original study using the STARRS dataset. Also, the top SNPs rs17052966 and rs117915572 from Partners Biobank GWAS showed suggestive significances in the meta-1 analysis (p = 2.87 × 10−5 and p = 9.61 × 10−6). In meta-2 analysis, we identified 13 significant genomic loci with 31 independent significant SNPs, in which 11 loci were novel (Supplementary Tables 4 and 6). The top SNP, rs113851554 (p = 1.37 × 10−21), is on chromosome 2 and close to the MEIS1 gene. MEIS1 is a homeobox gene and plays an important role in neural crest development42. Multiple studies showed its relationship with sleep disorder, as well as restless legs syndrome (RLS)13,14,19,43.

In meta-analysis 2, using position mapping, we also identified 118 related genes within 10 kb region of significant SNPs. MAGMA tissue expression results suggested that genes from central nervous system tissues were highly enriched for expression (Supplementary Fig. 2). GWAS catalog analysis showed a series of previously reported sleep disorder genes, such as MEIS1, CUL9 and FOXP2 (Supplementary Table 8)40,44,45.

Discussion

Insomnia is one of the most prevalent mental disorders world-wide, affecting 10–20% of population. The strong genetic impact on insomnia has been repeatedly reported from different data sources. In many of these studies, self-reported insomnia symptoms were used to identify cases from the general population, which could limit our understanding of the complexity of this disease.

The current study used electronic health records and genomic information from a large patient cohort to conduct a GWAS on clinically defined insomnia phenotype. We discovered one novel genomic risk locus on chromosome 8. The leading SNP is in the region transcribes a long non-coding RNA, which has not been reported for insomnia. Differential expressions of several lncRNAs were shown to be associated with sleep deprivation46. In addition, among the eight genes mapped by our most highly significant SNP, 7 genes have been shown to be related with neuronal functions and psychiatric disorders, suggesting the possible significance of the genome region surrounding the discovered risk genomic locus.

We also conducted a large-scale meta-analysis by combining our results and 2 recent insomnia GWAS using data from UK Biobank and STARRS. The top identified SNP rs113851554 (p = 1.37 × 10−21) was among the top SNPs from Jansen et al. (p = 1.56 × 10−51) and Lane et al. (p = 9.76 × 10−30) 2019 studies13,14. Since the UKBB sample size is significantly larger than cohorts from Partners Biobank and STARRS, the result of meta-analysis was mainly driven by UKBB samples and the top SNPs from Partners GWAS did not show significances. However, we observed a moderated genetic correlation between study of Partners Biobank and Jansen’s report. Also, multiple significant SNPs we identified showed moderate significances in other GWAS, suggesting common components across them.

Substance use disorders, such as alcohol, nicotine and opioid, can also affect sleep patterns through various neurotransmitters and were shown to be significantly genetically associated with insomnia47. We found a strong positive genetic correlation between insomnia and these major substance use disorders among the same study population, providing more evidence for the relationship between psychiatric disorders and insomnia. Sleep patterns and multiple other clinical conditions were also showed to be closely connected. Studies have shown that sleep disorders affect more than 50% of adults with anxiety disorders48. Consistently, a moderate genetic correlation between insomnia and anxiety condition was observed in the current study. However, we did not observe significant correlations with depression, type 2 diabetes (we observed a moderate correlation) and schizophrenia which were previously reported13,15. Considering the previous correlation studies were mainly using summary statistics from UK Biobank, the different results we obtained could be caused by different definitions of these traits or the smaller sample size in our study.

Because of the broad definitions of insomnia, the phenotypes targeted by genome-wide association analysis have varied significantly across studies, ranging from primary insomnia to measurements of sleep length, sleep quality and early morning awakening. This could be one of the reasons for the fewer identified significant SNPs for insomnia and lack of consistent findings across studies. In this regard, electronic health records containing rich information about patient status and diagnostic information, can serve as an important data source of disease phenotypes.

This study has several limitations. First, insomnia is a common clinical symptom associated with multiple psychiatric disorders, which makes it very challenging to accurately define clinical insomnia. For the same reason, the genetic architecture identified by genome wide association studies can only reflect certain aspects of the complex insomnia phenotype. In this study, we used a simple ICD-code-based phenotype definition, and did not attempt to stratify the sample into multiple insomnia sub-phenotypes for GWAS due to the limitation of our sample size and the accuracy of the phenotyping method. We are planning to conduct following-up studies to further address these questions with larger sample size and other sources of phenotype information in the EHR, such as problem lists and clinical notes. Second, the study cohort is derived from a patient population, which could reflect more severe stage of insomnia. This could be one of the reasons we did not replicate several known insomnia related SNP from previous studies. Third, the cohort we extracted from Partners Biobank has a relatively small sample size compared with UK Biobank, which caused a significant imbalanced signal when conducting the meta-analysis.

In summary, we used clinical diagnosis information to identify insomnia cases among hospitalized patients. Our study cohort consists of clinically defined insomnia and provides a novel reference for insomnia genetic studies. Due to the heterogeneous clinical stages and complexity of the EHR data mining methods, we only utilized diagnostic codes in the development of our cohort in the current study. Based on this exploration, our developed pipeline will facilitate future research for more comprehensive genetic studies based on clinical records.