Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Rare copy number variants in over 100,000 European ancestry subjects reveal multiple disease associations


Copy number variants (CNVs) are suggested to have a widespread impact on the human genome and phenotypes. To understand the role of CNVs across human diseases, we examine the CNV genomic landscape of 100,028 unrelated individuals of European ancestry, using SNP and CGH array datasets. We observe an average CNV burden of ~650 kb, identifying a total of 11,314 deletion, 5625 duplication, and 2746 homozygous deletion CNV regions (CNVRs). In all, 13.7% are unreported, 58.6% overlap with at least one gene, and 32.8% interrupt coding exons. These CNVRs are significantly more likely to overlap OMIM genes (2.94-fold), GWAS loci (1.52-fold), and non-coding RNAs (1.44-fold), compared with random distribution (P < 1 × 10−3). We uncover CNV associations with four major disease categories, including autoimmune, cardio-metabolic, oncologic, and neurological/psychiatric diseases, and identify several drug-repurposing opportunities. Our results demonstrate robust frequency definition for large-scale rare variant association studies, identify CNVs associated with major disease categories, and illustrate the pleiotropic impact of CNVs in human disease.


Copy number variants (CNVs) are losses or gains of genomic segments. Although CNVs are commonly observed in healthy individuals, they are known to have gene dosage-sensitive effects on specific phenotypes. Prior CNV studies suggest that CNVs have a widespread impact on the human genome, as they are often associated with biological functions impacting disease susceptibility. However, most existing studies are based on limited sample sizes and lack adequate depth of rare CNV coverage. Historically, rare large CNVs are known to be associated with certain rare disease phenotypes. More recently, many previously thought to be rare  CNVs were found to be common across populations with allelic properties similar to single-nucleotide polymorphism (SNP) genotypes. The frequency and distribution of CNVs across the human genome have been examined by a number of recent studies using a variety of oligonucleotide arrays, CNV calling algorithms, and analytical methods, in an effort to understand the role of CNVs in human health and disease1,2,3,4,5,6,7,8,9,10,11,12. As CNVs impacting gene products are often deleterious, most such CNVs are rare genomic events that undergo negative selection. Consequently, large-scale sample sizes are required for sufficiently powered studies.

This study examines the CNV burden in over 100k subjects of European ancestry. We highlight the importance of rare, recurrent CNVs on the functional human genome and show that they are associated with on common, complex human diseases, including unreported therapeutic opportunities. Copy number variation regions (CNVRs) are significantly more likely to overlap OMIM genes (2.94-fold), genome-wide association studies (GWAS) loci (1.52-fold), and non-coding RNAs (ncRNAs; 1.44-fold), compared with random distribution (P < 1 × 10−3). We uncover strong CNV associations with four major disease categories, including autoimmune, cardio-metabolic, oncologic, and neurological/psychiatric diseases, several of which impact genes that represent potential drug targets for future validation.


Amassing and curating CNV calls

To examine the distribution of CNVs across the human genome, we genotyped 100,028 individuals from populations of European ancestry using results from either genome-wide SNP arrays (Illumina and Affymetrix) or array comparative genomic hybridization (aCGH) platforms (SignatureChip and Agilent). For the SNP array platforms (N = 52,321), CNVs were called using both the genotype B allele frequencies and intensity log R ratios (LRRs) calculated from 520,017 overlapping SNPs; LRRs were used for CNV calls on the aCGH platforms (N = 48,707), see Supplementary Methods. For quality control and association testing, we used ParseCNV, a robust CNV pipeline that has been validated for CNV calling based on extensive experimental experience13,14,15,16,17. To avoid study bias in the interpretation of the frequencies of rare recurrent CNVs across the 100,028 subjects, individuals with known multi-systemic syndromic disorders with established causality attributable to CNVs were excluded from the analysis (i.e., 22Q deletion and duplication syndromes, del15q11-13, del16p11.2, and trisomy 21. Incidentally, apart from trisomy 21, through our genetic analysis, we identified genetic lesions associated with other syndromic conditions from subjects who were previously undiagnosed).

We identified a total of 11,314 deletion, 5625 duplication, and 2746 homozygous-deletion CNVRs, defined as a contiguous cluster of non-singleton SNPs (occurring with a frequency of >0.03% or ~30 samples) that spans <1 Mb (Fig. 1a, Supplementary Fig 1, and Supplementary Data 13). We observed a mean CNV inheritance rate of 94% in a subset of several thousand family trios examined, with biological replicate concordance rate of 98%, confirming the reliability and specificity of the reported CNV calls (see Supplementary Methods). Furthermore, experimental validation by quantitative PCR (qPCR)18 was observed in >95% of the 2127 randomly selected, genotyped samples. We also examined CNVR distributions by genotyping platforms and CNVR size ranges to exclude the presence of technical bias in genotyping, sample processing, or CNV calling. As a further method of quality control, we showed that CNVRs observed across this study population recapitulated a number of CNVs previously reported by Conrad et al.19 (Supplementary Table 1).

Fig. 1: Genome-wide frequency and distribution of CNVs.

a “Manhattan plot” showing the distribution of identified CNVs across the genome with genomic positions across the x-axis, including deletions (left), duplications (center), and homozygous deletions (bottom) with the y-axis showing observed frequency of each CNV in the total study population. b Histogram of duplication, deletion, and homozygous deletions frequencies for all identified CNVs as a function of total CNV burden per individual (in bps on a log10 scale). c Histogram of duplication, deletion, and homozygous deletions frequencies as a function of total CNV burden (in bps on a log10 scale) per individual. d Density plot showing each individual’s deletion burden (in bps on a log10 scale) plotted against that same individual’s duplication burden (in bps on a log10 scale). Individuals with the same event burden (bps affected) are binned into hexagon and the color correlates with the total count of individuals in each bin, reflecting that most individuals who have more duplications also have more deletions; however, there are some exceptions in patients with relatively large burdens of deletions or duplications without comparable burden of the converse.

A vast majority of CNVRs (11,287 deletions [99.75%] and 5614 [99.96%] duplications) were recurrent, or present in at least 2 individuals. Although individual CNVRs are rare (>98.5% had a frequency < 0.01), consistent with previous reports20, the presence of CNVs in the human genome is collectively common. Indeed, the average genome contained a total CNV burden of ~655 kb, of which 370 kb is accounted for by duplications, and 285 kb by deletions, and 9.5 kb of which are accounted for by homozygous deletions (Fig. 1b and Supplementary Fig. 2). Although CNVRs were heterogeneously distributed across the genome, no large stretch of the genome was exempt from harboring CNVs (Fig. 1a) and nearly all CNVRs (99.5%) harbor both deletions and duplications without a bias in mean deletion vs. duplication frequency (0.60% vs. 0.50%, respectively). Importantly, 13.7% of all CNVRs observed are unreported, defined as overlapping <50% with any CNVR reported in the Database of Genomic Variation (DGV).

CNV impact on health and disease

Among the most clinically important CNVRs are homozygous deletions (hdCNVRs), as they are most likely to be pathologic. As expected, a lower percentage of the hdCNVRs are recurrent (2076 or 75.6%). Among the hdCNVRs identified, 375 are unreported, 44.3% of which were private. Leveraging the power of the large cohort, we also examined the data for evidence of homozygous or hemizygous deletions that incur embryonic lethality. Among the 2021 deletion CNVRs with a population frequency of at least 1.25%, we identified 62 loci at which no hdCNVR was identified, consistent with the possibility of embryonic lethality or early death (P < 0.05; Supplementary Table 2).

To evaluate the biological and functional impact of CNVRs on human health and disease, we used repeated permutations to determine whether the overlap between CNVR regions and functional regions (RefSeq genes, OMIM morbid loci, ultra-conserved elements, CpG islands, conserved transcription factor-binding sites, ncRNAs, exons, conserved transcriptional factor-binding sites and genome-wide significant GWAS hits) was greater than that expected at random (see online methods; Fig. 2 and Supplementary Table 3). We identified significant annotations for both OMIM morbid genes (enrichment ratio, ER = 2.94), genome-wide significant GWAS loci (ER = 1.52), as well as recombination hotspots (ER = 1.32), all P< 1 × 10−4. Together, these findings underscore the important impact of natural selection in driving human genetic diversity and CNV distribution, as well as the importance of CNVs in phenotypic variation and human disease.

Fig. 2: Functional enrichment analysis.

a Relative enrichment ratios for mapping of a CNV to a locus annotated as each of the given categories. ER was calculated based on 10,000 repeated simulations for each category/CNVR combo (see Methods). Regions of the genome bearing CNVs were significantly more likely to map to loci that are gene-bearing (Genic regions), exonic, or conserved (PHAST or PhyloP), and are enriched for functional loci, including miRNA targets (RNAi), CpG islands (CpG), microsatellites, recombination hotspots (Recombination), transcription factor-binding sites (TF BD), and were more likely to map to regions previous CNVs have been reported in the Database for Genomic Variants (DGV). ER enrichment ratio; DUP duplication; DEL deletion; HD homozygous deletion. The enrichment results are reported separately for “ALL” CNVRs and CNVRs optimized for Dup, Del, or HD CNVs. b Contribution of CNVRs with different transcriptional impact as annotated by Ensemble Variant Effect Predictor for the nearest gene, respectively, for all CNVRs, deletion CNVRs, duplication CNVRs, disease-associated CNVRs, homozygous deletion CNVRs, and all CNVs identified.

To identify potential associations between recurrent CNVs and human disease, we clustered individuals with disease phenotypes into four major categories (Fig. 3a) and healthy controls: (1) autoimmune/inflammatory diseases (n = 11,489), (2) cancers (n = 9105), (3) cardiovascular and metabolic diseases (n = 2581), and (4) a combination of psychiatric, neurodevelopmental, and neurological diseases (n = 43,841). At the expense of reduced sensitivity, this approach enabled us to identify loci that have broad disease implications in spite of the low frequency of most CNVRs, which makes a phenome-wide approach impractical21.

Fig. 3: Genomic landscape of disease-associated CNVs.

a Subjects (n) that are enrolled in the study based on disease category. Patients are classified into one of the four major disease categories or healthy controls. b Distribution of disease-associated deletions (top) and duplications (bottom) of CNVRs by length. Disease categories are color-coded. CNVRs associated with disease categories are on average larger than CNVRs not associated with disease categories (black-colored line), but do not significantly differ from one another. c Circos plot illustrating the distribution of CNVRs identified in the context of genomic elements. For all other layers from INNERMOST to OUTERMOST, the tracks show: sno/miRNAs (1), miRNA targets (2), conserved (PHAST) sites (3), frequency of duplication CNVRs (4), frequency of deletion CNVRs (5), recombination rates (r2) (6), expression of major EnCODE cell lines (7) corresponding to the candidate genes impacted by the respective DA-CNVRs (8). The innermost linkages reflect genes encoding the respective protein for all pairwise protein–protein interactions affected DA-CNVRs.

Consistent with expectation, disease-associated CNVRs (DA-CNVRs) are larger than average (Fig. 3b) and 94% (ER = 2.4) overlapped CNV-bearing regions previously reported in the DGV and 18% (ER = 1.2) overlapped GWAS loci (Fig. 2a), both significantly enriched as compared with chance (P < 0.001)22,23. We identified candidate genes in DA-CNVRs in a number of well-established disease-associated loci, including chr2p24.3 (MYCN amplification in cancer)24, chr22q11.21 (COMT and TBX1 deletion in neuropsychiatric disease and congenital heart conditions)25,26,27, and chr17q21.1 (NR1D1, deletion and duplication associated with response to lithium in bipolar disease and major depressive disorder)28,29. In addition, we unveiled multiple, putatively unreported DA-CNVRs that map to relevant candidate genes (Table 1). Among those are several well-established drug targets and others in development, including but not limited to TNFAIP8, HSPA9, SLIT3, HCN2, GRK6, ITGB8, ADK, CD44, NR1D1, and SLC38A1030,31,32,33,34,35,36,37,38,39,40,41,42,43. We performed similar enrichment analyses across each set of the domain-specific DA-CNVRs, showing that a number of the functional and genomic elements were enrichments in multiple disease domains (Supplementary Fig. 5).

Table 1 Select loci enriched with CNVs in four major disease categories.

Finally, we also identified multiple disease-associated hdCNVRs (DA-hdCNVRs), including those that map to unreported disease-association loci, as well as those that map to known genes reported by GWAS or other association studies (Table 2). On average, DA-hdCNVRs are smaller in size than the DA-CNVRs, which may be a consequence of the deleteriousness of large hdCNVR events in the genome (Supplementary Fig. 5).

Table 2 Homozygous deletion CNVRs associated with a major disease category.


Our analysis presents a dense map of CNVs across the human genome (Fig. 1 and Supplementary Data 13) and refines the frequencies of rare recurrent CNVs across the genome. Although individual CNVs are rare, most are recurrent and, collectively, CNVs represent an important and not infrequent source of genetic variation in the human genome. Although the importance of CNVs in rare genetic syndromes, congenital diseases, and in the cancer genome is well-elucidated, the role of germline CNVs in common human diseases has been thus far largely understudied. The salient overlap between CNVRs and GWAS signals observed in our study suggests that rare and uncommon CNVs may significantly contribute to common polygenic diseases (Fig. 2). Although some association studies have attempted to leverage GWAS to capture common CNVs, in the form of TagSNPs, our findings suggest that common polymorphisms do not effectively capture rare CNVs (Supplementary Data 4). This may  contribute to the missing heritability observed when comparing heritability calculated from GWAS findings with the expected heritability obtained from familial/twin studies. More effort is needed to evaluate the degree to which rare CNVs may contribute to the observed missing heritability using larger sample sizes and appropriate genomic platforms.

The role of CNVs and structural rearrangements as a driving force in human evolution and genome variation is also evident in the overlap of CNVRs with recombination hotspots. Recombination rates were higher for smaller and less common CNVRs and, in keeping with prior reports, a third (35.3%) of the CNVRs overlapped a recombination hotspot. In addition, we observed significant overlap between CNVRs and loci bearing segmental duplications and microsatellite sites (P < 1 × 10−4; Fig. 2a and Supplementary Figs. 3 and 4), likely related to increased rates of DNA replication defects at these regions. As polymorphic and repetitive loci are often neglected in disease-association studies due to technical challenges, it is especially important to further evaluate the phenotypic contributions of CNVs mapping to these loci.

We report herein a number of CNVRs, including a number of unreported CNVRs, which are associated with common human diseases. In addition to providing evidence of the impact of CNVRs in common diseases, these genes offer important avenues for therapeutic intervention. One notable example is a chr7p15.3 deletion associated with autoimmune disease (P < 6.87 × 10−19). This 2.9 kb deletion overlaps the gene ITGB8, which encodes the cell-surface glycoprotein β8 integrinITGB8a is a  well-established drug target for ovarian cancer and its expression is critical for dendritic cell-mediated induction of regulatory T-cell repertoires44. Through extensive functional studies, Travis et al.45 and others have shown that the conditional loss of the transforming growth factor-β-activating integrin α-V/β8 on leukocytes causes severe inflammatory bowel disease and age-related autoimmunity in mice. Recent efforts show that the α-V/β8 receptor complex is a viable therapeutic target in fibro-inflammatory airway disease46.

Another candidate locus is chr19p13.3, which encodes HCN2, a hyperpolarization-activated, cyclic nucleotide-gated K+ channel47. CNVs at this locus were associated with neurological diseases (deletion P < 6.53 × 10−47 and duplication P < 1.75 × 10−11). A careful review of literature reveals that increased HCN2 expression and activity are associated with neuropathic hyperalgesia, and neuropathic pain is initiated by HCN2-driven action potential firing in NaV1.8-expressing nociceptors48. In addition, Dibbens et al.49 showed that a 9 bp exonic deletion in HCN2 was associated with pediatric-onset generalized epilepsy with febrile seizures, consistent with the enhanced neuronal excitability observed in vitro in a Xenopus oocyte model of the indel. Several other genes residing at the association loci reported include known drug targets with repurposing opportunities, such as GRK6, CD44, SLC38A10, and ADK, with potential implications across multiple different cancers and auto-inflammatory diseases.

Moreover, in this work, we identified a number of DA-hdCNVRs. A prime example is an hdCNVR at chr2q34 that is associated with autoimmunity and interrupts the coding region of ERBB4. This gene encodes a cell-surface receptor tyrosine kinase that is a key oncogene and targetable by multiple Food and Drug Administration-approved small-molecule inhibitors. Interestingly, germline mutations in ERBB4 have also recently been linked to amyotrophic lateral sclerosis and schizophrenia50,51, and the loss of ErbB4 expression is found in patients with relapsing–remitting multiple sclerosis52. Furthermore, the expression of this family of proteins in microglia promotes re-myelination of neurons in response to soluble isoform of neuregulin-1, also known as glial growth factor 253,54. Given the safety and efficacy of established small-molecule modulators of this gene, it would be an ideal candidate gene for multi-disease, drug-repositioning opportunities.

Collectively, our findings support  a biological model in which recurrent CNVRs play a role across multiple common human diseases due to pleiotropic functions and/or broad expression of the affected gene(s). In support of this hypothesis, we examined the expression patterns of candidate genes in autoimmune/inflammatory-associated CNVRs and showed that their expression is enriched in immune-specific tissues and cell types, as compared with other genes in the human transcriptome (Supplementary Fig. 6b). In addition, we show extensive literature support of candidate genes in cancer-associated CNVRs and the pleiotropic expression patterns of these genes across multiple tissues and at high levels in malignant cells from these sites (Supplementary Fig. 8 and Supplementary Data 7). Finally, for CNVRs associated with multiple disease categories, we show that the impacted candidate genes are significantly more likely to be found in shared biological networks and to impact gene network interactions (Supplementary Fig. 7).

In summary, we report the frequencies, distributions, and disease associations of rare recurrent CNVs using genome-wide SNP genotyping and CGH data from over 100,000 unrelated individuals of European ancestry, representing the largest population-based CNV analysis to-date. The strong correlation between CNVRs and known GWAS loci, as well as recombination hotspots, suggest that rare or uncommon CNVs may contribute to the missing heritability conundrum, and that CNV-bearing loci correlate with regions of the genome under increased selective pressures. We further support this by demonstrating that many candidate genes mapping to DA-CNVRs are broadly expressed across multiple tissues and show examples in which such candidate genes have known pleiotropic roles across related diseases. Finally, although GWAS and genome-wide CNV analyses have significantly advanced our understanding of the distribution and biological impact of CNVs, whole-genome sequencing studies are needed to identify and fine-map CNVs, particularly those that are extremely rare and/or under 1 kb55. Together, our observations underscore that CNVs—both common and rare—have important biological roles in human health and disease. Thus, more effort and larger studies elucidating these associations may be an avenue for understanding and diagnosing genetic diseases and targeted therapeutic interventions.


Sample cohorts

All samples in the primary analysis were derived from one of two cohorts. The first group included 52,321 samples obtained from de-identified samples associated with electronic medical records residing in the genomics biorepository at The Children’s Hospital of Philadelphia (CHOP). All samples were also genotyped at the Center for Applied Genomics (CAG) at CHOP. A second set of 29,085 samples from subjects with neurological disease and 19,584 matching controls were run on a CGH array at The University of Washington in Seattle, WA. Informed consent authorizing the use of de-identified GWAS data was obtained from all subjects. The Institutional Review Board at The Children’s Hospital of Philadelphia approved the study.

Over 95% of the DNA was extracted from fresh blood. Six incremental versions of the Illumina 550k SNP set were used across both sets of samples, with a total of 520,017 SNPs in common to all included chip versions. PennCNV56 was used for CNV calls and validated by QuantiSNP18. To ensure data quality and to minimize technical bias, only samples with a mean call rate >98% and LRR SD < 0.25 were included in the analysis. Furthermore, autosome genotype relatedness and intensity wave variations following GC content wave correction were assessed for sample exclusion. CNV sensitivity was assessed based on the identification of known CNVs in the reference HapMap individuals. CNV calls of different size ranges across the genome were validated by independent testing in 2127 samples using qPCR with an overall specificity exceeding 95% as shown by results from 367 successful validation assays including consistent data quality across 7 disease studies with a range of different genomic loci and CN states (Supplementary Fig. 9).

Quality filtering

To minimize false-positive CNV calls, a set of quality-filtering metrics were performed. Case and control matching was insured by calculating a genomic inflation factor between groups. Wave artifacts roughly correlating with GC content resulting from hybridization bias of low full-length DNA quantity are known to interfere with accurate detection of copy number variations. Only samples with GC-corrected wave factor of LRR < |0.02| were accepted. If the count of CNV calls made by PennCNV exceeded 100, it was suggestive of poor DNA quality and those samples were excluded. Thus, only samples with CNV call count <100 were included. Any duplicate samples (such as monozygotic twins or repeats on the same patient) were identified, and, as a result, one sample was excluded.

Statistical tests

CNV frequency was compared between cases and controls, as well as across genotyping cohorts. Comparisons were made for each SNP using Fisher’s exact test. To determine CNV enrichment, we only considered loci that were nominally significant between the comparative groups (P< 0.05). For case–control comparisons, we looked for recurrent CNVs that were observed across different independent cohorts or were not observed in any of the control subjects, and were validated with an independent method. Three lines of evidence established statistical significance: independent replication P< 0.05, permutation of observations, and absence of loci observed with control enriched significance. We used DAVID (Database for Annotation, Visualization, and Integrated Discovery)57 to assess the significance of functional annotation clustering of independently associated results into InterPro categories.

We have complied with all relevant ethical regulations for the work with human participants. Informed consent was obtained for all participants prior to including them in the biobank at the CAG at CHOP.

Data availability

CNV calls generated are available from the authors and are published with open access on the website ( All data authorized for dbGaP submission have been deposited to dbGaP (accessions: phs000490.v1.p1, phs000607.v3.p2, phs000371.v1.p1, phs000490.v1.p1, phs001194.v2.p2, phs001194.v2.p2.c1, phs001661, and phs000233).


  1. 1.

    Donahoe, P. K. et al. Detection of large-scale variation in the human genome. Nat. Genet. 36, 949 (2004).

    PubMed  Article  CAS  PubMed Central  Google Scholar 

  2. 2.

    Sebat, J. et al. Large-scale copy number polymorphism in the human genome. Science 305, 525–528 (2004).

    ADS  CAS  PubMed  Article  PubMed Central  Google Scholar 

  3. 3.

    Sharp, A. J. et al. Segmental duplications and copy-number variation in the human genome. Am. J. Hum. Genet. 77, 78–88 (2005).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  4. 4.

    Tuzun, E. et al. Fine-scale structural variation of the human genome. Nat. Genet. 37, 727–732 (2005).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  5. 5.

    MacArthur, D. G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–828 (2012).

    ADS  CAS  PubMed  PubMed Central  Article  Google Scholar 

  6. 6.

    Hinds, D. A., Kloek, A. P., Jen, M., Chen, X. & Frazer, K. A. Common deletions and SNPs are in linkage disequilibrium in the human genome. Nat. Genet. 38, 82–85 (2006).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  7. 7.

    McCarroll, S. A. et al. Common deletion polymorphisms in the human genome. Nat. Genet. 38, 86–92 (2006).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  8. 8.

    Locke, D. P. et al. Linkage disequilibrium and heritability of copy-number polymorphisms within duplicated regions of the human genome. Am. J. Hum. Genet. 79, 275–290 (2006).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  9. 9.

    Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444–454 (2006).

    ADS  CAS  PubMed  PubMed Central  Article  Google Scholar 

  10. 10.

    Itsara, A. et al. Population analysis of large copy number variants and hotspots of human genetic disease. Am. J. Hum. Genet. 84, 148–161 (2009).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  11. 11.

    Shaikh, T. H. et al. High-resolution mapping and analysis of copy number variations in the human genome: a data resource for clinical and research applications. Genome Res. 19, 1682–1690 (2009).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  12. 12.

    Craddock, N. et al. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature 464, 713–720 (2010).

    ADS  CAS  PubMed  Article  Google Scholar 

  13. 13.

    Glessner, J. T., Li, J. & Hakonarson, H. ParseCNV integrative copy number variation association software with quality tracking. Nucleic Acids Res. 41, e64 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  14. 14.

    Glessner, J. T. et al. Autism genome-wide copy number variation reveals ubiquitin and neuronal genes. Nature 459, 569–573 (2009).

    ADS  CAS  PubMed  PubMed Central  Article  Google Scholar 

  15. 15.

    Glessner, J. T. et al. Strong synaptic transmission impact by copy number variations in schizophrenia. Proc. Natl Acad. Sci. USA 107, 10584–10589 (2010).

    ADS  CAS  PubMed  Article  Google Scholar 

  16. 16.

    Glessner, J. T. et al. Duplication of the SLIT3 locus on 5q35.1 predisposes to major depressive disorder. PLoS ONE 5, e15463 (2010).

    ADS  CAS  PubMed  PubMed Central  Article  Google Scholar 

  17. 17.

    Glessner, J. T. et al. A genome-wide study reveals copy number variants exclusive to childhood obesity cases. Am. J. Hum. Genet. 87, 661–666 (2010).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  18. 18.

    Colella, S. et al. QuantiSNP: an objective Bayes Hidden-Markov model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res. 35, 2013–2025 (2007).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  19. 19.

    Conrad, D. F., Andrews, T. D., Carter, N. P., Hurles, M. E. & Pritchard, J. K. A high-resolution survey of deletion polymorphism in the human genome. Nat. Genet. 38, 75–81 (2006).

    CAS  PubMed  Article  Google Scholar 

  20. 20.

    Pinto, D., Marshall, C., Feuk, L. & Scherer, S. W. Copy-number variation in control population cohorts. Hum. Mol. Genet. 16, R168–R173 (2007).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  21. 21.

    Denny, J. C. et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics 26, 1205–1210 (2010).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  22. 22.

    Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014).

    CAS  PubMed  Article  Google Scholar 

  23. 23.

    NHGRI. Published GWAS through 08/01/2014. NHGRI GWA Catalog (NHGRI, 2014).

  24. 24.

    Maris, J. M. et al. Comprehensive analysis of chromosome 1p deletions in neuroblastoma. Med. Pediatr. Oncol. 36, 32–36 (2001).

    CAS  PubMed  Article  Google Scholar 

  25. 25.

    Shprintzen, R. J. Velo-cardio-facial syndrome: 30 years of study. Dev. Disabil. Res. Rev. 14, 3–10 (2008).

    PubMed  PubMed Central  Article  Google Scholar 

  26. 26.

    Chessa, M. et al. Relation of genotype 22q11 deletion to phenotype of pulmonary vessels in tetralogy of Fallot and pulmonary atresia-ventricular septal defect. Heart 79, 186–190 (1998).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  27. 27.

    McDonald-McGinn, D. M. & Sullivan, K. E. Chromosome 22q11.2 deletion syndrome (DiGeorge syndrome/velocardiofacial syndrome). Medicine (Baltim.) 90, 1–18 (2011).

    Article  Google Scholar 

  28. 28.

    Yin, L., Wang, J., Klein, P. S. & Lazar, M. A. Nuclear receptor Rev-erbalpha is a critical lithium-sensitive component of the circadian clock. Science 311, 1002–1005 (2006).

    ADS  CAS  PubMed  Article  Google Scholar 

  29. 29.

    Nováková, M., Praško, J., Látalová, K., Sládek, M. & Sumová, A. The circadian system of patients with bipolar disorder differs in episodes of mania and depression. Bipolar Disord. 17, 303–314 (2015).

    PubMed  Article  CAS  Google Scholar 

  30. 30.

    Zhang, L., Liu, R., Luan, Y.-Y. & Yao, Y.-M. Tumor nNecrosis factor-α induced protein 8: pathophysiology, clinical significance, and regulatory mechanism. Int. J. Biol. Sci. 14, 398–405 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  31. 31.

    Wang, W.-W., Wang, Y.-B., Wang, D.-Q., Lin, Z. & Sun, R.-J. Integrin beta-8 (ITGB8) silencing reverses gefitinib resistance of human hepatic cancer HepG2/G cell line. Int. J. Clin. Exp. Med. 8, 3063–3071 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. 32.

    Cui, Y. et al. miR-199a-3p enhances cisplatin sensitivity of ovarian cancer cells by targeting ITGB8. Oncol. Rep. 39, 1649–1657 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  33. 33.

    Zhou, Z., Li, Z., Shen, Y. & Chen, T. MicroRNA-138 directly targets TNFAIP8 and acts as a tumor suppressor in osteosarcoma. Exp. Ther. Med. 14, 3665–3673 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  34. 34.

    Lowe, J. M. et al. The novel p53 target TNFAIP8 variant 2 is increased in cancer and offsets p53-dependent tumor suppression. Cell Death Differ. 24, 181–191 (2017).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  35. 35.

    Zhang, C. et al. The significance of TNFAIP8 in prostate cancer response to radiation and docetaxel and disease recurrence. Int. J. Cancer 133, 31–42 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  36. 36.

    Raghuram, S. et al. Identification of heme as the ligand for the orphan nuclear receptors REV-ERBα and REV-ERBβ. Nat. Struct. Mol. Biol. 14, 1207–1213 (2007).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  37. 37.

    Woldt, E. et al. Rev-erb-α modulates skeletal muscle oxidative capacity by regulating mitochondrial biogenesis and autophagy. Nat. Med. 19, 1039–1046 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  38. 38.

    Spinelli, V., Sartiani, L., Mugelli, A., Romanelli, M. N. & Cerbai, E. Hyperpolarization-activated cyclic-nucleotide-gated channels: pathophysiological, developmental, and pharmacological insights into their function in cellular excitability. Can. J. Physiol. Pharmacol. 96, 977–984 (2018).

  39. 39.

    Tsantoulas, C. et al. Hyperpolarization-activated cyclic nucleotide–gated 2 (HCN2) ion channels drive pain in mouse models of diabetic neuropathy. Sci. Transl. Med. 9, eaam6072 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  40. 40.

    DiFrancesco, J. C. & DiFrancesco, D. Dysfunctional HCN ion channels in neurological diseases. Front. Cell. Neurosci. 6, 71 (2015).

    Article  CAS  Google Scholar 

  41. 41.

    Misra, S. et al. Hyaluronan-CD44 interactions as potential targets for cancer therapy. FEBS J. 278, 1429–1443 (2011).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  42. 42.

    Sahin, I. H. & Klostergaard, J. CD44 as a drug delivery target in human cancers: where are we now? Expert. Opin. Ther. Targets 19, 1587–1591 (2015).

    CAS  PubMed  Article  Google Scholar 

  43. 43.

    Huang, W.-Y. et al. Nanoparticle targeting CD44-positive cancer cells for site-specific drug delivery in prostate cancer therapy. ACS Appl. Mater. Interfaces 8, 30722–30734 (2016).

    CAS  PubMed  Article  Google Scholar 

  44. 44.

    Moyle, M., Napier, M. A. & McLean, J. W. Cloning and expression of a divergent integrin subunit beta 8. J. Biol. Chem. 266, 19650–19658 (1991).

    CAS  PubMed  Google Scholar 

  45. 45.

    Travis, M. A. et al. Loss of integrin alpha(v)beta8 on dendritic cells causes autoimmunity and colitis in mice. Nature 449, 361–365 (2007).

    ADS  CAS  PubMed  PubMed Central  Article  Google Scholar 

  46. 46.

    Minagawa, S. et al. Selective targeting of TGF-β activation to treat fibroinflammatory airway disease. Sci. Transl. Med. 6, 241ra79 (2014).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  47. 47.

    Emery, E. C., Young, G. T., Berrocoso, E. M., Chen, L. & McNaughton, P. A. HCN2 ion channels play a central role in inflammatory and neuropathic pain. Science 333, 1462–1466 (2011).

    ADS  CAS  PubMed  Article  PubMed Central  Google Scholar 

  48. 48.

    Santoro, B. et al. Identification of a gene encoding a hyperpolarization-activated pacemaker channel of brain. Cell 93, 717–729 (1998).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  49. 49.

    Dibbens, L. M. et al. Augmented currents of an HCN2 variant in patients with febrile seizure syndromes. Ann. Neurol. 67, 542–546 (2010).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  50. 50.

    Kalkman, H. O. Altered growth factor signaling pathways as the basis of aberrant stem cell maturation in schizophrenia. Pharmacol. Ther. 121, 115–122 (2009).

    CAS  PubMed  Article  Google Scholar 

  51. 51.

    Birchmeier, C. ErbB receptors and the development of the nervous system. Exp. Cell Res. 315, 611–618 (2009).

    CAS  PubMed  Article  Google Scholar 

  52. 52.

    Tynyakov-Samra, E., Auriel, E., Levy-Amir, Y. & Karni, A. Reduced ErbB4 expression in immune cells of patients with relapsing remitting. Mult. Scler. Mult. Scler. Int. 2011, 561262 (2011).

    PubMed  Google Scholar 

  53. 53.

    Marballi, K. et al. In vivo and in vitro genetic evidence of involvement of neuregulin 1 in immune system dysregulation. J. Mol. Med. (Berl.). 88, 1133–1141 (2010).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  54. 54.

    Flames, N. et al. Short- and long-range attraction of cortical GABAergic interneurons by neuregulin-1. Neuron 44, 251–261 (2004).

    CAS  PubMed  Article  Google Scholar 

  55. 55.

    Mills, R. E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65 (2011).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  56. 56.

    Wang, K. et al. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 17, 1665–1674 (2007).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  57. 57.

    Huang, D. W. et al. The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 8, R183 (2007).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

Download references


We thank the patients and their families for their participation in the respective genotyping studies and in the Biobank Repository at the Center for Applied Genomics. Y.R.L. was supported by the Paul and Daisy Soros Fellowship for New Americans and the NIH F30 Individual NRSA Training Grant. This study was supported by Institutional Development Funds from The Children’s Hospital of Philadelphia (CHOP), by CHOP’s Endowed Chair in Genomic Research (Hakonarson), and by U01-HG006830 (NHGRI-sponsored eMERGE Network).

Author information




All authors contributed to reviewing and providing comments on the manuscript.

Corresponding author

Correspondence to Hakon Hakonarson.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks Stephen Chanock and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, Y.R., Glessner, J.T., Coe, B.P. et al. Rare copy number variants in over 100,000 European ancestry subjects reveal multiple disease associations. Nat Commun 11, 255 (2020).

Download citation

Further reading


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing