Main

Copy number variation, defined as the gain or loss of genomic material >1 kb in size,1 has been the subject of intense research in both normal and disease populations over the last several years. These investigations were made possible by the completion of the Human Genome Project, which provided a detailed physical map and high-quality reference assembly of the human genome2 and enabled the development of whole-genome array technologies capable of accurate determination of copy number at very high resolution.

Copy number variants (CNVs) are common in normal individuals and have been identified in approximately 35% of the human genome.1 When present as hemizygous events in normal individuals, these imbalances are considered “benign” (i.e., no major phenotypic effect on human development); however, their role as susceptibility loci in common and complex genetic diseases and traits is now being actively explored. Data from control populations are being collected in databases of normal variation, including the Database of Genomic Variants1 and the Database of Genomic Structural Variation (dbVar) (http://www.ncbi.nlm.nih.gov/dbvar).3 These large datasets will contribute to a human gene dosage map through exclusion by defining those regions for which single copy loss or gain is tolerated and do not produce an overtly abnormal phenotype.

CNVs have also been identified as one of the most common causes of human disease. In fact, one of the earliest and most significant clinical benefits of the Human Genome Project has been the application of whole-genome CNV analysis to evaluate individuals with developmental disabilities, including developmental delay (DD), intellectual disability (ID), autism, epilepsy, and/or birth defects, a group of disorders representing up to 14% of the population.4 Commonly referred to as cytogenetic or chromosomal microarrays (CMA), these technologies have quickly replaced the standard G-banded karyotype as the first-tier genetic test for the evaluation of this patient population.5,6 There are many technology platforms available for whole-genome copy number analysis at resolutions of 100–500 kb (compared with 5–10 Mb for karyotype), with even higher resolution at “clinical targets,” such as individual genes in which haploinsufficiency leads to dominant Mendelian disorders. From numerous published studies, the yield of clinically significant or pathogenic CNVs (pCNVs) by CMA is 15–20%, compared with a yield of approximately 3–5% by standard cytogenetic analysis in the same patient population.5

In an important subset of CMA cases, the potential functional significance of a particular CNV may be unknown and is referred to as a variant of uncertain clinical significance (VOUS). Parental and family studies can be helpful in the clinical interpretation of these cases, as a de novo occurrence of the CNV strengthens the evidence that it is pathogenic. However, the significance of many CNVs still remains uncertain even after familial studies due to variable expressivity or incomplete penetrance. Therefore, it would be extremely beneficial to improve our knowledge of the functional significance of CNVs throughout the genome by performing comparative analyses of large datasets from case cohorts and control populations to definitively associate specific genomic regions with human disease.

Herein, we describe genome-wide CNV results from the first dataset from the International Standards for Cytogenomic Arrays (ISCA) consortium5 (https://www.iscaconsortium.org/) that includes analysis of 15,749 cases and 10,118 controls. This study was designed to assess the frequency of CNVs in this population and initiate an evidence-based process to determine the functional significance of structural variation across the genome. Compared with individually rare CNVs, recurrent CNVs lend themselves to large case-control studies due to their relatively higher frequency. Therefore, we have focused our initial analysis on 14 recurrent CNV regions to statistically assess the correlation between rare CNVs and developmental disorders. Furthermore, ongoing analysis of the ISCA CNV dataset compared with normal structural variation will delineate genomic regions and individual genes that are subject to dosage effects resulting in intellectual and other developmental disabilities. Such efforts will result in a human gene dosage map for developmental disorders.

MATERIALS AND METHODS

Cases

This study adhered to guidelines set by the institutional review boards at the participating laboratories. CMA was performed in a subset of clinical ISCA laboratories on cases referred for diagnostic testing with various indications including unexplained DD, ID, dysmorphic features, multiple congenital anomalies, autism spectrum disorders (ASDs), or clinical features suggestive of a chromosomal syndrome. Anonymized data from 15,749 cases were included.

CNV detection

CMA was carried out following standard procedures. We used a consensus microarray design, focusing on unique genomic regions and avoiding repetitive sequences.7 The arrays were either 44K or 105K custom-designed 60-mer oligonucleotide arrays (Agilent Technologies, Santa Clara, CA) with a whole-genome backbone plus targeted, higher density coverage of known disease-causing regions.7 The backbone coverage included probes spaced every approximately 35–75 kb, allowing for CNVs of approximately 250 kb and greater to be detected. All clinically relevant CNVs ≥ 500 kb in the backbone are reported in this study. The 500 kb threshold in the backbone regions was used as this size limit was consistently used as the reporting criteria by the ISCA laboratories. For the targeted regions, we could identify imbalances of approximately 20–50 kb.

Arrays were scanned using a GenePix Autoloader 4200AL, GenePix 4000B (Molecular Devices, Sunnyvale, CA) or Agilent scanner (Agilent Technologies, Santa Clara, CA). Results were analyzed using Feature Extraction and DNA Analytics software packages (Agilent Technologies, Santa Clara, CA). Data include only those imbalances that contained at least four consecutive probes with abnormal log2 ratios. Data are presented as minimum coordinates (sequence positions of the first and last probes within the CNV) in the NCBI36 genome assembly.

CNVs were categorized by clinical laboratories as pathogenic, VOUS, or benign based on known clinically relevant regions, gene content, and inheritance pattern as described previously.5,8 For both deletions and duplications, the genes located within the CNVs were assessed, as well as neighboring genes. Imbalances that involved large genomic segments from the chromosomal backbone coverage were considered to be likely pathogenic if they contained multiple known genes and did not overlap a confirmed benign CNV region. CNVs were classified as pathogenic if the CNV included an autosomal dominant gene known to cause a disease phenotype. The genomic regions associated with known pathogenic and benign CNVs are listed in Tables, Supplemental Digital Content 1, http://links.lww.com/GIM/A196 and were also deposited into dbVar (nstd45). Because the clinical laboratories that contributed data used different standards for reporting benign CNVs, an accurate assessment of the frequency of these benign CNVs was impossible for this dataset; therefore, benign CNVs identified in cases with otherwise normal array results were not included in this study.

Confirmation of abnormal array findings were carried out by fluorescence in situ hybridization (FISH), quantitative polymerase chain reaction, standard G-banded chromosome analysis, multiplex ligation-dependent probe amplification, or a second array analysis, depending on the size of the observed CNV. As the great majority of pathogenic changes were confirmed by an independent method, the genotypic data quality is extremely high, providing a large dataset with high fidelity. Parental studies by FISH, quantitative polymerase chain reaction, multiplex ligation-dependent probe amplification, or array analysis were conducted to determine the inheritance in a subset of cases where parental samples were referred for follow-up testing. To the best of our knowledge, results from testing of parental and siblings' samples were excluded from the final dataset if they showed the same genomic imbalance as the proband.

We developed an automated program to scan the data for inconsistencies in clinical interpretation for two or more reported genomic imbalances that overlapped in length by more than 50% but that were classified differently (as pathogenic, VOUS, or benign). This program flagged the genomic regions in which there was inconsistent annotation of CNVs, and these CNVs were subsequently reviewed and, where appropriate, assigned a single classification. For cases with complex rearrangements involving several CNVs, the interpretation was based on each individual CNV. The reported CNVs from this study are included in Table, Supplemental Digital Content 2, http://links.lww.com/GIM/A197 and were submitted to dbVar (nstd37). The number of genes was assessed by counting partial and whole genes included in the region based on the UCSC known gene track.

Statistical analysis

Our initial approach focuses on recurrent events as they are more common and lend themselves to case-control analysis; future studies will focus on nonrecurrent CNVs as large enough case numbers become available. Recurrent rearrangements mediated by segmental duplications were identified by comparison with previously described hotspot regions.9 Imbalances were considered recurrent if they included the critical region of the deletion/duplication event and, based on probe coverage, were likely mediated by paired, flanking segmental duplications. We carried out statistical analysis of 14 selected regions including (Table 1 for chromosome coordinates) 1q21 thrombocytopenia-absent radius region,10,11 distal 1q21.1,12,13 3q29,14,15 5q35,16,17 7q11.23,18,19 8p23.1,20,21 15q11.2-q13,2224 15q13,25,26 16p13.11,27,28 16p11.2,2931 17p11.2,32,33 17q12,3436 17q21.31,3739 and 22q11.2.40,41 For the 1q21 regions, if the imbalance included both 1q21 thrombocytopenia-absent radius10 and the distal 1q21.1 region,12 the imbalance was included in the distal 1q21.112 frequency. In the 15q11q13 region, imbalances that spanned BP2–BP542 were counted in the BP2–BP3 frequency and not the BP4–BP5 frequency. Both the smaller and larger rearrangements (1.5 and 3.0 Mb) for 16p13.1128 and 22q1143 were included in their respective CNV categories. For this study, we excluded recurrent CNVs involving 17p12 (HNPP/CMT1A) as these CNVs are either not associated with cognitive defects or are late-onset in nature (and, therefore, not expected to be enriched in our mostly pediatric patient population) and 15q11 (BP1–2) which were not consistently reported by the contributing laboratories. CNV data from 10,118 individuals from control populations were obtained from several recent reports.4447 Processed CNV data were used directly from three of the previous control studies.4446 For the data from the article by Shi et al.,47 we performed CNV analysis of the raw data for regions of interest using the Affymetrix Power Tools software (Affymetrix, Santa Clara, CA). Log2 ratio data were extracted and analyzed using the BEAST algorithm (Satten et al., submitted). All P values and odds ratios for case-control analyses were calculated using Fisher's exact test.

Table 1 Frequencies of recurrent deletions

RESULTS

CNV characterization

We analyzed data from 15,749 whole-genome oligonucleotide arrays on individuals who presented for diagnostic array testing with abnormal clinical phenotypes including DD/ID, ASD, and/or multiple congenital anomalies. We detected 4628 imbalances consistent with our reporting criteria (defined in “Materials and Methods”) and classified 2691 (17.1%) as pathogenic (pCNVs), in line with prior reports of the yield from CMA in diagnostic testing.5 As a single individual may have had multiple pCNVs (e.g., unbalanced translocations), the diagnostic yield for this dataset was 14.7% (2321 cases with pCNV/15,749 total cases). Excluding 106 whole-chromosome aneuploidies, there were 2585 pCNVs with a mean size of approximately 6.5 Mb (median of 2.8 Mb) and a mean of approximately 69 genes per CNV (median of 44 genes). Deletions were more commonly interpreted as pathogenic than duplications, accounting for 67.9% of the imbalances.

In 9.3% of cases, an observed genomic imbalance was classified as a VOUS, as there was insufficient evidence to conclude the CNV was either pathogenic or benign. There were ultimately 1468 CNVs classified as VOUS, with a mean size of 765 kb (median of 569 kb) and a mean of approximately 10 genes per CNV (median of five genes). Duplications were more common than deletions, accounting for 68.8% of the imbalances.

The inheritance of a CNV was determined in a subset of cases to aid in the clinical interpretation and where both parental specimens were available. Of the 1412 CNVs with known inheritance, 566 (40%) were found to be de novo. The majority of the de novo events (513 CNVs, 91%) were classified as pathogenic, whereas 51 CNVs (9%) were classified as uncertain. Two de novo CNVs, interpreted to be benign, were incidentally identified in the course of parental studies to determine the inheritance of other CNVs classified as VOUS. The de novo benign CNVs included a duplication of the beta-defensin cluster on chromosome 8p and a duplication of the CHRNA7 (OMIM# 118511) gene on chromosome 15q; both of these CNVs have been observed as common polymorphisms in control populations.

Frequency of recurrent events

A subset of the imbalances identified by CMA includes recurrent imbalances that result from rearrangements between low-copy repeats, also known as segmental duplications. These rearrangements cause genomic disorders that have been recently reviewed.48 Sharp et al.9 described 130 rearrangement hotspots in the human genome by defining these regions as large genomic segments (50 kb-10 Mb) that are flanked by segmental duplications ≥ 10 kb in size and ≥95% identical. Of all CNVs detected in this case cohort, approximately 24% result from rearrangements between segmental duplications.

Tables 1 and 2 list the frequencies in the ISCA dataset for 14 CNV regions associated with recurrent deletions and duplications, respectively. It is important to note that many of the recognizable recurrent syndromes may still be tested for by targeted FISH studies, rather than CMA. As cases ascertained from FISH testing were not included in this study, the frequencies of such syndromes are likely underestimated.

Table 2 Frequencies of recurrent duplications

For the 14 recurrent regions, the number of deletions and duplications were often unequal, which can be explained by ascertainment (recurrent duplications may result in milder phenotypes and, therefore, not be ascertained in our cohort of affected individuals) and mechanism (deletions generated by non-allelic homologous recombination occur more frequently than duplications).49 Not surprisingly, the most common deletion in this cohort, with 93 cases (1 in 169 abnormal cases), was the 22q11.2 deletion (OMIM# 188400),40 whereas the reciprocal duplication (OMIM# 608363) with a milder phenotype41 was detected in only 32 cases. The most common recurrent duplication in our dataset was in 16p13.11, seen in 45 cases, whereas the reciprocal deletion associated with neurodevelopmental defects was detected in only 22 cases. For both deletions and duplications, the second most commonly affected region was the recurrent 16p11.2 CNV (OMIM# 611913). Both deletions and duplications of this region have been reported in individuals with an abnormal neurologic phenotype.30 The frequency of the 16p11.2 deletion in this abnormal cohort is approximately 1 in 235. Therefore, this CNV was detected nearly as often as the 22q11.2 deletions, indicating that this CNV is also a frequent cause of intellectual and developmental disabilities.

Frequency of nonrecurrent events

Of all CNVs detected in this case cohort, most (76%) were individually rare and not mediated by segmental duplications. This large group of CNVs provides a resource to examine regions of the genome that contain multiple CNVs with overlapping segments of deleted or duplicated material to define genotype-phenotype correlations. As an example, we highlight three recently described regions (2p15 deletion,50 16q24.3 deletion,51 and 17p13 duplication52) where overlapping de novo CNVs were characterized to define the associated phenotype and identify candidate genes. In the ISCA case cohort, we found four de novo deletions in 2p15 with a smallest region of overlap (SRO) of approximately 2.4 Mb, five de novo deletions in 16q24 with a SRO of approximately 450 kb, and four de novo duplications in 17p13 with a SRO of approximately 312 kb. As the ISCA database grows, cases such as these will prove invaluable for identifying disease-causing genes.

Case-control analysis to define functional significance

The CNVs identified in this study of individuals with neurodevelopmental disorders are rare and highly heterogeneous, with no single CNV being identified in more than 1% of the cases. Therefore, methods are needed to begin to statistically assess the relationship between such rare variation and human disease. For this study, we first focused on deletions and duplications of 14 recurrent genomic regions as their relative frequency is higher than CNVs involving nonrecurrent regions. We selected 14 of the most common and clinically relevant recurrent CNVs (listed in “Materials and Methods”) for a formal case-control study to initiate an evidence-based process for defining the clinical significance of structural variation across the genome. Many of these 14 regions have inconclusive or contradictory data in the literature regarding their phenotypic implications, so a targeted analysis of these regions is needed to inform their functional significance.

Tables 3 and 4 list the results of these analyses for recurrent deletions and duplications, respectively. We compared the ISCA case cohort of 15,749 cases to 10,118 combined controls from several recent publications.4447 These reports used microarrays with levels of resolution equivalent to or higher than the ISCA array design; thus, there should be no significant difference in sensitivity in the calls between the case and control datasets given that the 14 regions analyzed in this study were approximately 600 kb or greater. Although not all the controls used in these studies were formally assessed for neurocognitive abnormalities, these datasets have been used before as control populations in other studies. Itsara and colleagues45 previously performed a meta-analysis of segmental duplication mediated regions on 6860 abnormal individuals and 5674 control individuals.45 For regions in common with our study, the CNV P values from the previous study are included in Tables 3 and 4 for comparison.

Table 3 Case-control analysis of recurrent deletions
Table 4 Case-control analysis of recurrent duplications

All 14 recurrent deletions were significantly overrepresented in cases compared with controls (Table 3), demonstrating each is a pCNV. The 22q11.2 deletion was not seen in controls, confirming the pathogenic nature of this known disease-causing CNV (P = 9.15−21). The 16p11.2 deletion was observed in 67 cases in the ISCA cohort, but only five 16p11.2 deletions were found among the control population, providing strong evidence for the pathogenic nature of this CNV (OR = 8.64; P = 6.34−10).

Other recurrent deletions detected with a high frequency in the abnormal cohort include those in 1q21.1 (OMIM# 612474; OR = 11.82; P = 5.38−09), 15q13 (OMIM# 612001; OR = ∞; P = 1.44−10), and 15q11-q13 (breakpoint [BP] 1/2–3 of the Prader-Willi [OMIM# 176270]/Angelman [OMIM# 105830] syndromes region; OR = ∞; P = 2.77−09). We also identified 18 deletions involving the 17q12 region (OMIM# 137920); these deletions were initially reported to have no neurocognitive phenotype.34 More recent studies, however, have shown an association between 17q12 deletions and DDs35 and autism/schizophrenia.36 The absence of the 17q12 deletion in 10,118 controls is strong evidence for classifying this deletion as pathogenic (P = 0.00015).

We also analyzed the reciprocal duplications of the 14 recurrent deletion CNVs (Table 4). Determining the functional significance for duplications can be more challenging due to the more subtle and milder phenotypes associated with an increase in gene dosage compared with the more severe phenotypic effects of haploinsufficiency. The initial classifications for these CNVs ranged from VOUS to pathogenic events.

For six duplications initially classified as pathogenic (in 1q21.1 [OMIM# 612475], 7q11.23 [OMIM# 609757], 15q11.2-q13 [OMIM# 608636], 17p11.2 [OMIM# 610883], 17q12, and 22q11.2), the case-control analysis corroborated this classification (Table 4). The 16p11.2 duplication was initially classified as a VOUS; however, our case-control analysis demonstrates that this duplication is most likely pathogenic (OR = 6.28; P = 2.5−05).

Several recurrent CNV regions have had equivocal reports in the literature. For example, duplications of 16p13.11 have been previously suggested to be linked with autism,27 whereas another study proposed that the duplications may be a benign CNV.28 Because of the uncertainty in the literature, duplications in three regions (16p13.11, 15q13 BP4–5, and proximal 1q21) were initially classified as VOUS. As these duplications were not significantly enriched in the ISCA case cohort or in controls, the classification of these CNVs remains uncertain at this time using the formal case-control assessment.

Duplications of 3q29,15 8p23.1,21 and 5q3517 have been previously reported in individuals with abnormal phenotypes. In this case-control analysis, these events were identified more often in cases than in controls. However, because of the low frequency of these duplications in the clinically affected population, the differences were not statistically significant. Therefore, as a conservative approach, we would classify these three CNVs as uncertain until larger sample sizes are available. More detailed phenotypic investigations of individuals carrying duplications of 3q29, 8p23.1, and 5q35 in the ISCA cohort and other patient cohorts will help to clarify whether the observed phenotypes are consistent with the previously reported syndromes associated with these duplications.

DISCUSSION

There are now many published reports of the significant role of rare, de novo CNVs with major phenotypic effects in various human disease populations, including intellectual disabilities, ASDs, epilepsy, and schizophrenia, among others. Many of these studies are based on well-phenotyped research cohorts that were originally collected and characterized to optimize the ability to detect small effects in genome-wide association studies. Although positive associations have been identified for a few common diseases through these efforts, a surprising and remarkable finding has been the identification of rare, de novo CNVs with major phenotypic effects, particularly in neurocognitive and behavioral disorders. Because these events are rare, obtaining adequate evidence for their functional role in disease causation requires very large sample sizes and large control populations.

An alternative model for assessing the contribution of CNVs to disease, which has been used particularly in the study of children with unexplained developmental disabilities and congenital anomalies, has been the reporting of case series from clinical laboratory testing. Most of these published studies have represented CNV data from single laboratories and were based on previous generation targeted array analysis using bacterial artificial chromosome genomic clones.5 Compared with analysis of research cohorts of well-phenotyped patients, the amount and quality of phenotypic data associated with clinical laboratory referrals is often quite limited.

For this study, we have combined these two approaches by exploiting a large CNV dataset derived from a consortium of clinical laboratories to explore the frequency and functional significance of rare CNVs. Our analysis of the first 15,749 ISCA cases, one of the largest CNV studies to date, has confirmed the power of this approach. We have defined the frequency (17.1%) of pCNVs in a cohort of individuals with intellectual and developmental disabilities and performed formal case-control studies of selected recurrent genomic regions whose frequency was sufficient for statistical analysis.

The determination of whether a CNV contributes to an abnormal phenotype depends on many factors, including gene content, previous evidence of pCNVs in the region, type of CNV (deletion or duplication), inheritance pattern, and frequency in unaffected populations. As such, larger CNVs may be more likely to be classified as pathogenic as they have a higher chance of including a dosage-sensitive gene and/or they include a larger number of genes that cumulatively result in an abnormal phenotype. Our experience, as well as that of other groups,53 has shown that the classification of a previously unreported CNV not associated with known disease genes can vary. To address such discrepancies, we used case-control statistical evidence for 14 selected recurrent CNV regions to objectively determine their significance.

We analyzed deletions and duplications of each region separately, resulting in 28 total recurrent CNV regions. Using this approach, we demonstrated and confirmed the pathogenic nature of 20 recurrent regions. For the 16p11.2 duplications that had previously been reported as uncertain in the literature, we were able to reclassify this CNV region as pathogenic. Overall, we conclude that 21 of the 28 recurrent CNVs examined should be considered pathogenic and provide a clinical diagnosis for any individual harboring a CNV of these regions.

The statistical approach we used to classify recurrent CNVs and the results we obtained are useful tools for researchers and the clinical community in interpreting whether a CNV has pathologic effects. However, although such statistical analysis is possible for recurrent CNVs, where the frequency is high, this strategy is more difficult for the remaining approximately 75% of CNVs, which are not mediated by segmental duplications and are individually very rare. Therefore, other approaches need to be explored to address this class of CNVs. One possibility for these highly heterogenous CNVs is to analyze all genomic intervals of a defined size (e.g., 500 kb or 1 Mb) or to use a “sliding-window” analysis to examine overlapping genomic intervals along the length of each chromosome. By comparing structural variation observed in cases to controls, disease-causing regions can be differentiated from those associated with normal variation by using the control data to define regions of the genome where dosage changes can be tolerated without overt phenotypic effects. As nonrecurrent CNVs are very rare events, the collection of data from hundreds of thousands of cases will be needed for this type of analysis to be successful. Continued efforts of the ISCA consortium, as well as other databases such as DECIPHER (https://decipher.sanger.ac.uk/), will be essential to this process to obtain enough overlapping CNVs to provide the power needed for statistical analyses.

The ISCA consortium is continuing to grow and now includes more than 150 clinical laboratories from across the world. Given the rapid increase in utilization of this testing on a routine clinical basis, and the ability to recruit an expanding number of collaborating labs contributing data to a central database, the size of this cohort will continue to rapidly grow, providing a highly cost-effective way to obtain very large CNV datasets. In addition, as this data will be publicly available through two NCBI resources, database of Genotypes and Phenotypes and dbVar, this resource can be readily accessed by researchers and the clinical community. Having large datasets from individuals with abnormal phenotypes will foster more objective formal scientific analyses to predict which CNVs will impact human development. Such efforts will make it possible to develop a whole-genome dosage map in humans to determine which genes and regions are subject to haploinsufficiency or triplosensitivity compared with those that are tolerant of dosage changes.