Introduction

Intellectual disability (ID) affects about 3% of individuals globally, and, for half the cases, the cause is unknown. The wide-spread use of chromosomal microarray analysis (CMA) led to a new frontier in clinical diagnosis, with the ability to detect causative submicroscopic chromosomal imbalances, also called pathogenic copy-number variants (CNVs), in at least 10–15% of affected individuals in whom conventional cytogenetic analysis is normal.1, 2, 3

A few years ago, the number of genes recognized to contribute to ID was reported as 300.4, 5 Currently, the estimated number of ID genes is at least 900–950, based on the evidence that there are 91 pathogenic X-linked genes that account for 10–15% of ID in males.6

Although the power of next-generation sequencing has opened up the potential to screen all exons or the whole genome for sequence mutations, the analysis algorithms are not, as yet, robust for routine identification of losses or gains of sequence involving a single or a few exons. As the probe capacity on microarrays increases, there is the potential to screen a large number of genes at the exonic level using microarrays7, 8, 9, 10 providing the means to identify CNVs that are currently missed with whole-genome clinical arrays and next-generation sequencing.

We designed a microarray with a single-exon resolution and screened 1397 genes known or hypothesized to cause ID. We screened 165 trios composed of a child with idiopathic ID and both normal parents using this array. We detected and independently validated 36 CNVs in 32 families. Seventeen of these involve genes known to cause ID, of which at least 11, including 7 that are intragenic, are clearly pathogenic. Our results confirm the efficacy of our design and offer novel insights into the pathogenesis of ID.

Materials and methods

Subjects

Patients with ID with or without additional clinical features were selected for study. The cause of the ID in each child was unknown despite full evaluation by a clinical geneticist, a karyotype at ≥500 band resolution and subtelomeric FISH studies. Autism was diagnosed using the Autism Diagnostic Schedule/Autism Diagnostic Interview (ADOS/ADI). This study was approved by the University of British Columbia Clinical Research Ethics Board and Sainte-Justine Hospital Ethics Board. Informed consent was obtained for each patient. Paternity and maternity were confirmed by using six highly informative unlinked microsatellite markers, as previously described.11

Custom array design

Our goal was to design a custom NimbleGen 12-plex array with 135 000 probes covering suspected or known genes involved in the development of ID at time of design (April 2008), with a minimum of eight probes per exon in each of our selected genes. The NCBI human genome sequence build 36.1 was used as the reference sequence. Detailed array design is provided in Supplementary Notes S1, and the complete array and data discussed have been deposited into the NCBI Gene Expression Omnibus (GEO) repository and are publically available (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE39533). Briefly, the study comprised six main stages: (a) selection of genes and design of oligonucleotide probes, (b) testing of a pilot phase NimbleGen 385 K array with validation on control samples from patients with known CNVs, (c) selection of best performing probes and design of the final NimbleGen 135 K array, (d) CMA of 165 idiopathic ID trios (affected child and both unaffected parents) using the final NimbleGen 12-plex array (each subarray containing 135 K probes) and bioinformatic analysis to identify de novo CNVs, (e) validation of CMA results by quantitative PCR using SYBR green, and (f) genotype–phenotype correlation to determine pathogenic relationship of validated de novo CNVs to ID. The following genes were selected for inclusion on the array: (1) genes previously shown to cause ID (obtained from OMIM with keywords ‘mental retardation’ or ‘intellectual disability’; October 2007), (2) all genes within reported microdeletion/microduplications (<100 kb), reported in the DECIPHER database (October 2007), in which the phenotype included ID, (3) ID candidate genes within reported microdeletion/microduplications (>100 kb) reported in the DECIPHER database (October 2007), in which the phenotype included ID. If there were no reported candidate genes within these CNVs, a set of eight probes was placed every 10 kb within genes or highly conserved regions, (4) all brain-expressed glutamate receptors (GRC) and the majority of their known interacting proteins,12 and (5) genes involved in epigenetic regulation. A complete list of genes and regions covered by the array can be found in Supplementary Notes (ST1). The selection resulted in a total of 1397 RefSeq genes. The probes were selected according to a previously published protocol,13 and the arrays were synthesized by NimbleGen according to our custom design specification. Figure 1 shows the location of all probes and coverage obtained by our final 135 K array design.

Figure 1
figure 1

Location of probes against a chromosome ideogram. The blue dots clustered around the black vertical line represent individual probes, with the vertical line representing the normal log2 ratio signal of 0. In this case the patient has a duplication of chromosome 9pter as seen by a shift to the left from the normal by the vertical cluster of probes at that region. In addition, the hybridization in this case is of a female child versus the father, as can be seen by a shift toward the left, signifying a gain, from the normal of all chromosome X probes and shift to the right from the normal, signifying a loss, of all chromosome Y probes.

Array hybridization and data analysis

The labeling, hybridization and washing of the array were performed according to the manufacturer’s specifications, and analysis was done with default settings using the NimbleScan software version 2.1.4. (Supplementary Notes S1). CNVs were identified when the log2 ratio was >0.2 (duplications) or <−0.2 (deletions) with a minimum of five probes affected, using the BioDiscovery Nexus Software. The minimally affected region within a CNV was considered to be between the location of the first and last probe with an abnormal log2 ratio (>0.2 or <−0.2). The maximally affected region for a CNV was defined as the region between the location of the normal probe (log2 ratio between ±0.2) immediately preceding the most proximal abnormal probe and the normal probe immediately following the most distal abnormal probe.

Defining de novo CNVs

The analyses measured the strength of the hybridization signal obtained in two different array genomic hybridization experiments – one with the child’s DNA in comparison with that of the mother and the second with the child’s DNA in comparison with that of the father. A CNV was considered to be de novo if it was independently identified in both hybridizations. All inherited variants (CNVs that were seen on a hybridization versus one parent but not the other) were analyzed visually, and the CNV was reclassified as de novo if there appeared, by eye, to be a shift of the probe log2 ratio in the same direction in the second parent that had not been called by the software. Each CNV call in which the child’s signal was less than that of the parent (called a ‘loss’ in the tally) could actually represent either a loss of copy-number in the child or a gain of copy-number in the parent. Similarly, each CNV call in which the child’s signal was greater than that of the parent (called a ‘gain’ in the tally) could actually represent either a gain of copy-number in the child or a loss of copy-number in the parent. The direction of the CNV was confirmed with independent qPCR validation on the trio using a commercially available pooled reference set (see Supplementary Notes S1). We have used CNV as a general term without a size limitation because of current recommendations that the original 1 kb minimum CNV size was a reflection of early technologies.14, 15 Exons were numbered according to the NCBI reference sequence database (RefSeq).

Validation of de novo CNVs

CNVs were validated by qPCR (ΔΔCt method) using SYBR Green (Applied Biosystems, Life Technologies, Carlsbad, CA, USA) on an ABI 7500 fast real-time PCR system using both parents and a pooled reference sample from Promega (Madison, WI, USA) (Catalog#: G3041 – male and female and G1521-female only) with hexose-6-phosphate dehydrogenase (H6PD MIM#138090) as the control locus. Primers (listed in ST2) were designed using Primer Express (Applied Biosystems) (detailed in S1). CNVs on the X chromosome were validated against a pooled reference sample composed of female only DNA, whereas those on autosomes were validated against a pooled reference sample composed of both male and female DNA.

Pathogeneticity

A de novo CNV was classified as pathogenic when all of the following criteria were met: (1) it occurred within a known ID gene, (2) was predicted to disrupt the gene, and (3) the reported phenotype and the phenotype in our patient overlapped. A de novo CNV was classified as likely pathogenic when our patients’ phenotype and the phenotype reported with disruptions in the gene overlapped, but the CNV we identified has not been previously reported.

Results

We designed a custom array targeting known and hypothesized ID genes with sufficient probe coverage within exons to detect a CNV involving a single exon (Figure 1). We used this custom platform to perform CMA in 165 trios, each consisting of a child with idiopathic ID and both unaffected parents. We identified 176 putative de novo CNVs, of which 36 in 32 trios were confirmed by qPCR (Tables 1 and 2). The 21% validation rate we achieved is in keeping with other CMA research studies using whole-genome arrays.16 Four individuals (patients 931, 331, 600, and 513) exhibited two de novo CNVs each. For each of these four patients, one CNV was clearly pathogenic and the contribution, if any, of the second CNV is unknown.

Table 1 Summary of confirmed de novo CNVs identified in genes or microdeletion/microduplication regions previously associated with ID and summary of patient phenotype and other CMA information
Table 2 Summary of confirmed de novo CNVs identified in genes/loci not previously associated with ID and summary of patient phenotype and other CMA information

Of the 36 confirmed de novo CNVs (27 deletions and 9 duplications), nine involved at least the whole gene (with the possible maximally affected region also including adjacent genes) and three involved large regions known or strongly suspected to cause syndromic ID (15q24 microdeletion, 15q22 microdeletion, and a 5-Mb deletion of 9p24). We confirmed 15 (42%) de novo intragenic single or multi-exon CNVs (examples given in Figure 2), of which seven CNVs (five deletions and two duplications) in seven patients occurred in the JARID2 gene – further studies suggest this is a benign polymorphism occurring at high population frequency (manuscript in preparation). Seven additional CNVs (19%) involve either only exon 1 or exon 1 and the adjacent few exons, and all were analyzed for involvement of the translational start site or promoter region if probe coverage was present upstream of exon 1 using FirstEF, an in silico program.17 Of these seven, three CNVs in two genes (ARID1B, CDH2) removed the translational start site, one CNV involving exon 1 (CHD6) removed the predicted promoter sequence and another CNV (SNTG2) involved only the upstream regulatory sequence of a gene including the promoter as predicted by FirstEF.

Figure 2
figure 2

Examples of validated de novo CNVs. (a) 33 Kb loss of exon 3 of the IL1RAPL1 gene in patient 881. (b) A terminal loss of chromosome 9 in patient 412. (c) Loss of exon 1 of the CDH2 gene in patient 576. In each image, the top portion shows the chromosome ideogram and relevant genes within the CNV. Losses are indicated by a red bar below the ideogram. The X-axis is base pairs and the Y-axis is the log2 ratio of the hybridization of the affected child relative to a normal parent. The red and green bars on either side of 0 designate the log2 ratio cutoffs for the program to flag a (heterozygous and homozygous) loss and gain, respespectively.

We identified 14 CNVs in 14 individuals involving 12 genes, known to be causative for ID, as well as three microdeletion CNVs (in three individuals). Table 1 summarizes our array findings and the clinical phenotype of these patients. We consider 11 CNVs to be pathogenic in our patients based on the loss or predicted loss of function of the encoded protein and the phenotypic overlap between our patients and those previously reported (STXBP1,11, 18 SHANK3 (three patients),12 IL1RAPL1,19, 20 UBE2A,21, 22 NRXN1,23, 24 MEF2C,25 CHD7,26, 27 15q2428 and 9p24 microdeletions). We consider two whole gene duplications to be likely pathogenic (patients 224, 8327); DCX duplication has not been reported in ID, but is sensitive to loss,29 the PI4KA duplication in our patient has already been published.30 We identified two CNVs affecting genes (FREM2, GRIK2) associated with autosomal recessive forms of ID. In patient 419 with a heterozygous deletion of GRIK2, Sanger sequencing of all GRIK2 exons and their intronic boundaries did not reveal any mutations making it unlikely that the phenotype in this patient was due to loss of GRIK2. The phenotype of patient 354 (Table 1) was not consistent with that reported for autosomal loss of FREM2. Therefore, we considered these two CNVs unlikely to be pathogenic in these patients. For two other de novo CNVs, pathogenicity is unclear (ARID1B, 15q22 microdeletion). We identified a duplication of exon 1 of ARID1B, and the genomic location of this extra exon is unknown. The effect of a tandem duplication of exon 1 is difficult to predict, as in this case it does not disrupt the reading frame and would require functional studies to determine pathogenicity. In the case of the 15q22 deletion, the minimally effected region contains three non-ID genes (GCNT3, FOXB1 and BNIP2) and maximally 13.8 Mb of sequence maybe involved, therefore, without a whole-genome CMA to better delineate the extent and genes involved in the CNV, pathogenicity is unclear.

CMA using whole-genome research or clinical arrays were run on nine of the patients who had confirmed de novo CNVs within genes known to cause ID (Table 1). Of these nine cases, six were also abnormal by clinical array with confirmed CNVs that overlapped with our affected regions, while three of the de novo CNVs we found were not identified by clinical CMA. The first of these was an 18.8-kb pathogenic deletion of exons 20–23 of SHANK3 (patient 828). Interestingly, the same clinical array identified a 43-kb deletion of exons 9–23 of SHANK3 in two other patients (patients 622 and 248), in whom our array also detected the same CNVs. The second was a 3.5-kb deletion of exons 10 and 11 of STXBP1 (patient 970) and the third was a 5.5-Mb deletion involving the whole NRXN1 gene (patient 513). We were perplexed by the clinical CMA missing this large CNV, despite the platform used probing this gene with 30 markers. It is possible the CNV only includes NRXN1 and that 30 markers is insufficient to call a CNV, and these findings highlight how CMA data analyses methods may differently affect results.

With regards to CNVs we identified in known pathogenic regions; three patients were found to have CNVs in known pathogenic microdeletion regions. The first was a 1.3- to 4.5-Mb deletion (minimally-maximally affected size) within 15q24, a known microdeletion syndrome region.28 The phenotype of our patient (Table 1) is consistent with the reported syndrome, indicating that this CNV is pathogenic. The second patient (patient 412) had a 178-kb to 6.6-Mb deletion of 9p24 that includes the SMARCA2 gene targeted by our array. A clinical CMA (with a genomic backbone and thus able to better define breakpoints than our design) characterized this de novo CNV as a 5-Mb deletion. The clinical CMA in this patient identified an additional 3.5-Mb duplication CNV of 16q24.1 that we didn’t detect due to a lack of probe coverage in the region as per our array design. The clinical report for this patient called the 9p24 deletion pathogenic based on size, while the smaller 16q24.1 CNV was of unknown clinical significance. In the third patient (452), we identified a 450-kb to 13.8-Mb deletion of 15q22, and for reasons explained above we are currently uncertain of the pathogenicity of this event.

In addition to designing a proof-of-principle exon-resolution targeted ID array, one goal of this project was to investigate genes involved in epigenetic regulation and synaptogenesis for involvement in causative CNVs in patients with ID. We identified and independently validated 19 de novo CNVs that included genes of these categories in 19 patients (Table 2). For these cases, many of the CNVs that we identified occurred within genes in which single case reports have been published or show mouse model data consistent with a role in ID. In these instances, further study is necessitated to determine the extent of these CNVs with detailed phenotype–genotype correlations (manuscripts in preparation). We have therefore omitted a detailed assessment of the pathogenicity of these novel findings in this paper.

Interestingly, three of the above 19 patients had a clinical CMA performed, and all of them were normal on the clinical platform. Patient 164, who had a deletion involving at least the CADM2 gene, was analyzed on a 105K Gene Dx array. We do not know the probe coverage of this gene by the Gene Dx platform; however, it is unlikely to have contained sufficient probes to detect a single gene imbalance as CADM2, as it is not part of a recognized ID syndrome. Two other patients (patients 576 and 7581) both showed a minimal loss of exons 1–2 of CDH2. One of these patients was part of another CMA study,30 by which no corresponding CNV was identified using an Affymetrix 500 K, Agilent 244 K or NimbleGen 385 K array. However, none of these platforms included a sufficient number of probes (>5) to call a CNV in this gene. This suggests that the affected region is probably restricted to the deletion of exon 1–2 of CDH2, which does remove the translational start site located in exon 1. The third clinical array was run on a patient with a deletion of exon 1 of the CHD6 gene; however, this was a clinical BAC array (Signature Genomics- the array did not have CHD6 in its target list as of April 2010). These data highlight the potential for gene centric array design to identify CNVs that could be missed by clinical CMA.

Discussion

The use of whole-genome CMA to identify microdeletion/microduplications has greatly improved the rate of diagnosis of genetic imbalance in ID patients, so much so, hence CMA is now recommended as the first line test for individuals with ID.31 There has been an increasing interest in intragenic CNVs, and a number of investigators have reported single-exon-resolution CMA for small sets of carefully chosen genes in affected individuals7, 8, 9, 10 and in normal individuals.32, 33 However, there have been few studies assessing a large set of candidate genes thought to be causative for a complex and highly heterogeneous condition like ID.34 Therefore, the aim of our study was to design a custom array able to identify CNVs in known ID genes/loci as well as in candidate ID genes, at single-exon resolution, which is well below the current level of detection of standard clinical CMA.

Of the 36 validated de novo CNVs, 23 were intragenic, involving one or more exons within a single gene (Tables 1 and 2). Boone et al. reported results from their analysis of 3743 cases referred to the Medical Genetics Laboratory at Baylor College of Medicine for CMA with a custom targeted clinical array probing 1700 candidate genes for a variety of clinical conditions including ID, at an average coverage of four probes per exon.35 The authors found 40 CNVs involving one or more exons of which 15 were known to cause a recognized phenotype concordant with that of the patients. Three of the genes involved in de novo CNVs in our study – IL1RAPL1, STXBP1, and NRXN1 – were also identified in the Baylor cohort, although no overlap in intragenic deletions was observed. Similar to the study by Boone et al, our study highlights the high proportion, 64% in our study, of intragenic CNVs that are pathogenic or potentially pathogenic for disease.

Whole-genome CMA identifies pathogenic CNVs in 15–20% of children with ID who have a normal karyotype.6 We hypothesized that we could identify more pathogenic CNVs with a targeted exonic resolution array than with a whole-genome or clinical CMA. As reported above, 12 of our patients with validated de novo CNVs were also tested on various other clinical or whole-genome research CMA platforms; for six patients, the other platforms reported normal results. The CNVs in three of these patients we consider to be pathogenic, and all are intragenic CNVs involving genes known to cause ID. Of the six known ID loci that were also detected by clinical CMA, the average size of the CNV was 2600 kb (range 43 kb–5000 kb), which was significantly larger than the average size (9.3 kb; range 3.5 kb–18.5 kb) of the three CNVs involving known ID genes that were missed by the clinical CMA. Current whole-genome clinical arrays have sparse coverage of individual exons within a gene and rely on probe coverage within the whole gene to identify a CNV. This type of design would miss single exon or possibly multi-exon CNVs. In our array design, we chose to have eight probes per exon and used a minimum of five probes showing a shift of the log2 ratio to be called as a CNV enabling us to detect intragenic CNVs that are missed by platforms with less dense probe coverage.

Of the 176 de novo CNVs identified by our array, we confirmed 36, a true positive rate of 21%, similar to other whole-genome research CMA studies in which sensitivity is emphasized over specificity.16 Because of the research nature of this project, the settings to identify a CNV were liberal in order to detect previously unrecognized ID loci or uncover novel intragenic causative CNVs, and therefore a high false positive rate is expected. In terms of the specificity of our design, because of our Canadian health care structure, not all patients run on our custom array were fortunate to also have a clinical array performed. Therefore, without analyzing all of our patients on clinical arrays, it is unclear if our array design would miss potentially pathogenic CNVs. However, for patients for whom clinical CMA was performed, our design identified all pathogenic CNVs that were detected by clinical CMA. Only in one case (patient 412), our array missed one of the two CNVs identified on clinical CMA as discussed previously.

We identified 17 CNVs in 11 genes (five are known ID genes: ARID1B, MEF2C, CHD7, UBE2A, and JMJD1C) with epigenetic regulatory function and eight CNVs in six genes (five are known ID genes: GRIK2, SHANK3, STXBP1, IL1RAPL1, and NRXN1) with synaptogenic function. Many of the candidates we identified have not been previously reported to cause ID or have only been reported in an animal model or a few limited cases reports and require further study, beyond the scope of this work, to determine pathogenicity.

Although our custom microarray provided an effective platform for identifying known and novel CNVs associated with ID, its design has several limitations: First, the lack of coverage outside our targeted regions usually prevented us from determining the CNV breakpoints. In our experience, as is also generally found,14, 31 breakpoints defined by CMA are often imprecise, as they are based on statistical inference. Nevertheless, the location of breakpoints within exons on our custom chip is more precise because of the density of our probe coverage within targeted regions. The fact that breakpoints within an intronic sequence cannot be localized precisely should not alter our interpretation of an intragenic CNVs as long as the canonical splice donor and acceptor sequences are intact. However it is possible more complex rearrangements are present (for example, the multi-copy gains may not be positioned in tandem or even on the same chromosome) that we are unable to assess without breakpoint sequencing. We did not perform breakpoint sequencing as it is beyond the scope of this work, but doing so would have allowed us to define the genotype more accurately and also infer genomic mechanisms for CNV causation. Second, the sparse and irregular genomic coverage prevents us from knowing whether pathogenic CNVs of untested regions are present in these patients. Both of these two issues could be resolved by adding a ‘backbone’ of regularly spaced probes throughout the genome. CMA is imprecise at estimating the number of copies present of a particular loci. qPCR was performed on all members of the trio and compared with a pooled reference set allowing us to resolve de novo CNVs calls within polymorphic loci (ie, a CNV called as a loss in a child can actually be a gain in the parent (or vice versa)). Such loci accounted for many of the false positive de novo CNVs in our CMA analysis. However, complex loci that have polymorphic alleles with several different copy-numbers or occur in various different overlapping sizes may have confounded both our CMA and qPCR analysis. Finally it must be borne in mind that in those cases where a clear genotype–phenotype correlation has not been established, the rare de novo event we have detected may not be necessary and sufficient to produce ID in the affected child. This is almost certainly true of the patients harboring CNVs involving JARID2 exon 6. As information from higher resolution CMA and whole-exome and -genome sequencing studies of well-phenotyped ID patients accumulate, it should be possible to characterize the pathogenicity of many of the CNVs we found that are currently of uncertain clinical significance.

In summary, CMA was performed on 165 idiopathic ID trios using an exon-level-resolution custom microarray and de novo CNVs were confirmed in 32 trios. Sixty-four percent of our validated de novo CNVs were intragenic, including three pathogenic CNVs not identified by clinical CMA. Intragenic CNVs are likely a significant contributor to genetic causes of ID and will be missed with current commercially available clinical arrays using current clinical CMA guidelines.31