Over the past several years, our understanding of copy-number variation (CNV) within the human genome and its relation to disease has rapidly evolved. Molecular cytogenetic techniques such as microarray-based comparative genomic hybridization have identified disease-causing CNVs in a variety of disorders, ranging from pediatric disease (congenital anomalies, intellectual disability, epilepsy, and autism spectrum disorders) to adult-onset conditions such as schizophrenia. Of note, many CNVs were identified in multiple, variable disease cohorts, indicating that identical genetic changes could result in different phenotypes.1,2 Furthermore, some of these CNVs were inherited from phenotypically normal parents.1,2 Although the genetics community was already familiar with variable expressivity in the classic example of 22q11.21 deletions, traditional cytogenetics had taught us to use inheritance of a genetic change as a definitive factor for pathogenicity. Specifically, de novo aberrations are thought to be more deleterious, whereas inherited rearrangements (such as a marker chromosome) are considered more benign. However, for newly described CNVs like the distal 1q21.1 microdeletions/microduplications, despite variable phenotypes and inheritance from normal parents, enrichment of the CNVs among affected individuals in comparison with healthy controls implicated them as pathogenic.3 As increasing numbers of cases and controls are studied for CNVs, we are discovering many additional examples of these “predisposing,” or “susceptibility,” loci.1,2,4,5

Microarray analysis is now recommended as a first-tier test for many pediatric neurodevelopmental disorders.6,7 Postnatal identification of one of these susceptibility CNVs explains at least one part of the genetic etiology of the disorder in the individual, although additional factors, either genetic or environmental, are likely to ultimately influence the phenotypic expression of these loci.1,8 Additional genetic factors, such as other CNVs, may be identified via microarray testing, but in many cases, the other influences on the phenotype remain unknown. This poses challenges to recurrence-risk counseling because subsequent children inheriting the CNV could have more or less severe, or no, phenotypic consequences, and specific testing is not available to inform such predictions. In addition, as the use of microarrays in prenatal settings increases, fetuses without a known family history of these CNVs will be identified as carriers. This can lead to counseling dilemmas and parental anxiety, especially in low-risk pregnancies, because the associated neurodevelopmental phenotypes cannot be ascertained prenatally and it is difficult to quantify the risk to the fetus. To aid in counseling for these CNVs, we calculated empiric estimates for penetrance on the basis of the CNV frequencies in our population of postnatal microarray-based comparative genomic hybridization samples and in control populations.

Materials and Methods

We examined postnatal specimens received by our laboratory, mostly from the United States, for clinical microarray-based comparative genomic hybridization between March 2004 and April 2012. The analysis of indications for study among samples received in the first quarter of 2008 and of 2011 showed that 51–54% of individuals have developmental delay/intellectual disability and 10–11% have epilepsy, whereas cases with autism spectrum disorders have increased from 10% to 14%, those with congenital anomalies have increased from 16% to 23%, but those with dysmorphic features have decreased from 25% to 16%. Cases with unspecified indications for study have decreased from 7% to 5%. These are likely underestimates of actual phenotypes because not all phenotypic features are recorded on the test requisition form. The array platform used depended on the date of specimen receipt because array designs changed over time. Samples were tested on targeted, bacterial artificial chromosome–based arrays (SignatureChip versions 1–4; Signature Genomic Laboratories, Spokane, WA; n = 15,411), whole-genome, bacterial artificial chromosome–based arrays (SignatureChipWG versions 1–2; Signature Genomic Laboratories; n = 8,113), or whole-genome, oligonucleotide-based arrays (SignatureChipOS version 1; manufactured by Agilent Technologies, Santa Clara, CA; SignatureChipOS versions 2–3; manufactured by Roche NimbleGen, Madison, WI; all custom designed by Signature Genomic Laboratories; n = 25,113) according to previously described methods.9,10,11,12 For the CNVs analyzed here, the targeted, bacterial artificial chromosome–based arrays only had coverage of 22q11.21 and proximal 1q21.1, whereas the whole-genome arrays had coverage of all studied CNVs. Frequencies for 15q11.2 deletions were calculated only for cases studied on oligonucleotide-based arrays because the CNV was initially interpreted as likely benign and, therefore, not captured in our database for the cases studied with bacterial artificial chromosome–based arrays. For determination of CNV frequencies, only those CNVs that are of the recurrent size, as determined within the limits of resolution of the array used, are counted; any CNVs that do not include the entire region or extend into surrounding regions are excluded. However, individuals who harbor CNVs at other loci are included in the CNV frequencies.

Control specimens included samples from 8,329 previously described adult controls profiled on Illumina single-nucleotide polymorphism arrays.4 Additional control specimens were collected from the Atherosclerosis Risk in Communities study (dbGaP accession phs000090.v1.p1) and the Wellcome Trust Case Control Consortium (WTCCC2 1958 British birth cohort). Both the Atherosclerosis Risk in Communities and WTCCC2 data were derived from Affymetrix SNP6.0 (Affymetrix, Santa Clara, CA) array profiles and processed using Affymetrix Genotyping Console 4.1 with hg18 chromosome annotations. Samples were filtered using the default contrast quality control parameters, and segmentation was also performed using default settings. Additional filtering was applied to remove cases with excessive CNV counts, and a threshold of >72 CNVs per case was established using an outlier detection method for skewed data.13 After quality control filtering, the final control set consisted of 11,305 controls from the Atherosclerosis Risk in Communities study, 2,612 controls from the WTCCC2 58C cohort, and 8,329 previously published controls.

CNVs chosen for study were recurrent, identified in controls, and significantly enriched in cases ( Table 1 ). A Bayesian analysis was performed, based on the method used by Vassos et al.14 for the calculation of the penetrance of CNVs associated with schizophrenia, although we differed from their methods by using the observed population CNV frequencies directly in the following calculation. In brief, penetrance was calculated as:

Table 1 Penetrance estimates with case and control frequencies for recurrent CNVs

where D = disease, G = genotype (i.e., the presence of the CNV), and = absence of disease. Because we intended to calculate the probability of any abnormal pediatric phenotype when the CNV was identified on prenatal microarray testing, we defined the frequency of disease (P(D)) to be 5.12%, which is derived from the work of Baird et al.,15 who estimated the population frequency of diseases with an important genetic component among individuals younger than 25 to be 53 in 1,000. We subtracted from this 1.8 per 1,000, the frequency of chromosomal disorders, because these will have been ruled out in most cases through karyotyping. The 95% confidence intervals (CIs) for our penetrance estimates were calculated using the binomial CI for case and control counts calculated by the Clooper–Pearson exact tail area method. Using penetrance samples from both case and control distributions, we first calculated maximal and minimal likely counts in which the probability of generating a more extreme count for either cases or controls is 15.8% ; thus the probability of sampling a more extreme combination of case and control counts is 0.158 × 0.158 = 0.025 per tail. In the case of observed proportions of 0 and 1, the upper and lower binomial confidence bounds are fixed at 0 and 1, respectively. The lower penetrance bound is thus defined by substituting the maximal likely control count and minimal likely case count, whereas the upper bound is defined by substituting the minimal likely control count and maximal likely case count. Because this methodology is based on two one-tailed analyses, the actual CI will approach 97.5% as case and control counts approach their respective minima and maxima.


Penetrance estimates for these CNVs range from 10.4% (95% CI, 8.45–12.7%) for 15q11.2 deletions, which only represents about a twofold increase in risk over the background population risk, to 62.4% (95% CI, 26.8–94.4%) for distal 16p11.2 deletions ( Table 1 ). The lower penetrance figures are seen with CNVs that show less marked differences in frequencies between cases and controls, including distal 16p11.2 duplications, 16p12.1 deletions, and 16p13.11 deletions. The CNVs with a larger difference between cases and controls, including 16p11.2 proximal deletions and 1q21.1 distal deletions, have higher penetrance rates. In addition, higher penetrance is seen with CNVs that have higher de novo frequencies ( Table 1 ; P = 0.0029, Spearman correlation). For some of these CNVs that are still rare in controls, such as the distal 16p11.2 deletions, screening a larger control group would help to ensure a more precise estimate of penetrance. For still other CNVs that were not found in controls and therefore not part of this study (as penetrance would be estimated at 100%), such as the BP4-BP5 15q13.2q13.3 microdeletion, the CNVs may be inherited from apparently healthy parents in some cases,16 so penetrance is not complete, yet our data could not be used to estimate a value. Although similar penetrance estimates based on a subset of these data were recently part of a corrigendum to the article by Cooper et al.,4 our increased population sizes and inclusion of only postnatal cases give increased power to the estimates in this current study.


By using a patient population with a variety of phenotypes, we are able to provide penetrance estimates for our group of disease susceptibility CNVs for a range of abnormal pediatric phenotypes. This is both a strength and a weakness, because our estimates apply simply to the presence or absence of any abnormal pediatric phenotype without providing information about expressivity. It is well established that these CNVs lead to a spectrum of phenotypes, and predictions about severity (expressivity) are not possible on the basis of the data presented here. Some CNVs may have an association with a specific phenotype, and different calculations could provide separate estimates for a phenotype of concern. For example, Bayesian analysis for proximal 16p11.2 deletions and autism spectrum disorders, with P(GD) being 0.5%17 and P(D) being 1/110,18 yields a penetrance estimate of 14.5% for an autism spectrum disorder phenotype in the presence of a proximal 16p11.2 deletion. Notably, this is lower than our penetrance for any abnormal phenotype, which supports the use of our estimates to include specific phenotypes among a number of other possible manifestations. In addition, our estimates do not include risks for adult-onset or other conditions, such as obesity, that alone would likely not lead to an individual to be referred for clinical microarray-based comparative genomic hybridization testing. Although subclinical phenotypes may not be of concern, adult-onset conditions might be, and penetrance has been estimated for some of these CNVs and, for example, schizophrenia.14 Finally, it should be noted that we could have underestimated penetrance because the controls studied did not have in-depth phenotyping and may include mildly affected individuals. Also, these estimates are based on populations that are assumed to be mostly Caucasian, so it is also unclear whether estimates would vary in other ethnic groups.

The calculation model and estimates provided here will hopefully be a useful tool in prenatal genetic counseling, providing one more piece of information to inform prospective parents on the risks associated with carrying a specific CNV. Although counseling should still include information about the range of possible phenotypic outcomes, penetrance estimates can help to put the degree of risk into perspective; for example, counseling about a 15q11.2 deletion could be relatively reassuring with a ~90% likelihood of a normal phenotype, as compared with an ~50% chance of a normal outcome with a 16p11.2 proximal deletion. The ultimate phenotype of the child is probably affected by his/her genetic background and other environmental factors, the vast majority of which are unknown and therefore cannot be tested. Even when microarray testing identifies an additional CNV, it is not possible to predict how the CNVs may interact. Although it is still possible that prenatal microarray testing will identify a novel CNV of unclear clinical significance, in which case data do not exist to apply this model, large population studies have estimated that ~1/200 low-risk pregnancies carry a clinically significant CNV, many of which are at these recurrent loci;19,20 and so these penetrance estimates are likely applicable for many abnormal prenatal microarray results.


J.A.R. is an employee of Signature Genomic Laboratories, a subsidiary of PerkinElmer, Inc. E.E.E. is on the scientific advisory boards for Pacific Biosciences, Inc., SynapDx Corp., and DNAnexus, Inc. The other authors declare no conflict of interest.