Indexing Effects of Copy Number Variation on Genes Involved in Developmental Delay

A challenge in clinical genomics is to predict whether copy number variation (CNV) affecting a gene or multiple genes will manifest as disease. Increasing recognition of gene dosage effects in neurodevelopmental disorders prompted us to develop a computational approach based on critical-exon (highly expressed in brain, highly conserved) examination for potential etiologic effects. Using a large CNV dataset, our updated analyses revealed significant (P < 1.64 × 10−15) enrichment of critical-exons within rare CNVs in cases compared to controls. Separately, we used a weighted gene co-expression network analysis (WGCNA) to construct an unbiased protein module from prenatal and adult tissues and found it significantly enriched for critical exons in prenatal (P < 1.15 × 10−50, OR = 2.11) and adult (P < 6.03 × 10−18, OR = 1.55) tissues. WGCNA yielded 1,206 proteins for which we prioritized the corresponding genes as likely to have a role in neurodevelopmental disorders. We compared the gene lists obtained from critical-exon and WGCNA analysis and found 438 candidate genes associated with CNVs annotated as pathogenic, or as variants of uncertain significance (VOUS), from among 10,619 developmental delay cases. We identified genes containing CNVs previously considered to be VOUS to be new candidate genes for neurodevelopmental disorders (GIT1, MVB12B and PPP1R9A) demonstrating the utility of this strategy to index the clinical effects of CNVs.


Clinical Microarray Datasets
The clinical microarray (CMA) data was obtained from two independent sites, The Hospital for Sick Children (SickKids) and Credit Valley Hospital (CVH). A total of 7,106 and 3,513 cases CMA data were obtained, respectively, who went through confirmed diagnosis for DD (Table S1). In both sites, ISCA 180K comparative genomic hybridization array was used to detect large CNVs by applying circular binary segmentation algorithm. For reference, we used a pool of 10 samples to compare individual probe intensities. The clinical annotation for each sample variant was conducted by the clinical geneticist in each site.
DNA extracted from peripheral blood was used to perform comparative genomic hybridization array (aCGH) analysis on a custom designed 4 X 180K oligonucleotide microarray platform (Oxford Gene Technology (OGT), Oxford, UK). Microarray experiments were performed according to the manufacturer's instructions. Briefly, DNA from the proband and pooled same-sex reference DNA (Promega, Madison, WI) were labeled with Cy3-dCTP and Cy5-dCTP, respectively and were hybridized to the array slide according to the manufacturer's protocol (OGT). The arrays were scanned using the Agilent G2505Bmicroarray scanner. Data analysis was performed using the Agilent Feature Extraction software (10.7.11) and CytoSure Interpret Software version 3.4.3 (OGT). Clinical interpretation of copy number variants was consistent with the ACMG guidelines 1 . Parental follow-up studies were performed by FISH analysis on cultured lymphocytes using standard protocols. Metaphase chromosomes were counter-stained with DAPI, and inverted grey scale imaging was used to visualize chromosome banding patterns for chromosome identification, using the ISIS Metasytems imaging software version 5.5.4 (Newton, MA, USA). Parental follow-up of deletions less than 200 kb and duplications less than 700 kb were performed by aCGH.
We have used 9,692 unrelated control samples from multiple major population scale studies that used high-resolution microarray platform (Table S3). These samples do not have any obvious psychiatric history. The studies include 4,347 control samples assayed in Illumina 1M from the Study of Addiction Genetics and Environment (SAGE) 2 and the Health, Aging, and Body Composition (HABC) 3 ; 2,988 control samples assayed in Illumina Omni 2.5M from COGEND 4 and KORA projects 5 ; 2,357 control samples assayed in Affymetrix 6.0 from Ottawa Heart Institute (OHI) 6 and PopGen project 7 . In addition, we have incorporated additional 11,255 control datasets assayed in Illumina platforms from ARIC and WTCC2 project 8 .

Critical Exon Classification
For critical exon classification described in (Uddin et al, 2014) 9 , we used the 1000 genomes project for rare missense loss of function (LOF) mutation burden computation and transcriptome data from the human developmental brain atlas.

a. Burden of rare missense mutations
We used data from the 1000 genomes project 10 initiated by the US National Health Heart, Lung and Blood Institute (NHLBI) to calculate the burden of rare missense mutations in human populations. 1,039 whole genome sequencing samples (495 males, and 544 females) 10 . Within these whole genome sequenced (WGS) samples, exonic regions had mean coverage of at least 20X. We used the RefSeq gene annotation model (which includes all exons from annotated isoforms) for our analysis. Genes with no variant calls were excluded. As described previously 9 , we annotated the variants using Annovar and considered rare missense and loss of function (LOF) variants as strong proxy for recent (mostly within the last 5,000-10,000 years) rare deleterious mutation events in humans.

b. Spatio-temporal Human Brain Expression:
The normalized RNA-seq expression profiles of spatio-temporal developmental human brains were downloaded from the BrainSpan database (http://www.brainspan.org/static/download.html). We have analyzed 388 tissue samples from 32 post mortem donors (prenatal and adult). The expression measures were provided for exons as reads per kilobase (kb) per million (RPKM) from mapped reads. Method details for sequencing, alignment, QC and expression quantification can be found in the BrainSpan Technical White Paper (http://www.brainspan.org/). We have conducted our spatial-temporal (prenatal and adult) analysis on 16 brain regions, including 11 neocortex regions (V1C, primary visual cortex; STC, posterior (caudal) superior temporal cortex; IPC, posterior inferior parietal cortex; A1C, primary auditory cortex; S1C, primary somatosensory cortex; M1C, primary motor cortex; DFC, dorsolateral prefrontal cortex; MFC, medial prefrontal cortex; VFC, ventrolateral prefrontal cortex; OFC, orbital frontal cortex; ITC, inferolateral temporal cortex) and AMY, amygdaloid complex; CBC, cerebellar cortex; HIP, hippocampus; MD, mediodorsal nucleus of thalamus; and STR, striatum. To classify critical exon, we have computed the 75 th percentile 9 value from the entire dataset and used it as a threshold to define exons with high and low expression. Critical exon fraction was computed for a gene or a group of genes (impacted by CNVs) by applying the 75 th percentile index on all exons. The fraction was computed by dividing the number of exons classified as critical exon over total number of exons.

Human Developmental Protein Expression Data
The protein expression levels for the genes were analyzed using high-resolution genome-wide Fourier-transform mass spectrometry data 11 (downloaded from Human Proteome Map). We have used in-depth proteomic profiling of 30 histologically normal human samples, including 17 adult tissues (lung, heart, liver, gall bladder, adrenal gland, kidney, urinary bladder, prostate, testis, ovary, rectum, colon, pancreas, oesophagus, retina, frontal cortex, and spinal cord) and 7 fetal tissues (liver, heart, brain, placenta, gut, ovary, testis) 11 . High-resolution Fourier transform mass spectrometers used for fragmentation (high-high mode) to process the data. The data resulted in the identification of proteins encoded by 17,294 genes accounting for approximately 84% of the total annotated protein-coding genes in humans 11 . Average spectral counts per gene per sample were used as the measure for protein expression.

WGCNA Network details:
We have used weighted coexpression network analysis (WGCNA) analysis using human protein expression data in development. The R WGCNA package was used to conduct the analysis 12,13 . The use of weighted networks represents an improvement over unweighted networks because it preserves continuous nature of the co-expression information and it is biologically robust with respect to parameter ß 14 . We excluded proteins that are not expressed (expression = 0) in at least 90% of the samples because such low-expressed features tend to reflect noise and correlations based on counts that are mostly zero are not really meaningful. The absolute value of the Pearson correlation coefficient is calculated for all pair-wise comparisons of protein expression values across all developmental tissue samples into a similarity matrix. We used blockwise network construction and module detection method where the clustering of a block will consists maximum of 20,000 proteins. A signed adjacency matrix was constructed using a "soft" power adjacency function a ij = |0.5 + 0.5 * cor(x i , x j )| β where the absolute value of the Pearson correlation measures protein the co-expression similarity, and a ij represents the resulting adjacency that measures the connection strengths. We have chosen the soft thresholding beta = 18 based on the scale-free topology 14 criterion ß for our analysis. Next, to compute modules, where the proteins have high "topological overlap", we compared connection strength between proteins in the network. The parameters for module detection used -were: minimum 30 proteins per module and a medium sensitivity deepsplit = 2 was applied to cluster splitting. The clustering of genes for modules used average linkage hierarchical clustering and modules are identified in the resulting dendrogram by the dynamic hybrid tree cut. Found modules are trimmed of genes whose correlation with module eigengene (KME) is less than a threshold defined by the function minKMEtoStay and for merging similar modules, we used 0.35 as a threshold. The connectivity of each node i is the sum of connections to other nodes.
For visualizing the protein co-expression network, Cytoscape network software v.2.8.3 was used. The node are represented by a circle and the edge between the nodes implies the co-expression weighted Pearson distance. The color of the node is representative of their membership to a phenotype.

Significant Test Analysis and Permutation Test
We have used Fisher's exact test (FET) for all count data and gene enrichment test with p-value < 0.05 (after Bonferroni multiple test correction) as the threshold for significance. To reveal the strength of enrichment association with the gene lists, we undertook a permutation test by randomly drawing equal numbers of genes and reanalyzing the data under the null-hypothesis. The random draw was conducted from a background that is appropriate for the test. With sufficient iterations (100,000 times), the resulting sets of p-values are presumed to be a reasonable approximation of the null distribution of the p-values.

Primer Name
Sequence (