De novo genic mutations among a Chinese autism spectrum disorder cohort

Recurrent de novo (DN) and likely gene-disruptive (LGD) mutations contribute significantly to autism spectrum disorders (ASDs) but have been primarily investigated in European cohorts. Here, we sequence 189 risk genes in 1,543 Chinese ASD probands (1,045 from trios). We report an 11-fold increase in the odds of DN LGD mutations compared with expectation under an exome-wide neutral model of mutation. In aggregate, ∼4% of ASD patients carry a DN mutation in one of just 29 autism risk genes. The most prevalent gene for recurrent DN mutations is SCN2A (1.1% of patients) followed by CHD8, DSCAM, MECP2, POGZ, WDFY3 and ASH1L. We identify novel DN LGD recurrences (GIGYF2, MYT1L, CUL3, DOCK8 and ZNF292) and DN mutations in previous ASD candidates (ARHGAP32, NCOR1, PHIP, STXBP1, CDKL5 and SHANK1). Phenotypic follow-up confirms potential subtypes and highlights how large global cohorts might be leveraged to prove the pathogenic significance of individually rare mutations.

A utism spectrum disorders (ASDs) are characterized by impairments in social communication and restricted or repetitive behaviours or interests. ASD has a strong but complex genetic component and is typified by a male bias 1 . Although the prevalence of ASD in the United States has increased to 1 in 68 children 2 , the prevalence worldwide has been estimated as closer to 1% (ref. 3). Large-scale whole-genome copy number variation (CNV) studies [4][5][6][7] , whole-exome sequencing studies [8][9][10][11][12][13] and candidate resequencing studies 14,15 have established the importance of de novo (DN) germline mutations in ASD. Studies of DN mutations have led to the discovery of dozens of new candidate ASD risk genes in primarily European cohorts. However, no study has explored on a large scale the spectrum of DN mutations in these ASD risk genes in the Chinese population. While the frequency of DN mutation should typically not differ among population groups, examples of ethnic predilection (for example, 17q21.31 recurrent mutations) due to genomic architecture, sequence composition or patient recruitment biases may exist [16][17][18] .
In this study, we apply single-molecule molecular inversion probes (smMIPs) 19 to rapidly and cost-effectively sequence the coding regions of 189 autism risk genes across a large cohort of 1,543 autism probands (1,045 from trios) from the Autism Clinical and Genetic Resources in China (ACGC). Several candidate genes with recurrent DN likely gene-disruptive (LGD) mutations are established by this study, implicating several novel autism risk genes in the Chinese population. Wherever possible, we worked with the five ACGC coordinating centres to investigate patterns of inheritance and phenotypic features of patients identified with DN mutations. We compare these data to determine whether previously observed genespecific comorbidities distinguish etiological subtypes of autism in the Chinese population. Our study helps to identify global risk genes providing evidence and motivation for further functional and translational studies of specific ASD risk genes, which may guide future personalized treatments.

Results
Resequencing of 189 genes in a large Chinese ASD cohort. We selected autism risk genes for targeted sequencing mainly based on the frequency and severity of DN mutations from previously published exome sequencing studies 12,13 under the hypothesis that DN mutations in genes contributing to autism pathology will not differ significantly between global populations. These included DN mutation calls from exome sequencing of autism families-primarily the Simons Simplex Collection (SSC) and the Autism Sequencing Consortium (ASC). Candidate genes were ranked according to the presence of the following criteria: (1) recurrent DN LGD events, (2) one LGD event and more than one DN missense mutation, (3) recurrent DN missense mutations and CNV disruption among cases of ASD and developmental delay (DD) 3,5,6,20 , (4) CNV disruption among cases of ASD and DD with potential functional relevance in the brain (yet no DN single-nucleotide variants in exome sequencing data from ASD probands) or (5) multiple DN events among intellectual disability (ID) probands 21,22 . We also chose to include several known ASD risk genes. We excluded genes that showed a high tolerance to severe mutations in American populations based on exome sequencing data from 6,500 unaffected individuals (NHLBI Exome Sequencing Project). The final gene set consisted of 213 candidate genes (Supplementary Data 1).
Molecular inversion probes (MIPs) were designed using an improved algorithm, which incorporates a single-molecule (sm) tag into the degenerate region within the MIP backbone 19 . The final designs for our 213 candidate genes included 11,394 smMIPs distributed into three pools after optimization (see the 'Methods' section and Supplementary Data 2). For the purpose of this study, we assembled one of the largest Chinese ethnic cohorts of ASD. The ACGC aims to collect the clinical and genetic resources of more than 10,000 ASD families (trios/quads) over the next 5 years. All the patients in this study were born in China and were diagnosed according to DSM-IV (American Psychiatric Association, 2000) criteria (Supplementary Table 1). In total, 1,086 ASD proband-parent trios (one affected offspring and two unaffected parents) and 561 autistic children (most with one parent sample available) from the ACGC were selected for this study. The geographical distribution of the families is shown ( Fig. 1 and Supplementary Table 2) based on the birth origin of the patients. Only probands (n ¼ 1,647; male:female ratio ¼ 6:1) were targeted for initial sequencing.
In total, 1,543 probands (1,045 from trios) passed MIP capture and other QC measures (see the 'Methods' section and Supplementary Fig. 1 and Supplementary Data 3) and 189 genes passed QC measures (see the 'Methods' section and Supplementary Fig. 2 and Supplementary Data 2). Further analysis of this data set was based only on those genes and samples that passed these QC measures. We discovered 4,226 rare variants (see the 'Methods' section) predicted to alter the amino acid sequence or gene splicing. We selected LGD (nonsense, frameshift and splice site) mutations (n ¼ 120, of which 92 were found among the 1,045 trio probands) and missense mutations with a combined annotation dependent depletion (CADD) score of greater than 30 (MIS30; n ¼ 216, of which 179 were found among the 1,045 trio probands) for validation. In total, we validated 321 putative severe mutations (115 LGD and 206 MIS30) by Sanger sequencing with an overall validation rate of 95.5%. Where parental DNA were available, we also assessed inheritance by Sanger sequencing of variants (Supplementary Data 4).
Inheritance and mutation analysis. Among the 1,045 trios, we discovered 43 DN mutations in 29 of the 189 candidate genes tested, of which 35 were LGD events (Supplementary Data 4). We calculated the overall probability of detecting 35 or more DN LGD events in our panel of 189 genes using a probabilistic model derived from human-chimpanzee differences and an expected rate of 1.75 DN mutations per exome 14,23 as P ¼ 1.98 Â 10 À 24 (binomial test). This observation corresponds to an odds ratio (OR) of 11.1 (95% confidence interval (CI) ¼ 7.7-15.4) compared with null expectations strongly supporting the enrichment of ASD candidates in our 189 gene set. We repeated this analysis removing the 19 known ASD/ID genes (Supplementary Data 1) and observed a reduced but still highly significant signal (P ¼ 1.17 Â 10 À 5 ) corresponding to an odds ratio of 4.  Table 3; Supplementary Fig. 3 shows the distribution of CADD scores among the 29 genes). From the events that validated in the proband by Sanger sequencing (429/522; 82.2%), we identified 11 (2.8%) new DN missense mutations from samples where parental DNA were available (n ¼ 386). This is in contrast to the MIS30 events where 21.6% (8/37) of mutations were DN. While CADD cutoffs are certainly an imperfect definition of pathogenicity, such thresholds significantly increase the odds that a DN and likely diseasecausing mutation will be found (CADD430, OR ¼ 8.77, P ¼ 7.83 Â 10 À 5 , Fisher's exact test). Among the low CADD DN missense events, 5 of the 11 were associated with our top two genes (SCN2A and CHD8). We identified four additional missense DN mutations (CADD r30) in SCN2A, which increased the total number to 12 DN mutations (seven LGD and five missense; Fig. 2). We also identified one additional DN missense mutation in CHD8 (for a total of three LGD mutations and one missense mutation; Fig. 2). DN mutations in SCN2A, thus, account for approximately 1.1% of probands in our Chinese cohort.

Comparison of DN LGD events between populations.
To determine whether the top DN ASD risk genes differ between European and Chinese populations, we extracted the regions for the 189 genes targeted in this study from exome sequencing data from 2,508 SSC families (1,911 quads and 597 trios) and 1,445 ASC trio families (Supplementary Data 6). We compared the total DN LGD mutation rates for these 189 genes between Chinese and European probands (1,892 SSC probands where both parents were self-reported in the Caucasian ethnic group); no significant difference was observed (P ¼ 0. 15 Notably, the rate of SCN2A DN LGD mutations appears to be increased in our study (7/1,045; 0.7%) compared with published exome sequencing studies of ASD families, specifically the SSC and ASC (4/3,953; 0.1%). This difference was nominally significant (P ¼ 2.6 Â 10 À 3 , two-tailed Fisher's exact test, OR ¼ 6.7, 95% CI ¼ 1.7-31.1) but did not withstand multiple testing correction for all 189 candidate genes. Compared with published MIP-based studies of individuals with ID/DD (3/3,387 or 0.09% of individuals carried an SCN2A LGD mutation) 20 , our data still appear to trend toward increased numbers of individuals carrying DN LGD SCN2A mutations (P ¼ 2.4 Â 10 À 3 ). Because DN and private LGD mutations in candidate risk genes are individually rare (Supplementary Fig. 4 and Supplementary Data 9), our current study is underpowered to robustly detect differences in mutation frequencies between affected populations at the single-gene level. However, this nominally significant finding is of interest for potential validation in future studies of larger sample cohorts (Supplementary Discussion).
DN mutation recurrence among Chinese ASD probands. As part of this study design, we selected candidate genes with as few as one DN LGD mutation in samples of primarily European descent for resequencing in Chinese samples with the aim of identifying additional mutations, thus establishing DN LGD recurrence for these candidate ASD risk genes. We identified DN LGD mutations in CUL3, DOCK8, GIGYF2, MYT1L and ZNF292 (Table 2), establishing recurrence for these loci where only one LGD event had previously been identified in the SSC and ASC.
LGD, de novo likely gene-disruptive mutation; MIS430, de novo missense mutations with CADD score greater than 30; MISr30, de novo missense mutations with CADD score less than or equal to 30.
Nominal P values and q values were from DN mutation simulation, the probabilistic framework developed previously 14,23 . The combined data set represents data from 4,998 autism patients.   The identification of two DN LGD mutations in the combined sample size of the ACGC and SSC is unlikely 9 . Of these five genes, MYT1L shows no evidence of rare LGD mutations in a large population control (ExAC; n ¼ 45,376), while two genes had a small number of rare LGD mutations in this control population (GIGYF2 had nine individuals carrying rare LGD mutations at three sites and CUL3 had four individuals carrying private LGD variants). None of these three genes showed evidence of common LGD variation in the ExAC cohort (Supplementary Data 7 and 11). In addition, we also integrated our DN results from the ACGC with DN CNVs identified from SSC and Autism Genome Project (AGP) families. From these candidate CNVs, we identified a DN LGD mutation in both ARHGAP32 and NCOR1 (Table 2, Fig. 3). Finally, we identified one DN LGD mutation in each of four genes-SHANK1, PHIP, STXBP1 and CDKL5-of which, SHANK1, STXBP1 and CDKL5 are already known to be associated with syndromic and non-syndromic forms of ASD (Table 2). To further investigate the potential impact of rare inherited LGD variants, we compared our rare LGD event rates with those from the ExAC cohort (using the subset of 45,376 neuropsychiatric disease-free individuals). We combined these data with CNV data from 29,085 children with ID/DD and 19,584 controls using a joint probability model previously described in Coe et al. 20 . Briefly, rates of LGD mutations and deletion CNVs between proband and control cohorts are combined using a statistical model that assumes array and MIP patients are independently assayed. Under this model, we examined 57 autosomal genes with at least one LGD mutation and sufficient sequence coverage in both exome and MIP samples. We identified a significant enrichment (qo0.05) for LGD events for 10 targets (Supplementary Data 7), including three genes with only a single LGD event identified in our study. While these 10 genes are good candidates, most CNVs are large and encompass many genes; in addition, the presence of overlapping large CNVs and DN LGD events does not confer the same specificity as recurrent LGD mutations.
Because there are 498 probands with only one or no parental sample(s) available, inheritance for variants in this subset cannot be determined. Several LGD mutations in these samples have a high probability of being DN considering the low-predicted mutability and evidence of DN mutations in these genes in European ASD cohorts. In addition, our integrated CNV and LGD burden analysis highlight genes with an increased likelihood of haploinsufficiency in ID/DD/ASD that do not reach significance in this cohort by analysis of DN single-nucleotide variants and indels alone. These variants include a frameshift mutation in GIGYF2 in proband M15067, a stop-gain mutation in STXBP1 in proband M17663, and a frameshift mutation in SYNGAP1 in proband M19759. We also identified three LGD mutations in RIMS1 with undetermined inheritance. Counts of all LGD and missense mutations by gene are provided (Supplementary Data 8).
Clinical characterization of genotypes and phenotypes. We attempted to recontact all patients with DN mutations identified in this study to collect detailed phenotypic information. In the end, 20 patients (B46.5% of patients with DN mutations) were successfully recontacted and detailed phenotypes were recorded (Table 3). We found that ID (15/16), gastrointestinal (GI) disturbance (9/20), and macrocephaly (5/15) are some of the most common comorbid conditions among these patients.
Although the number of individual patients for any gene is few, several of the phenotypes identified during patient recontact were reminiscent of those reported in the literature. The patients with the CHD8 DN mutation, for example, show ID (2/2), macrocephaly (2/2), tall stature (2/2), high BMI (2/2), mild regression (2/2), sleep problems (1/2), GI disturbance (1/2), attention deficit (2/2) and anxiety (1/2). These findings are consistent with the clinical and genetic subtype previously reported for this gene 24 . Similarly, patients with the SCN2A DN mutation show other comorbid conditions, including ID (3/3) and seizures (2/3) consistent with the observation that SCN2A DN mutations have been identified among ID and epileptic encephalopathy patient cohorts 21,22,25 . Among our patients with ADNP and DYRK1A DN mutations, we noted congenital cardiac defects. In addition, the ADNP proband also has attention problems, and the DYRK1A  proband reported febrile seizures in infancy, hypertonia, neonatal feeding problems and sleep disturbances. All of these phenotypes have been described before in association with DN mutation of these genes 26,27 . Several other novel comorbidities associated with CHD8, ADNP, POGZ, ARHGAP32, GRIN2B, ASH1L, NCKAP1, CHD2 and DSCAM are also recorded (Table 3). Larger cohorts of patients with DN mutations in these genes will be required to assess the significance of these observations.

Discussion
We have performed an investigation of DN mutations among ASD candidate risk genes in a Chinese ASD patient cohort. Among the 1,045 trios tested, we discovered 43 DN mutations in 29 of the 189 candidate genes queried by our MIP-based approach. The most frequently DN mutated gene in our study was SCN2A where we identified seven novel DN LGD mutations and five DN missense mutations. DN mutations in SCN2A accounted for B1.1% of ASD patients in our Chinese cohort. This rate was higher than exome-and MIP-sequenced ASD cohorts of European descent. This observation is not likely explained by differences in DN mutation rates among global populations or paternal age but rather by ascertainment and technology biases. For example, published head-to-head comparisons of exome and MIP sequencing technologies on the SSC cohort have shown that MIPs have greater sensitivity for variant detection 14 , which could also account for these increased rates of DN variation in our MIP study compared with exome sequencing studies.
Many sequencing studies of ASD and ID/DD patients have highlighted the extensive overlap among mutated genes between these comorbid conditions 4,21,22,28 . Indeed, it has been estimated that 440% of patients with ASD may suffer some form of cognitive impairment 29 . Even among some of the most rigorously ascertained autism cohorts, such as the SSC, it estimated that 30% of SSC patients also have ID 30 . Although all of the ACGC patients met strict autism (DSM-IV) criteria, IQ data was not routinely collected for the majority of patients in this study. It is possible that some of the most severely affected individuals may have been initially recruited from the referring centres. Our patient followup for specific mutations generally supports this hypothesis. For example, four of the SCN2A families were available for recontact and three of these families had available IQ data that showed impaired cognition for the proband. Further, one of these families also reported severe language impairment adding credence to our hypothesis that this cohort, although defined as autistic by the DSM-IV, is enriched for patients who are also cognitively impaired 31 . This is consistent with studies of autism families in the SSC where three of the original four patients identified with DN SCN2A mutations were also intellectually impaired (IQo70) 12 .
The ability to recontact patients with DN mutations in this study was important both for confirming previously published genotype-phenotype correlations as well as describing potentially new genetic subtypes of ASD. For SCN2A, in addition to ID in three of these patients, two of the four families that could be recontacted also reported seizures; SCN2A has been previously implicated in epilepsy 32 . We also identified three patients with DN CHD8 LGD mutations. Two of them were recontacted and the phenotypes were similar to those previously reported 24 . Similarly, mutations in ADNP are reportedly linked to a shared clinical phenotype, including ID, facial dysmorphism and congenital heart defects 27 . The ADNP proband in this study had severe ID, a congenital heart defect, attention problems, mild regression and differently sized eyes. The patients with the DN DYRK1A LGD mutation also presented similar phenotypes as described previously 26 , such as ID, febrile seizures in infancy, hypertonia, neonatal feeding problems, cardiac defects at birth and sleep disturbances. Microdeletions of SHANK1 have been previously identified in ASD patients 33 . In addition to an ASD diagnosis, our SHANK1 patient also presented with ID, typical regression (after 3-4 years old) and an abnormal gait.
We also report here clinical phenotypes for anecdotal cases of DN LGD mutations in WDFY3, GIGYF2, MED13L, NCKAP1, 16

Febrile seizures infancy
Febrile seizures infancy À þ À À þ À À À À À C-section À À À þ þ þ À À þ À Premature birth À À À þ À À À À À À Neonatal feeding problems þ À À À À À À À À À Childhood feeding problems À À À À À À À À À NA ARTICLE ARHGAP32, DSCAM and MYT1L, which serves as a useful starting point for further phenotype-genotype correlations. In addition, we also noted specific comorbidities in several patients that have not been previously described. The patient with a MYT1L DN mutation was diagnosed with a motor tic disorder and showed an unusual gait. The patient with a MED13L DN mutation has hyperopia and a clear dysmorphia typified by mandibular protrusion. The patient with a WDFY3 DN LGD mutation has ID and macrocephaly and had severe GI disturbances up to 4 years of age, as well as hyperactivity, sleep problems, attention problems and anxiety. The patient with a GIGYF2 DN LGD mutation also has mild ID and macrocephaly. His mother reported an obsession with food but that he has no sense of his appetite being satiated in addition to problems with attention and anxiety. We note that MED13L haploinsufficiency is well established 34,35 ; others, such as DSCAM and GIGYF2, are emerging high-impact risk genes from exome and CNV studies 23,36,37 . For example, Akshoomoff and colleagues recently highlighted the neuronassociated GTPase activating protein (ARHGAP32) as the best candidate gene for ASD features associated with Jacobsen syndrome based on overlapping deletions that narrowed the critical region to include this locus among three other genes 38 . To our knowledge, we discovered the first DN LGD mutation in ARHGAP32 in a patient with ASD although one DN missense variant has been previously reported 13 . Similarly, the discovery of a patient with a DN LGD mutation in NCOR1 (Table 3), now in addition to published DN missense mutations in ASD 12 and DD 28 cohorts, is exciting in light of the known interaction of the Rett syndrome gene (MECP2) with members of this nuclear receptor co-repressor complex 39,40 .
This study confirms the importance of DN mutations in ASD and further shows that candidate risk genes originally identified primarily from European ASD cohorts are highly relevant in other populations, such as the Chinese cohort described here. The large number of DN events identified in SCN2A (both LGD and missense) confirm the significance of DN missense variation. There will likely be risk genes where both LGD and missense mutations contribute to disease etiology, perhaps in the case of SCN2A. Other risk genes may predominantly be represented by DN LGD events (for example, CHD8 and ADNP) or missense events. Larger numbers of samples will need to be screened to identify the diversity of risk genes that contribute to disease variation. Detailed clinical follow-up will continue to be essential in this genotype-first model, particularly for risk genes where both DN LGD and missense signals significantly associate with disease phenotypes, to determine whether different classes of mutations have different biological effects (for example, haploinsufficient versus dominant-negative effects) 41 . These efforts will be very beneficial for the early disease warning and diagnosis and, most importantly, early intervention while also providing the motivation for further functional and translational studies of disease risk genes, which will aid in future personalized treatments. Unlike rare and common variants that are frequently skewed in their population distribution, our study raises the exciting possibility that extremely large autism cohorts may in the near future be amassed across the continents to prove the pathogenicity of autism disease genes in cohorts of hundreds of thousands of patients.

Methods
Human subjects. All the subjects who participated in this study completed informed consent before the original sample collection. The DNA samples were extracted from the peripheral blood of patients and parents if available. Probands were diagnosed with ASD based on DSM-IV criteria, and families were excluded if the proband did not meet these diagnostic criteria. Detailed DSM-IV diagnostic criteria are described in Supplementary Table 1. Autism-related single-gene disorders, such as fragile X syndrome, tuberous sclerosis complex and phenylketonuria, were excluded where possible. While patients originate from across China, the majority of samples came from three provinces: namely, Shandong, Hunan and Guangdong (see Supplementary Table 2 for sample distribution). The recontacted patients underwent a thorough assessment, including a detailed review of medical history and comprehensive physical and neurocognitive phenotype. Instruments such as the ABC (autism behaviour checklist) and SRS (social responsiveness scale) were applied where possible. This study was approved by the Institutional Review Board (IRB) of the State Key Laboratory of Medical Genetics, School of Life Sciences at Central South University, Changsha, Hunan, China and adhered to the tenets of the Declaration of Helsinki. In addition to approval by the local IRBs, the study has been reviewed and is compliant with the Chinese Ministry of Science and Technology (MOST) for the Review and Approval of Human Genetic Resources (Approval number: HGR-2016-10-55:556).
MIPs and pooling. MIPs were designed using MIPgen with an updated scoring algorithm 42 . MIPs were composed of a common linker, which is 30 base pairs (bp), flanked by extension and ligation arms ranging from 15 to 30 bp (the total MIP length is between 75 and 80 bp). Each MIP with unique arms will target a specific genomic region and set to a total fixed length of 162 bp. Oligonucleotides were ordered from Integrated DNA Technologies (IDT, Coralville, IA, USA; MIPs listed in Supplementary Data 2). Five degenerate bases were added between the common linker and the extension arm, which allow a non-duplicate coverage of 4 5 ¼ 1,024 as the theoretical maximum. In cases of polymorphisms that may interfere with capture, two MIPs were designed to capture either haplotype. MIPs were designed against the GRCh37 human genome reference using dbSNP138. MIPs were pooled together by gene. For initial testing (1X pool), probes were combined at equal molar concentrations and phosphorylated. After initial testing, probes that performed poorly were repooled and phosphorylated with increased molar ratios at 10X or 50X to rescue the capture coverage. The final pools (1X, 10X and 50X) were combined as a working pool. For the purpose of sequencing, we divided the smMIPs into three pools; we tested and rebalanced each pool independently using 16 unaffected (HapMap) samples as controls. We distinguished the three pools by name as SKLMG-PL1 (63 genes with 3,557 MIPs total), SKLMG-PL2 (71 genes with 3,773 MIPs total) and SKLMG-PL3 (55 genes with 3,563 MIPs total) (Supplementary Data 2).
Multiplex capture and amplification. One hundred nanograms of genomic DNA was hybridized with 1X Ampligase buffer (Epicentre, Madison, WI, USA), 0.32 mM dNTPs, 0.5 Â of Hemo KlenTaq (0.32 ml; New England Biolabs, Inc., Ipswich, MA, USA), one unit of Ampligase (Epicentre) and MIPs in one 25 ml reaction. Gap filling and ligation were also performed in this reaction. The amount of MIPs needed was based on the 1X pool concentration on a ratio of 800 MIP copies to each haploid genome copy. The reactions were incubated at 95°C for 10 min and then 60°C for 22 h. 2 ml of exonuclease mix were used to degrade linear DNA for incubation at 37°C for 45 min then 95°C for 2 min. Amplification of the captured DNA was performed as previous reported. Five microlitres of B96 different barcoded libraries were pooled and purified with 0.9X AMPure XP beads (Beckman Coulter, Brea, CA, USA) according to standard protocol. Hundred microlitres of 1X EB (Qiagen, Valencia, CA, USA) were used to resuspend the libraries. Then gel visualization with 2% nondenaturing polyacrylamide gel and quantification with the Qubit dsDNA HS Assay (Life Technologies, Grand Island, NY, USA) was performed according to the manufacturer's protocol. Approximately 192 individuals were combined as the final megapools from multiple libraries for sequencing on one lane of Illumina HiSeq 2000 and 101 bp paired-end reads were generated according to the manufacturer's protocol.
MIP sequencing data analysis. Sequencing reads were analysed as described previously. For primary single-nucleotide variation, indels and target coverage analysis, initial 101 bp paired-end reads were trimmed to 81 bp before mapping. Sequences were mapped to hg19 using BWA-MEM. Sample-tag indices were made using the 5 bp degenerated sequence, which was removed from the beginning of read2 and added to the index barcode. For QC, read-pairs with incorrect pairs and insert sizes were removed after mapping, leaving only the reads with the highest quality scores. We applied FreeBayes for variant calling, and variants were filtered based on read depth 48 and read quality 420 in each megapool. Then all variants were filtered against common variants using dbSNP138. We applied a frequency filter of 'allele count (AC)o3' (o0.13%) to the ACGC data set, and annotated variants on the isoform of each gene (Supplementary Data 10) using SeattleSeq138 and the NCBI 37/hg19 reference. The same frequency filters and annotation were used for the ExAC data set (45,376 individuals included, only neuropsychiatric disease-free individuals used). Individuals with o75% of the target (greater than eightfold coverage) (Supplementary Fig. 1) and genes where less than 30% of the individuals were sufficiently sampled were removed from further analysis ( Supplementary Fig. 2).
Variant validation. Variants were validated with PCR and Sanger sequencing. Primers were designed using BatchPrimer3 by uploading sequence files in FASTA format according to the user manual with the optimal PCR product size set to 300 bp. For picked primers, we performed BLAT and in silico PCR on the UCSC Genome Browser to avoid multiple hits. PCR was performed with a standard programme in 25 ml reaction volume. The PCR products were purified with 0.9X AMPure XP beads and the product was visualized on a gel before sending to Sanger sequencing. To verify the DN status, PCR was also performed on parental DNA, if available.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.