A recurrent SHANK3 frameshift variant in Autism Spectrum Disorder

Autism Spectrum Disorder (ASD) is genetically complex with ~100 copy number variants and genes involved. To try to establish more definitive genotype and phenotype correlations in ASD, we searched genome sequence data, and the literature, for recurrent predicted damaging sequence-level variants affecting single genes. We identified 18 individuals from 16 unrelated families carrying a heterozygous guanine duplication (c.3679dup; p.Ala1227Glyfs*69) occurring within a string of 8 guanines (genomic location [hg38]g.50,721,512dup) affecting SHANK3, a prototypical ASD gene (0.08% of ASD-affected individuals carried the predicted p.Ala1227Glyfs*69 frameshift variant). Most probands carried de novo mutations, but five individuals in three families inherited it through somatic mosaicism. We scrutinized the phenotype of p.Ala1227Glyfs*69 carriers, and while everyone (17/17) formally tested for ASD carried a diagnosis, there was the variable expression of core ASD features both within and between families. Defining such recurrent mutational mechanisms underlying an ASD outcome is important for genetic counseling and early intervention.


INTRODUCTION
Autism Spectrum Disorder (ASD) is a heterogeneous condition, both in clinical presentation and in terms of the underlying etiology. Individuals with ASD are increasingly being seen in clinical genetics 1,2 . More than 100 genetic disorders that can exhibit features of ASD (e.g., Fragile X, Phelan-McDermid syndromes, Rett) 3 and dozens of rare susceptibility genes (e.g., NLGN, NRXN, SHANK family genes), and copy number variation (CNV) loci (e.g., 1q21.1 duplication,15q11-q13 duplication, 16p11.2 deletion), have been identified, which combined can facilitate a molecular diagnosis in~5-40% of ASD cases [4][5][6][7] . The likelihood of a genetic finding in ASD is dependent on the complexity of the phenotype (e.g., idiopathic or syndromic, with or without intellectual disability) 8,9 , the genomic technology used (e.g., microarrays, exome sequencing, genome sequencing, or combinations thereof) 10 , as well as the annotation pipeline and "gene lists" used for interpretation 11,12 . There are examples of how understanding the genetic subtypes of ASD can assist early identification enabling earlier behavioral intervention, and informing prognosis, medical management, and assessment of familial recurrence risk 13,14 . Moreover, genomic data promise to facilitate pharmacologic-intervention trials through stratification based on pathway profiles 15,16 . To support these applications, there is a growing interest in performing robust genetic analyses, often in families and in unique populations, linked to deep phenotyping [17][18][19] .
The largest datasets available for genotype/phenotype correlations in ASD studies are based on CNV assessment since microarrays became the first-tier clinical diagnostic test 20,21 . The most relevant finding from this vast literature is that even for recurrent CNVs (i.e., genomic disorders) involved in ASD, which typically affect the same genes, there is the variable expression of phenotypes relevant to the core features in autism, and other medical features [22][23][24][25] . 1 More recently, genotype and phenotype studies of sequencelevel variation (single-nucleotide variants, or SNV, and insertion/ deletion, or indel events) affecting individual genes are starting to reveal clinical correlations in ASD. For example, loss-of-function variants in the SCN2A sodium channel gene impair glutamatergic neuronal excitability, leading to ASD and/or intellectual disability, while gain of function variants potentiate excitability leading to infantile-onset seizure phenotypes 26 . Different germline dominant-acting mutations in the phosphatase and tensin homolog (PTEN) gene found in ASD lead to an increased average head circumference in children 27 . Loss-of-function variants in the CHD8 chromodomain helicase DNA-binding protein eight gene are also found in overgrowth and intellectual disability forms of ASD 28 . Despite some progress in resolving genotype-phenotype correlations, the vast genetic complexity and variable expressivity of genes involved in ASD continue to confound most predictive studies.
Following a genotype-first approach, here we initially searched available ASD-specific, controlled access, genome-wide sequence databases, such as MSSNG (https://research.mss.ng) and Simon's Simplex Collection (SSC) (https://www.sfari.org/resource/sfaribase) as well as our own in-house data (available in the next MSSNG data release) to identify recurrent sequence-level damaging variants (de novo loss-of function or missense variants predicted to be damaging based on the American College of Medical Genetics guidelines 29 ) affecting the same site (genomic location) in the same gene in different families. The database searches were then followed by a literature survey to identify additional individuals reported to have the same variant. In our most compelling finding, we identified a mutational 'hotspot' in a string of 8-Gs in exon 21 (p.Ala1227Glyfs*69) of the SHANK3 gene that was present in 17 individuals from 15 unrelated families with ASD, as well as one individual with several autistic features and Phelan-McDermid Syndrome (but who was not tested for ASD). The individuals identified in both the ASD-specific databases and the published manuscripts had various details available describing the phenotype which we have summarized. We were able to contact the families that are described for the first time in this paper to gather additional information. Using these available data, we assessed the intra-and inter-familial phenotypic variation (as well as all other genetic information) within these individuals and discuss the findings in the context of genotype-phenotype comparison, including variable expression of ASD core symptom and related features.

RESULTS
Identification of the recurrent p.Ala1227Glyfs*69 variant To achieve the most comprehensive genomic representation (difficult to sequence exons, splice site boundaries) for variant detection, we initially examined the Autism Speaks MSSNG wholegenome sequencing (WGS) cohort (https://research.mss.ng/), with 11,359 samples, including 5102 affected individuals and 3567 with family data, typically belonging to trios, or quads (two parents and two affected children) for recurrent mutations. Secondly, we tested the Simon Simplex Collection (SSC) WGS collection (https:// www.sfari.org/resource/simons-simplex-collection/), which comprises 9,205 samples, including 2419 affected individuals and 2393 with family data (typically two parents, one affected child, one unaffected child). Previous studies have extensively reported on MSSNG 6,17,30,31 and SSC 32,33 . Probands from both cohorts met the criteria for ASD based on scores from standardized diagnostic criteria tools, typically the Autism Diagnostic Observation Schedule (ADOS) 34 and the Autism Diagnostic Interview-Revised (ADI-R) 35 and/or was supported by clinical criteria. Many individuals were also assessed with standardized measures of intelligence (I.Q.), including verbal and nonverbal ability, language, social behavior, adaptive functioning, and physical measurements 6,32,33 . All of this phenotype data is available from the respective databases.
From the genome sequences analyzed, our most interesting finding identified five probands in MSSNG (four males and one female) from four families and one proband in SSC (male) carrying a heterozygous guanine duplication in SHANK3 (NCBI: NM_033517.1; ENSEMBL: ENST00000262795.5; c.3679 or c.3676 depending on the transcript) ( Table 1; the reference sequence NM_033517.1 was selected as the appropriate transcript for this study as this was the reference sequence used in the original publication of this variant in Durand et al. 36 ). We also found other recurrent sequence-level de novo heterozygous damaging missense variants in the PTEN, CAMK2A, SPTAN1, MECP2, and CSNK1E genes, but in each of these instances no more than two unrelated individuals were found in the combined MSSNG and SSC data (Supplementary Material; Table S1).
The discovery of this recurrent guanine duplication variant in SHANK3 was confirmed using Sanger sequencing (Fig. 1). We then scanned the literature, including using Varicarta 37 and found that this same guanine duplication was reported in 12 probands affected by ASD 4,36,[38][39][40][41][42] , and one proband within the ASD borderline range, Phelan-McDermid syndrome, significantly delayed language, and speech and visual-motor deficits 38 . We carefully examined all genotypes and found that one was the same individual in the SSC cohort (14470.p1); 40 therefore, we removed this duplicate individual. Considering the new cases reported here and the cases reported in the literature, the p. Ala1227Glyfs*69 variant has been observed in a total of 18 cases from 16 families, identified using different genome-testing approaches ( Table 2). Nearly all of these probands (17/18) were ascertained for ASD, although the general phenotype, as discussed below, varies somewhat among individuals (Table 3; Fig. 2). We also detected one female individual with ASD (with mild intellectual disability) carrying a de novo G deletion (7-G's) at this same site (c.3679del p.Ala1227Profs*57).

Genome annotation of the p.Ala1227Glyfs*69 variant
The SHANK3 guanine duplication is located within a segment of 8-G′s on chromosome 22q13 at genomic location [hg38] g.50,721,505dup or g.50,721,512dup, depending on the position that this variant is annotated in the guanines (Table 1; Fig. 2). Some tools annotate the first G as the duplication, and others annotate it as the final G (Supplementary Material; Fig. 3). The sequencing technology might also affect the variant annotation, with Sanger sequencing conventionally adding the G duplication at the 3′ end of the gene as the first point of amino acid change, and Next Generation Sequencing usually left aligning the variant. Independent of the position of the base insertion in the 8-Gs, the frameshift starting in exon 21 results in the new reading frame ending with a stop codon at position 69, causing a truncation lacking the C-terminal region (Fig. 3). We also confirmed that both exome sequencing and WGS reliably captured this 8-G string genomic segment in the short-read sequence (see Methods).
Segregation and population frequency of the recurrent p. Ala1227Glyfs*69 variant All the probands identified in this study carried de novo variants with the exception of five individuals. One family with two brothers first reported in the initial SHANK3 ASD-discovery paper 36 inherited the variant from their mother, who was found to be mosaic. Two siblings within the MSSNG cohort (MSSNG00342-003 and MSSNG00342-004) inherited the variant from their father, who was also shown to be a mosaic ( Table 2). In this latter case, the variant was only present in 8 of 50 reads in the father's WGS data and was verified using a T.A. clone Kit (Invitrogen cat number 45-0046). Proband 1-1047-003 also seems to have inherited the L.O. Loureiro et al. variant from his mother by somatic mosaicism, in whom the variant was present in 1 of 32 reads of the WGS data. Exome sequencing analysis was also performed in this mother, with the variant being observed in 2 of 110 reads. To search for additional potential relevant somatic mutations 43 , we tested the original alignment files in both cohorts using DeNovoGear's dng-call method for the SHANK3 locus 44 using 0.8 as a posterior probability of a de novo mutation (ppDNM), but we did not find any other candidates. Considering the families studied in MSSNG and SSC (our most trusted datasets) 6/7,521(0.08%) ASD-affected individuals carried the p.Ala1227Glyfs*69 variant in 5/6,681 (0.07%) of families. The Fisher's exact test of the association between the frequency in heterozygous individuals in ASD cases and control population databases has a P value of 0.029.
Consequences of p.Ala1227Glyfs*69 on the SHANK3 protein Nonsense mutations and frameshifts in SHANK3 can lead to reduced expression, and SHANK3-deficient neurons were found to have an altered phospho-proteome that may explain their decreased dendritic spine density 45 . However, SHANK3 mRNA is still expressed in truncation mutant-containing induced pluripotent stem cells (iPSCs) 46 and truncated SHANK3 proteins may have a dominant-negative effects in neurons 47,48 . We therefore explored the consequences of p.Ala1227Glyfs*69 on the SHANK3 protein. We annotated the positions of amino acids to which the variant is mapped according to ENSEMBL and the UCSC genome browser. Using the DISOPRED3 predictor 49 and the consensus of eight predictors from MobiDB-lite 50 , we identified where the mutation falls with respect to intrinsically disordered regions (IDRs) of the protein, which may influence protein folding and binding 51 . In both predictors, the position of interest was found to be embedded within a large IDR, which map to multiple isoforms (Fig. 3B). Mutations that create frameshifts and stop codons in this region of SHANK3 36,52 truncate two proline-rich binding sites for Homer and Cortactin (Fig. 3A) and affect function, including altering neuronal morphology in cell-based experiments 46,47 . The SHANK3 protein serves as a scaffold to connect membrane receptors to the actin-cytoskeleton in the postsynaptic density (PSD), a protein-rich sub-compartment considered to be a biomolecular condensate formed by phase separation 53,54 due to multivalent interactions 46 . In each of the isoforms, these truncations are expected to impair canonical PSD formation and stability.
The variant isoforms were also analyzed using Feature Analysis of Intrinsically Disordered Regions, a tool that identifies the presence of consensus protein recognition motifs in IDRs 55,56 and using PScore 57 , predicts phase separation propensity via IDR planar pi-contacts ( Fig. 3C; Supplementary Material; Fig. S2). A number of specific short linear interaction motifs were found to be altered. Of particular interest is the increase in SH3 domain class I-binding motifs, given that SHANK3 is known to interact with numerous SH3 domains. The variants significantly increase the number of arginine-glycine and arginine-arginine dipeptide instances, which are associated with mRNA binding and phase separation, and increase the cysteine content of the sequence. A reduction in SHANK3 protein due to the frameshift (e.g., through nonsense mediated decay; discussed below) could also affect the phase separation of the PSD, which is known to be concentration dependent 58 .

Reference genome
The annotation considers different reference genomes, the position of the duplication in the guanine string, and the annotation tool. The guanine duplication in each carrier in the main text of the paper is referred to as p.Ala1227Glyfs*69. 1 Alamut Visual version 2.15 (SOPHiA GENETICS, Lausanne, Switzerland). This tool annotates the final G as duplicated and provides the coding and protein change as well as ClinVar entries and general population frequencies ( Supplementary Fig. S3).
2 NM_001080420.1 record has been removed from NCBI. This RefSeq was permanently suppressed because currently there is insufficient support for the transcript and the protein. Exon 11 was based on ab initio prediction and is not supported by transcript data.
At the time of submission, the most recent RefSeq NM_01372044, which replaces and updates NM_033517, was not available in the Alamut software.   has not been formally diagnosed with ASD but has reported autism-associated phenotypes. We also detected one female individual with ASD carrying a de novo G deletion (7-G's) at this same site (c.3679del p. Ala1227Profs*57).   38 . has not been diagnosed with ASD but has some autism-associated phenotypes. Graphical representation of these phenotypes is presented in Fig. 2 1 and ALFA might be due to low-quality sequencing with the preliminary description being corrected in gnomAD v3. It is also noteworthy that~1/100 people will have ASD, so it would be expected to find p.Ala1227Glyfs*69 variant carriers in control populations. Based on our findings described here they would likely have ASD, but additional studies will be required to further assess this. We have analyzed the genomic conservation of this variant with GERP 64 , UCSC PhyloP, and phastCons for primates, placental mammals, and 100 vertebrates 65 . GERP identifies constrained elements in multiple alignments by quantifying substitution deficits. These deficits represent substitutions that would have occurred if the element were neutral DNA but did not occur because the element has been under functional constraint. The p. Ala1227Glyfs*69 variant has a GERP score of 5.2 (p = 0), suggestive of having a large deleterious effect 66 . The PhyloP score was 0.6 for primates, 1.35 for mammals, and 2.13 considering 100 vertebrates, suggesting high evolutionary conservation. The PhastCon scores were also higher than 0.98 for primates, mammals, and vertebrates, which indicates a strong negative selection on this variant.

Genotype and phenotype correlation
In all 17 p.Ala1227Glyfs*69 carriers evaluated for ASD, ASD was confirmed by review of the ASD gold standard diagnostic tests available in the databases or as reported in the original manuscripts, and the majority of participants described are reported to have an intellectual disability defined as an IQ score below 70 and impairments in adaptive functioning, although the spectrum of severity is wide (Table 3; Fig. 2). Four individuals were ascertained for Phelan-McDermid Syndrome, with three of these being of the 17 receiving a formal ASD diagnosis and one never being assessed for autism. Language deficits are also prevalent and often severe. We were cautious about making claims on other associated conditions as they have not been universally and systematically ascertained. However, hypotonia and gait abnormalities are common, also consistent with animal model data 67 . Seizures were reported in 3/18 participants. Other neurodevelopmental concerns include ADHD, anxiety, Developmental Coordination Disorder, and mood disorders. Gastrointestinal distress and sleep dysfunction were also reported. Last, both dysmorphia and other organ anomalies were reported (conductive hearing loss-and coronary artery fistula). Within pairs of siblings sharing a variant, there is a similarity of phenotype, with some variability in the severity of the intellectual disability.
Different de novo mutations in SHANK3 have also been associated with other developmental/neuropsychiatric disorders and genetic syndromes such as schizophrenia 47,68 and Phelan-McDermid Syndrome (PMS) 69 . The majority of children diagnosed with PMS also have ASD, and both conditions are often associated with intellectual and language delay, hypotonia, seizures, and sleep disorders, although children with PMS also often have other organ involvement. We also examined the whole genomes from the MSSNG and SSC p.Ala1227Glyfs*69 carriers and assessed for other clinically relevant variants that could be contributing to the varying phenotypic presentation, but none were identified. Additionally, no other clinically relevant variants were highlighted in those individuals described in the literature 36,38,[69][70][71] .
To evaluate if common genetic variants may be contributing to the ASD phenotype in the p.Ala1227Glyfs*69 SHANK3 variant carriers, we calculated their ASD polygenic risk score (PRS) for all accessible individuals from European ancestry in MSSNG (db6) and SSC. PRS in the probands analyzed in this study varied between −1.167 and 15.606 (Table 2), showing no clear pattern between the presence of the clinically significant SHANK3 variant and the polygenetic risk of common variants. PRS in all subjects with autism in MSSNG and SSC ranges between −18.580 and 20.626.

DISCUSSION
Our data indicate that 17/17 carriers (from 15 independent families) of the p.Ala1227Glyfs*69 variant affecting SHANK3 who have been formally tested carry a diagnosis of ASD. Our analysis did not identify any other obvious rare or common genetic variants, or combinations thereof, in the genomes of these individuals that could be contributing to the phenotypes reported in these individuals. Given the nature of neurobehavioral complexity, perhaps not surprisingly, there is phenotypic heterogeneity exhibited amongst p.Ala1227Glyfs*69 carriers, which is a hallmark of autism 72,73 , as well as other related brain disorders that may share overlapping clinical features and contributory susceptibility genes 74,75 . It is instructive for future "genotype-first" queries that the discovery of this recurrent p.Ala1227Glyfs*69 variant was missed in our early analyses. It was only detected here upon careful consideration of the different naming schemes of the various isoforms (and exons within them) in SHANK3, which also varied between different software tools, as well as the various genome builds being compared against (Table 1) 76,77 .
In addition, we searched for p.Ala1227Glyfs*69 SHANK3 variants in unpublished data from the SPARK cohort 41 . From 8744 ASDaffected individuals for which sequencing data from both parents were available, the variant was detected in two male individuals, both de novo. The variant was also detected in three out of 13,156 ASD-affected individuals (two males and one female) for which parental sequences were not available and thus inheritance could not be determined. As well from a private database we identified a female teen with ASD which based on the Vineland she would be described as severe, severe language delay, and severe global developmental delay. As highlighted on continuous measures of emotional difficulties (CBCL), she also presents with attention difficulties. This individual was not included in Table 3 since gold standard ASD measures were not available and this phenotype description is based on available assessments. We mention this data just to demonstrate that the variant is found in other Fig. 2 Phenotypic heterogeneity in individuals (X-axis) carrying the SHANK3 p.Ala1227Glyfs*69 variant reported in the MSSNG 6 , SSC 32,33 , and in published papers 4,36,[38][39][40][41][42]71 . Those individuals in the same family are grouped within the black boxes. Gray spaces indicate the absence of the phenotype. White spaces indicate that the phenotype might have not been accessed in the proband. Phenotypic categories are described in Table 3. Individual S7 was not formally reported as being formally tested for ASD. *Caution is needed in the interpretation of these frequencies since some phenotypes were not assessed for some individuals. collections, as would be expected, and await the presentation of more detailed phenotype data from these participants.
Two independently-created murine models with an insertion of a guanine nucleotide into the analogous mouse base pair position, which we refer to here as Shank3 InsG3680, have also demonstrated changes in cellular, circuit, and behavioral phenotypes 67,78 (Supplementary Material; Table S2). Specifically, these Shan-k3InsG3680 mouse models demonstrated changes to baseline neurotransmission and/or impairments in long-term depression (LTD) and long-term potentiation (LTP), the synaptic basis of learning and memory. Overall homozygous Shank3InsG3680 +/+ mice exhibited more significant changes than heterozygous Shank3InsG3680 mice, suggesting that functioning of one normal Shank3 copy maybe sufficient to support some of its function.
Regional differences in synaptic deficits and synaptic composition were observed, and the extent of the impact may have been modulated by other Shank family genes. In the adult hippocampus, expression of the reversible Shank3InsG3680 variant cassette 67 produced a truncated Shank3 protein and loss of the major high molecular weight isoforms at the synapse. This was associated with impaired hippocampal mGluR dependent LTD, intact LTP, and changes to baseline NMDA receptor (NMDAR) mediated synaptic function. In the striatum, Zhou et al. 78 showed a significant decrease of levels of Shank3 mRNA in the Shank3InsG3680 strain compared with the wild type, suggesting a reduced level of mRNA through nonsense-mediated decay. This finding suggests that the InsG3680 variant results in a nearcomplete loss of SHANK3 protein, concomitant with synaptic transmission deficits in juvenile and adult homozygous mutant Shank3InsG3680 (+/+) mice. Post-translational modifications, modulated by nitric oxide, were also found in both young and adult Shank3InsG3680 +/+mice.
In assessments of general cognitive function, Shan-k3InsG3680 +/+ mice showed mild spatial learning impairments  Table 1). C Normalized impact of the variant for the three isoforms using FAIDR, a tool that identifies physical features and the presence of consensus protein recognition motifs in intrinsically disordered protein regions 56 . (*Note that SCD, sequence charge decoration, a measure of charge patterning associated with phase separation, has values significantly above 2: 5.4, 7.0, and 10.2 for the three isoform.).
in the Morris Water Maze task and motor learning deficits in the accelerating rotarod task, while heterozygous mice did not 67 . ASDassociated behaviors in these two models also showed mixed outcomes in both social interaction impairments and repetitive behaviors that, similar to human assessments, may be dependent on age and gender. Speed et al. 67 reported statistically different effects in some of their assessments comparing between male and female adult mice. This group did not observe social interaction deficits in the three-chamber task with mixed-sex adult mutant mice, nor did they observe repetitive behaviors, but instead suggested aversion to novel objects. However, in large all-male cohorts, Zhou et al. 78 showed deficits in social behaviors in both juvenile and adult mice. In addition, in adults there was increased anxiety, repetitive grooming behaviors, and sensory processing differences 78 . On balance, the mouse data seems to generally recapitulate the learning impairments and behavioral differences seen in patients with the p.Ala1227Glyfs*69 SHANK3 variant.
Highly penetrant alleles such as p.Ala1227Glyfs*69 in neurodevelopmental disorders are under severe negative selection and are constantly being removed from the population 79,80 . However, recurrent mutations are always being added to the gene pool and while typically occurring randomly, the intrinsic 81 and extrinsic characteristics 82 may also have an influence 83 . Experimental investigations have shown that guanine bases can be targets for oxidative damage in DNA, while mutability in other bases is more variable 84 . Moreover, the locus under study is within 8 guanines, which constitutes a homopolymer run (HR). HRs are sequences with six or more identical nucleotides and are associated with >10-fold enrichment of mutation compared to the genomic average 85 . It is noteworthy that there are three other G homopolymer runs in SHANK3, but no recurrent variants were found at these sites.
The CpG content of DNA has also been shown to influence the mutation rate in non-CpG-containing sequences, suggesting that intrinsic properties of DNA sequences may be more important than the chromosomal environment in determining mutation rates and genome integrity. Evidence indicates that because of the propensity for methyl-CpGs to deaminate and produce mismatches, it is plausible that error-prone repair mechanisms may have a role in hypermutability. CpG methylation might also have epigenetic effects by promoting chromatin states that make DNA more susceptible to mutations 86 .
Although exceedingly rare (0.075% frequency in the ASD families studied by WGS), the finding that this p.Ala1227Glyfs*69 variant in SHANK3 is, so far, concordant with an ASD, and that it will surely continue to sporadically re-occur in the population, has important implications for genetic counseling. It will also be important to continue to search for the p.Ala1227Glyfs*69 variant in SHANK3 to see if it confers risk in other disorders, including perhaps under a multiple-variant model 87 . Defining a specific mutational mechanism underlying an ASD outcome, may also focus strategies for the development of therapeutic interventions.

Genome sequence analysis
We searched ASD-specific genomic databases in which the participants upon recruitment had a diagnosis of ASD, for damaging de novo sequencelevel variants affecting exactly the same genomic location in different families. A variant was defined to be damaging if it caused loss-of function (stop gain, frameshift, or canonical splice site-disrupting) or was a predicted deleterious missense variant based on American College of Medical Genetics guidelines 29 . Initially, we examined rare (frequency less than 0.001 in gnomAD and 1000 g) de novo variants identified from MSSNG data release DB6 (release date June 24, 2020), which were detected as previously described 6 . After identifying this recurrent variant in SHANK3, we then searched our in-house databases and performed literature searches for the same variant. Ethical review of these cohort studies was approved by institutional review boards and included assessing datasets through applications to Data Access Committees.

Phenotyping measures
Phenotypic data was extracted either from the original manuscripts, in which case we attempted to stay close to the original descriptions or from the reference databases. In the latter case, clinical diagnosis of autism spectrum disorder was reported in the databases and was supported by ADI/ADOS. Intellectual disability was reported as a clinical diagnosis and in most cases formal IQ testing was available for confirmation. Language delay was available as a clinical diagnosis, often with characterizations, such as "minimally verbal" or "nonverbal" and in many cases formal language measure scores were available for review. Information on psychiatric/ neurological comorbidities was extracted from the original manuscripts, or available as a clinician diagnosis or clinical concern based on continuous measures of such symptomatology available (e.g., CBCL, RCADS).

Confirming representation of exon 21 in exome and WGS datasets
Given the high GC-density content of SHANK3, which can influence exon capture and sequencing 52 , we thought it was critical when assessing mutational frequency to confirm that there were no biases in readcoverage of the site of the target variant within exon 21 (Supplementary Material; Fig.1). Using whole-exome sequences from 298 patients and 462 controls from our internal dataset, we ran the Agilent Sureselect Clinical research exome V1 for exome sequence analysis and show that the coverage around the G duplication region is at the anticipated 120x coverage (Supplementary Material; Fig. 1). This analysis also indicates that diagnostic exome sequencing will more than adequately capture and accurately genotype this position. WGS analysis of probands from MSSNG and SSC also confirm that exon 21 in SHANK3 is uniformly covered.

Protein and evolutionary conservation analysis
We used the DISOPRED3 predictor 49  Polygenic risk score analysis (PRS) PRS was calculated for all individuals from European ancestry in MSSNG (db6) and SSC merged with 1000 Genomes European population using GWAS summary statistics derived from the iPSYCH Autism project including 13,076 cases and 22,664 controls from Denmark 88 . This included probands MSSNG00342-003, MSSNG0342-004, 1-1047-003, 2-1774-003, and 14470.p1. A total of 25,837 SNPs were included in PRS calculation. Since the proband 7-0527-003 was part of a later version of the MSSNG cohort (db7), he was not included in the initial PRS calculation. This individual's PRS was calculated separately with his parents (7-0527-001 and 7-0527-002) using the same 25,837 SNPs included in PRS calculations for the others and centered by the mean in whole MSSNG/SSC/1000 Genomes European population. However, of 25,837 SNPs, 1496 were missing due to sample quality in this family, and caution is needed in comparison with the other subjects. The approach for interpretation of the PRS data was based on the previous studies 18,88,89 .

Study recruitment
This study has complied with all relevant ethical regulations including obtaining informed consent from all participants and was approved by the Research Ethics Board at The Hospital for Sick Children.

DATA AVAILABILITY
Access to the whole-genome sequence and phenotype information from MSSNG and SSC data can be obtained by completing data access agreements (https://research. mss.ng and https://www.sfari.org/resource/sfari-base, respectively), as was done for this study. These two well-established and stable whole-genome sequence and phenotype resources are utilized by approved investigators worldwide. The 1000 G genome-sequencing data are publicly available via Amazon Web Services (https:// docs.opendata.aws/1000genomes/readme.html). Access to data through other publications or resources is described in the main text and is outlined in Table 2. Whole-genome sequence for 7-0572-003 will be available in the MSSNG database in its next release but can be requested in advance by contacting the corresponding author. The relevant variant information from the exome or direct Sanger sequencing data for the individuals for which whole-genome sequencing data does not exist and is described for the first time in this paper (HNDS_0130-01; 1505221080) is found in Table 2. Additional data can also be requested by contacting the corresponding author.