Introduction

Autism Spectrum Disorder (ASD) is a heterogeneous condition, both in clinical presentation and in terms of the underlying etiology. Individuals with ASD are increasingly being seen in clinical genetics1,2. More than 100 genetic disorders that can exhibit features of ASD (e.g., Fragile X, Phelan-McDermid syndromes, Rett)3 and dozens of rare susceptibility genes (e.g., NLGN, NRXN, SHANK family genes), and copy number variation (CNV) loci (e.g., 1q21.1 duplication,15q11-q13 duplication, 16p11.2 deletion), have been identified, which combined can facilitate a molecular diagnosis in ~5–40% of ASD cases4,5,6,7. The likelihood of a genetic finding in ASD is dependent on the complexity of the phenotype (e.g., idiopathic or syndromic, with or without intellectual disability)8,9, the genomic technology used (e.g., microarrays, exome sequencing, genome sequencing, or combinations thereof)10, as well as the annotation pipeline and “gene lists” used for interpretation11,12.

There are examples of how understanding the genetic subtypes of ASD can assist early identification enabling earlier behavioral intervention, and informing prognosis, medical management, and assessment of familial recurrence risk13,14. Moreover, genomic data promise to facilitate pharmacologic-intervention trials through stratification based on pathway profiles15,16. To support these applications, there is a growing interest in performing robust genetic analyses, often in families and in unique populations, linked to deep phenotyping17,18,19.

The largest datasets available for genotype/phenotype correlations in ASD studies are based on CNV assessment since microarrays became the first-tier clinical diagnostic test20,21. The most relevant finding from this vast literature is that even for recurrent CNVs (i.e., genomic disorders) involved in ASD, which typically affect the same genes, there is the variable expression of phenotypes relevant to the core features in autism, and other medical features22,23,24,25.

More recently, genotype and phenotype studies of sequence-level variation (single-nucleotide variants, or SNV, and insertion/deletion, or indel events) affecting individual genes are starting to reveal clinical correlations in ASD. For example, loss-of-function variants in the SCN2A sodium channel gene impair glutamatergic neuronal excitability, leading to ASD and/or intellectual disability, while gain of function variants potentiate excitability leading to infantile-onset seizure phenotypes26. Different germline dominant-acting mutations in the phosphatase and tensin homolog (PTEN) gene found in ASD lead to an increased average head circumference in children27. Loss-of-function variants in the CHD8 chromodomain helicase DNA- binding protein eight gene are also found in overgrowth and intellectual disability forms of ASD28. Despite some progress in resolving genotype-phenotype correlations, the vast genetic complexity and variable expressivity of genes involved in ASD continue to confound most predictive studies.

Following a genotype-first approach, here we initially searched available ASD-specific, controlled access, genome-wide sequence databases, such as MSSNG (https://research.mss.ng) and Simon’s Simplex Collection (SSC) (https://www.sfari.org/resource/sfari-base) as well as our own in-house data (available in the next MSSNG data release) to identify recurrent sequence-level damaging variants (de novo loss-of function or missense variants predicted to be damaging based on the American College of Medical Genetics guidelines29) affecting the same site (genomic location) in the same gene in different families. The database searches were then followed by a literature survey to identify additional individuals reported to have the same variant. In our most compelling finding, we identified a mutational ‘hotspot’ in a string of 8-Gs in exon 21 (p.Ala1227Glyfs*69) of the SHANK3 gene that was present in 17 individuals from 15 unrelated families with ASD, as well as one individual with several autistic features and Phelan-McDermid Syndrome (but who was not tested for ASD). The individuals identified in both the ASD-specific databases and the published manuscripts had various details available describing the phenotype which we have summarized. We were able to contact the families that are described for the first time in this paper to gather additional information. Using these available data, we assessed the intra- and inter-familial phenotypic variation (as well as all other genetic information) within these individuals and discuss the findings in the context of genotype-phenotype comparison, including variable expression of ASD core symptom and related features.

Results

Identification of the recurrent p.Ala1227Glyfs*69 variant

To achieve the most comprehensive genomic representation (difficult to sequence exons, splice site boundaries) for variant detection, we initially examined the Autism Speaks MSSNG whole-genome sequencing (WGS) cohort (https://research.mss.ng/), with 11,359 samples, including 5102 affected individuals and 3567 with family data, typically belonging to trios, or quads (two parents and two affected children) for recurrent mutations. Secondly, we tested the Simon Simplex Collection (SSC) WGS collection (https://www.sfari.org/resource/simons-simplex-collection/), which comprises 9,205 samples, including 2419 affected individuals and 2393 with family data (typically two parents, one affected child, one unaffected child). Previous studies have extensively reported on MSSNG6,17,30,31 and SSC32,33. Probands from both cohorts met the criteria for ASD based on scores from standardized diagnostic criteria tools, typically the Autism Diagnostic Observation Schedule (ADOS)34 and the Autism Diagnostic Interview–Revised (ADI-R)35 and/or was supported by clinical criteria. Many individuals were also assessed with standardized measures of intelligence (I.Q.), including verbal and nonverbal ability, language, social behavior, adaptive functioning, and physical measurements6,32,33. All of this phenotype data is available from the respective databases.

From the genome sequences analyzed, our most interesting finding identified five probands in MSSNG (four males and one female) from four families and one proband in SSC (male) carrying a heterozygous guanine duplication in SHANK3 (NCBI: NM_033517.1; ENSEMBL: ENST00000262795.5; c.3679 or c.3676 depending on the transcript) (Table 1; the reference sequence NM_033517.1 was selected as the appropriate transcript for this study as this was the reference sequence used in the original publication of this variant in Durand et al.36). We also found other recurrent sequence-level de novo heterozygous damaging missense variants in the PTEN, CAMK2A, SPTAN1, MECP2, and CSNK1E genes, but in each of these instances no more than two unrelated individuals were found in the combined MSSNG and SSC data (Supplementary Material; Table S1).

Table 1 Genome annotation of the SHANK3 guanine duplication (rs797044936).

The discovery of this recurrent guanine duplication variant in SHANK3 was confirmed using Sanger sequencing (Fig. 1). We then scanned the literature, including using Varicarta37 and found that this same guanine duplication was reported in 12 probands affected by ASD4,36,38,39,40,41,42, and one proband within the ASD borderline range, Phelan-McDermid syndrome, significantly delayed language, and speech and visual-motor deficits38. We carefully examined all genotypes and found that one was the same individual in the SSC cohort (14470.p1);40 therefore, we removed this duplicate individual. Considering the new cases reported here and the cases reported in the literature, the p.Ala1227Glyfs*69 variant has been observed in a total of 18 cases from 16 families, identified using different genome-testing approaches (Table 2). Nearly all of these probands (17/18) were ascertained for ASD, although the general phenotype, as discussed below, varies somewhat among individuals (Table 3; Fig. 2). We also detected one female individual with ASD (with mild intellectual disability) carrying a de novo G deletion (7-G’s) at this same site (c.3679del p.Ala1227Profs*57).

Fig. 1: Pedigrees of MSSNG families reported for the first time in this study and their Sanger sequencing confirmation.
figure 1

A Pedigree MSSNG00342; B Pedigree 1–1047 (unaffected sibling was targeted Sanger sequenced but was not the whole-genome sequenced); C Pedigree 2-1774 (unaffected sibling sample was not available); D. Pedigree 7–0574 (will be available in MSSNG DB7). Gray shapes indicate individuals with an ASD diagnosis and carry the SHANK3 variant.

Table 2 ASD probands identified in MSSNG, SSC, and other publications containing the p.Ala1227Glyfs*69 SHANK3 variant.
Table 3 Phenotype of ASD probands identified in MSSNG, SSC, and other publications containing the SHANK3 p.Ala1227Glyfs*69 variant Thicker lines box individuals from the same family.
Fig. 2: Phenotypic heterogeneity in individuals (X-axis) carrying the SHANK3 p.Ala1227Glyfs*69 variant reported in the MSSNG6, SSC32,33, and in published papers4,36,38,39,40,41,42,71.
figure 2

Those individuals in the same family are grouped within the black boxes. Gray spaces indicate the absence of the phenotype. White spaces indicate that the phenotype might have not been accessed in the proband. Phenotypic categories are described in Table 3. Individual S7 was not formally reported as being formally tested for ASD. *Caution is needed in the interpretation of these frequencies since some phenotypes were not assessed for some individuals.

Genome annotation of the p.Ala1227Glyfs*69 variant

The SHANK3 guanine duplication is located within a segment of 8-G′s on chromosome 22q13 at genomic location [hg38]g.50,721,505dup or g.50,721,512dup, depending on the position that this variant is annotated in the guanines (Table 1; Fig. 2). Some tools annotate the first G as the duplication, and others annotate it as the final G (Supplementary Material; Fig. 3). The sequencing technology might also affect the variant annotation, with Sanger sequencing conventionally adding the G duplication at the 3′ end of the gene as the first point of amino acid change, and Next Generation Sequencing usually left aligning the variant. Independent of the position of the base insertion in the 8-Gs, the frameshift starting in exon 21 results in the new reading frame ending with a stop codon at position 69, causing a truncation lacking the C-terminal region (Fig. 3). We also confirmed that both exome sequencing and WGS reliably captured this 8-G string genomic segment in the short-read sequence (see Methods).

Fig. 3: Impact of the SHANK3 p.Ala1227Glyfs*69 variant on the protein.
figure 3

A (top left) Guanine string containing 8 Gs found in non-affected individuals; (top right) Guanine string containing nine Gs found in ASD-affected individuals and parents with somatic mutations; (bottom) Location of the frequent guanine duplication in the SHANK3 gene. ANK ankyrin repeats, SH3 SRC homology 3 domain, PDZ postsynaptic density 95/Discs large/zona occludens, HBS homer binding site, CBS cortactin binding site, SAM sterile alpha motif domain. B Alignment of wild type protein sequences, for each of three highly expressed splice isoforms, to the protein sequence of the variant around the position of the mutation; (note, in this figure the first transcript presented is ENST00000262795.5 and the protein change for this is p.Ala1226Glyfs*69 as shown in Table 1). C Normalized impact of the variant for the three isoforms using FAIDR, a tool that identifies physical features and the presence of consensus protein recognition motifs in intrinsically disordered protein regions56. (*Note that SCD, sequence charge decoration, a measure of charge patterning associated with phase separation, has values significantly above 2: 5.4, 7.0, and 10.2 for the three isoform.).

Segregation and population frequency of the recurrent p.Ala1227Glyfs*69 variant

All the probands identified in this study carried de novo variants with the exception of five individuals. One family with two brothers first reported in the initial SHANK3 ASD-discovery paper36 inherited the variant from their mother, who was found to be mosaic. Two siblings within the MSSNG cohort (MSSNG00342-003 and MSSNG00342-004) inherited the variant from their father, who was also shown to be a mosaic (Table 2). In this latter case, the variant was only present in 8 of 50 reads in the father’s WGS data and was verified using a T.A. clone Kit (Invitrogen cat number 45-0046). Proband 1-1047-003 also seems to have inherited the variant from his mother by somatic mosaicism, in whom the variant was present in 1 of 32 reads of the WGS data. Exome sequencing analysis was also performed in this mother, with the variant being observed in 2 of 110 reads. To search for additional potential relevant somatic mutations43, we tested the original alignment files in both cohorts using DeNovoGear’s dng-call method for the SHANK3 locus44 using 0.8 as a posterior probability of a de novo mutation (ppDNM), but we did not find any other candidates. Considering the families studied in MSSNG and SSC (our most trusted datasets) 6/7,521(0.08%) ASD-affected individuals carried the p.Ala1227Glyfs*69 variant in 5/6,681 (0.07%) of families. The Fisher’s exact test of the association between the frequency in heterozygous individuals in ASD cases and control population databases has a P value of 0.029.

Consequences of p.Ala1227Glyfs*69 on the SHANK3 protein

Nonsense mutations and frameshifts in SHANK3 can lead to reduced expression, and SHANK3-deficient neurons were found to have an altered phospho-proteome that may explain their decreased dendritic spine density45. However, SHANK3 mRNA is still expressed in truncation mutant-containing induced pluripotent stem cells (iPSCs)46 and truncated SHANK3 proteins may have a dominant-negative effects in neurons47,48. We therefore explored the consequences of p.Ala1227Glyfs*69 on the SHANK3 protein. We annotated the positions of amino acids to which the variant is mapped according to ENSEMBL and the UCSC genome browser. Using the DISOPRED3 predictor49 and the consensus of eight predictors from MobiDB-lite50, we identified where the mutation falls with respect to intrinsically disordered regions (IDRs) of the protein, which may influence protein folding and binding51. In both predictors, the position of interest was found to be embedded within a large IDR, which map to multiple isoforms (Fig. 3B). Mutations that create frameshifts and stop codons in this region of SHANK336,52 truncate two proline-rich binding sites for Homer and Cortactin (Fig. 3A) and affect function, including altering neuronal morphology in cell-based experiments46,47. The SHANK3 protein serves as a scaffold to connect membrane receptors to the actin-cytoskeleton in the postsynaptic density (PSD), a protein-rich sub-compartment considered to be a biomolecular condensate formed by phase separation53,54 due to multivalent interactions46. In each of the isoforms, these truncations are expected to impair canonical PSD formation and stability.

The variant isoforms were also analyzed using Feature Analysis of Intrinsically Disordered Regions, a tool that identifies the presence of consensus protein recognition motifs in IDRs55,56 and using PScore57, predicts phase separation propensity via IDR planar pi-contacts (Fig. 3C; Supplementary Material; Fig. S2). A number of specific short linear interaction motifs were found to be altered. Of particular interest is the increase in SH3 domain class I-binding motifs, given that SHANK3 is known to interact with numerous SH3 domains. The variants significantly increase the number of arginine-glycine and arginine-arginine dipeptide instances, which are associated with mRNA binding and phase separation, and increase the cysteine content of the sequence. A reduction in SHANK3 protein due to the frameshift (e.g., through nonsense mediated decay; discussed below) could also affect the phase separation of the PSD, which is known to be concentration dependent58.

p.Ala1227Glyfs*69 as a pathogenic variant

The p.Ala1227Glyfs*69 variant is classified in ClinVar as “Pathogenic for ASD, NDD, and others” and is exceptionally rare or absent in control populations (ClinVar; https://www.ncbi.nlm.nih.gov/clinvar/variation/208759/). In the gnomAD v2.1.1 dataset59, which uses the hg37 as reference genome, it has an allele frequency of 16/160,994 alleles = 0.000099 (0.0099%). In ALFA60, this variant is also reported in 0.02% of control Europeans samples. However, in gnomAD v3, 1000 Genomes Project (that uses hg38 as a reference genome), TOPMed61, two unpublished pediatric controls from our group (INOVA and CHILD), the Personal Genome Project Canada62 and Medical Genome Reference Bank63 this variant is not present. In combination, this suggests that the presence of the variant in gnomAD v2.1.1 and ALFA might be due to low-quality sequencing with the preliminary description being corrected in gnomAD v3. It is also noteworthy that ~1/100 people will have ASD, so it would be expected to find p.Ala1227Glyfs*69 variant carriers in control populations. Based on our findings described here they would likely have ASD, but additional studies will be required to further assess this.

We have analyzed the genomic conservation of this variant with GERP64, UCSC PhyloP, and phastCons for primates, placental mammals, and 100 vertebrates65. GERP identifies constrained elements in multiple alignments by quantifying substitution deficits. These deficits represent substitutions that would have occurred if the element were neutral DNA but did not occur because the element has been under functional constraint. The p.Ala1227Glyfs*69 variant has a GERP score of 5.2 (p = 0), suggestive of having a large deleterious effect66. The PhyloP score was 0.6 for primates, 1.35 for mammals, and 2.13 considering 100 vertebrates, suggesting high evolutionary conservation. The PhastCon scores were also higher than 0.98 for primates, mammals, and vertebrates, which indicates a strong negative selection on this variant.

Genotype and phenotype correlation

In all 17 p.Ala1227Glyfs*69 carriers evaluated for ASD, ASD was confirmed by review of the ASD gold standard diagnostic tests available in the databases or as reported in the original manuscripts, and the majority of participants described are reported to have an intellectual disability defined as an IQ score below 70 and impairments in adaptive functioning, although the spectrum of severity is wide (Table 3; Fig. 2). Four individuals were ascertained for Phelan-McDermid Syndrome, with three of these being of the 17 receiving a formal ASD diagnosis and one never being assessed for autism. Language deficits are also prevalent and often severe. We were cautious about making claims on other associated conditions as they have not been universally and systematically ascertained. However, hypotonia and gait abnormalities are common, also consistent with animal model data67. Seizures were reported in 3/18 participants. Other neurodevelopmental concerns include ADHD, anxiety, Developmental Coordination Disorder, and mood disorders. Gastrointestinal distress and sleep dysfunction were also reported. Last, both dysmorphia and other organ anomalies were reported (conductive hearing loss- and coronary artery fistula). Within pairs of siblings sharing a variant, there is a similarity of phenotype, with some variability in the severity of the intellectual disability.

Different de novo mutations in SHANK3 have also been associated with other developmental/neuropsychiatric disorders and genetic syndromes such as schizophrenia47,68 and Phelan-McDermid Syndrome (PMS)69. The majority of children diagnosed with PMS also have ASD, and both conditions are often associated with intellectual and language delay, hypotonia, seizures, and sleep disorders, although children with PMS also often have other organ involvement. We also examined the whole genomes from the MSSNG and SSC p.Ala1227Glyfs*69 carriers and assessed for other clinically relevant variants that could be contributing to the varying phenotypic presentation, but none were identified. Additionally, no other clinically relevant variants were highlighted in those individuals described in the literature36,38,69,70,71.

To evaluate if common genetic variants may be contributing to the ASD phenotype in the p.Ala1227Glyfs*69 SHANK3 variant carriers, we calculated their ASD polygenic risk score (PRS) for all accessible individuals from European ancestry in MSSNG (db6) and SSC. PRS in the probands analyzed in this study varied between −1.167 and 15.606 (Table 2), showing no clear pattern between the presence of the clinically significant SHANK3 variant and the polygenetic risk of common variants. PRS in all subjects with autism in MSSNG and SSC ranges between −18.580 and 20.626.

Discussion

Our data indicate that 17/17 carriers (from 15 independent families) of the p.Ala1227Glyfs*69 variant affecting SHANK3 who have been formally tested carry a diagnosis of ASD. Our analysis did not identify any other obvious rare or common genetic variants, or combinations thereof, in the genomes of these individuals that could be contributing to the phenotypes reported in these individuals. Given the nature of neurobehavioral complexity, perhaps not surprisingly, there is phenotypic heterogeneity exhibited amongst p.Ala1227Glyfs*69 carriers, which is a hallmark of autism72,73, as well as other related brain disorders that may share overlapping clinical features and contributory susceptibility genes74,75. It is instructive for future “genotype-first” queries that the discovery of this recurrent p.Ala1227Glyfs*69 variant was missed in our early analyses. It was only detected here upon careful consideration of the different naming schemes of the various isoforms (and exons within them) in SHANK3, which also varied between different software tools, as well as the various genome builds being compared against (Table 1)76,77.

In addition, we searched for p.Ala1227Glyfs*69 SHANK3 variants in unpublished data from the SPARK cohort41. From 8744 ASD-affected individuals for which sequencing data from both parents were available, the variant was detected in two male individuals, both de novo. The variant was also detected in three out of 13,156 ASD-affected individuals (two males and one female) for which parental sequences were not available and thus inheritance could not be determined. As well from a private database we identified a female teen with ASD which based on the Vineland she would be described as severe, severe language delay, and severe global developmental delay. As highlighted on continuous measures of emotional difficulties (CBCL), she also presents with attention difficulties. This individual was not included in Table 3 since gold standard ASD measures were not available and this phenotype description is based on available assessments. We mention this data just to demonstrate that the variant is found in other collections, as would be expected, and await the presentation of more detailed phenotype data from these participants.

Two independently-created murine models with an insertion of a guanine nucleotide into the analogous mouse base pair position, which we refer to here as Shank3 InsG3680, have also demonstrated changes in cellular, circuit, and behavioral phenotypes67,78 (Supplementary Material; Table S2). Specifically, these Shank3InsG3680 mouse models demonstrated changes to baseline neurotransmission and/or impairments in long-term depression (LTD) and long-term potentiation (LTP), the synaptic basis of learning and memory. Overall homozygous Shank3InsG3680 +/+ mice exhibited more significant changes than heterozygous Shank3InsG3680 mice, suggesting that functioning of one normal Shank3 copy maybe sufficient to support some of its function.

Regional differences in synaptic deficits and synaptic composition were observed, and the extent of the impact may have been modulated by other Shank family genes. In the adult hippocampus, expression of the reversible Shank3InsG3680 variant cassette67 produced a truncated Shank3 protein and loss of the major high molecular weight isoforms at the synapse. This was associated with impaired hippocampal mGluR dependent LTD, intact LTP, and changes to baseline NMDA receptor (NMDAR) mediated synaptic function. In the striatum, Zhou et al.78 showed a significant decrease of levels of Shank3 mRNA in the Shank3InsG3680 strain compared with the wild type, suggesting a reduced level of mRNA through nonsense-mediated decay. This finding suggests that the InsG3680 variant results in a near-complete loss of SHANK3 protein, concomitant with synaptic transmission deficits in juvenile and adult homozygous mutant Shank3InsG3680 (+/+) mice. Post-translational modifications, modulated by nitric oxide, were also found in both young and adult Shank3InsG3680 +/+mice.

In assessments of general cognitive function, Shank3InsG3680 +/+ mice showed mild spatial learning impairments in the Morris Water Maze task and motor learning deficits in the accelerating rotarod task, while heterozygous mice did not67. ASD-associated behaviors in these two models also showed mixed outcomes in both social interaction impairments and repetitive behaviors that, similar to human assessments, may be dependent on age and gender. Speed et al.67 reported statistically different effects in some of their assessments comparing between male and female adult mice. This group did not observe social interaction deficits in the three-chamber task with mixed-sex adult mutant mice, nor did they observe repetitive behaviors, but instead suggested aversion to novel objects. However, in large all-male cohorts, Zhou et al.78 showed deficits in social behaviors in both juvenile and adult mice. In addition, in adults there was increased anxiety, repetitive grooming behaviors, and sensory processing differences78. On balance, the mouse data seems to generally recapitulate the learning impairments and behavioral differences seen in patients with the p.Ala1227Glyfs*69 SHANK3 variant.

Highly penetrant alleles such as p.Ala1227Glyfs*69 in neurodevelopmental disorders are under severe negative selection and are constantly being removed from the population79,80. However, recurrent mutations are always being added to the gene pool and while typically occurring randomly, the intrinsic81 and extrinsic characteristics82 may also have an influence83. Experimental investigations have shown that guanine bases can be targets for oxidative damage in DNA, while mutability in other bases is more variable84. Moreover, the locus under study is within 8 guanines, which constitutes a homopolymer run (HR). HRs are sequences with six or more identical nucleotides and are associated with >10-fold enrichment of mutation compared to the genomic average85. It is noteworthy that there are three other G homopolymer runs in SHANK3, but no recurrent variants were found at these sites.

The CpG content of DNA has also been shown to influence the mutation rate in non-CpG-containing sequences, suggesting that intrinsic properties of DNA sequences may be more important than the chromosomal environment in determining mutation rates and genome integrity. Evidence indicates that because of the propensity for methyl-CpGs to deaminate and produce mismatches, it is plausible that error-prone repair mechanisms may have a role in hypermutability. CpG methylation might also have epigenetic effects by promoting chromatin states that make DNA more susceptible to mutations86.

Although exceedingly rare (0.075% frequency in the ASD families studied by WGS), the finding that this p.Ala1227Glyfs*69 variant in SHANK3 is, so far, concordant with an ASD, and that it will surely continue to sporadically re-occur in the population, has important implications for genetic counseling. It will also be important to continue to search for the p.Ala1227Glyfs*69 variant in SHANK3 to see if it confers risk in other disorders, including perhaps under a multiple-variant model87. Defining a specific mutational mechanism underlying an ASD outcome, may also focus strategies for the development of therapeutic interventions.

Methods

Genome sequence analysis

We searched ASD-specific genomic databases in which the participants upon recruitment had a diagnosis of ASD, for damaging de novo sequence-level variants affecting exactly the same genomic location in different families. A variant was defined to be damaging if it caused loss-of function (stop gain, frameshift, or canonical splice site-disrupting) or was a predicted deleterious missense variant based on American College of Medical Genetics guidelines29. Initially, we examined rare (frequency less than 0.001 in gnomAD and 1000 g) de novo variants identified from MSSNG data release DB6 (release date June 24, 2020), which were detected as previously described6. After identifying this recurrent variant in SHANK3, we then searched our in-house databases and performed literature searches for the same variant. Ethical review of these cohort studies was approved by institutional review boards and included assessing datasets through applications to Data Access Committees.

Phenotyping measures

Phenotypic data was extracted either from the original manuscripts, in which case we attempted to stay close to the original descriptions or from the reference databases. In the latter case, clinical diagnosis of autism spectrum disorder was reported in the databases and was supported by ADI/ADOS. Intellectual disability was reported as a clinical diagnosis and in most cases formal IQ testing was available for confirmation. Language delay was available as a clinical diagnosis, often with characterizations, such as “minimally verbal” or “nonverbal” and in many cases formal language measure scores were available for review. Information on psychiatric/ neurological comorbidities was extracted from the original manuscripts, or available as a clinician diagnosis or clinical concern based on continuous measures of such symptomatology available (e.g., CBCL, RCADS).

Confirming representation of exon 21 in exome and WGS datasets

Given the high GC-density content of SHANK3, which can influence exon capture and sequencing52, we thought it was critical when assessing mutational frequency to confirm that there were no biases in read-coverage of the site of the target variant within exon 21 (Supplementary Material; Fig.1). Using whole-exome sequences from 298 patients and 462 controls from our internal dataset, we ran the Agilent Sureselect Clinical research exome V1 for exome sequence analysis and show that the coverage around the G duplication region is at the anticipated 120x coverage (Supplementary Material; Fig. 1). This analysis also indicates that diagnostic exome sequencing will more than adequately capture and accurately genotype this position. WGS analysis of probands from MSSNG and SSC also confirm that exon 21 in SHANK3 is uniformly covered.

Protein and evolutionary conservation analysis

We used the DISOPRED3 predictor49 and the consensus of eight predictors from MobiDB-lite50 to map where the p.Ala1227Glyfs*69 variant falls with respect to intrinsically disordered regions (IDRs) of the protein. The variant isoforms were also analyzed using Feature Analysis of Intrinsically Disordered Regions55,56 and using PScore57. We analyzed the genomic conservation of the p.Ala1227Glyfs*69 variant with GERP64, UCSC PhyloP, and phastCons for primates, placental mammals, and 100 vertebrates65. The main text, tables, and figures (including Supplemental) have additional details relevant to the presentation of the results.

Polygenic risk score analysis (PRS)

PRS was calculated for all individuals from European ancestry in MSSNG (db6) and SSC merged with 1000 Genomes European population using GWAS summary statistics derived from the iPSYCH Autism project including 13,076 cases and 22,664 controls from Denmark88. This included probands MSSNG00342-003, MSSNG0342-004, 1-1047-003, 2-1774-003, and 14470.p1. A total of 25,837 SNPs were included in PRS calculation. Since the proband 7-0527-003 was part of a later version of the MSSNG cohort (db7), he was not included in the initial PRS calculation. This individual’s PRS was calculated separately with his parents (7-0527-001 and 7-0527-002) using the same 25,837 SNPs included in PRS calculations for the others and centered by the mean in whole MSSNG/SSC/1000 Genomes European population. However, of 25,837 SNPs, 1496 were missing due to sample quality in this family, and caution is needed in comparison with the other subjects. The approach for interpretation of the PRS data was based on the previous studies18,88,89.

Study recruitment

This study has complied with all relevant ethical regulations including obtaining informed consent from all participants and was approved by the Research Ethics Board at The Hospital for Sick Children.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.