Autism spectrum disorders (ASDs) include a broad range of developmental brain disorders that share a complex and heterogeneous etiology characterized by fundamental deficits in social reciprocity, impaired language and communication skills, as well as repetitive and stereotypic behavior. About 1% of the population are directly affected and many more as family members. Despite the high heritability of ASD, with near-perfect concordance in monozygotic twins,1 a >50% heritability among siblings2 and a >25% risk for developing ASD in a male sibling,3 the genetic risk factors are still poorly understood.4 In the absence of reliable and feasible biomarkers, ASDs are still diagnosed exclusively according to behavioral criteria. ASD-specific therapeutic approaches are urgently needed to meet the challenge of an increasing prevalence, yet genome-wide association studies (GWAS) have not met the need for a better understanding of the etiology of ASD.5

Many GWAS have been marred by both low sensitivity and specificity. The first three studies reported SYT17 (s=−log10PP=6.72), DMD (6.57),6 the moesin pseudogene MSNP1AS (9.67)7 (>750 kB from either of the cadherins CDH9 or CDH10),6 an independent finding in MSNP1AS (5.47),8 and the taste receptor TAS2R1 (6.68).9 The ‘largest (GWAS) of psychiatric illness so far in 33 332 subjects with psychiatric disorders, including 4949 subjects with ASD,10 pointed to two calcium (Ca2+) channel subunits (CACNA1C and CACNB2, five disorder meta-analysis only), consistent with previously suggested involvement of Ca2+ signaling in psychiatric disorders, but also to an intron in AS3MT, a gene involved in arsenic metabolism. The models with the best fit to ASD suggested TCF4 (a transcription factor, which ‘may have an important role in nervous system development’), but also DPYD (a pyrimidine catabolic enzyme) and PCGEM1 (non-protein coding). In an even larger study of educational attainment in 126 558 subjects, only ‘three independent SNPs were genome-wide significant’.11

Our results are based on one of the largest studies of ASD in the United States, which included 2705 children with ASD from the Autism Genome Project (AGP).12 In the original analysis, both stages confirmed CNTNAP2 as a risk factor for ASD,13 but the ‘highly associated MACROD2 SNP from primary stage 1 analysis showed little if any signal in the stage 2 sample (P-value 0.206)’.12 Even a score combining several putative risk alleles could not account for more than 1% of the variance.12 Another analysis of the first stage (AGP I)14 found seven rare de novo copy-number variations (CNV) unique to cases, four among them in a cluster of genes related to neurexin/neuregulin signaling comprising SHANK2 (two cases), a region containing five genes, including SYNGAP1 (one) and DLGAP2 (one), as well as a group of 7 among 219 rare inherited CNVs (P=3.1 × 103) in a 300 -kB X-linked region containing PTCHD1, a gene with unknown function deleted in two related boys with intellectual disability.15 Overall, CNVs were enriched in genes ‘involved in cellular proliferation, projection and motility, and GTPase/Ras signalling’, yet no ‘connected pathways’ could be postulated.14

Despite evidence for a likely involvement of de novo and environmental or epigenetic risk factors, including maternal antibodies16 or stress during pregnancy17 and paternal age,18,19 we contend that coding variations contribute substantially to the heritability of ASD and can be successfully detected and assembled into connected pathways with GWAS—if the experimental design, the primary outcome, the statistical methods used, and the decision rules applied were better targeted toward the particulars of non-randomized studies of common diseases. With u-statistics for genetically structured wide-locus data comprising several neighboring SNPs (μGWAS) addressing the former two conditions, we have recently confirmed axonal guidance and Ca2+ signaling as key pathways in childhood absence epilepsy (CAE)20 from 185 cases and publicly available controls only. As shared genetic risk factors have been suggested for neurodevelopmental disorders, in general,21 and epilepsies and ASD, in particular,22, 23, 24 we hypothesized that these pathways are involved in ASD as well. In this study, we took advantage of the higher power of μGWAS as a wide-locus approach and its higher specificity as a non-parametric method, and compared more vs less severe cases, to elucidate risk factors for particular features of ASD. Finally, we present a novel objective decision rule for study-specific genome-wide significance, which adjusts for a GWAS-specific bias in determining cutoffs for enrichment with disease-related genes.

Materials and methods

Study subjects/genotyping

The study was approved by the IRB of The Rockefeller University. No human participants were involved in the research. The samples were genotyped on Human1Mv1_C and Human1M-Duov3_B Illumina chips. The genomic data were downloaded from dbGaP (data set phs000267.v2.p2) and details of the study population are described elsewhere.12,25 In the preparation of the data, we retained only SNPs with an rsID. During quality control, we removed SNPs according to the following criteria (AGP I/AGP II): MAF<2% (851/1066), >20% missing genotypes (5179/791), one-sided Hardy–Weinberg P-values for lack of heterozygosity <1 × 104 (0/0), LD with their neighbors >0.98 (149 512/149 982), data quality μ-score20 among the bottom 10% (67 979/68 380).

Study design

We aimed at risk factors specific to strict definition autism (SDA) by comparing case subpopulations meeting the definition of SDA and milder cases with ASD (excluding SDA), for which we here use the term ‘high-functioning autism’ (HFA). To reduce variance, we included only subjects of European ancestry genotyped on the more frequently used platform in either stage. In AGP II, we also excluded female cases because of confounding between chip platform and disease severity. The total number of subjects included (m:male/f: female) was 547/98 (SDA) and 358/68 (HFA) in AGP I and 375 (SDA) and 201 (HFA) in AGP II.

Wide-locus approach

To overcome several of the shortcomings seen in previous applications of single-SNP GWAS (ssGWAS) to common diseases, we combined several strategies at different stages of the analysis process. We aimed at wide loci of up to six neighboring SNPs as a primary outcome and applied the same non-parametric GWAS approach based on u-statistics for multivariate data26 with genotypic structures (μGWAS)20 as in the previous CAE study.20 For the AGP I data, we stratified the analysis by sex,27 and selected sex-specific results, if either sex, after Bonferroni correction for two sexes,28 was more significant than the stratified analysis. To avoid spurious findings, we excluded loci outside of linkage-disequilibrium (LD) blocks containing genes with known function or adjacent to their 5′-end and also loci highly influenced by a single SNP only, unless these SNPs were implicated in both stages or had been implicated in other studies.

MAF significance correlation

With any finite sample size, the significance of a u-/rank test is limited and more significant results can only be obtained for SNPs with sufficiently high MAF. We performed ssGWAS simulations with 2 500 000 permuted phenotypes, comparing two groups of equal size n for various MAFs. The 1–10−5 quantile of the permutation distribution drops from the expected s=5.26 cutoff, which is routinely met for MAF >0.33, to 4.9 (n=1000 subjects), 4.7 (n=500) and 4.5 (n=300) for a MAF of 0.05. For the 7.5 level, this bias is projected to be even larger. Because of this MAF significance correlation, the expected diagonal in a ssGWAS quantile-quantile (QQ) plot under the null hypothesis that ‘no SNP is associated with the trait’,29 turns into a curve dropping below the diagonal towards the end.

Estimating the expected s-value distribution from >108 permutations to obtain stable estimates of the 1–10−7.5 quantile is neither practical nor sufficient to avoid a biased selection of SNPs for limited tests. Because of the MAF significance correlation, any SNP that is ‘significant’ when comparing observed phenotypes, is also more likely to be ‘significant’ with random phenotype permutations (see Supplementary Figure 12).

Non-randomization bias

The reason for this curvature often not being recognized in QQ plots is that GWAS subjects are deterministically categorized based on their outcome (here: SDA vs HFA), rather than randomly assigned to interventions (as in clinical trials). Any deterministically categorized populations, however, are expected to differ systematically in aspects related to neither the condition of interest nor common ancestry factors (which could potentially be accounted for through principal component analysis). When the downward trend from using a limited test and the upward bias from deterministic selection are similar, the s-values may still appear to follow the diagonal, hiding loci suggesting ‘true association’.29

Multipicity adjustments for diplotype length

For multivariate tests of overlapping diplotypes, the estimated quantile-rank (QR) curve needs to be elevated above the diagonal throughout to account for multiple tests conducted around the same SNP. Because most of these tests are highly dependent, the elevation of the estimated QR curve compared with the estimated QQ curve (Figure 1) is limited, but the distance is likely to vary across diseases and populations.

Figure 1
figure 1

μGWAS QR plot (curved) vs traditional ssGWAS QQ plot (straight), left: AGP I, right: AGP II. Each point represents the most significant result among all diplotypes centered at the same SNP ranked by significance (low to high). Dashed blue curve: projection. Solid blue curve: loess estimation (see Materials and Methods). Vertical lines connect the most significant s-values (−log10 P) of a gene (dot) with its expected value (solid blue line). Light red and gray vertical lines indicate genes with unknown function and results with low reliability (either low μIC or reliance on a single SNP), respectively. Top and bottom gene list (by significance, right to left, excluding genes with unknown function): μGWAS and ssGWAS results, respectively. Shaded genes are among the genes highlighted in Figure 2. Full and open triangles mark genes with an identical match or family member of SFARI genes, respectively (see Supplementary Table 1 for details). The dotted horizontal lines represent the projected WG apex (6.272 and 6.064) and an exploratory 100 gene cutoff (4.835 and 4.480) for AGP I and AGP II, respectively (Supplementary Table 1). The horizontal solid blue line indicates the proposed study-specific GWS.

Projected QR curves

The diagonal of the traditional QQ plot does not depend on any data, including the most ‘significant’ data. The s-values are expected to fit the diagonal for the most part (except for the most significant results),29 (Figure 1A), as the vast majority of SNPs are expected not to be associated with the disease. In direct analogy, the QR curve for a multivariate test should be ‘smooth’, with upward deviations indicating ‘true association’, related or not. On the basis of the above rationale and the simulation results shown below, we propose to estimate the highest point of the projected QR curve (apex) for each chromosome from a smooth projection of the s-values after truncating as many of the highest values as needed for the projection to have a monotone increase and, conservatively for a limited test, a non-positive second derivative. Fitting against the data also reduces the effect of population stratification29 (Figure 1B). (For computational convenience, we have chosen locally weighted polynomial regression,30 as implemented in S+, loess.smooth (…, degree=2, family=’gaussian’).)

Estimated WG QR apex

While chromosomes may differ with respect to their content of related and unrelated risk factors (take for example, the HLA region in autoimmune diseases), random errors are expected to have the same distribution across all autosomes. Hence, we can estimate the expected WG apex as the (winsorized) median projected apex among autosomes with the smallest deviation of s-values from the projection. (Here, we selected ten autosomes based on maximum norm and median for robustness, but the strategy to determine the optimal number, including the criteria for ‘optimality’, remains to be determined.)

Estimated QR curves

The estimated curve for each chromosome is then calculated as the loess projection30 of this chromosome’s s-values with as many of the highest values replaced with the estimated WG apex until the curve’s apex is at or below that level. Applied to the WG projection (Figure 1, Supplementary Figures 5-11, bottom right), this procedure yields the estimated WG curve. The simulation results in Supplementary Figure 12 (bottom right vs bottom left) demonstrate the low variance of the estimates from phenotype permutations and the similarity of their median apex with the winsorized median apex estimated from the observed s-values. Details of this approach are discussed in the Supplementary Material in the context of Supplementary Figure 5.

Study-specific GWS

For studies aiming to confirm individual SNPs as associated with a phenotype, the ‘confirmatory’ paradigm31 requires adjustment for multiplicity. When applied to GWAS, these adjustments are typically based on a ‘customary’ fixed 0.05 level, irrespective of study size or relative risk of type I over type II errors (see, Fisher32 p. 358 and Gigerenzer33 for a discussion), and the assumption of 1 000 000 independent SNPs, irrespective of chip density.29 Moving from individual SNPs to overlapping diplotypes increases the dependency of any formal multiplicity adjustment on assumptions with questionable biological validity.

In most GWAS, however, we do not aim to confirm hypotheses regarding specific SNPs. Instead, we aim at picking likely candidates from >40 000 (pseudo-) genes, whose relative importance and epistatic interactions are unknown. As graphical procedures are particularly useful for such ‘exploratory’ studies,34 we chose QR plots to guide with interpretation. Unfortunately, exact cutoffs for deviation of s-values from the estimated curve are unknown. When ‘the knowledge [is] at best approximate[,] an approximate answer to the right question, which is often vague, [is far better] than an exact answer to the wrong question, which can always be made precise’,35 pp 13–14). Hence, we present a heuristic approach that relies on fewer unrealistic assumptions than typical attempts to quantify a particular error rate.

As the expected WG curve needs to be estimated, the s-values have a complex dependency structure, and the appropriate level of significance (α) for the given sample size is unknown, we propose a heuristic decision rule based on weak assumptions only. In the long run, one would expect most s-values above the apex to be significant at any α > 0 (consistency) and regions with the strongest association to have the highest odds at being included (unbiasedness). Hence, for a particular α, one could lower the cutoff, but, to account for variance in estimating the apex, one would need to raise it. As a compromise, we propose the estimated WG apex as a cutoff for study-specific GWS.


Traditional GWS cutoffs

The commonly used cutoffs tend to have low sensitivity for enriched genes. In an analysis of AGRE/NIMH data,9 for instance, 'excess of independent regions associated at P<105', had been observed, even though ‘no SNP met criteria for genome-wide [permutation] significance [of] P<2.5 × 10−7'. None of our results exceeds this cutoff, either, even though our ssGWAS results (Figure 1, bottom) and, in particular, μGWAS results (Figure 1, top list) of both stages are also highly enriched with genes collected in the SFARI Gene database (Figure 1, red triangles).

The WG projection apices for ssGWAS of ~6.0 (Figure 1, bottom) are clearly exceeded only once, by MMP10 in AGP I. A noticeable deviation from the expected distribution is commonly used as a decision criterion for selecting candidate genes.29 On the basis of the projected WG curve, only three AGP I genes (Table 1a), but none of the AGP II genes, fulfill this criterion in ssGWAS, compared with eleven and two genes (Table 1a/b, excluding NTMT1), respectively (Figure 1, top), in μGWAS.20

Table 1 Overview of genes in GW analyses meeting the significance criteria and relationships with functional clusters (symbols highlighted in bold) mentioned in the Results section

Study-specific GWS

The proposed more flexible cut-offs (see Materials and Methods) account for the MAF significance correlation. In both AGP stages, ~100 genes deviate from the estimated curve (Figure 1, solid curve) as an heuristic criterion for expected enrichment37 (Supplementary Table 1). The set of genes deviating sufficiently, however, can be difficult to determine objectively. We are proposing the estimated WG QR apex as a more formal study-specific criterion, which here increases the number of significant regions from none (when compared against a fixed GWS of 7.5) to 18 and 8 for AGP I and II, respectively (Figure 1, solid horizontal lines).

In all μGWAS included, that is, Figure 1 (four analyses), Figure 4, Supplementary Figures 7,9 and 12 (three analyses), as well as in numerous others (results not shown), <20 genes or gene regions exceeded study-specific GWS. The high enrichment with pathway genes even below the WG apex in μGWAS of AGP II (Figure 1, top right) attests to the proposed approach being conservative. Further support comes from the number of selected genes being smaller with randomized vs observed phenotypes (3–7 vs 14, Supplementary Figure 12), albeit not zero, consistent with the above MAF significance correlation resulting in SNPs with a high MAF not only to be more likely to be significant with observed phenotypes, but also with random phenotype permutations. In addition, the number of selected genes is smaller with comparable populations of smaller size (Figure 1, Figure 4 vs Figure 1), as expected in selection procedures.38,39

The previous CAE study and the additional comparison of HFA cases vs parental controls (Figure 4) had a study-specific cutoff of 7.20 (Supplementary Figure 7, 21 functional regions, including CNTNAP2, DLGAP1 and NALCN as 19th to 21st), and 4.91 (Supplementary Figure 8, 25 regions, including ARHGAP24, SLC25A21 and PTENP1 as 25th, 22nd and 20th), respectively.

Specificity of the proposed approach in the current study

A common problem with many ‘pathway analysis’ approaches is that sufficiently many inconsistent findings may be present in the published literature for at least some pathways to fit (almost) any set of genes generated by GW screening. Hence, a major strength of the current study is that the primary hypothesis about Ras/Ca2+ signaling being involved had been stated a priori based on our published CAE results (Figure 2, bottom), increasing confidence in the current ASD results (‘prioritized subset’42) and allowing the specificity of the proposed cutoff for study-specific GWS to be discussed.

Figure 2
figure 2

Ras/Ca2+ signaling in ASD and CAE. (a) AGP I, (b) AGP II, (c) CAE. Pathway-related genes among the top 20, 50 and 100 are circled in bold, double and thin blue lines, respectively. Genes included in SFARI Gene (ASD) and CarpeDB (CAE), respectively, are shaded in red (see Supplementary Table 1 for details); the five genes identified in previous GWAS (see Introduction) are indicated in turquoise and underlined. Upon GF binding to cell-surface receptors (for example, IGFR, MET, PDGFR, ERBBn), formation of receptor complexes initiates proliferation, cytoskeletal organization and survival along Ras downstream effectors. GFs are immediately deactivated by PTPRs. Downstream activities are modulated by agonists binding to G-protein-coupled receptors (GPCR) activating phospholipase C (PLC) to form membrane diacylglycerol (DAG) and inositol trisphosphate (IP3). While DAG activates Ras directly, IP3 stimulates (‘winged’ arrows) the release of Ca2+ from the endoplasmic reticulum (ER), starting a process of Ca2+-dependent activation of Ras involving several feedback loops. The fall of Ca2+ concentration in internal stores (dotted areas) leads, via STIM1, to the opening of store-operated Ca2+ channels (SOCC) in the plasma membrane. ITPKB phosphorylates IP3 into IP4, which opens voltage-operated Ca2+ channels (VOCC). CaCCs can either directly activated by Ca2+ elevation or through Ca2+/calmodulin kinase II (CaMKII)-mediated phosphorylation. Other plasma membrane ion channels involved are Ca2+ channels operated by NMDA and kainate ligands, voltage-operated potassium channels (VOPC). GABA-operated Cl channels reverse from excitatory efflux to inhibitory influx during maturation.43 Overall Ca2+ levels are limited by plasma membrane Ca2+ ATPase (PMCA). Known drug interactions are indicated in green.

Of the top 100 genes selected in AGP I and II, 57 and 47 genes, respectively, could be related to Ras/Ca2+ signaling (Supplementary Table 1 and Figure 1), matching a targeted false discovery rate of 50%.37 The increasing enrichment toward the top 50 and top 20 genes (Supplementary Table 1), reaching 100% for the top twelve regions in AGP I, attests to the high specificity of the results. Additional support comes from the replication of the results in two independent populations (see below). In an unrelated autoimmune disease, psoriasis, in contrast, the majority of genes identified were located in the HLA region or interleukins (data not shown). The lack of overlap between these unrelated diseases further attests to the specificity of the proposed approach.

To guide with interpretation, the subset of genes among the top 100 reported in our CAE study20 that let to the Ras/Ca2+ hypothesis (Figure 2, bottom) and the matching genes among the top 100 genes from either of the two stages of the current study are arranged in Figure 2 around a putative ‘consolidated pathway’ derived from several ‘canonical pathways’. Although many variations of such a consolidated pathway could be constructed, we contend that there is sufficient consensus among canonical pathways for genes in close proximity to be functionally related.

Replication across independent populations

With complex diseases, independent studies are not expected to show more than the functional equivalence seen in the close overlap between stages in Figure 2 (top and center). When hundreds of genes contribute,44 few, if any, would be expected to be among the most significant in any two independent studies, even in the absence of selection and ascertainment bias. The two AGP populations, however, were collected consecutively in different sets of locations. Female cases could only be included in AGP I, due to imbalances in disease severity and chip platform usage om AGP II. While of limited value for confirming the Ras/Ca2+ signaling hypothesis, the results of the exploratory pathway analysis (Supplementary Table 2) suggest that AGP I and II patients vary more with respect to behavior (‘schizophrenia’) and developmental risk factors (‘neuritogenesis’), respectively.

Still, seven genes among the top 100 in both stages (Table 1c) can be directly related to the hypothesized pathway (ranks and Fisher’s45 combined s-values sF in parentheses): SORCS2 (10th/36th, 9.85) binds NGFR and mediates apoptosis46 as well as responses to proneurotrophins.47 CDH13 (45th/25th, 8.86) is an atypical cadherin involved in cell signaling, rather than adhesion. It colocalizes with αvβ3 integrin,48 downregulates neural cell growth49 and was disrupted by a microdeletion in an ASD case.50 The membrane progesterone receptor36 PGR (66th/18th, 8.78) drives ERK/MAPK signaling51 and contributes to neuron excitability through steroids52 in the brain.53 GRK5 (36th/59th, 8.68) controls neuronal morphogenesis54 by phosphorylating G-protein-coupled receptors (GPCRs) and initiates β-arrestin-mediated downregulation in a Ca2+/calmodulin-dependent manner. PZP (34th/93rd, 8.56) interacts with the target of minocycline, MMP9,55 which cleaves56 the extracellular component of CD44,57 whose expression has been implicated in ASD58 and whose intracellular component interacts with RAS59 via both ERBB2 and PLC.60,61 PTPRT (90th/20th, 8.57) will be discussed below.

Among the top genes with sF > 8.5 (Table 1d) are several more ASD-related genes. KCNMA1 (795th/1st, 9.38) and DOK5 (2nd/497th, 9.29), also listed among the individual genes, are a Maxi-K channel in which rare mutations have been identified62 and a gene that mediates neurite outgrowth, respectively. HIVEP2 (110th/4th, 9.22) is also known as Schnurri-2 and Shn-2(−/−) mice exhibited hypersensitivity to stress accompanied by anxiety-like behavior.63 Mutations in SEPT9 (123rd/8th, 9.01) cause hereditary neuralgic amyotrophy.64 DMD (17th/88th, 8.97) is a member of a glycoprotein complex, which accumulates at a variety of neuronal synapses. Dystrophin is associated with Duchenne and Becker muscular dystrophies (DMD), where it is implicated in signaling events and synaptic transmission. DMD is comorbid to ASD.65 Two studies found a genetic association of DMD with ASD,6,66 and one deletion in DMD was found in a CNV analysis of the AGP I data.14 The ortholog of SHROOM3 (23rd/109th, 8.66) in mice is required for proper neurolation.67 The combined results of PPFIBP1 (11th/197th, 9.17), which is also included in the univariate results above, will be discussed in the context of PTPRs below.

Functional clusters of genes

Overall, the results (see Supplementary Figure 1 for a Manhattan plot) are highly consistent with previously proposed aspects of the etiology of ASD. The clusters of genes implicated in both of the independent stages (Figure 2a/b) consistently overlap with our published CAE results (Figure 2c), confirming the involvement of ion channels (top right) and signaling downstream of RAS (bottom left), with two noticeable additional gene clusters in ASD. Both stages implicate several genes involved in deactivation of growth factor (GF) receptors (Figure 2a/b, top left) as ASD-specific risk factors and chloride (Cl) signaling, either through Ca2+ activated Cl channels (CaCC, Figure 2a, top right) or through the lysosomal Cl/H+ exchange transporter CLCN7 (Figure 2b, top right), whose disruption leads to widespread degeneration in the CNS of mice.68 In CAE, in contrast, several genes related to ER and Golgi as well as DST interacting with F-actin69 support the hypothesis that assembly/trafficking and lateral diffusion, respectively, of GABAA receptors, potentially mediated through RhoA and CDC42,70 may be more specific to the etiology of epilepsies.71

Broad evidence for involvement of PTPRs

One of the most striking observations is the involvement of at least five PTPRs in ASD (Figure 2, 10 o’clock position). PTPRs (Table 1e) regulate GF signaling through reversible protein tyrosine dephosphorylation.72 PTPRT (90th/20th, 8.57) was implicated in ASD by a deletion73 (Table S2 AU018704) and a somatic mutation.74 It is the PTPR most frequently mutated in colon cancer, where all five missense mutations identified reduced phosphatase activity.75 PTPRD (519th/84th, 7.26), for which rare CNVs were previously reported,14 and its ligand76 IL1RAPL2 (10th in AGP II), which is associated with X-linked non-syndromic mental retardation, are also implicated. De novo disruptions in PPFIA1 and the neighboring SHANK2 were recently reported in a person with autistic behavior77 and, here, PTPRF is implicated through the association of its interacting binding protein 1 PPFIBP1 (11th/197th, 9.14) and ERC2 (49th in AGP II). PTPRG is known to bind both CNTN6 (13th in AGP II) and CNTN4 (837th/78th, 7.06),78 which play an important role in postnatal brain development.79 PTPRB (21st/880th, 7.77) binds CNTN1, which is involved in axonal expression and neurite extension.80

Replication of PTPR wide loci across the independent stages

Notably, the region of high significance in two of the PTPRs, PTPRT (Figure 3, top/left) and PTPRB (21st/880th, 9.11) (Figure 3, bottom/left), comprises the same SNPs in both independent stages. Moreover, the PTPRT region is located in the same LD block as a known somatic mutation (rs146825584).74

Figure 3
figure 3

Extended Manhattan plot of μGWAS results for PTPRT (top), PPFIBP1 (bottom/left), and PTPRB (bottom/right) by AGP stage. The X-axis shows base pairs within chromosome. Black dots indicate significance in ssGWAS, lines indicate significance in diplotypes of width 2 (dotted) .. 6 (solid), red color indicates low μ-scores for reliability, suggesting a potential artifact, unless supported in both populations. Green dots and s-values indicate univariate results for SNPs within the most significant region. s-Values in gray indicate nearby results. Below the panels are gene annotations, LD blocks, and recombination rate from HapMap.81 The PTPRT region comprises rs6102794, rs6072693, rs6072694, rs6102795, rs6016759 and rs6102798. The ‘x’ and box at the bottom indicate a somatic mutation at rs14682558474 and a deletion at 41,036,259–41,300,52173 (Supplementary Table S2, AU018704), respectively. The PTPRB: region comprises rs3782377, rs2567137, rs2567133, rs2278342, rs2116209 and rs2278341. KCNMB4 results driven by a single SNP in one population only (rs787931, red ‘x’) are indicated as a potential artifact, but the related KCNMA1 was the most significant gene in AGP II (Table 1).

Evidence for PTPR risk being epistatic

To further explore the risk conveyed by PTPRs, we scored male subjects combined with 1047 male controls from a melanoma study genotyped on the same chip platform (see Supplementary Material) stratified by stages,27 see Figure 5 (PTPRT) and Supplementary Figure 13 (PTPRB, PTPRD, and PPFIBP1). The polarized diplotypes with the best discrimination by stage are highly consistent, indicating that the populations agree not only in the location of the risk factors, but also in the high risk alleles.

Figure 5
figure 5

Comparison of PTPRT allelotype profiles between SDA cases, melanoma controls, and HFA cases. Rows indicating individual subjects’ SNP profile (orange/green: homozygous; yellow: heterozygous) are sorted within each population by diplotype μ-score (dark green to dark read) computed from the three consensus SNPs (rs6102794, rs6072694 and rs6102795 out of the six-SNP PTPRT region of Figure 3), which are highlighted as more saturated. Dotted lines are added for visual guidance.

For PTPRT and PTPRB in both stages and for PTPRD and PPFIBP1 in AGP I, SDA and HFA cases scored higher and lower than controls, respectively, so that no difference could have been detected by comparing all cases against controls. This result are consistent with the hypothesis that that PTPR variations, in general, merely affect body size (and, thus, are not selected against), but in the presence of other genetic risk factors contribute significantly to deciding the fate of an ASD case towards either HFA or SDA.

K+ and Cl ion channels as drug targets

Aside from PTPRs (Figure 2, 10 o’clock) as a risk factor for protracted GF signaling, our results suggest a second functional cluster of genes, involved in Cl transport and signaling, as specific to ASD (Table 1f). In AGP I, the CaCCs ANO4 and ANO7 scored 1st and 70th, respectively. In AGP II, the lysosome membrane H+/Cl exchange transporter CLCN7 scored 21st, followed by CAMK2A, which regulates ion channels, including anoctamins82 (55th), and LRRC7 (densin-180), which regulates CAMK2A83 (Figure 2a/b, 2 o’clock). The role of the anoctamins in pathophysiology is not well understood, except that CaCC activity in some neurons is predicted to be excitatory84 and to have a role in neuropathic pain or nerve regeneration. More recently, CaCCs have also been suggested as involved in ‘neurite (re)growth’.85

Finally, we compared the HFA and SDA cases as separate groups against all parental controls in the larger AGP I population. Overall, the level of significance is lower and the enrichment is less pronounced, especially for the SDA cases (Supplementary Figure 9), as expected when cases and some controls are related. For the HFA cases (Figure 4, and Supplementary Figure 8), however, a second anoctamin, ANO2, located on the other arm of chromosome 12, competes with ANO4 (Figure 1, left), for the most significant gene among the result. Hence, drugs targeting anoctamins might have broader benefits for the treatment of ASD than in preventing progression to more severe forms of autism.

Figure 4
figure 4

Comparison of HFA cases against all parental controls. Genes shaded in dark and light blue are members of or associated with the Ras/Ca2+ pathway (Figure 2), respectively (see Figure 1 for legend). PFN2 inhibits the formation of IP3 and DAG by binding to PIP2, LIMK1 is phosporylated by ROCK1 and PAK1, downstream of RHOA and RAC1, respectively. PTENP1 acts as a decoy for PTEN-targeting miRNAs.40 SLC25A21 may be involved in 2-oxodipate acidemia, which is accompanied by mental retardation and learning disabilities.41 Cytogenic bands: ANO2: 12p13.3, ANO4: 12q23.3.

ANO2 and ANO6 are associated with panic disorder and major depressive disorder, respectively. ANO3, ANO4, ANO8 and ANO10, but not ANO1, are also expressed in neuronal tissue.86 As ‘druggable channels’, anoctamins ‘may be ideal pharmacological targets to control physiological function or to correct defects in diseases’.87 Few drugs, however, target individual anoctamins or even exclusively CaCCs. Cl channel blockers such as fenamates, for instance, may decrease neuronal excitability primarily by activating Ca2+-dependent outward rectifying K+ channels.


ASDs are complex diseases involving many genes along common pathways.23 Based, in part, on mouse studies88,89 and enrichment of CNVs in a previous analysis of the AGP I data,14 there is an emerging consensus building that dysregulation of the Ras pathway is involved in ASDs. Ca2+ signaling has an excitatory impact on the Ras pathway90 and abnormal Ca2+ signaling has been implicated in ASD.91, 92, 93 Still, ssGWAS have largely failed to elucidate the precise mechanism by which Ras and Ca2+ signaling interact and how to determine effective therapies.

While wide-locus GWAS is known to have the potential of higher power over ssGWAS with common diseases,94,95 practical problems abound. Many traditional multivariate methods96 including simple linear/logistic regression, gene-based97 approaches combining ‘individual marker P-values’ across a gene, and gene-centric98 approaches ‘counting the number of minor alleles for each sample at each SNP’ assume independence and additivity/multiplicativity of risk factors. As a downside, meaningful non-linear relationships may be overlooked (false negatives), while random errors, not subject to biological constraints, may occasionally fulfill any assumption, so that many ‘significant’ results are often false positives due to model misspecification. Increasing degrees of freedom with logistic regression by adding sequential interaction terms for neighboring SNPs may increase likely ‘noise’.20

With the advent of mainframe computers, more complex calculations (for example, factor analysis) became feasible. Personal computers triggered the development of resampling methods. Recently, increases in memory to gigabytes and massive parallel computing have spurred the methodological advances making wide-locus GWAS based on a nonparametric approach (u-statistics for multivariate data, μGWAS) feasible.99 Making only biologically plausible assumptions (additional risk variants within a wide-locus increase risk, albeit to an unknown extent) avoids typical model misspecification biases with traditional methods, whose assumptions (independence and additivity of the risk conferred by the SNPs within a scan statistic window100) primarily aim at computational simplicity.101

Several alternative strategies have limitations when applied to GWAS of common diseases. Increasing the sample size cannot guard against systematic, though unrelated differences between non-randomized samples taken from outbred populations.102 While the power for true disease-related differences increases with sample size, so does the power for equally true selection biases. Analyses based on predefined sets of genes comprising a pathway, including gene-set enrichment103,104 have low power when many relevant genes are only indirectly associated with the pathway. One of our most striking observations in this study (and in the previous CAE study) is the lack of association with non-redundant members of the Ras pathway itself, consistent with clinical observations that Rasopathies, such as Costello syndrome, cause more severe phenotypes than most forms of ASD and are routinely selected against. Moreover, our results in both CAE and ASD suggest a role for pseudogenes (EEF1A1P12 and PTENP1, respectively) in support of the previously identified pseudogen MSNP1.7 Finally, exome sequencing may overlook the variations in introns or promoter regions typical for common diseases20 (see Supplementary Figure 3) and the advantage of whole genome (WG) sequencing over GWAS based on SNPs is limited for common, compared to familial risk factors.

Individual de novo variations conferring noticeable risk are typically selected against during evolution. Epistatic variations (as originally defined by Fisher105) within the same intragenic or promoter region (wide-locus), however, could persist if each variation’s contribution per se were small. Hence, while principal component analysis can guard against subsets of SNPs related to common ancestry factors, wide-locus GWAS reduces the impact of individual SNPs, in general, and, thus, may guard against a broader range of artifacts. With the risks of artifacts from population stratification reduced as part of the statistical approach, the advantage of family-based association tests (FBAT) to control for population stratification by using hypothetical siblings as controls may pale against the disadvantage of low power in complex diseases,106,107 where related subjects are expected to share most, if not all, genetic risk factors. On the other hand, comparisons of cases against unrelated controls could identify risk factors that distinguish ASD cases, in general, from unaffected subjects, which were not present in the AGS population, but the results might more likely be confounded by factors unrelated to the disease (population stratification).

μGWAS increases power by comprehensively analyzing information from several neighboring SNPs, drawing on expected LD from HapMap81 and the spatial structure of SNPs within an LD block20 without introducing biases through unrealistic assumptions (independence and additivity). The proposed study-specific genome wide significance cutoffs bypass the unattainable goal of guarding against systematic biases in GWAS in favor of guarding against random errors only. Shifting the focus from individual SNPs, which could easily be false positives, to wide loci, which are more likely functional, also shifts the burden to avoid false positives from the decision strategy (increasing the level of significance) to the statistical method (integrating information from neighboring SNPs).

When applied to the AGP data, the consistent results from this hypothesis-driven prioritized subset analysis in two independent populations strongly confirm the Ras/Ca2+ hypothesis, and provide, for the first time, evidence-based insights into the etiology of ASD, a novel treatment paradigm, and additional approved drugs that might be repurposed for ASD.

Our results also suggest a reinterpretation of several previously reported findings. The educational attainment study11 mentioned in the Introduction, for instance, reported only three apparently unrelated loci reaching conventional GWS: LRRN2, LOC150577, and LOC100129158. Six of the top ten loci, however, point to genes closely related to the Ras/Ca2+ pathway (Figure 2): mir2113 (a micro-RNA located in the same LD block as LOC100129158, which has GRIK2 and PIK3C2A among the predicted targets with highest confidence), PIK3C2B (in the same LD block as LRRN2), STK24 (containing a CDC42 binding domain), ATXN2L and ITPR3 (both involved in regulating Ca2+ efflux from the ER), and GPM6A (involved in NGF-dependent Ca2+ influx).

Four additional genetic findings relate directly to ASD. First, being able to identify more narrowly defined regions, μGWAS pinpoints SMAP2, which encodes a GAP that acts on ARF1, a member of a Ras superfamily. Both SMAP2 and RIMS3 are located in the same 3.3 Mb region in chromosome 1 (Supplementary Figure 2) which was identified as a microdeletion. RIMS3 had been selected based on ‘literature review and bioinformatics analyses’,59 but our results suggest SMAP2 as the more likely candidate for a gene involved in ASD. Second, μGWAS confirms the involvement of CNTNAP2,108 with the strongest signal (in AGP II, Supplementary Figure 2) between exons 14 and 15, the same intronic region as rs2710117 previously associated with developmental language disorders109 and major depression.110 The significant findings in NXPH1, NLGN2, DAOA, and GRIK5 support its role in ‘localization of potassium channels within differentiating axons’ (Figure 2, 1 o’clock), consistent with the rare CNVs seen in DLGAP2, SHANK2, and SYNGAP1,14 as well as rare mutations in NLGN3.111 However, the known functional relationship between CNTN1, a binding partner of PTPRB, and CNTN2,112,113 a binding partner of CNTNAP2, raises the possibility that CNTNAP2 may also be involved through its role in ‘mediat[ing] interactions between neurons and glia during nervous system development’ (Figure 2, 10 o’clock). Third, ASD has been associated with HRAS, although no functional mutation has been identified. In fact, the original publications cautioned that ‘the TH and HRAS-1 genes are molecularly close, might be in linkage disequilibrium, and could reasonably [both] be considered as good candidate genes’ for the ‘positive association between autism and two [(3′ and exon 1)] HRAS markers’.114 Our results suggest that ANO9, located within only 100 kb of HRAS and in the same LD block might be an even better candidate gene than TH, which is located 1.5 Mb and several LD blocks away. Finally, our results are consistent with several of the canonical pathways identified in a previous study,115 but also with little overlap among individual SNPs across populations.116

Overall, our results strongly suggest that ASD is in large part a neurodevelopmental disease. While it had originally been suggested that symptoms of ASD are present from birth or shortly thereafter,117 there is now consensus that symptoms emerge gradually over the first 18 months of life.118, 119, 120 Our results lead to a clinical hypothesis testable in a phase II trial for interventions based on a tentative functional interaction between Ras and Ca2+ signaling. Our findings are consistent with increased brain volume,121 brain connectivity,122,123 and skeletal growth correlated with severity of symptoms124 and suggest impaired inhibition of neuronal growth due to defects in PTPRs upstream of RAS (Figure 2, top left) as a distinctive critical aspect in the etiology of SDA and a hypothesis to ‘elucidate the ‘dark matter’ [relating] the [PI3K-AKT-mTOR] pathway’125 downstream of RAS (Figure 2, bottom left).

The suggested role of PTPR variations in protracted GF signaling suggests the time of accelerated growth during 6–12 months of age for beginning pharmaceutical interventions targeting Ca2+ signaling, from the time where a decline in eye fixation and atypical pattern in scanning of faces and social scenes can be observed120,126,127 to the time where language regression is seen in some children128 at the begin of the ‘stranger anxiety’ period,129 while interventions targeting downstream mTOR signaling130 might be most effective when started even earlier, before symptoms are seen.

We posit a counterproductive maladaptive socio-emotional response131 to exposure to unfamiliar faces and voices, caused by sensory overload in response to disorganized perception of salient social features, potentially leading to experiences as intolerable as migraines, which are also known to share genetic risk factors,24 as a possible explanation for the observed differential shift in attention between unfamiliar and familiar faces132 and the hyporesponsiveness to social sensory stimuli in very young children with ASD.133 This association might also explain the differences seen in electroencephalography (EEG) measurements at 6–18 months between siblings of ASD children and normal controls.134 Structures for ‘social intelligence’ may be pruned135 in favor of structures for ‘analytical intelligence’ during this critical period where connectivity of the brain is validated and refined under the influence of environmental experiences. Over time, the brain adjusts to the over-excitation—as in CAE—so that children outgrow hypersensitivity to social cues before being old enough to report them, but at that time the window of opportunity for developing cortical structures may already be closed.

This reasoning may also explain both the limited success with interventions at later age and the ‘savant skills’ in some ASD cases. Along with increased community and clinical awareness of ASD and changing diagnostic standards, demographic shifts from rural to urban environments136 and children’s increased exposure to television since the 1950s137 may have also led to cases of sensory overstimulation in response to social information in children with genetic predisposition and, thus, contributed to the increase in incidence of ASD.121,138

This testable clinical hypothesis serves to challenge our current thinking about interventions. A shift in focus may be required from intervention in school aged children to early prevention starting around 12 months of age,139 during which time children shape and refine their neural circuitry in response to social stimuli.140 Secondly, although the American Academy of Pediatrics’ recommendation against television in children under the age of two years141 stems from studies in a more general population of children, unfavorable neurodevelopmental and behavioral outcomes in children with ASD might be even more compounded by early media exposure. Furthermore, early behavioral and educational interventions may need to favor personnel familiar to the child (Figure 6).

Figure 6
figure 6

Hypothesized interventions to prevent regression in children with ASD. During the critical period of developing cortical structures for social interactions the risk of stress-induced regression might be reduced through a combination of strategies including a protective environment where exposure to unfamiliar faces is limited and pharmacological interventions to reduce hyperexcitability related to Ca2+ signaling by targeting ion channels determined through genetic testing of genes known to be involved in Ca2+ signaling among children with a risk phenotype.

Our results are consistent with a wide spectrum of genes having mutations contributing to the risk of ASD. Drugs that target ion channels may decrease hyperexcitation to a level where a child does not feel the need to withdraw from social interaction. Testing at marker diplotypes in ion channels or related genes could serve to personalize the choice of the medication most effective in reducing excessive excitation of the Ras pathway during the critical period. Memantine is already used in the treatment of ASD, including in children from 2.5 years of age,142 based on a stimulatory effect on neurogenesis in a mouse model for fragile X syndrome (FXS),143 the most commonly inherited form of mental retardation, and its limited success in Alzheimer’s disease.144 Memantine could be most effective in children with mutations in and downstream of NMDA receptors. Gabapentin, approved by the FDA for the treatment of partial seizures in children from 3 years of age145 and tested in population-pharmacokinetic studies included subjects starting from age 1 month,146 might be repurposed for children with mutations in VOCCs.

Especially for children with mutations involving K+ and Cl signaling, fenamates, which have so far been considered in pain, in general, and in (juvenile) arthritis and (menstrual) migraines147 in particular, as well as in epilepsies for decreasing excitatory synaptic activity and reducing neuronal excitability,148,149 but not in ASD, might be more effective. Mefenamic acid has been used in preterm children150 and, in the EU, is approved for use in infants starting at 6 months of age.151 The diuretic bumetanide, which inhibits Cl influx via the Na+–K+–2Cl co-transporter NKCC1, has been shown to improve symptoms of ASD in some 3- to 11-year-old children152 as well as the efficiency of gabaergic drugs (including barbiturates and benzodiazepines) in neonates,153 where intracellular concentration of Cl is higher than in the mature brain and efflux of Cl through GABAA receptors is excitatory.43 As fenamates target a variety of K+ and Cl channels,154 they may have a lower risk for systemic (hypokalaemia155) and, in particular, neurodevelopmental156 side effects in younger children.

To reduce the duration of exposure during trials, event-related responses to visual stimuli such as familiar and unfamiliar faces, measured by EEG,134,157 magnetoencephalography (MEG),158 skin conductance,159 or eye tracking,127 might serve as surrogate endpoints to test the predictive association between genetic risk factors and treatment effectiveness.

Our results attest to a broad spectrum of genetic risk factors contributing to ASD. In particular, other factors than variations in PTPRs might sensitize the Ras pathway to hyperexcitation by interfering with growth factor downregulation. The translational repressor FMR1, for instance, directly targets multiple Ras and Ca2+ signaling pathway components160 and loss of FMR1 expression may cause FXS through aberrant Ras signaling.161 Similarly, reduced repression of mRNA involved in GF signaling might be an alternative mechanism to increase the severity of ASD. Activation of the PI3K/Akt pathway through mutations in PTEN, consistent with our findings from comparing HFA cases against parental controls (PTENP1, Figure 4) might be involved in a broader range of the autism spectrum.162 The gain-of-expression variation of MSNP1AS in ASD cases7 is also expected to cause ‘overproliferation’ through ‘increased RhoA activity’,163 possibly by competing with miRNAs downregulating MSN,164 interacting with the cytoplasmic CD44-ERBB2 complex (Figure 2, 11 o’clock).61 Finally, mutations in SYNGAP1 might restricting the length of time over which RAS remains activated.165 Hence, while the proposed interventions are unlikely to capture the whole spectrum of risk factors, they might prevent a substantial proportion of children with various risk factors for ASD from developing along the more severe spectrum of this heterogenic disease. The overlap in genetic risk factors between ASD and CAE (Figure 2) suggests another potential benefit of the proposed early intervention. As neonatal seizures per se may cause long-term neurological problems,166 preventing the postulated intolerable experiences may positively affect a wider range of ASD symptoms. The ability to hone in with μGWAS on specific genetic risk factors in small populations will enable us to develop better diagnostics and to identify subpopulations with other risk factors.

Of course, this multidisciplinary, translational approach is not restricted to ASD, but will further enhance our knowledge of many other complex disorders, thus allowing for the development of a broad range of novel therapeutic modalities with the hope of improved survival and quality of life for many other populations of patients.

Our results suggest that a relevant portion of genetic risk for common diseases is determined through coding variations, as was widely expected after deciphering the human genome ten years ago and could be detected—if GWAS were better adapted to the specifics of common diseases. The paucity of cogent GWAS results in ASD (see the Introduction) compared to some autoimmune and metabolic diseases167 may, at least in part, be due to the choice of data source (exome sequencing, rather than whole genome genotyping), study design (comparison against related or hypothetical controls, rather than unrelated cases with different phenotypes), primary outcome (individual SNPs, rather than regions comprising several SNPs), study objective (confirmation of individual SNPs, rather than enrichment of genes along pathways), statistical model (linear model, rather than u-statistics), and determination of ’genome-wide significance’ cutoffs (WG, rather than selective chromosome estimation). With more appropriate statistical strategies, reanalyzing the data already collected, including data from publicly available repositories (such as the NIH’s dbGaP), could finally lead to the insights sought for and the therapies urgently needed.