The substantial phenotypic heterogeneity in autism limits our understanding of its genetic etiology. To address this gap, here we investigated genetic differences between autistic individuals (nmax = 12,893) based on core and associated features of autism, co-occurring developmental disabilities and sex. We conducted a comprehensive factor analysis of core autism features in autistic individuals and identified six factors. Common genetic variants were associated with the core factors, but de novo variants were not. We found that higher autism polygenic scores (PGS) were associated with lower likelihood of co-occurring developmental disabilities in autistic individuals. Furthermore, in autistic individuals without co-occurring intellectual disability (ID), autism PGS are overinherited by autistic females compared to males. Finally, we observed higher SNP heritability for autistic males and for autistic individuals without ID. Deeper phenotypic characterization will be critical in determining how the complex underlying genetics shape cognition, behavior and co-occurring conditions in autism.
The core diagnostic criteria for autism consist of social communication difficulties, unusually restricted and repetitive behavior, and sensory difficulties that are present early in life and affect social, occupational and other important domains of functioning1,2. However, these criteria are broad, leading to substantial heterogeneity. Two individuals with very different phenotypic features, co-occurring conditions, support needs or outcomes may both be diagnosed as autistic1,3.
Heterogeneity in autism can arise from multiple, partly overlapping sources. This includes differences in core diagnostic features (core features)1,3,4 and associated features such as IQ, adaptive behavior and motor coordination, all of which have an impact on life outcomes3,5,6. Furthermore, sex and gender7,8 and co-occurring ID and developmental, behavioral and medical conditions9,10 alter the presentation and measurement of core autism features. While a few studies have attempted to investigate the genetic influences on this heterogeneity11,12,13,14,15,16,17,18, substantial gaps remain. First, existing studies investigating genotype–phenotype associations have been limited to summed scores of core autism features in smaller sample sizes19,20,21 rather than the underlying latent dimensions. This distinction is important given that autism is phenotypically dissociable12,22,23, and some associations may emerge only when latent traits are considered. Second, while the impact of de novo genetic variants on co-occurring developmental disabilities is reasonably well characterized17,20,21, the impact of common genetic variants is unknown. Third, although sex differences in autism vary by the presence of ID17,24,25, the sex-differential impact of common genetic variants in autistic individuals with and without ID is unknown. Finally, the impact of latent core autism phenotypes, sex and de novo variants on the common variant heritability also warrants investigation with large sample sizes.
Here, we address these four questions by combining genetic and phenotypic data from up to 12,893 autistic individuals from four different datasets. We focus on de novo protein-truncating and missense variants in constrained genes (high-impact de novo variants)17,26 and PGS for autism and genetically correlated phenotypes16. Finally, this larger sample size alongside more detailed information on genes underlying severe developmental disorders27 also allows us to revisit and provide deeper insights into two additional important issues relevant to heterogeneity in autism: the association of high-impact de novo variants with (1) co-occurring developmental disabilities and (2) sex.
Identifying latent phenotypes in core autism features
A critical challenge in identifying sources of heterogeneity in autism is understanding the latent structure of core autism phenotypes. To this end, we combined two widely used parent-reported measures of autistic traits (Repetitive Behavior Scale—Revised (RBS)28 and Social Communication Questionnaire—Lifetime version (SCQ)29) for 24,420 autistic individuals from the Simons Simplex Collection (SSC)30 and the Simons Foundation Powering Autism Research for Knowledge (SPARK)31 cohorts.
In exploratory factor analyses (Methods), we tested 42 different factor models, including bifactor models (Supplementary Table 1 and Supplementary Fig. 1). We identified a correlated six-factor model with good theoretical interpretation (Supplementary Fig. 2), and confirmatory factor analyses identified fair fit indices (confirmatory fit indices, 0.92–0.94; Tucker–Lewis indices (TLI), 0.92–0.94; root mean square errors, 0.056–0.060). Fit indices increased modestly when including orthogonal method factors in the model (Supplementary Table 1). The explained common variances and hierarchical Ω values for the bifactor models were low (<0.8), suggesting that general factors may not explain the data well (Supplementary Table 2). The six identified factors are (1) insistence on sameness (F1), (2) social interaction at the age of five years (F2), (3) sensory–motor behavior (F3), (4) self-injurious behavior (F4), (5) idiosyncratic repetitive speech and behavior (F5) and (6) communication skills (F6) (Supplementary Table 3). These broadly correspond to four restricted, repetitive and sensory behavior factors, that is, non-social factors (insistence of sameness, sensory–motor behavior, self-injurious behavior and idiosyncratic repetitive speech and behavior) and two social factors (social interaction and communication skills).
All interfactor correlations were significant and moderate to high in magnitude, with higher correlation among non-social and social factors than between social and non-social factors (Fig. 1a). Sex differences were minimal (Cohen’s d < 0.1; Fig. 1b and Supplementary Table 4a). All factors were negatively correlated with full-scale IQ (Fig. 1c, Supplementary Fig. 3 and Supplementary Table 4b). In this cross-sectional data, older participants had lower factor scores (that is, fewer difficulties), with the exception of ‘social interaction’ (Fig. 1d), in line with previous research32. Alternatively, this could reflect diagnostic bias. However, of the 21 items in the ‘social interaction’ factor, 19 specifically ask about behavior between the ages 4 and 5 years (Methods), and this trajectory likely reflects recall bias, as caregivers are likely to report more severe behaviors retrospectively33. Similar trends were observed in both males and females (Supplementary Fig. 4). Of the six factors and RBS and SCQ, only insistence on sameness (F1) and self-injurious behavior (F4) had significant SNP heritability (Supplementary Table 5). There were moderate to high genetic correlations among the six factors (Supplementary Table 6).
Common genetic variants are associated with core autism features
We next conducted association analyses between 19 different core and associated features and different classes of genetic variants (Methods). We first investigated the association between the 19 features and PGS for autism (iPSYCH autism data freeze), intelligence34, educational attainment35, attention-deficit–hyperactivity disorder (ADHD)36 and schizophrenia37 and, as a negative control, hair color38 (n = 2,421–12,893, Supplementary Table 7). In multiple regression analyses, ADHD PGS were associated with increased non-social core autism features (total scores on the RBS, insistence on sameness, sensory–motor behavior and self-injurious factor scores) (Fig. 2 and Supplementary Table 8). Intelligence PGS were associated with increased full-scale and nonverbal IQ. Educational attainment PGS were associated with increased full-scale and verbal IQ and reduced scores on core autism features. Schizophrenia PGS were associated with reduced adaptive behavior, measured using the composite score of the Vineland Adaptive Behavior Scales. Moderate heterogeneity (I2 > 50%) was observed only for 10% of the associations. The majority of the significant associations (12 of 15) had concordant effect directions in all cohorts (Supplementary Fig. 5). We did not identify any significant genotype–phenotype association using hair color (blonde versus other) as a negative control (Supplementary Table 8).
In line with previous results17,20,21, the number of high-impact de novo variants (protein-truncating single-nucleotide variants (SNVs) and structural variants and missense variants with missense badness, PolyPhen-2 and constraint (MPC) score >2, n = 2,863–4,442) was associated with reduced measures of IQ, adaptive behavior and motor coordination but not core autism features (Fig. 2 and Supplementary Table 9). The effect sizes of the PGS were not attenuated after controlling for the presence of high-impact de novo variants (Supplementary Table 9), which was true even for full-scale IQ.
In autistic individuals, full-scale IQ decreased with increasing number of high-impact de novo variants but increased with increasing PGS for intelligence (Fig. 3a). No strong evidence of interaction between PGS for intelligence and high-impact de novo variants was observed, suggesting their additive effects on full-scale IQ. Among the significant genotype–phenotype associations, accounting for full-scale IQ did not attenuate the effects of PGS on core autism features (Fig. 3b and Supplementary Table 10), which was supported by minimal and statistically non-significant genetic correlations between full-scale IQ and the core autism features (Supplementary Table 6). By contrast, associations between high-impact de novo variants and associated autism features were attenuated, partly because of the moderate phenotypic correlations between these features and full-scale IQ (Fig. 4c).
Core autism phenotypes in high-impact de novo carriers
While high-impact variants in some autism-associated genes lead to core autistic features, notably in animal models (for example, refs. 39,40), as a group, they were not robustly associated with core autism features in this study (Fig. 2). It is unclear whether the latent structure of core phenotypes differs in autistic individuals with high-impact de novo variants (henceforth, carriers) compared to autistic individuals without any known high-impact de novo variant (henceforth, non-carriers). We thus investigated differences in the latent structure of core autism phenotypes between carriers (n = 325) and non-carriers (n = 2,727). Although likelihood-ratio tests identified significant configural invariance violation (that is, the factor structure dissimilar across groups, P < 2 × 10−16), this was due to the relatively large sample size: the fit indices and visual inspections of the latent structure suggested that the differences were minimal (Supplementary Table 11).
Given this, we first investigated whether autistic carriers had higher PGS for autism than non-carriers, which may account for core autism features in carriers (additivity). As demonstrated previously but with a different set of PGS19, autistic carriers had lower PGS for autism than autistic non-carriers (βPGS = −0.16, s.e. = 0.045, P = 3.67 × 10−4, linear regression; Fig. 4a). This difference was not observed for PGS for educational attainment, IQ or schizophrenia (Supplementary Table 12). However, while autistic non-carriers had higher PGS than non-autistic siblings (βPGS = 0.19, s.e. = 0.023, P = 2.68 × 10−15, logistic regression), autistic carriers (n = 579) were indistinguishable from non-autistic siblings (n = 3,681) based on autism PGS (βPGS = 0.028, s.e. = 0.045, P = 0.53, logistic regression; Supplementary Fig. 6).
The PGS in a trio with an affected child can be summarized as the parental mean PGS (henceforth, midparental PGS) and the deviation of the affected child’s PGS from the midparental PGS. As previously reported14, with this expanded sample size, we identified an overtransmission of autism PGS to autistic individuals (mean = 0.17, s.e. = 0.01, n = 6,981, P < 2 × 10−16) and, curiously, a modest undertransmission to unaffected siblings (mean = −0.03, s.e. = 0.02, n = 3,832, P = 0.034) (Fig. 4b and Supplementary Table 13). This likely reflects both reproductive stoppage41 and underdiagnosis of autism in the parental generation42. Carriers had a modest overtransmission of autism PGS (mean = 0.08, s.e. = 0.04, n = 579, P = 0.02), while this was substantially higher in non-carriers (mean = 0.18, s.e. 0.01, n = 4,997, P < 2 × 10−16). Notably, while carriers had significantly lower overtransmission than non-carriers (P = 0.02), they had a significantly higher overtransmission than siblings (PGS; P = 9.1 × 10−3), providing additional support for additivity of common and rare genetic variants.
A second hypothesis is that the effect of high-impact de novo variants on core autism features is partly mediated by associated autism features, given the modest negative correlation between them (Fig. 4c). Given that high-impact de novo variants are associated with a relatively sizeable reduction in both full-scale IQ and motor coordination, we reasoned that there would be a knock-on effect on core autism features. The fact that we did not observe a significant association between high-impact de novo variants and core autism features (Fig. 2b) may be due to attenuated correlations between core and associated features in carriers compared to non-carriers21. However, tests of matrix correlation equivalence suggested no differences in the phenotypic correlation structures of carriers and non-carriers (P = 9.25 × 10−4, Jennrich test for matrix equivalency). This was supported by the finding of no differences in pairwise Pearson’s correlation coefficients between each of the three associated features and the six factors, SCQ and RBS between carriers and non-carriers (Fisher’s Z-test, all P > 0.05).
One alternate explanation is that we are underpowered to observe this effect. We used simulations to investigate whether we had sufficient statistical power to identify associations between high-impact de novo variants and core autism features. Assuming that all effects are completely mediated by only one of the three associated features (full-scale IQ, adaptive behavior or motor coordination), power calculations indicate that we had less than 80% power for all core autism features tested (Fig. 4d). Larger samples may identify significant effects between high-impact de novo variants and core autism features, but it will be important to investigate whether the associations are mediated by associated autism features. However, neither of these two hypotheses excludes the possibility that different classes of de novo variants (for example, missense versus protein-truncating, de novo variants in specific functional categories) may be associated with core autism features.
Autism PGS and co-occurring developmental disabilities
Multiple co-occurring developmental disabilities are another source of heterogeneity among autistic individuals. While co-occurring developmental disabilities are associated with high-impact de novo variants15,17,20, it is unclear whether they are impacted by PGS for autism. In the SPARK study, in line with previous research15,17,20, carriers of high-impact de novo variants had increased counts of co-occurring developmental disabilities (βde novo = 0.31, s.e. = 0.05, P = 1.55 × 10−8, n = 3,089; quasi-Poisson regression). By contrast, higher PGS for autism was associated with reduced count of co-occurring developmental disabilities (βPGS = −0.037, s.e. = 0.009, P = 3.91 × 10−5, n = 13,435, quasi-Poisson regression), even after accounting for the other three PGS (Fig. 5a and Supplementary Table 14a). Leave-one-out analyses indicated that the results were not driven by any one developmental disability (Supplementary Fig. 7). Notably, autistic individuals with five or more co-occurring developmental disabilities did not have statistically higher autism PGS than non-autistic siblings (Fig. 5a and Supplementary Table 14b). By contrast, even when restricting to autistic individuals with no co-occurring developmental disabilities, individuals with a high-impact de novo variant were more likely to be autistic than non-autistic siblings (Fig. 5a and Supplementary Table 14b).
The apparent negative association between autism PGS and co-occurring developmental disabilities has not, to our knowledge, been reported earlier. This can reflect both a true negative association (for example, PGS for autism increase IQ in both the general population16,43 and in autistic individuals as seen in Fig. 2a) and the negative correlation between high-impact de novo variants and autism PGS. To better delineate this, we investigated the association between the two classes of genetic variants and two well-characterized developmental phenotypes: age of walking independently and age of first words. In autistic individuals, autism PGS were associated with earlier age of walking (βPGS = −0.012, s.e. = 0.003, P = 3.2 × 10−5, negative binomial regression) and earlier age of first words (βPGS = −0.0125, s.e.= 0.005, P = 0.01, negative binomial regression), while high-impact de novo variants increased the age for both phenotypes (Fig. 5b and Supplementary Table 15b). The association between autism PGS and age of walking but not age of first words remained statistically significant after accounting for high-impact de novo variants and full-scale IQ (Supplementary Table 15a). Similarly, the association between high-impact de novo variants and age of walking but not age of first words remained significant after accounting for full-scale IQ (Supplementary Table 15a). However, autism PGS were not significantly associated with either age of walking or age of first words in siblings (Supplementary Table 15a). Despite the negative association between autism PGS and the two phenotypes, even autistic individuals in the highest decile of autism PGS had higher mean age of walking and age of first words than siblings, as did autistic non-carriers (Fig. 5b) and autistic individuals with no co-occurring developmental disability, suggesting other sources of variation in these phenotypes (Supplementary Table 15b).
There is likely heterogeneity even within the broad class of constrained genes, with differential impact on autism vis-à-vis co-occurring developmental disabilities. Previous research has attempted to disentangle this heterogeneity by comparing counts of disrupting de novo variants in autism versus those in severe developmental disorders (genetically undiagnosed developmental disorders with accompanying ID and/or developmental delays)17. The lack of detailed phenotypic information in the cohorts assessed renders the previous research difficult to interpret44. Here we take a different approach to revisit this question. Using the more detailed data on co-occurring developmental disabilities in the SPARK study, we investigated whether constrained genes robustly associated with severe developmental disorders (DD genes)27 have differential effects on co-occurring developmental disabilities in autistic individuals compared to other constrained genes (non-DD genes). We use the term ‘non-DD genes’ for convenience as this list is also likely to contain genes associated with severe developmental disorders that may be discoverable at larger sample sizes but are likely less penetrant (that is, lower effect size) or lead to increased prenatal or perinatal death (that is, rarer) compared to variants in the DD genes27.
In the SPARK cohort, 35.6% of the carriers had high-impact de novo variants in DD genes. Autistic individuals were more likely to be carriers of either set of genes than non-autistic siblings, which was observed even when restricting to autistic individuals without any known co-occurring developmental disability (Fig. 5c and Supplementary Table 14c,d). However, while the risk for the count of co-occurring developmental disabilities was elevated in carriers of DD genes (βde novo = 0.54, s.e. = 0.08, P = 6.48 × 10−12; quasi-Poisson regression), this was much more modest for carriers of non-DD genes (βde novo = 0.15, s.e. = 0.07, P = 0.035; quasi-Poisson regression). Supporting this, autistic carriers of high-impact de novo variants in DD genes started walking independently and using words ~3 months later than autistic carriers of high-impact de novo variants in non-DD genes (P < 0.05 in both; Fig. 5b and Supplementary Table 15b). These results support a broad phenotypic distinction between the two sets of genes. We ran sensitivity analyses using a larger but overlapping list of genes identified from a highly curated database, Developmental Disorder Gene-to-Phenotype45, and identified consistent results (Supplementary Tables 14 and 15).
Sex differences in common and high-impact de novo variants
We next turned to another potential source of heterogeneity: sex. Autistic females are more likely to have high-impact de novo variants than autistic males17,26,46,47, which is thought to support the ‘female protective effect’ in autism13,46. However, a similar effect is observed in severe developmental disorders more generally and is entirely explained by a relatively small number of genes significantly associated with severe developmental disorders (that is, DD genes)48. We thus revisited sex differences in high-impact de novo variants using data from the SPARK and SSC studies (Supplementary Table 16), restricting our analyses to autosomal genes. Across all high-impact de novo variants, autistic females were more likely to be carriers than males (Relative risk (RR) = 1.52; 95% confidence interval, 1.27–1.81). However, this was explained entirely by high-impact de novo variants in DD genes (DD genes, RR = 2.53, 95% confidence interval = 1.91–3.35; non-DD genes, RR = 1.14, 95% confidence interval = 0.89–1.46) (Fig. 6a). This sex difference in DD genes remained and was not attenuated after accounting for the total number of co-occurring developmental disabilities in the SPARK cohort (unconditional estimates, βde novo = 0.83, s.e.= 0.21, P = 8.15 × 10−5; conditional estimates, βde novo = 0.82, s.e. = 0.22, P = 3.53 × 10−4; logistic regression) and after accounting for full-scale IQ and motor coordination scores in the SSC and SPARK cohorts (unconditional estimates, βde novo = 1.10, s.e. = 0.15, P = 3.42 × 10−13; conditional estimates, βde novo = 1.31, s.e. = 0.20, P = 8.19 × 10−11; logistic regression). We did not observe sex differences for either gene set in siblings (P > 0.05). These results suggest that sex differences in high-impact de novo variants are driven by a relatively small set of highly constrained genes that also increase the likelihood of co-occurring developmental disabilities in autism.
Both the contribution of PGS (Fig. 5a) and the male:female ratio are higher in autistic individuals without ID than in those with ID, suggesting that polygenic likelihood for autism may differ between sexes at IQ scores of 70 or above. Recent studies have found higher PGS for autism in females than in males19 and greater overtransmission of PGS for autism in female non-carriers than in male carriers49, yet neither have stratified by ID. We conducted sex-stratified polygenic transmission disequilibrium tests (pTDT) to investigate this (nmax = 6,981 autistic trios). While PGS for autism were overtransmitted in both male and female probands, this overtransmission did not differ by sex (Fig. 6 and Supplementary Table 17). However, in autistic individuals without ID (IQ > 70), females had ~75% higher overtransmission of autism PGS than males (P = 0.02, two-tailed Z-test; Fig. 6b), which was observed even when using the sex-stratified autism genome-wide association study (GWAS) (Supplementary Table 17). When additionally removing individuals with borderline intellectual functioning (IQ < 90), females had double the overtransmission of autism PGS compared to males (females, mean = 0.34, s.e. = 0.06, n = 276; males, mean = 0.17, s.e. = 0.03, n = 1,328; difference, P = 0.01, two-tailed Z-test). We did not find any sex difference in overtransmission for autistic individuals with ID or autistic carriers of a high-impact de novo variant or non-autistic siblings. This sex difference in overtransmission was not observed for PGS for educational attainment and intelligence, suggesting that the results are not due to differences in IQ scores between sexes. Furthermore, there was no difference in midparental PGS scores, family income or parent education by sex or ID (P > 0.05 for all comparisons), factors correlated with participation in research50. This suggests that these results are unlikely to be explained by sex differences in participation. We cannot, however, distinguish the female protective effect due to common or rare variants from diagnostic bias in the current study24,51.
Sex and ID impact SNP heritability
Finally, we investigated the impact of this heterogeneity on SNP heritability calculated using GREML52,53 and phenotype correlation–genotype correlation (PCGC)54 with individuals from the ABCD cohort as population controls (Methods). All heritability estimates are reported on the liability scale (Fig. 7a and Supplementary Table 18).
We identified a modest SNP heritability for autism (GCTA, h2SNP = 0.29, s.e. = 0.02; PCGC, h2SNP = 0.29, s.e. = 0.03), which is higher than estimates from iPSYCH16 but lower than estimates from the AGRE55 and PAGES56 cohorts. Autistic individuals with ID had lower SNP heritability than autistic individuals without ID (P = 1.6 × 10−3, two-tailed Z-test). SNP heritability for autism in autistic carriers compared to general population controls (agnostic of carrier status) was modest (GCTA, h2SNP = 0.20, s.e.= 0.05; PCGC, h2SNP = 0.14, s.e = 0.08), which is similar to the SNP heritability observed for autistic individuals with ID. However, when comparing autistic high-impact de novo carriers with autistic non-carriers, the SNP heritability was not statistically significant (GCTA, h2SNP = 0.14, s.e. = 0.14; PCGC, h2SNP = 0.15, s.e. = 0.19), suggesting that the observed SNP heritability for autistic carriers reflects autism rather than factors associated with the generation of germline mutations57,58. This result is in line with our pTDT analyses, which identify an overtransmission of PGS in carriers, and previous research that has identified a smaller yet significant heritability for severe developmental disorders59.
Stratifying by sex had the largest effect on SNP heritability (Fig. 7b). Males had approximately 70% higher SNP heritability than females (P = 9.3 × 10−3, two-tailed Z-test). This difference was observed across a range of prevalence estimates (Fig. 7c and Supplementary Table 19) after downsampling the number of autistic males to match the number of autistic females (Supplementary Table 18) and varying the male:female ratio to 3.3:1 to account for diagnostic bias51 (Supplementary Table 18). By contrast, stratifying individuals by high scores (1 s.d. above the mean) on the core autism phenotypes or a combination of two core autism phenotypes modestly reduced or did not alter the SNP heritability for autism (Fig. 7a and Supplementary Table 18).
Individual differences among autistic individuals in core and associated features are complex and genetically multifactorial. High-impact de novo variants and PGS have differential and often independent effects on these features. There is additivity between common and high-impact de novo variants in autism. These represent the most widely studied class of genetic variants in autism thus far, yet emerging evidence suggests a role for other classes (for example, rare inherited and de novo tandem repeats) of genetic variants as well17,19,60,61. However, this negative correlation between high-impact de novo variants and autism PGS may not extend to the general population. Because we have focused only on autistic individuals and not the general population, we may have induced a negative correlation between them because people have to have either a high PGS or high-impact de novo variants to cross the diagnostic threshold.
The two classes of genetic variants do not have the same effects on either the core or associated autism phenotypes nor on co-occurring developmental disabilities. The negative association between autism PGS and co-occurring developmental disabilities reflects both a true negative association (for example, for IQ43) and the additivity between rare and common variants.
We observe sizeable differences in both common and high-impact de novo variants based on sex and ID. While these results may be interpreted as providing support for the female protective effect13,46, this interpretation is not straightforward. First, the increased likelihood of being a carrier of high-impact de novo variants was observed only with genes associated with severe developmental disorders, not for other constrained genes, despite both sets of genes increasing the likelihood for autism. This suggests that the female protective effect may be for severe developmental disorders rather than for autism specifically, which warrants further investigation. Second, the higher overtransmission of autism PGS must be interpreted alongside the reduced SNP heritability of autism in females. Assuming high genetic correlation between males and females, reduced SNP heritability in females suggests that higher PGS are required to reach the equivalent levels of genetic likelihood in males62. Yet this raises another important question: why do autistic females have lower SNP heritability than autistic males? Does this reflect ascertainment bias in the GWAS cohorts, diagnostic bias, diagnostic overshadowing, camouflaging or masking and/or social stigma7,24,51? Several social factors can influence diagnosis in a sex-differential manner, and investigating this is paramount to understanding sex-differential genetic effects.
In conclusion, our findings have important implications for using genetics to understand autism. We need deeper phenotyping at scale and need to account for the evolving diagnostic criteria for autism63.
For factor analyses, we restricted our analyses to autistic individuals from the SSC and SPARK cohorts. Participants had to have completed the two phenotypic measures (details are below) to be included in the factor analyses. We also excluded autistic individuals with incomplete entries in either of the two measures (n = 5,754 only in the SPARK cohort). This resulted in 1,803 participants (n = 1,554 males) in the SSC, 14,346 participants (n = 11,440 males) in SPARK version 3 and 8,271 participants (n = 6,262 males) in extra entries from SPARK version 5 (SSC, mean age = 108.75 months, s.d. = 43.29 months; SPARK version 3, mean age = 112.11 months, s.d. = 46.43 months; SPARK version 5, mean age = 111.22 months, s.d. = 48.19 months). Only the SCQ was available for siblings in the SPARK study.
We conducted analyses using data from four cohorts of autistic individuals: the SSC (n = 8,813)30, the Autism Genetic Resource Exchange (AGRE, CHOP sample) (nmax = 1,200)64, the AIMS-2-TRIALS Longitudinal European Autism Project (LEAP) sample (nmax = 262)65 and SPARK (n = 29,782)31. For sibling comparisons, we included siblings from the SSC (n = 1,829) and SPARK (n = 12,260) cohorts. For trio-based analyses, we restricted to complete trios in the SSC (n = 2,234) and SPARK (n = 4,747) cohorts. For all analyses, we restricted the sample to autistic individuals who passed genetic quality control (QC) and who had phenotypic information.
We conducted factor analyses using the SCQ29 and the RBS28. The SCQ is a widely used caregiver report of autistic traits capturing primarily social communication difficulties and, to a lesser extent, repetitive and restricted behaviors29. There are 40 binary (yes-or-no) questions in total, with the first question focusing on the individual’s ability to use phrases or sentences (total score, 0–39). We used the Lifetime version rather than the current version as this was available in both the SPARK and SSC studies. Of note, in the Lifetime version, questions 1–19 are about behavior over the lifetime, while questions 20–40 refer to behavior between the ages of 4 to 5 years or in the last 12 months if the participant is younger. We excluded participants who could not communicate using phrases or sentences (n = 217 in the SSC and n = 17,092 in SPARK) as other questions in the SCQ were not applicable to this group of participants. The RBS is a caregiver-reported measure of presence and severity of repetitive behaviors over the last 12 months. It consists of 43 questions assessed on a four-point Likert scale (total score, 0–129). Higher scores on both measures indicate greater autistic traits.
Exploratory factor analyses
We conducted exploratory factor analysis on a random half of the SSC (n = 901 individuals, of which 782 were males) using ‘promax’ rotation to identify correlated factors as implemented by ‘psych’ (ref. 66) in R. We conducted three sets of exploratory correlated factor analyses: for all items, for social items and for non-social items. Previous studies have provided support for a broad dissociation between social and non-social autism features12,23 and have conducted separate factor analyses of social (for example, refs. 67,68) and non-social autism features (for example, refs. 69,70). Thus, we reasoned that separating items into social and non-social categories might aid the identification of covariance structures that may not be apparent when analyzing all items together. We divided the data into social (all of the SCQ except item 1 and nine other items and item 28 from the RBS) and non-social (nine items from the SCQ (items 8, 11, 12 and 14–18) and all items from RBS except item 28) items, which was carried out after discussion between V.W. and X.Z. The ideal number of factors to be extracted was identified from examining the scree plot (Supplementary Fig. 2), parallel analyses and theoretical interpretability of the extracted factors. However, we examined all potential models using confirmatory factor analyses as well to obtain fit indices, and the final model was identified using both exploratory and confirmatory factor analyses.
We then applied the model configurations from ‘promax’ rotated exploratory factor analysis for bifactor models to explore the existence of general factor(s). In addition to a single general factor bifactor model, we divided the data into social and non-social items as mentioned earlier and applied bifactor models separately for the social and non-social items. Hierarchical Ω values and explained common variances were then calculated for potential models as extra indicators of the feasibility of bifactor models, but hierarchical Ω values were not greater than 0.8 for most of the models tested, and explained common variances were not greater than 0.7 (refs. 71,72,73) for any of the models tested (Supplementary Table 2).
Confirmatory factor analyses
Three rounds of confirmatory factor analyses were conducted: first for the second half of the SSC, followed by analysis of SPARK participants whose phenotypic data were available in version 3 of the data release and, finally, analysis of SPARK participants whose phenotypic data were available only in version 4 or version 5 of the data release and not in the earlier releases. To evaluate the models, multiple widely adopted fit indices were considered, including the comparative fit index (CFI), the TLI and the root mean square error of approximation. In CFA, items were assigned only to the factor with the highest loading to attain parsimony. We conducted three broad sets of confirmatory factor analyses: (1) confirmatory factor analyses of all correlated factor models, (2) confirmatory factor analyses of the autism bifactor model and (3) confirmatory factor analyses of social and non-social bifactor models. For each of these confirmatory factor models, we limited the number of factors tested based on the slope of the scree plots and based on the number of items loading onto the factor (five or more). For the confirmatory factor analyses of social and non-social bifactor models, we iteratively combined various numbers of social and non-social group factors. In bifactor models, items without loading onto the general factor in the correspondent EFA were excluded. Items were allocated to different group factors, which were identified based on the highest loading (items with loading <0.3 were excluded). Due to the ordinal nature of the data, all CFAs were conducted using the diagonally weighted least-squares estimator (to account for the ordinal nature of the data) in the R package lavaan 0.6-5 (ref. 74). We identified the model most appropriate for the data at hand with TLI and CFI > 0.9 (TLI and CFI > 0.95 for bifactor models), low root mean square error of approximation and good theoretical interpretability based on discussions between V.W. and X.Z. Additionally, as sensitivity analyses, the identified model (correlated six-factor model) was run again with two orthogonal method factors mapping onto SCQ and RBS-R to investigate if the fit indices remained high after accounting for covariance between items derived from the same measure, as these measures vary subtly during the period of time evaluated. We also reanalyzed the identified model after removing items that were loaded onto multiple factors (>0.3 on two or more factors) to provide clearer theoretical interpretation of the model. For genetic analyses, we used factor scores from the correlated six-factor model without including the orthogonal method factors and without dropping the multi-loaded items.
Genetic quality control
QC was conducted for each cohort separately by array. We excluded participants with genotyping rate <95%, excessive heterozygosity (±3 s.d. from the mean), non-European ancestry as detailed below, mismatched genetic and reported sex and, for families, those with Mendelian errors >10%. SNPs with genotyping rate <10% were excluded, or they were excluded if they deviated from Hardy–Weinberg equilibrium (P < 1 × 10−6). Given the ancestral diversity in the SPARK cohort, Hardy–Weinberg equilibrium and heterozygosity were calculated in each genetically homogeneous population separately. Genetically homogeneous populations (corresponding to five super-populations: African, East Asian, South Asian, admixed American and European) were identified using the five genetic principal components calculated using SPARK and 1000 Genomes Phase 3 populations75 and clustered using UMAP76. Principal components were calculated using linkage disequilibrium-pruned SNPs (r2 = 0.1, window size = 1,000 kb, step size = 500 variants, after removing regions with complex linkage disequilibrium patterns) using GENESIS77, which accounts for relatedness between individuals, calculated using KING78.
Imputation was conducted using the Michigan Imputation Server79 with 1000 Genomes phase 3 version 5 as the reference panel49 (for AGRE and SSC), with the HRC r1.1 2016 reference panel80 (for AIMS-2-TRIALS) or using the TOPMed imputation panel81 (for both releases of SPARK). Details of imputation have been previously reported82. SNPs were excluded from polygenic risk scores if they had minor allele frequency <1%, had an imputation r2 < 0.4 or were multi-allelic.
We restricted our PGS associations to four GWAS. First, we included a GWAS of autism from the latest release from the iPSYCH cohort (iPSYCH-2015)83. This includes 19,870 autistic individuals (15,025 males and 4,845 females) and 39,078 individuals without an autism diagnosis (19,763 males and 19,315 females). All individuals included in this GWAS were born between May 1980 and December 2008 to mothers who were living in Denmark. GWAS was conducted on individuals of European ancestry, with the first ten genetic principal components included as covariates using logistic regression as provided in PLINK. Further details are provided elsewhere49. We additionally included GWAS for educational attainment (n = 766,345, excluding the 23andMe dataset)35, intelligence (n = 269,867)34, ADHD (n = 20,183 individuals diagnosed with ADHD and 35,191 controls)36 and schizophrenia (69,369 individuals diagnosed with schizophrenia and 236,642 controls)37. These GWAS were selected given the relatively large sample size and modest genetic correlation with autism. Additionally, as a negative control, we included PGS generated from a GWAS of hair color (blonde versus other, n = 43,319 blondes and n = 342,284 others) from the UK Biobank, which was downloaded from https://atlas.ctglab.nl/traitDB/3495. This phenotype has SNP heritability comparable to that of the other GWAS used (h2 = 0.15, s.e. = 0.014), is unlikely to be genetically or phenotypically correlated with autism and related traits, and has a sample size large enough to be a reasonably well-powered negative control.
PGS were generated for three phenotypes using polygenic risk scoring with continuous shrinkage (PRS-CS)84, which is among the best-performing polygenic scoring methods using summary statistics in terms of variance explained85. In addition, this method bypasses the step of identifying a P-value threshold. We set the global shrinkage prior (φ) to 0.01, as is recommended for highly polygenic traits. Details of the SNPs included are provided in Supplementary Table 3.
De novo variants were obtained from Antaki et al.19. De novo variants (structural variants and SNVs) were called for all SSC samples and a subset of the SPARK samples (phase 1 genotype release, SNVs only). To identify high-impact de novo SNVs, we restricted data to variants with a known effect on protein. These are damaging variants: ‘transcript_ablation’, ‘splice_acceptor_variant’, ‘splice_donor_variant’, ‘stop_gained’, ‘frameshift_variant’, ‘stop_loss’, ‘start_loss’ or missense variants with MPC86 scores >2. We further restricted data to variants in constrained genes with a LOEUF score <0.37 (ref. 87), which represent the topmost decile of constrained genes. For SVs, we restricted data to SVs affecting the most constrained genes, that is, those with LOEUF score <0.37, representing the first decile of most constrained genes. We did not make a distinction between deletions or duplications. To identify carriers, non-carriers and parents, we restricted our data to samples from the SPARK and SSC studies that had been exome sequenced and families in which both parents and the autistic proband(s) passed the genotyping QC.
For genes associated with severe developmental disorders, we obtained the list of constrained genes that are significant genes associated with severe developmental disorders from Kaplanis et al.27. To investigate the association of this set of genes with autism and developmental disorders, we first identified autistic carriers with a high-impact de novo variant and then divided this group into carriers who had at least one high-impact de novo variant in a DD gene and carriers with high-impact de novo variants in other constrained genes.
Only individuals with undiagnosed developmental disorders are recruited into the Deciphering Developmental Disorders study, and, as such, known genes associated with developmental disorders that are easy for clinicians to recognize and diagnose may be omitted from the genes identified by Kaplanis et al.27. To account for this bias, we ran sensitivity analyses using a larger but overlapping list of genes identified from the Developmental Disorder Gene-to-Phenotype database (DDG2P). From this database, we used constrained genes that are either ‘confirmed’ or ‘probable’ developmental disorder genes and genes for which heterozygous variants lead to developmental phenotypes (that is, mono-allelic or X-linked dominant).
Core and associated autism features
We identified 19 autism core and associated features that (1) are widely used in studies related to autism; (2) are a combination of parent-, self- and other-reported and performance-based measures to investigate if reporter status affects the PGS association; (3) are collected in all three cohorts; and (4) cover a range of core and associated features in autism. The core features are
ADOS88: social affect
ADOS88: restricted and repetitive behavior domain total score
ADI89: communication (verbal) domain total score
ADI89: restricted and repetitive behavior domain total score
ADI89: social domain total score
Parent-reported Social Responsiveness Scale-2 (ref. 90): total raw scores
Insistence of sameness factor (F1)
Social interaction factor (F2)
Sensory–motor behavior factor (F3)
Self-injurious behavior factor (F4)
Idiosyncratic repetitive speech and behavior (F5)
Communication skills factor (F6).
The associated features are
Vineland Adaptive Behavior Scales91: composite standard scores
Developmental Coordination Disorders Questionnaire92.
Measures of IQ were quantified using multiple methods across the range of IQ scores in the AGRE, SSC and LEAP studies. In the SPARK study, IQ scores were available based on parent reports on ten IQ score bins (Fig. 1c). We used these as full-scale scores. For analyses involving the SPARK and SSC cohorts, we converted full-scale scores from the SSC into IQ bins to match what was available from the SPARK study and treated them as continuous variables based on examination of the frequency histogram (Supplementary Fig. 8). For the six factors, we excluded individuals who were minimally verbal (Factor analyses), but these individuals were not excluded for analyses with other autism features.
We identified seven questions relating to developmental delay in the SPARK medical screening questionnaire. These are all binary questions (yes or no). Summed scores ranged from 0 to 7. The developmental phenotypes include the presence of
ID, cognitive impairment, global developmental delay or borderline intellectual functioning
Language delay or language disorder
Learning disability (learning disorder, including reading, written expression or math; or nonverbal learning disability)
Motor delay (for example, delay in walking) or developmental coordination disorder
Social (pragmatic) communication disorder (as included in DSM IV TR and earlier)
Speech articulation problems.
We included the age of first words and the age of walking independently for further analyses. This was recorded using parent-reported questionnaires in the SPARK study and in ADI-R89 in the SSC study. While other developmental phenotypes are available, we focused on these two, as they represent major milestones in motor and language development and are relatively well characterized.
Note of distribution of phenotypes and statistical analyses
Before any statistical analyses, we visually inspected the distributions of the variables. All continuous variables were approximately normally distributed with the exception of the ‘age of first words’, the ‘age of walking independently’ and the count of co-occurring developmental disabilities. For these three variables, we used quasi-Poisson or negative binomial regression to account for overdispersion in the data and because the variance was much greater than the mean. These models produced the same estimate but modestly different standard errors. Both have two parameters. However, while quasi-Poisson regression models the variance as a linear function of the mean, the negative binomial models the variance as a quadratic function of the mean. The model that produced the lower residual deviance was chosen between the two. For all other continuous variables, we used linear regression and parametric tests. For binary data, we used logistic regression as there was not a large imbalance in the case:control ratio.
Genetic association analyses
For each cohort, PGS and high-impact de novo variants were regressed against the autism features with sex and the first ten genetic principal components as covariates in all analyses, with all continuous independent variables standardized. In addition, array was included as a covariate in SSC and AGRE datasets. This was performed using linear regression for standardized quantitative phenotypes, logistic regression for binary phenotypes (for example, association between PGS and the presence of a high-impact de novo variant), Poisson regression for count data (number of developmental disorders or delays, not standardized) and negative binomial regression for the age of walking independently or the age of first words (not standardized; MASS93 package in R).
For the association between genetic variables and core and associated autism phenotypes, we first conducted linear regression analyses for the four PGS first using multivariate regression analyses with data from SPARK (waves 1 and 2), SSC, AGRE and AIMS-2-TRIALS LEAP. This is of the form:
where EA is educational attainment and 10PCs are ten principal components. For the negative control, we added the negative control as an additional independent variable in equation (1):
For the AGRE and SPARK studies, we ran equivalent mixed-effects models with family ID modeled as random intercepts to account for relatedness between individuals. This was carried out using the lme4 (ref. 94) package in R.
For high-impact de novo variants, we included the count of high-impact de novo variants as an additional independent variable in equation (1) and ran regression analyses for SPARK (wave 1 only) and SSC. To ensure interpretability across analyses, we retained only individuals who passed the genotypic QC, which included only individuals of European ancestries. Family ID was included as a random intercept:
Effect sizes were meta-analyzed across the three cohorts using inverse-variance-weighted meta-analyses with the following formula:
where βi is the standardized regression coefficient of the PGS, SEi is the associated standard error and wi is the weight. P values were calculated from Z scores. Given the high correlation between the autism features and phenotypes, we used Benjamini–Yekutieli false discovery rates to correct for multiple testing (corrected P < 0.05). We calculated heterogeneity statistics (Cochran’s Q and I2 values) for the PGS meta-analyses but not for the associations with high-impact de novo variants, as the latter were calculated using only two datasets (SSC and SPARK).
For the SPARK and SSC studies, we investigated the association between PGS (equation (1)) and being a carrier of a high-impact de novo variant (equation (3)) and the age of first walking and first words using negative binomial regression and conducted inverse-variance meta-analyses (equation (4)). We ran the same analyses for the SPARK study to investigate the association between PGS (equation (1)) and high-impact de novo variants (equation (3)) and counts of co-occurring developmental disabilities (quasi-Poisson regression). Leave-one-out analyses were conducted by systematically excluding one of seven co-occurring developmental disabilities and reconducting the analyses.
To investigate additivity between common and high-impact de novo variants, we conducted logistic regression with carrier status as a dependent binary variable and all PGS included as independent variables and genetic principal components, sex and age included as covariates. This was carried out separately for SPARK (wave 1) and SSC and meta-analyzed as outlined earlier.
Statistical significance of differences in factor scores between sexes were computed using t-tests. Associations with age and IQ bins were conducted using linear regressions after including sex as a covariate.
Matrix equivalency tests were conducted using the Jennrich test in the psych66 package in R. Power calculations were conducted using simulations. Statistical differences between pairwise correlation coefficients (carriers versus non-carriers) in core and associated features were tested using the package cocor95 in R. Using scaled existing data on full-scale IQ, adaptive behavior and motor coordination, we generated correlated simulated variables at a range of correlation coefficients to reflect the correlation between the six core factors and the three associated features. We then ran regression analyses using the simulated variable and high-impact de novo variants as provided in equation (3). We repeated this 1,000 times and counted the fraction of outcomes for which the association between high-impact de novo variant count and the simulated variable had P < 0.05 to obtain statistical power. Differences in the age of walking and the age of first words between groups of autistic individuals and siblings were calculated using Wilcoxon rank-sum tests.
Sex differences: polygenic transmission disequilibrium tests
Polygenic transmission deviation was conducted using polygenic transmission disequilibrium tests14. To allow comparisons with midparental scores, residuals of the autism PGS were obtained after regressing out the first ten genetic principal components. These residuals were standardized by using the parental mean and standard deviations. We obtained similar results using PGS that had not been residualized for the first ten genetic principal components. We defined individuals without co-occurring ID as individuals whose full-scale IQ is above 70 the SSC and SPARK studies. Additionally, in the SPARK cohort, we excluded any of these participants who had a co-occurring diagnosis of ‘intellectual disability, cognitive impairment, global developmental delay or borderline intellectual functioning’. Analyses were conducted separately in the SSC and SPARK cohorts and meta-analyzed using inverse-variance-weighted meta-analyses. We additionally conducted pTDT analyses on non-autistic siblings to investigate differences between males and females.
Sex differences: high-impact de novo variants
For sex differences in high-impact de novo variants, we calculated relative risk in autistic females versus males based on (1) all carriers, (2) carriers of DD genes and (3) carriers of non-DD genes (SPARK wave 1 and SSC). For sensitivity analyses, we conducted logistic regression with sex as the dependent variable and carrier status for DD genes and either full-scale IQ and motor coordination scores (in SPARK wave 1 and SSC) or number of developmental disorders (only in SPARK wave 1) as covariates. For each sensitivity analysis, we provide the estimates of the unconditional analysis as well (that is, without the covariates).
We opted to conduct heritability analyses using unscreened population controls rather than family controls (that is, pseudocontrols or unaffected family members), as this likely reduces SNP heritability96 owing to parents having higher genetic likelihood for autism compared to unselected population controls55 and due to assortative mating97. Case–control heritability analyses were conducted using the ABCD cohort as population controls; specifically, the ABCD child cohort in the USA, recruited at the age of 9 or 10 years. This cohort is reasonably representative of the US population in terms of demographics and ancestry. As such, it represents an excellent comparison cohort for the SPARK and SSC cohorts. The ABCD cohort was genotyped using the Smokescreen genotype array, a bespoke array designed for the study containing over 300,000 SNPs. Genetic QC was conducted identically as for SPARK. Genetically homogeneous groups were identified using the first five genetic principal components followed by UMAP clustering with the 1000 Genomes data. We restricted our analyses to 4,481 individuals of non-Finnish European ancestries in the ABCD cohort. Scripts for this are available at https://github.com/vwarrier/ABCD_geneticQC. Imputation was conducted, similar to the analysis of SPARK data, using the TOPMed imputation panel.
For case–control heritability analyses, we combined genotype data from the ABCD cohort and from autistic individuals from the SPARK and SSC cohorts. We restricted the analysis to 6,328,651 well-imputed SNPs (r2 > 0.9) with minor allele frequency >1% in all datasets. Furthermore, we excluded multi-allelic SNPs and SNPs with minor allele frequency difference of >5% between the three datasets and, in the combined dataset, were not in Hardy–Weinberg equilibrium (P > 1 × 10−6) or had genotyping rate <99%. We additionally excluded related individuals, identified using GCTA-GREML, and individuals with genotyping rate <95%. We calculated genetic principal components for the combined dataset using 52,007 SNPs with minimal linkage disequilibrium (r2 = 0.1, 1,000 kb, step size of 500 variants, removing regions with complex long-range linkage disequilibrium). Visual inspection of the principal-component plots did not identify any outliers (Supplementary Fig. 9). While our QC procedure is stringent, we note that there will be unaccounted-for effects in SNP heritability due to fine-scale population stratification, differences in genotyping array and participation bias in the autism cohorts. However, our focus is on the differences in SNP heritability between subgroups of autistic individuals, and unaccounted-for case–control differences will not affect this.
We calculated SNP heritability for autism and additionally in subgroups stratified for the presence of ID, sex, sex and ID together, and the presence of high-impact de novo variants. We also conducted SNP heritability in subgroups of autistic individuals with scores >1 s.d. from the mean for each of the six factors, autistic individuals with F1 scores > F2 scores and autistic individuals with F2 scores > F1 scores.
We calculated the observed-scale SNP heritability (baseline and subgroups) using GCTA-GREML52,53 and, additionally, using PCGC54. In all models except for the sex-stratified models, we included sex, age in months and the first ten genetic principal components as covariates. In the sex-stratified models, we included age in months and the first ten genetic principal components as covariates. For sex-stratified heritability analyses, both case and control data were from the same sex. For GCTA-GREML, the observed-scale SNP heritability was converted into liability-scale SNP heritability using equation (23) from Lee et al.98. PCGC estimates SNP heritability directly on the liability scale using the prevalence rates from Maenner et al.99. For all analyses, we ensured that the number of cases did not exceed the number of controls, with a maximum case:control ratio of 1.
We used prevalence rates from Maenner et al.99, which provides prevalence of autism among 8 year olds (1.8%). The study also provides prevalence rates by sex and by the presence of ID. However, there is wide variation in autism prevalence. We thus recalculated the SNP heritability across a range of state-specific prevalence estimates obtained from Maenner et al.99. For estimates of liability-scale heritability for subtypes defined by factor scores >1 s.d. from the mean, we estimated a prevalence of 16% of the total prevalence. For F1 > F2 and F2 > F1, prevalence was estimated at 50% of the total autism prevalence. Estimating approximate population prevalence of autistic individuals with high-impact de novo variant carriers is difficult due to ascertainment bias in existing autism cohorts. However, a previous study has demonstrated that the mutation rate for rare protein-truncating variants is similar between autistic individuals and siblings from the SSC and autistic individuals and population controls from the iPSYCH sample in Denmark, which does not have a participation bias100, implying that the de novo mutation rate in autistic individuals from the SPARK and SSC cohorts may be generalizable. Using the sex-specific proportion of de novo variant carriers and autism prevalence, we calculated a prevalence of 0.2% for being an autistic carrier of a high-impact de novo variant.
For sex-stratified SNP heritability analyses, we additionally calculated SNP heritability for a range of state-specific prevalence estimates to better model state-specific factors that contribute to autism diagnosis. In addition, using a total prevalence of 1.8%, we estimated SNP heritability using a male:female ratio of 3.3:1 (ref. 51) to account for diagnostic bias that may inflate the ratio.
We used GCTA-GREML to also estimate SNP heritability for the six factors, full-scale IQ and the bivariate genetic correlation between them. We used the same set of SNPs used in the case–control analyses. We were unable to conduct bivariate genetic correlation for the case–control datasets due to limitations of sample size.
We received ethical approval to access and analyze de-identified genetic and phenotypic data from the three cohorts from the University of Cambridge Human Biology Research Ethics Committee.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Genetic and phenotypic data for SFARI and SPARK are available upon application and approval from the Simons Foundation (https://www.sfari.org/resource/autism-cohorts/). Approved researchers can obtain the SPARK and SSC population datasets described in this study by applying at https://base.sfari.org. Data for AGRE are available upon application and approval from Autism Speaks (https://www.autismspeaks.org/agre). Data for EU-AIMSLEAP are available upon application and approval to the EU-AIMSLEAP committee (https://www.eu-aims.eu/the-leap-study). DDG2P phenotypes can be obtained here: https://www.deciphergenomics.org/ddd/ddgenes. GWAS data are available for hair color (https://atlas.ctglab.nl/traitDB/3495), schizophrenia and ADHD (https://www.med.unc.edu/pgc/download-results/), intelligence (https://ctg.cncr.nl/software/summary_statistics/) and educational attainment (https://thessgac.com/).
All scripts used in this study are available as follows: genetic QC and imputation in SSC (https://github.com/vwarrier/SSC_liftover_imputation; basic scripts used for imputing the SSC genotyped datasets), genetic QC and imputation in SPARK (https://github.com/vwarrier/SPARK_QC_imputation; QC and imputation of the SPARK dataset), genetic QC and imputation in the ABCD cohort (https://github.com/vwarrier/ABCD_geneticQC), bespoke genetic analyses (https://github.com/vwarrier/autism_heterogeneity; this git has the code for the heterogeneity in the Autism Project). We used the following software packages: PRScs (https://github.com/getian107/PRScs; polygenic prediction via continuous shrinkage priors), the TOPMed Imputation Server (https://imputation.biodatacatalyst.nhlbi.nih.gov/), PLINK (PLINK 2.0 (https://www.cog-genomics.org/)), GCTA-GREML (PLINK 2.0 (https://cnsgenomics.com/)), PCGC (PCGC regression (https://dougspeed.com/)). The following R packages were used: psych 2.1.6, cocor 1.1-3, lavaan 0.6-5, MASS 7.3-54, lme4 1.1-27.1.
Lai, M.-C., Lombardo, M. V. & Baron-Cohen, S. Autism. Lancet 383, 896–910 (2013).
American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders 5th edn (American Psychiatric Association, 2013).
Lord, C. et al. Autism spectrum disorder. Nat. Rev. Dis. Primers 6, 5 (2020).
Geschwind, D. H. Advances in autism. Annu. Rev. Med. 60, 367–380 (2009).
Mandell, D. S., Novak, M. M. & Zubritsky, C. D. Factors associated with age of diagnosis among children with autism spectrum disorders. Pediatrics 116, 1480–1486 (2005).
Kanne, S. M. et al. The role of adaptive behavior in autism spectrum disorders: implications for functional outcome. J. Autism Dev. Disord. 41, 1007–1018 (2011).
Lai, M.-C. & Szatmari, P. Sex and gender impacts on the behavioural presentation and recognition of autism. Curr. Opin. Psychiatry 33, 117–123 (2020).
Warrier, V. et al. Elevated rates of autism, other neurodevelopmental and psychiatric diagnoses, and autistic traits in transgender and gender-diverse individuals. Nat. Commun. 11, 3959 (2020).
Frazier, T. W. et al. Demographic and clinical correlates of autism symptom domains and autism spectrum diagnosis. Autism 18, 571–582 (2014).
Havdahl, K. A. et al. Multidimensional influences on autism symptom measures: implications for use in etiological research. J. Am. Acad. Child Adolesc. Psychiatry 55, 1054–1063 (2016).
Havdahl, A. et al. Genetic contributions to autism spectrum disorder. Psychol. Med. 51, 2260–2273 (2021).
Warrier, V. et al. Social and non-social autism symptoms and trait domains are genetically dissociable. Commun. Biol. 2, 328 (2019).
Robinson, E. B., Lichtenstein, P., Anckarsäter, H., Happé, F. & Ronald, A. Examining and interpreting the female protective effect against autistic behavior. Proc. Natl Acad. Sci. USA 110, 5258–5262 (2013).
Weiner, D. J. et al. Polygenic transmission disequilibrium confirms that common and rare variation act additively to create risk for autism spectrum disorders. Nat. Genet. 49, 978–985 (2017).
Robinson, E. B. et al. Genetic risk for autism spectrum disorders and neuropsychiatric variation in the general population. Nat. Genet. 48, 552–555 (2016).
Grove, J. et al. Identification of common genetic risk variants for autism spectrum disorder. Nat. Genet. 51, 431–444 (2019).
Satterstrom, F. K. et al. Large-scale exome sequencing study implicates both developmental and functional changes in the neurobiology of autism. Cell 180, 568–584 (2020).
Chaste, P. et al. A genome-wide association study of autism using the Simons Simplex Collection: does reducing phenotypic heterogeneity in autism increase genetic homogeneity? Biol. Psychiatry 77, 775–784 (2015).
Antaki, D. et al. A phenotypic spectrum of autism is attributable to the combined effects of rare variants, polygenic risk and sex. Nat. Genet. https://doi.org/10.1038/s41588-022-01064-5 (2022).
Buja, A. et al. Damaging de novo mutations diminish motor skills in children on the autism spectrum. Proc. Natl Acad. Sci. USA 115, E1859–E1866 (2018).
Bishop, S. L. et al. Identification of developmental and behavioral markers associated with genetic abnormalities in autism spectrum disorder. Am. J. Psychiatry 174, 576–585 (2017).
Happé, F., Ronald, A. & Plomin, R. Time to give up on a single explanation for autism. Nat. Neurosci. 9, 1218–1220 (2006).
Frazier, T. W. et al. Validation of proposed DSM-5 criteria for autism spectrum disorder. J. Am. Acad. Child Adolesc. Psychiatry 51, 28–40 (2012).
Lai, M.-C., Lombardo, M. V., Auyeung, B., Chakrabarti, B. & Baron-Cohen, S. Sex/gender differences and autism: setting the scene for future research. J. Am. Acad. Child Adolesc. Psychiatry 54, 11–24 (2015).
Werling, D. M. & Geschwind, D. H. Sex differences in autism spectrum disorders. Curr. Opin. Neurol. 26, 146–153 (2013).
Kosmicki, J. A. et al. Refining the role of de novo protein-truncating variants in neurodevelopmental disorders by using population reference samples. Nat. Genet. 49, 504–510 (2017).
Kaplanis, J. et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature 586, 757–762 (2020).
Lam, K. S. L. & Aman, M. G. The Repetitive Behavior Scale—Revised: independent validation in individuals with autism spectrum disorders. J. Autism Dev. Disord. 37, 855–866 (2007).
Rutter, M., Bailey, A. & Lord, C. SCQ: the Social Communication Questionnaire (Western Psychological Services, 2003).
Fischbach, G. D. & Lord, C. The Simons Simplex Collection: a resource for identification of autism genetic risk factors. Neuron 68, 192–195 (2010).
SPARK Consortium et al. SPARK: a US cohort of 50,000 families to accelerate autism research. Neuron 97, 488–493 (2018).
Pender, R., Fearon, P., Heron, J. & Mandy, W. The longitudinal heterogeneity of autistic traits: a systematic review. Res. Autism Spectr. Disord. 79, 101671 (2020).
Jones, R. M. et al. How interview questions are placed in time influences caregiver description of social communication symptoms on the ADI-R. J. Child Psychol. Psychiatry 56, 577–585 (2015).
Savage, J. E. et al. Genome-wide association meta-analysis in 269,867 individuals identifies new genetic and functional links to intelligence. Nat. Genet. 50, 912–919 (2018).
Lee, J. J. et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat. Genet. 50, 1112–1121 (2018).
Demontis, D. et al. Discovery of the first genome-wide significant risk loci for attention deficit/hyperactivity disorder. Nat. Genet. 51, 63–75 (2019).
The Schizophrenia Working Group of the Psychiatric Genomics Consortium, Ripke, S., Walters, J. T. R. & O’Donovan, M. C. Mapping genomic loci prioritises genes and implicates synaptic biology in schizophrenia. Preprint at medRxiv https://doi.org/10.1101/2020.09.12.20192922 (2020).
Watanabe, K. et al. A global overview of pleiotropy and genetic architecture in complex traits. Nat. Genet. 51, 1339–1348 (2019).
Peça, J. et al. Shank3 mutant mice display autistic-like behaviours and striatal dysfunction. Nature 472, 437–442 (2011).
Katayama, Y. et al. CHD8 haploinsufficiency results in autistic-like phenotypes in mice. Nature 537, 675–679 (2016).
Hoffmann, T. J. et al. Evidence of reproductive stoppage in families with autism spectrum disorder: a large, population-based cohort study. JAMA Psychiatry 71, 943–951 (2014).
Lai, M.-C. & Baron-Cohen, S. Identifying the lost generation of adults with autism spectrum conditions. Lancet Psychiatry 2, 1013–1027 (2015).
Clarke, T.-K. et al. Common polygenic risk for autism spectrum disorder (ASD) is associated with cognitive ability in the general population. Mol. Psychiatry 21, 419–425 (2015).
Myers, S. M. et al. Insufficient evidence for ‘autism-specific’ genes. Am. J. Hum. Genet. 106, 587–595 (2020).
Thormann, A. et al. Flexible and scalable diagnostic filtering of genomic variants using G2P with Ensembl VEP. Nat. Commun. 10, 2373 (2019).
Jacquemont, S. et al. A higher mutational burden in females supports a ‘female protective model’ in neurodevelopmental disorders. Am. J. Hum. Genet. 94, 415–425 (2014).
Sanders, S. J. et al. Insights into autism spectrum disorder genomic architecture and biology from 71 risk loci. Neuron 87, 1215–1233 (2015).
Deciphering Developmental Disorders Study. Prevalence and architecture of de novo mutations in developmental disorders. Nature 542, 433–438 (2017).
Wigdor, E. M. et al. The female protective effect against autism spectrum disorder. Preprint at medRxiv https://doi.org/10.1101/2021.03.29.21253866 (2021).
Pirastu, N. et al. Genetic analyses identify widespread sex-differential participation bias. Nat. Genet. 53, 663–671 (2021).
Loomes, R., Hull, L. & Mandy, W. P. L. What is the male-to-female ratio in autism spectrum disorder? A systematic review and meta-analysis. J. Am. Acad. Child Adolesc. Psychiatry 56, 466–474 (2017).
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 76–82 (2011).
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569 (2010).
Golan, D., Lander, E. S. & Rosset, S. Measuring missing heritability: inferring the contribution of common variants. Proc. Natl Acad. Sci. USA 111, E5272–E5281 (2014).
Klei, L. L. et al. Common genetic variants, acting additively, are a major source of risk for autism. Mol. Autism 3, 9 (2012).
Gaugler, T. et al. Most genetic risk for autism resides with common variation. Nat. Genet. 46, 881–885 (2014).
Gao, Z. et al. Overlooked roles of DNA damage and maternal age in generating human germline mutations. Proc. Natl Acad. Sci. USA 116, 9491–9500 (2019).
Kong, A. et al. Rate of de novo mutations and the importance of father’s age to disease risk. Nature 488, 471–475 (2012).
Niemi, M. E. K. et al. Common genetic variants contribute to risk of rare severe neurodevelopmental disorders. Nature 562, 268–271 (2018).
Trost, B. et al. Genome-wide detection of tandem DNA repeats that are expanded in autism. Nature 586, 80–86 (2020).
Mitra, I. et al. Patterns of de novo tandem repeat mutations and their role in autism. Nature 589, 246–250 (2021).
Dudbridge, F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 9, e1003348 (2013).
Happé, F. & Frith, U. Annual Research Review: looking back to look forward—changes in the concept of autism and implications for future research. J. Child Psychol. Psychiatry 61, 218–232 (2020).
Geschwind, D. H. et al. The Autism Genetic Resource Exchange: a resource for the study of autism and related neuropsychiatric conditions. Am. J. Hum. Genet. 69, 463–466 (2001).
Charman, T. et al. The EU-AIMS Longitudinal European Autism Project (LEAP): clinical characterisation. Mol. Autism 8, 27 (2017).
Revelle, W. & Revelle, M. W. psych: Procedures for Psychological, Psychometric, and Personality Research. R package version 2.16 https://cran.r-project.org/package=psych (2021).
Bishop, S. L., Havdahl, K. A., Huerta, M. & Lord, C. Subdimensions of social-communication impairment in autism spectrum disorder. J. Child Psychol. Psychiatry 57, 909–916 (2016).
Zheng, S. et al. Extracting latent subdimensions of social communication: a cross-measure factor analysis. J. Am. Acad. Child Adolesc. Psychiatry 60, 768–782 (2021).
Grove, R., Begeer, S., Scheeren, A. M., Weiland, R. F. & Hoekstra, R. A. Evaluating the latent structure of the non-social domain of autism in autistic adults. Mol. Autism 12, 22 (2021).
Richler, J., Bishop, S. L., Kleinke, J. R. & Lord, C. Restricted and repetitive behaviors in young children with autism spectrum disorders. J. Autism Dev. Disord. 37, 73–85 (2007).
Heise, D. R. & Bohrnstedt, G. W. Validity, invalidity, and reliability. Sociol. Methodol. 2, 104–129 (1970).
Bentler, P. M. Alpha, dimension-free, and model-based internal consistency reliability. Psychometrika 74, 137–143 (2009).
Reise, S. P., Moore, T. M. & Haviland, M. G. Bifactor models and rotations: exploring the extent to which multidimensional data yield univocal scale scores. J. Pers. Assess. 92, 544–559 (2010).
Rosseel, Y. lavaan: an R package for structural equation modeling and more. J. Stat. Softw. 48, 1–36 (2012).
Gibbs, R. A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3, 861 (2018).
Conomos, M. P. & Thornton, T. Genetic Estimation and Inference in Structured samples (GENESIS): statistical methods for analyzing genetic data from samples with population structure and/or relatedness. R package v.2 (Bioconductor, 2016).
Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010).
Howie, B. N., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G. R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955–959 (2012).
McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
Warrier, V. et al. Gene–environment correlations and causal effects of childhood maltreatment on physical and mental health: a genetically informed approach. Lancet Psychiatry 8, 373–386 (2021).
Bybjerg-Grauholm, J. et al. The iPSYCH2015 case–cohort sample: updated directions for unravelling genetic and environmental architectures of severe mental disorders. Preprint at medRxiv https://doi.org/10.1101/2020.11.30.20237768 (2020).
Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C. A. & Smoller, J. W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 10, 1776 (2019).
Pain, O. et al. Evaluation of polygenic prediction methodology within a reference-standardized framework. PLoS Genet. 17, e1009021 (2021).
Samocha, K. E., Kosmicki, J. A. & Karczewski, K. J. Regional missense constraint improves variant deleteriousness prediction. Preprint at bioRxiv https://doi.org/10.1101/148353 (2017).
Karczewski, K. J. et al. Author Correction: the mutational constraint spectrum quantified from variation in 141,456 humans. Nature 590, E53 (2021).
Lord, C. et al. Autism diagnostic observation schedule: a standardized observation of communicative and social behavior. J. Autism Dev. Disord. 19, 185–212 (1989).
Lord, C. et al. Autism Diagnostic Interview—Revised: a revised version of a diagnostic interview for caregivers of individuals with possible pervasive developmental disorders. J. Autism Dev. Disord. 24, 659–685 (1994).
Constantino, J. N. & Gruber, C. P. Social Responsiveness Scale: SRS-2 (Western Psychological Services, 2012).
Sparrow, S. S., Balla, D. A., Cicchetti, D. V. & Harrison, P. L. Vineland Adaptive Behavior Scales (American Guidance Service, 1984).
Wilson, B. N., Kaplan, B. J., Crawford, S. G. & Roberts, G. The Developmental Coordination Disorder Questionnaire 2007 (DCDQ’07). Phys. Occup. Ther. Pediatr. 29, 267–272 (2007).
Ripley, B. et al. MASS. R package version 7.3-54 https://cran.r-project.org/package=MASS (2021).
Bates, D., Sarkar, D., Bates, M. D. & Matrix, L. lme4. R package version 1.1-27.1 https://cran.r-project.org/package=lme4 (2021).
Diedenhofen, B. & Musch, J. cocor: a comprehensive solution for the statistical comparison of correlations. PLoS ONE 10, e0121945 (2015).
Peyrot, W. J., Boomsma, D. I., Penninx, B. W. J. H. & Wray, N. R. Disease and polygenic architecture: avoid trio design and appropriately account for unscreened control subjects for common disease. Am. J. Hum. Genet. 98, 382–391 (2016).
Baron-Cohen, S. The hyper-systemizing, assortative mating theory of autism. Prog. Neuropsychopharmacol. Biol. Psychiatry 30, 865–872 (2006).
Lee, S. H., Wray, N. R., Goddard, M. E. & Visscher, P. M. Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet. 88, 294–305 (2011).
Maenner, M. J. et al. Prevalence of autism spectrum disorder among children aged 8 years—Autism and Developmental Disabilities Monitoring Network, 11 sites, United States, 2016. MMWR Surveill. Summ. 69, 1–12 (2020).
Satterstrom, F. K. et al. Autism spectrum disorder and attention deficit hyperactivity disorder have a similar burden of rare protein-truncating variants. Nat. Neurosci. 22, 1961–1965 (2019).
S.B.-C. received funding from the Wellcome Trust (214322\Z\18\Z). For the purpose of open access, we have applied a CC BY public copyright licence to any author accepted manuscript version arising from this submission. S.B.-C. also received funding from the Autism Centre of Excellence, the SFARI, the Templeton World Charitable Fund, the MRC and the National Institute for Health Research Cambridge Biomedical Research Centre. The research was supported by the National Institute for Health Research Applied Research Collaboration East of England. Any views expressed are those of the author(s) and not necessarily those of the funder. Some of the results leading to this publication have received funding from the Innovative Medicines Initiative 2 Joint Undertaking under grant agreement no. 777394 for the project AIMS-2-TRIALS. This joint undertaking receives support from the European Union’s Horizon 2020 research and innovation program and the EFPIA and Autism Speaks, Autistica and the SFARI. V.W. is funded by St. Catharine’s College, Cambridge. T.B. has received funding from the Institut Pasteur, the CNRS, the Bettencourt–Schueller and the Cognacq–Jay Foundations, the APHP and the Université de Paris Cité. We acknowledge with gratitude the generous support of D. and M. Gillings in strengthening the collaboration between S.B.-C. and T.B. and between Cambridge University and the Institut Pasteur. The iPSYCH team was supported by grants from the Lundbeck Foundation (R102-A9118, R155-2014-1724 and R248-2017-2003), the NIMH (1U01MH109514-01 to A.D.B.) and the universities and university hospitals of Aarhus and Copenhagen. The Danish National Biobank resource was supported by the Novo Nordisk Foundation. High-performance computer capacity for handling and statistical analysis of iPSYCH data on the GenomeDK HPC facility was provided by the Center for Genomics and Personalized Medicine and the Centre for Integrative Sequencing, iSEQ, Aarhus University, Denmark (grant to A.D.B.). We thank J. Sebat for sharing the de novo variant calls in the SPARK and SSC datasets. We are grateful to all families at the participating SSC sites as well as the principal investigators (A. Beaudet, R. Bernier, J. Constantino, E. Cook, E. Fombonne, D. Geschwind, R. Goin-Kochel, E. Hanson, D. Grice, A. Klin, D. Ledbetter, C. Lord, C. Martin, D. Martin, R. Maxim, J. Miles, O. Ousley, K. Pelphrey, B. Peterson, J. Piggot, C. Saulnier, M. State, W. Stone, J. Sutcliffe, C. Walsh, Z. Warren and E. Wijsman). We are grateful to all families in the SPARK study, the SPARK clinical sites and SPARK staff.
M.E.H. is a cofounder of and consultant to and holds shares in Congenica Ltd., a genetics diagnostics company.
Peer review information
Nature Genetics thanks Philip Jansen and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Warrier, V., Zhang, X., Reed, P. et al. Genetic correlates of phenotypic heterogeneity in autism. Nat Genet (2022). https://doi.org/10.1038/s41588-022-01072-5