Autism occurs in approximately 1–2% of the population [1] and autistic individuals’ mental health difficulties are a major public health issue. In economic terms, the lifetime individual cost of autism is estimated at $2.4 (£1.5) million in the United States and United Kingdom and annual population costs are around $268 billion in the United States [2, 3]. While interest in and science investigating autism has been growing rapidly, progress towards translating scientific knowledge into high-impact clinical practice has been small and slow in pace. We are still far from delivering more effective intervention for unwanted symptoms, more precise and earlier diagnosis, better understanding and prediction of prognosis and development, and personalization of support and intervention. All of these points are within the scope of stratified psychiatry [4] and precision medicine [5]. To arrive at this point, our contention is that we will first need to grapple with an important issue holding back progress—heterogeneity within the autistic population.

The field is currently addressing this issue. Some have argued that we are at a crossroad and must acknowledge that the concept of autism as a single entity lacks validity at a biological level [6, 7] and that autism must be taken apart [8]. This idea relates to what others have discussed regarding autism as an umbrella label referring to many different kinds of ‘autisms’ [9] and how the scientific community should abandon attempts to continue characterizing all of autism under a single theory [10]. Research has begun along these new directions but is highly fractionated because heterogeneity is discussed across multiple levels of analysis, from genetics [11], neural systems [12,13,14], cognition [15], behavior and development [16,17,18], and clinical topics (e.g., response to treatment or outcome [19, 20]). Approaches differ in how heterogeneity should be decomposed, from utilizing theoretical a priori known stratifiers [21,22,23] or dimensions to data-driven approaches [12, 24,25,26]. Models for understanding heterogeneity also differ, with some conceptualizing distinctions as categorical/qualitative, continuous/dimensional, and/or where distinctions or similarities may cut across diagnostic boundaries [26,27,28]. Work can also differ with regards to aims that are specific to understanding heterogeneity within one level of analysis [29, 30], while others attempt to explain heterogeneity across levels [31,32,33,34,35,36].

The purpose of this paper is not to provide an in-depth review of the literature on these areas. Rather, we see a need to provide organizing principles for framing these diverse areas of research, so that future synthesis and theoretical development about heterogeneity can be facilitated. Specifically, we first discuss how commonly used terminology such as ‘spectrum’ or the ‘autisms’ can be used to imply different types of models for understanding heterogeneity in autism. Next, we discuss how heterogeneity arises within the context of the historical change in diagnostic criteria. Third, we provide arguments behind why understanding heterogeneity is critical for furthering progress towards precision medicine. Fourth, we discuss some of the problems with the dominant paradigm in the field—the case–control paradigm. In discussing these issues, we point towards problems with small sample studies and the need for bigger data. This leads into a discussion regarding characteristics of big data that are important for studying heterogeneity in autism. We follow this with organizing principles behind how one attempts to understand multi-level heterogeneity. We then discuss the role of transdiagnostic viewpoints which go beyond understanding heterogeneity just within autism. Finally, we conclude with discussions about realistic challenges, mitigating strategies, and clinical implications of big data approaches.

Terminology behind ‘heterogeneity’ and impact on building and evaluating models

The concept of heterogeneity in autism dates back to the original conceptions of an ‘autistic spectrum’ by Wing [37]. Since then, we now apply the concept of heterogeneity beyond just clinical, behavioral, and/or cognitive levels. A hallmark of heterogeneity in autism is its multi-level presentation (Fig. 1c), applicable from genotype through phenotype [9, 10], throughout development [16, 38], and manifesting as important clinical differentiation (e.g., outcome [20], response to treatment [19], etc.). Thus, the concept of heterogeneity not only applies to how individuals differ at one level of analysis, but also when and at which levels those differences arise, and potentially how heterogeneity across levels is coordinated. While the idea of heterogeneity itself has a longstanding history, better explanations are needed behind why heterogeneity manifests across different levels and how they are connected across levels and within or between individuals. Bringing such concepts back to developmental psychopathology, terms such as equifinality and multifinality [39] may be helpful. For example, a diversity of different developmental starting points or causal mechanisms in the genome may reach similar endpoints (equifinality) at levels more proximate to clinical outcomes or behavior [40]. However, very similar mechanisms at one level could also result in a diversity of endpoints (multifinality) [41]. Currently, the mapping of multi-level heterogeneity in autism is unclear, but it is imperative that we understand these mappings which are likely to be indicative of useful explanations towards precision medicine goals.

Fig. 1
figure 1

Approaches to decomposing heterogeneity in autism. a A population of interest is shown, and autism cases are colored in green, pink, and blue. The different colors are meant to represent different autism subtypes. In b we show the impact of ignoring heterogeneity on effect size. With a typical case–control model, we ignore these possible subtype distinctions and compare autism to controls on some dependent variable. In this example scenario there is no clear case–control difference but the autism group shows higher variability (indicated by the larger error bars). An approach towards decomposing heterogeneity might be to construct a stratified model whereby we model the subtype labels instead of one autism label, and then re-examine differences on the hypothetical dependent variable of interest. In this example, the autism subtypes show contradictory effects. These effects are masked in the case–control model as the averaging cancels out the interesting different effects across the subgroups. c Heterogeneity is shown in autism as multi-level phenomena. This panel also visualizes the difference between broad versus deep big data characteristics and labels the top-down versus bottom-up approaches to understanding heterogeneity in this multi-level context. Finally, this panel also shows how development is another important dimension of heterogeneity to consider at each level of analysis (i.e., ‘chronogeneity’). In this example, chronogeneity is represented by different trajectories for different types of autism individuals

There are many ways to talk about how autistic individuals are similar to or different from each other [42]. On the one hand, we can understand phrases like the ‘spectrum’ as referring to heterogeneity as graded continuous change between individuals. ‘Spectrum’ can also apply to both the clinically diagnosed autism population and the whole population, including those with the ‘broader autism phenotype’ [43,44,45,46]. The idea of a spectrum can be applied as a model for understanding heterogeneity between autistic individuals—a model we would refer to as a ‘dimensional model’. Dimensional models can also cut across traditional diagnostic boundaries, with the most prominent example of this being the National Institute of Mental Health (NIMH) Research Domain Criteria (RDoC) model [47]. However, we also use heterogeneity as a way of conceptualizing categorical or qualitative differences between autistic individuals. The term ‘spectrum’ could also imply a qualitative rather than a quantitative difference between individuals. However, terms that pluralize autism as ‘autisms’ may be more applicable here, as the idea of multiple kinds of autisms lends itself to categorical ways of thinking about individuals as ‘subgroups’ or ‘subtypes’. A subtype model for explaining heterogeneity in autism can also be called a ‘stratified model’.

Since we have different ways of talking about heterogeneity, the question will naturally arise as to which way of conceptualizing heterogeneity is best. Are categorical ‘subtype’ models better than continuous ‘dimensional’ models, or vice versa? This could be an ill-posed question, since these concepts and models need not be mutually exclusive. First, theoretically we could imagine an important blending between the two types of models for understanding heterogeneity and this can be tested statistically (e.g., factor mixture models [48]). For instance, one could first subtype the autistic population, and then further characterize between-individual variability through continuous models within each subtype. Second, the answer to such a question may differ depending on the aim of the model. For example, a subtype model might be better at predicting treatment responses, whereas a dimensional model might be better at predicting basic biological mechanisms, or vice versa. As we build a literature on understanding heterogeneity in autism, it will be important to be clear about how different models conceptualize heterogeneity, as well as understanding that different models may be important for different types of aims. The aphorism by George Box that ‘all models are wrong, but some are useful’ is applicable here [49]. Models are simplified explanations that typically account only for a portion of variability in a phenomenon. Even if models are quite different in their explanation and predictive power, they can still be useful for a variety of different aims. Therefore, a pragmatic approach for evaluating heterogeneity models will be important for moving forward, since it is unlikely that we will converge on single explanations (models) that can explain the wide array of multi-level heterogeneity in autism.

Heterogeneity, evolution of the diagnostic concept

The evolution of the nosology and diagnostic concept of autism changes the definition of autism—who counts as being on ‘on the spectrum’ and who gets a clinical diagnosis [50]. This evolution also contributes to the discussion about heterogeneity in autism. When ‘autism’ was first defined as ‘autistic disturbances of affective contact’, the core features were considered to be ‘extreme self-isolation’ and ‘obsessive insistence on the preservation of sameness’ [51, 52]. At the cognitive level, language impairments or peculiarities were seen as secondary to ‘basic disturbances in human relatedness’ [52]. Moreover, both Kanner [51] and Asperger [53] recognized good cognitive potential in their child patients and therefore autism was not necessarily tied to intellectual disability. However, at the next stage of nosological evolution, language and cognitive impairments began to be considered ‘core’ [54] and this conceptualization directly impacted the first operationalization of autism in the Diagnostic and Statistical Manual of Mental Disorders (DSM)-III [55], in which language deficits were core to diagnosis. Individuals identified as having autism in the 1970s and 1980s were therefore mostly those with marked difficulties in verbal communication, and many were considered to have intellectual disability. In the 1980s, Wing [56] and colleagues not only introduced the work of Hans Asperger into the English speaking world, but also conducted epidemiological studies that demonstrated the heterogeneity in social, language, motor, and cognitive abilities in the autistic and developmentally delayed population [57, 58]. Wing’s ideas of the ‘triad of social, communication and imagination impairments and repetitive behavior’, the lack of clear division between Kanner’s autism and less extreme forms, and the shift of core social impairment from ‘extreme autistic aloneness’ to ‘deficits in the use and understanding of unwritten rules of social behavior’ clearly broadened what autism encompassed. All these ideas were subsequently adopted into versions of diagnostic systems including DSM-III-R, DSM-IV and ICD-10 (International Statistical Classification of Diseases and Related Health Problems-10th Revision). Phenotypic heterogeneity therefore increased, allowing an autistic individual to be verbal or minimally verbal, ‘active but odd’, ‘passive’, ‘aloof’ or ‘loners’ [59], and with various combinations of repetitive and stereotyped behaviors. The DSM-5’s exclusion of language impairments from, and inclusion of atypical sensory responses into core symptoms, reflects how the concept of autism nowadays is much broader than how it had initially been conceptualized. The most recent revision of ICD criteria (ICD-11) further emphasizes specific diagnostic subgroups that qualify whether an individual with autism has impairments with functional language and/or intellectual development. With the changing and broadening diagnostic concept comes increased heterogeneity, inevitably at the behavioral phenotypic level, and possibly also at other levels of analysis.

This history behind the evolving diagnostic concept is an important yet often not fully acknowledged caveat for interpreting research on autism. Research spanning several decades may have been isolating phenomena in altogether different types of individuals than does more recent research. Since the spectrum of diagnosed individuals is wider today than in the past, interpretations behind lack of replication or inconsistencies across studies should take this into account, rather than assuming the population under study has not changed over time. As the diagnostic concept continues to change we must be mindful of this issue when interpreting how current research matches up to work that may be several decades old.

Shifting from the ‘one-size-fits-all’ paradigm towards understanding heterogeneity

Perhaps the most prominent justification behind why understanding heterogeneity is important is because individuals with autism widely differ in response to treatment. While most treatment approaches are early intensive behavioral intervention and naturalistic developmental behavioral intervention, the existing literature suggests that they have variable levels of effectiveness and in some cases may not significantly affect core autism features such as social-communication difficulties [60,61,62,63,64]. Currently, there are also no medical treatments that significantly affect the core characteristics of autism [65, 66]. Rather than advocating a ‘one-size-fits-all’ approach to treatment, most recent best practice recommendations specifically highlight the critical need for future research to identify factors that explain heterogeneity in response to treatment, in order to better individualize treatment and intervention approaches and to better target changes in core or functionally impairing symptomatology [60, 64]. A separate ethical issue raised by the neurodiversity movement is the idea that autism itself should not be a target for treatment, since it may be part of the individual’s genetic make-up and identity. Rather, treatment should target specific co-occurring symptoms and difficulties in adaptive functioning that cause suffering and disability. Such co-occurring symptoms and maladaptation (in many cases the contributing reasons are not solely within the autistic person but also arising from the environmental contexts) comprise a critical, yet under-developed, angle to stratification of the autism spectrum which will guide ethical and personalized intervention.

Heterogeneity also limits basic scientific progress towards understanding autism. To understand why, it is important to first make salient the problems with the dominant paradigm, which is ill-equipped to reveal heterogeneity—the case–control paradigm. The case–control paradigm exemplifies the ‘one-size-fits-all’ approach, since all cases are treated identically due to the same diagnostic label. Studies that attempt to identify ‘biomarkers’ via case–control designs have implicitly conceptualized the notion that if a strong biomarker did exist, it would completely differentiate cases from all controls. We have yet to isolate any biomarkers for autism that can reliably and consistently reach this high bar [7, 67]. One reason why case–control research has fallen short on identifying high-impact biomarkers could be that we are looking at the wrong features. However, an alternative explanation is that high-impact biomarkers are likely exclusive to specific subsets of autistic individuals. That is, a high-impact biomarker may be informative for one subtype of autism, but not others (Fig. 1b). In order to identify such stratification or dimensional biomarkers [68], one will have to change the approach from the case–control model to a stratified and/or dimensional model. This is not to say that case–control studies are not useful. Isolation of consistent and reliable case–control differences are useful for identifying on average differences, but typically with substantial degree of overlap in the distributions. However, if we are searching for biomarkers that could help us move towards precision medicine, we will need to pivot our approach away from case–control studies as the dominant paradigm and towards stratified and/or dimensional models that could yield much higher impact larger effects.

As an illustrative example, we take our own recent work on mentalizing ability in adults with autism. From a case–control perspective, autistic adults perform on average lower on the ‘Reading the Mind in the Eyes’ Test (RMET) compared to matched typically developing controls [69]. However, taking a stratified approach, we find that the autistic adult population can be reliably split into subtypes who are completely unimpaired on the RMET versus those who are highly impaired [25] (Fig. 2). Thus, in this example, while replicable on average case–control effects appear, a stratified approach that takes into account heterogeneity can isolate higher impact and more precise considerations about mentalizing as measured by the RMET in the adult autistic population.

Fig. 2
figure 2

Case–control vs stratified model example with adult autism and mentalizing ability. This figure reports data from Lombardo et al. [25] on two independent datasets of adults with autism and performance on an advanced mentalizing test, the Reading the Mind in the Eyes Test (RMET). a (Discovery), b (Replication) Case–control differentiation and the standardized effect size for each dataset are shown. cf RMET scores and standardized effect sizes from the same two datasets after unsupervised data-driven stratification into five distinct autism subgroups and four distinct TD subgroups. Autism subgroups 1–2 are highly impaired on the RMET, while autism subgroups 3–5 are completely overlapping in RMET scores with the TD population

Imprecise effect size estimates and lack of power in small sample size studies

Compounding the problem of utilizing ‘one-size-fits-all’ models like the case–control paradigm is the issue of small sample size studies. Over the last several decades, it has been common practice to conduct and publish small sample size studies. Small sample studies can be problematic from the viewpoint that statistical power is low for all but the largest effects. Small sample size also means that estimated sample statistics vary considerably relative to their population parameters due to more pronounced sampling variability. In Fig. 3, we show simulations that illustrate the issues of low power and imprecise estimates of effect size so that they are clear and salient to readers. A common case–control study with n = 20 per group results in an effect size that varies considerably relative to the true population effect. This variability in estimated effect size at small samples is consistent irrespective of what the true population effect is. Only with very large sample sizes (e.g., n > 1000) can we see that the sample effect size hones in with some precision on the true population effect size. The histograms shaded in red in Fig. 3 also show the limited statistical power one has at smaller effect sizes and small sample size.

Fig. 3
figure 3

Simulation of sample effect size estimates at different sample sizes and across a range of true population effects for a hypothetical case–control study. In this simulation we set the population effect size to a range of different values, from very small (e.g., d = 0.1) to very large (e.g., d > 1.0) (panels ae show simulation results when effect size ranges from d = 0.1 to d = 0.9 in steps of 0.2). We then simulated data from two populations (cases and controls), each with n = 10,000,000, that had a case–control difference at these population effect sizes. Next, we simulated 10,000 experiments where we randomly sampled from these populations different sample sizes (n = 20, n = 50, n = 100, n = 200, n = 1000, n = 2000) and computed the sample effect size estimate (standardized effect size, Cohen’s d) for the case–control difference. These histograms (gray) show how variable the sample effect size estimates are (black lines show 95% confidence intervals) relative to the true population effect size (green line). Visually, it is quite apparent how small sample sizes (e.g., n = 20) have wildly varying sample effect size estimates and that this variability is consistent irrespective of what the true population effect size is. Overlaid on each gray histogram are red histograms that show the distribution of sample effect size estimates where the hypothesis test (e.g., independent samples t-test) passes statistical significance at p < 0.05. The rightward shift in this red distribution relative to the true population effect size (green line) illustrates the phenomenon of effect size inflation. The problem is much more pronounced at small sample sizes and when true population effects are smaller. We then computed what is the average effect size inflation for this red distribution and plotted this average effect size inflation as a percentage increase relative to the true population effect in (f). Each line in panel f refers to simulations with different sample sizes. This plot directly quantifies the degree of effect size inflation across a range of true population effects and across a range of sample sizes. The code for implementing and reproducing these simulations is available at https://github.com/mvlombardo/effectsizesim

Effect size inflation in small sample studies

Our simulations also make salient another common characteristic of small sample size studies—the possibility for vast effect size inflation when statistically significant effects are identified [70]. Inflated effects occur because effect sizes that are deemed statistically significant in small studies benefit from noise in the direction of the effect. Such inflated effects present an over-optimistic view on the identified effects and are prone to the winner’s curse [70]. Inflated effects look attractive and may be easier to publish due to their apparent indication of large effects. However, in subsequent replication attempts, investigators likely will fail to identify effects as large as the original small sample study because the effect size in the original study was inflated by some degree [71]. We can see effect size inflation and its interaction with true population effect size in Fig. 3. At very small true population effect sizes, sample effect size estimates that are deemed statistically significant (the red histograms in Fig. 3a–e) are wildly inflated, and this problem is most pronounced for small sample size studies. For example, tiny population effect sizes of 0.1 standard deviations of difference show on average greater than 300 to 350% effect size inflation when a study observes a statistically significant effect at p < 0.05 with an n = 50 or n = 20, respectively (Fig. 3f). If the true population effect size is much larger (e.g., d > 0.5), inflation in effect size is attenuated, and at relatively large sample sizes (n > 100 per group), there is very little effect size inflation on average for such effects. Of course, these simulations here are simplistic examples of studies with only one statistical comparison. The reality is that studies typically make multiple comparisons and sometimes on a massive scale (e.g., neuroimaging, genetics). In these situations, inflated effect sizes become an even bigger problem [72].

Why is such a characteristic important in discussions of case–control paradigms versus paradigms that acknowledge heterogeneity? The pervasiveness of small sample sizes and effect size inflation in case–control studies tend to give over-optimistic views on the utility of case–control studies. Over the course of time, replication attempts typically decrease the enthusiasm for many such effects, because the reality is likely that most case–control effect sizes are much smaller than published small sample size studies would suggest. By portraying initial novel case–control studies as showing large effects, we may be less inclined to ask the question of whether heterogeneity is involved. Furthermore, small case–control effects may be due to complicated heterogeneity in the autism population that hides potentially large effects restricted to specific subtypes. By focusing on heterogeneity, we are likely to better identify true population effects of much larger magnitude. Assuming that such research identifies true large effects in relatively large samples, the issue of effect size inflation may be much less of an issue (as the simulations here demonstrate). However, any model where statistical power is low can show inflated effect sizes. Therefore, models that try to explain heterogeneity can be prone to effect size inflation as well, hence the need for very large samples and high statistical power in stratified or dimensional models.

Sampling bias across strata nested in the autism population

Small sample size case–control studies that do not acknowledge heterogeneity in the autism population are also particularly problematic because increased sampling variability has substantial biasing impact in enriching specific strata of the population over others. Ideally, to get a generalizable sample of the population in a case–control paradigm, one hopes that if there are unknown strata nested in the population, the sample prevalence of each strata reflects the true prevalence of that strata in the population. If such a criterion is not achieved, it means that samples can be biased by the enrichment of certain strata of the population over others. If enrichment of different strata of the population are present across multiple studies, they may paint a confusing and potentially contradictory picture of the phenomenon. A primary example of this is the systematic over-enrichment of males over females in most case–control studies, particularly intervention and biological studies [73,74,75], which may lead to male-biased inferences about autism [76]. Another simulation shown in Fig. 4 illustrates that small samples are much more prone to this bias due to enrichment of specific strata over others. In this simulation, there are five subtypes in the autism population, and each has different effects relative to the control population. Therefore, enrichment of different subtypes can have dramatic effects on the results of the study. Our simulation had equal population prevalence for each subtype (i.e., 20% of the autism population), which meant that from study to study, the specific strata that may be enriched is random. Obviously, in the likely scenario where population prevalence rates are asymmetrical across subtypes, the enrichment of specific strata could favor those subtypes with higher population prevalence.

Fig. 4
figure 4

Simulation showing sampling variability and bias of enrichment of specific strata in small sample size studies. In this simulation we generated a control population (n = 1,000,000) with a mean of 0 and a standard deviation of 1 on a hypothetical dependent variable (DV). We then generated an autism population (n = 1,000,000) with 5 different autism subtypes each with a prevalence of 20% (e.g., n = 200,000 for each subtype). These subtypes vary from the control population in effect size in units of 0.5 standard deviations, ranging from −1 to 1. This was done to simulate heterogeneity in the autism population that is reflective of very different types of effects. For example, the autism subtype 5 shows a pronounced increased response on the DV, whereas autism subtype 1 shows a pronounced decreased response on the DV. Across 10,000 simulated experiments, we then randomly sampled from the autism population sample sizes of n = 20, n = 200, and n = 2000, and computed the sample prevalence of each autism subtype. The ideal result without any bias would be sample prevalence rates of around 20% for each subtype. This 20% sample prevalence is approached at n = 2000, and to some extent at n = 200. However, small sample sizes such as n = 20 shows large variability in sample prevalence rates of the subtypes and this can markedly bias the results of a case–control comparison. The code for implementing and reproducing these simulations is available at https://github.com/mvlombardo/effectsizesim

Such biases due to sampling variability across subtypes have considerable importance for replicability. To illustrate, we give a simple example indicative of many cases in the current literature. For example, Study 1 may unknowingly possess a sample enriched with specific autism subtypes that show a decreased response on some dependent variable. Study 2 unknowingly has a different autism sample enriched with subtypes that show a contradictory increased response on the same dependent variable. Both studies are published and the authors of each may get into a heated debate, each claiming that the other is wrong. Yet a third study comes out with perhaps a more unbiased (and possibly larger) sample, and given that the overall population effect could be near zero for a case–control comparison (as in the simulations in Fig. 4), this third study finds no difference and claims that both studies 1 and 2 are false positives. While the third study may be the clearest indication of what occurs as an overall case–control effect, this study too may be missing the point completely—the population under investigation is not homogeneous and is stratified. Therefore, each study could have merit, if better contextualized and with some attempt to grapple with issues of heterogeneity. Thus, it is clear from these examples that practices of running case–control studies, utilizing small sample sizes, and not fully confronting the issue of heterogeneity in autism, may compound problems and lead to a conflicted literature and delay scientific progress. Given these considerations, our recommendation is to move away from small-sample case–control models and towards stratified and/or dimensional models that take into account important heterogeneity in autism.

Essential big data characteristics for studying heterogeneity

While the idea of heterogeneity in autism has been around for some time, it is understandable why as a field autism research has made only limited progress. Conducting research on heterogeneity can be difficult for reasons of lack of datasets that are large enough to sufficiently answer such questions. As the previous discussions on issues with small sample sizes suggest, we would argue that one key ingredient to studying heterogeneity in autism successfully is ‘big data’. When we use the phrase ‘big data’, we are not necessarily referring to the ‘feature’ dimension of the data—that is, massively multivariate ‘feature rich’ data (e.g., neuroimaging or genomics data). Obviously, feature-rich aspects of big data are indeed important in their own right and for the purposes of understanding heterogeneity. Rather, the dimensions we would emphasize about big data are the participant dimension (i.e., large sample size) and the depth of the measured features embedded in the participant dimension. Put another way, we need big data that have characteristics of being both ‘broad’ and ‘deep’ [77] (Fig. 1c).

Broad data refer directly to the participant or sample dimension of the dataset (as opposed to the feature dimension) and is characteristic of massive sample size. Such a broad spread over individuals should ideally provide good coverage over the population of interest and allows for sufficient sampling of each strata of interest. Broad data are an essential ingredient for decomposing heterogeneity in autism since we can run into many problems with data that are not sufficiently large or do not allow for such broad coverage over the population. Sufficiently broad data can also open up opportunities for replicating findings, since experimental designs can be planned ahead of time to set aside a sufficiently large validation set to replicate findings from an initial broad discovery set. As data sharing and open data initiatives become more available, we should see more investigations on heterogeneity that meet this big data requirement. There are some current resources that are immediately available to meet such needs (e.g., the ABIDE datasets [78], the National Database for Autism Research (NDAR) [79], the Simons Simplex Collection [80], SPARK [81], the Healthy Brain Network [82], and see refs. [83, 84]) and we would expect much more in the coming years. As we get better at detecting what are the relevant dimensions and/or subtypes explaining important heterogeneity in autism, we may be better able to design high-powered targeted studies where the requirements for massive sample size may be reduced substantially. However, for most topics, we are not yet at this stage, and thus broad data (i.e., massive sample size) are necessary.

Developing models to explain aspects of heterogeneity at one level is only the first step. Once we have built good models that explain heterogeneity at one level, we will need to ask the next translational question: ‘What else are these models good for?’ Put differently, stratified or dimensional models can be good at predicting phenomena at one level of analysis, but because autism is heterogeneous at multiple levels, could such models help us make sense of heterogeneity outside the domain that the model was originally built upon? Answering this question can have considerable relevance for precision medicine goals. For instance, a geneticist may have identified a unique biological subtype of autism based around a certain genetic mechanism. Such a genetic stratifier would already be useful for pinpointing a specific discrete cause for some proportion of the autism population. However, working towards precision medicine, we would next want to know whether such a genetic subtype is different from other autistic individuals on clinically relevant aspects such as prognosis, response to treatment, symptomatology, cognition, etc. Thus, when we ask this type of question, we need big data that are not only broad, but also ‘deep’ [77]. Deep data are data collected on the same individuals that penetrate through multiple levels of analysis (Fig. 1c). Deep data allow for stratifications or dimensional models to be built at one level, but the important tests of such stratifications can be done at other levels. An example of this can be seen in recent work on the Simons Simplex Collection. Here the authors made stratifications on the phenotype and then asked the question of whether such stratifications increased power for detecting genome-wide association study-type effects at the genetic level [31]. Thus, to best answer questions by utilizing stratified or dimensional models, we will require big data that are both broad and deep, as the combination of both types of data can allow for discovery of explanations of autism heterogeneity and can immediately point towards the utility of such models for explaining the multi-level complexity inherent in autism. New multi-site studies such as EU-AIMS Longitudinal European Autism Project (LEAP) are targeted to directly address both issues of broad and deep data [85,86,87] and we need other efforts along these lines.

Approaches to decomposing heterogeneity in autism: top-down, bottom-up, and chronogeneity

Since the approach to decomposing heterogeneity in autism towards precision medicine goals is one of identifying clinically and mechanistically useful models, it is helpful to make salient some different approaches towards these goals. A common circumstance might be where a researcher makes a stratification at a level higher up in the hierarchy presented in Fig. 1c. The translational next step may be to work down towards understanding how a stratified and/or dimensional model at this higher level of analysis can explain some phenomenon at a lower level. We refer to this as a top-down approach. For example, a clinically important stratification can be made in the early development of autism regarding language outcome at 4–5 years of age. Some children keep up with age-appropriate norms in the areas of expressive and receptive language development, whereas others fall far behind in their language abilities across these domains. The empirical question after making such stratification could be whether such autism language–outcome subtypes differentiate at the level of neural systems organization, particularly neural systems that are developing specialization of function for speech and language processes [22]. More recent work has also shown that variation at the level of gene expression in blood leukocytes is associated with large-scale speech-related functional neural responses. These gene expression-neuroimaging associations are different across autism language outcome subtypes [23]. In this example, it is clear that the stratifications were made at a level of analysis above the level that was later interrogated for mechanistic understanding. Thus, while early language outcome is itself a clinically important stratifier, this top-down work also indicates that the stratifier may also be mechanistically useful for pointing towards different underlying biology. Other examples of a top-down approach may be based on cognitive characteristics [88], sex/gender [76], and co-occurring medical and psychiatric conditions (e.g., epilepsy [89], attention-deficit/hyperactivity disorder (ADHD) [27], etc). This type of top-down approach may ultimately motivate future work that could potentially identify unique discoveries about biology behind a subset of the autism population that was previously unknown.

In contrast to top-down approaches, an approach that works from the bottom-up could be highly complementary. As the phrase implies, a bottom-up approach starts with identifying and building useful models from a lower level in the hierarchy, and then asks questions about how such low-level models can explain phenomena higher up in the hierarchy. For example, in the ‘genetics first’ approach, an investigator may be interested in identifying how different high-impact genetic causes of autism may be similar or different at a phenotypic or cognitive level of analysis [90,91,92,93]. In another example, an investigator may compare autism subtypes at the level of neural systems or structural brain features (e.g., with or without early brain enlargement), and then ask the question of whether such a stratification provides a meaningful indicator of differentiation at a clinical level [14]. Both top-down and bottom-up approaches can be useful, depending on the particular research question, and each can highlight different aspects of important heterogeneity in autism. In order to link up such multi-level complexity into explanations behind heterogeneity in autism, it will be imperative to have work from both approaches.

A final approach to decomposing heterogeneity deals with the lifespan developmental dimension across any level of analysis, or ‘chronogeneity’ [38]. Several large longitudinal studies consistently indicate that there are several autism subtypes with different developmental trajectories [16,17,18, 38, 94]. Regression, a developmental feature seen in autistic individuals, is another key stratifier that is surprisingly under-studied but with plausible unique biological bases [95, 96]. Within the developmental dimension, heterogeneity can be assessed as both inter- and intra-individual variability, but can also cover individualized deviance from group trajectories over time- [38] or age-specific norms [97, 98]. Chronogeneity thus offers a unique vantage point on multi-level heterogeneity not covered by understanding heterogeneity at static time points.

Approaches to decomposing heterogeneity in autism: supervised versus unsupervised

In addition to conceptualizing stratified and/or dimensional models by top-down, bottom-up, or developmental approaches, it is also important to clarify how we build on the process of understanding heterogeneity. Ultimately, the scientific process of better understanding heterogeneity in autism is a learning problem. Taking ideas from statistical or machine learning, we can broadly divide learning processes into supervised and unsupervised learning [99]. Supervised learning deals with a priori knowledge about a topic (i.e., known labels), and then seeks to derive a model to best predict that known information. With regard to the process of understanding heterogeneity in autism, the analogy of supervised learning can apply to all instances where the experimenter uses their own knowledge and justifications to dictate where the stratifications are made (e.g., top-down, bottom-up, or developmental). In other words, knowledge from a supervised source (e.g., an investigator, a theory) informs the stratification or dimension to be modeled. This type of approach has the advantage of being theory driven and/or builds on expert knowledge of the investigator (e.g., clinical intuition or experience), who may already have highlighted a distinction that is meaningful and justified in a variety of ways.

The disadvantage of solely relying on a ‘supervised’ approach is that the investigator and/or a theory may be missing other important distinctions about how to model heterogeneity for the question of interest. In this case, the learning process can be helped by some type of ‘unsupervised’ statistical learning process that uncovers distinctions that may not be readily apparent from a priori knowledge. Because big data are a key ingredient for building models to explain heterogeneity, we can utilize the feature-rich aspects of big data to embark on data-driven discovery of potentially complex multivariate patterns that distinguish different types of individuals. We refer to this data-driven approach as an ‘unsupervised’ approach since computationally, the learning occurs without any expert a priori knowledge and justifications and solely relies on statistical distinctions embedded in the data itself. With this approach we likely rely on advanced computational techniques from machine learning that are tailored to best identify complex multivariate distinctions. For example, we utilized clustering methods taken from systems biology and applied them to item-level patterning of behavioral responses on the RMET. This unsupervised approach yielded discovery of five different autism subtypes that could be replicably identified in an independent replication set (Fig. 2) [25]. In other work, Ellegood et al. [100] applied clustering to neuroanatomical phenotypes across a range of different mouse models for autism. This work illustrated that heterogeneous starting points (e.g., different genetic mutations highly associated with autism) can converge and diverge at the level of neuroanatomical phenotypes [100]. Using structural magnetic resonance imaging measures of cortical morphometry, Hong et al. [12] used clustering to identify three autism subtypes with different anatomical profiles. These anatomically defined subtypes were then found to be useful for increasing the performance of supervised learning models to predict symptom severity on measures such as the autism diagnostic observation schedule (ADOS) [12].

It should be noted that both supervised and unsupervised approaches have their advantages and disadvantages, and can be complementary. An example of this complementarity can be seen in a hybrid supervised–unsupervised approach from Feczko et al. [15]. In this study, the authors utilized a supervised ensemble learning model called a Functional Random Forest (FRF) model to classify autism versus typically developing children based on cognitive features from a neuropsychological test battery. In addition to classifying autism versus typically developing children, the FRF model produces a proximity matrix that indicates similarity between individuals. The authors then utilized this proximity matrix to identify subgroups in an unsupervised manner utilizing a community detection algorithm, typically used in network science to discover ‘modules’. This hybrid approach to cognitive subtyping proved useful for identifying different patterns of resting state functional connectivity across the subtypes. Thus, through the scientific process of building knowledge about important stratified or dimensional models, both unsupervised and supervised approaches can inform each other, and in some cases may be utilized together in a hybrid fashion.

Decomposing heterogeneity in relation to transdiagnostic constructs

Although so far we treat autism as an entity and focus on heterogeneity within it, this diagnostic construct is human-made, cumulative, and evolving [101, 102]. Phenotypically, autism frequently co-occurs with other neurodevelopmental (e.g., ADHD, tic disorders) and psychiatric (e.g., anxiety, depression, obsessive-compulsive disorder, psychotic disorders) conditions [1, 103] and heightened autistic traits often cut across other categorical diagnoses as well [104]. Underlying this may be multi-level processes cutting across sets of frequently co-occurring diagnoses [105], which potentially can be delineated by transdiagnostic approaches such as using the RDoC framework [47]. In this respect, we should acknowledge that heterogeneity in autism is part of the broader heterogeneity existing across neurodevelopmental and (physical and mental) health conditions. In the same vein, the reasons, principles, and approaches described above to tackle heterogeneity in autism can be similarly applied when autism is studied within a transdiagnostic framework cutting across multiple diagnoses. In the background of high co-occurrence, a transdiagnostic framework is necessary for deepening our understanding of the heterogeneity within and beyond autism.

Challenges for big data approaches in autism science and clinical practice

With all these advantages of big data in mind, we acknowledge it is easier said than done in practice. There are key practical challenges to be overcome. First, conducting studies with very large sample sizes is challenging, and perhaps only the most well-funded laboratories and/or consortiums can regularly conduct such work. In a situation where we are investigating phenomena with stratified models, this problem is magnified since one now needs large sample sizes within each strata being investigated. The practical issues are further compounded when there is need to replicate—a need which is absolutely necessary to build confidence in identified effects. Second, broad and deep data are both desirable, but there are inevitable tradeoffs when considering feasibility. Current initiatives to collate existing data from smaller-scale studies (e.g., ABIDE [78] and NDAR [79]) have stimulated the field of autism research to move towards broad data, and there are increasing consortium efforts taking a prospective, coordinated data acquisition protocol to synchronize the acquisition of broad and deep data (e.g., EU-AIMS LEAP [85,86,87], the Healthy Brain Network [82], the POND Network [106]). Continuous exchange across research teams to establish shared methodologies and measurements are critical, yet for the field to move forward, it is important to sustain flexibility and openness in incorporating new findings, methodologies, and measures, especially considering that the samples to be enrolled in future research must be more representative of the autistic population at large—truly diverse and inclusive (e.g., in terms of age and life stage, ethnic background, genetic make-up, social-economic status, cultural context, sex, and gender, etc.)—in order to represent the full spectrum of individuals around the globe. Fundamental to these large-scale and long-term efforts is advocacy for funding support that encourages coordinated study designs and data merging efforts to achieve broad and deep data.

In the meantime, we believe there is still room for ‘smaller science’ in the era of big data. Contributions towards delineating heterogeneity could still be made by studies with moderately sized but adequately-powered samples. By ‘moderately sized samples’, we do not define this phrase in absolute terms (e.g., some rule-of-thumb sample size that can be applied irrespective of the context). Rather, what counts as moderately sized samples will need to be defined for each research context. However, at the very least we do intend ‘moderately sized’ to mean sample sizes that are sufficiently statistically powered (e.g., >80%) for reasonably sized effects of interest (e.g., medium or large effects). Such sample sizes are different from what we consider as ‘large sample sizes’, which would be situations where the sample size offers more than enough statistical power (e.g., 90–100%) even for very small effect sizes and likely hones in on point estimates of the population effect size with high precision. These moderately sized studies can make substantial progress in autism research via several ways. First, such studies could focus on examining well-defined subgroups in the autism spectrum, derived either from hypothesis-driven strata (e.g., individuals with specific behavioral profile, specific neurobiological status, specific developmental characteristics, specific etiological factor, etc.) or strata discovered via prior big (broad) data studies. In this scenario of moderately sized studies, case–control models could be meaningful with evidence of independent replication. However, such studies will likely yield more information if they also are stratified and/or use dimensional models to capture aspects of important heterogeneity within autism. Such studies could help isolate effects specific to subsets of autism where the effects are larger than smaller effects typically found in case–control studies. Studies like these could help canalize research in specific directions towards better understanding such reasonably sized large or medium effects. Second, moderately sized studies could hone their focus on well-defined mechanisms in a hypothesis-testing/driven manner or conducting clinical trials that target on specific mechanisms (instead of treating autism overall as a single category driven by an ubiquitous cause). In these scenarios, moderately sized studies are not broad, but they could dive into deep data as a way to reveal more mechanistic insight and connect multiple levels of analysis. In sum, practical limitations likely require the field to alternate between investigations that are large-N and broad or more moderately sized studies that feature deep characteristics of the data. This strategy may facilitate future work until opportunities arise that can truly allow for big data that is both broad and deep.

Although big data approaches can move our research closer towards precision medicine goals, it is an even bigger challenge to translate the work into real-world individualized care and support. As in other fields of health care, person-level information that parses heterogeneity and achieves individual-level accuracy as a biomarker or predictor (e.g., BRCA gene mutations, the utility of which comes from big data science in oncology) is only part of the whole decision-making process in health care. Optimal care and support for autistic individuals has to be embedded in a person-centered, lifespan perspective that incorporates shared decision making and collaborative action planning [107]. Big data bring clarity to our understanding of individual differences on the autism spectrum and beyond the spectrum, yet in daily clinical practice, care and support can be improved only when such clarity is integrated with a perspective that respects the individuality of the autistic person and their personal contexts.

Conclusions

Understanding how heterogeneity manifests in autism is among the biggest challenges in our field. As we continue to develop models for explaining this heterogeneity, the organizing concepts laid out here could be useful in synthesizing very diverse areas of research. Heterogeneity must be interpreted relative to the zeitgeist, particularly as it pertains to how diagnostic concepts evolve. Models for explaining heterogeneity manifest in many ways, depending on whether the researcher conceptualizes the differences between individuals as quantitative and dimensional, or qualitative and categorical. There is room for both models that fuse together both dimensional and categorical distinctions. In general, we need to move beyond one-size-fits-all models such as case–control models, and we need to be stringent with respect to methodology, since practices such as small sample size research cannot live up to the challenges that heterogeneity creates. Small samples cannot adequately cover heterogeneity in the autism population in a highly generalizable fashion, and hence there is a need for ‘big data’ when studying heterogeneity. Big data should be both broad and deep, to not only sample adequately across different strata from the population but also to examine how strata defined at one level may be relevant for explaining variability at other levels. Heterogeneity can be parsed from multiple approaches that capitalize on information from levels of analysis either most proximate or most distal from the clinical phenotype and which work their way down or up through the hierarchy, or via an examination of change across development. Also important for conceptually organizing work on this topic is whether we utilize a priori knowledge to build heterogeneity models or whether we allow computational methods to inform us about data-driven distinctions that may be hidden and not readily apparent to most researchers. Models to understand heterogeneity can move beyond just those with clinical diagnoses of autism and, in the future, transdiagnostic approaches utilizing similar organizing concepts may provide complimentary information. Overall, the push to understand heterogeneity is critical as we attempt to move towards precision medicine, which will need to be embedded in a person-centered, lifespan-informed, shared decision-making and collaborative planning of care to provide holistic support for each unique autistic individual.