Rates of autism spectrum disorder (ASD) continue to climb, now impacting 1 in 68 individuals in the United States.1 Despite important progress in understanding the genetics of ASD,2, 3 ASD remains diagnosed through behavioral examination. The diagnosis of ASD is currently made using instruments designed to measure impairments in the two core domains of ASD, as defined by the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-V): (1) communication and social interaction and (2) restricted interests and repetitive behaviors. The Autism Diagnostic Observation Schedule (ADOS)4 is one of the most widely used instruments to assist in ASD diagnosis. The ADOS consists of a series of semi-structured activities designed to elicit specific behaviors of social interaction, communication, imaginative use of objects, restricted interests and repetitive behaviors. The diagnostic test is split into four modules, each tailored to specific individuals based on their language and developmental level to ensure coverage of a diverse set of behavioral manifestations.4 A certified professional at a clinical facility first administers the ADOS examination and then scores the individual based on his or her observations to determine the final diagnosis. The initial assessment can take between 30 and 60 minutes, and the scoring increases the total time to between 60 and 90 minutes. Due to variance in inter-rater reliability, additional professionals may re-score the individual, further increasing the time between testing and receipt of the official clinical diagnosis.4

Even ignoring the geographic and logistical hurdles in finding a certified professional to administer the ADOS, the time required for the exam and the rise in the number of children at risk for ASD have contributed to increasing bottlenecks in the healthcare system.5 The average age of diagnosis in the United States hovers stubbornly around 4 years,5 and families may wait as long as 13 months for the diagnosis after the initial screening,6 and even longer if they are from a minority population or are of lower socioeconomic status.7 Such delays impede early intervention speech and behavioral therapies that provide substantial benefits to children.8, 9 For the estimated 27% of individuals undiagnosed at 8 years of age,5 opportunities for therapeutic intervention have dissipated. Therefore, risk assessment and triage tools that can reach families earlier and enable them to receive the care they need are badly needed.

Given the promising findings from our previous work on the first module of the ADOS10, 11 and the ADI-R,12 we postulated that we might obtain similar results when examining records from the other two modules of the ADOS, which apply to a large portion of the population suspected of having an ASD.13 Improving upon our previous work, here we utilized the best-estimate clinical diagnosis when possible and incorporated stepwise backward feature selection into our machine learning pipeline to quantitatively select the optimal set of significant behavioral features that can accurately detect ASD risk in a large population of individuals. We assembled a collection of ADOS evaluations for 4540 individuals and developed a classifier for each module that exhibited optimal performance in classification of individuals both on and off the spectrum. Each classifier was trained on over 600 individuals and tested independently on more than 1000 individuals. The resulting classifiers contained fewer items than the ADOS-2 (ref. 14) and pinpointed several behaviors that could help guide future efforts focused on expeditious observation-based screening both in and out of clinical settings.

Materials and methods

Data sets

Data for modules 2 and 3 came from five separate repositories: Boston Autism Consortium (AC), Simons Simplex Collection v14 (SSC),15 Autism Genetic Resource Exchange (AGRE),16 National Database of Autism Research (NDAR)17 and the Simons Variation in Individuals Project (SVIP)18 (Table 1). The ADOS examination classified individuals into three discrete categories (autism, autism spectrum, and non-spectrum) by summing the scores from a subset of items from the ADOS and cross-referencing this total score with the thresholds for autism, autism spectrum and non-spectrum. ADOS scores for each item fall on an integer scale of 0–3, with scores of 7 or 8 reserved for behaviors not exhibited during the test. In a preprocessing step, the ADOS algorithm recodes scores of 3 to 2 and scores of 7 or 8 to 0 to improve reliability and validity.14 For our analyses, we recoded scores of 7 and 8 as 0, but elected to leave scores of 2 and 3 as distinct answer codes to increase granularity in the classification. In addition, we grouped strict autism and autism spectrum categories together into one autism spectrum cohort, leaving only two classes for machine learning, an autism spectrum class and a non-spectrum class.

Table 1 Training and testing data description

Recruitment varied by study. Individuals in AC, AGRE, NDAR and SSC were recruited with a suspicion of having ASD, and individuals in the SVIP were required to have or be related to an individual with the 16p11.2 duplication/deletion.19 Gender remained consistent across both modules; males comprised 82–86% of individuals with ASD and 61–63% of individuals without ASD. The intelligence quotient (IQ) was consistent across both modules and between individuals with and without ASD (Table 2). Due to the diverse phenotypic effects of the 16p11.2 duplication/deletion, individuals in SVIP were enriched for comorbidities, including ADHD, developmental coordination disorder, phonological disorder and others. Thus, the individuals in the SVIP proved useful for testing the specificity of the algorithms (i.e., differentiating between ASD and other behavioral disorders and developmental delays). A complete description of the phenotypic diversity of the samples used is provided in Supplementary Table S1.

Different versions of the ADOS were used in each data set, namely ADOS Version 1 (ref. 4) (AC, SSC, AGRE and NDAR), and ADOS version 2 (ref. 14) (SVIP). To ensure consistency across data sets, we computed the ADOS-2 diagnosis for all individuals in AC, AGRE, NDAR and SSC using the ADOS-2 algorithms. We elected to do this because the ADOS-2 incorporates repetitive and restrictive behaviors, and it has been shown to more accurately identify cases from non-spectrum controls in lower-functioning populations.14 Not all individuals had either a clinician’s diagnosis or the best-estimate clinical diagnosis. Specifically, 76% of the ASD cases and 46% of the non-autism controls had a recorded clinician’s diagnosis or the best-estimate clinical diagnosis. Therefore, we elected to use the diagnosis provided by the ADOS-2 algorithm for our classifier labels in the training processes.

Table 2 Sample description

Machine learning

We used machine learning to develop two classifiers: one derived from ADOS module 2 and the other from ADOS module 3. For each module, our strategy involved training eight different machine learning algorithms (Table 3) using stepwise backward feature selection, and testing the final classifier on four independent data sets. We chose stepwise backward feature selection over stepwise forward feature selection to allow for interactions between features.20 We used each module’s items as features, and the individuals’ ADOS-2 diagnoses as our prediction class. All machine learning analyses were performed in R and Weka21 (version 3-7-9). As the number of individuals with ASD outnumbered those without in both module 2 (~4:1) and module 3 (~5:1) across all data sets, we selected the data set with the highest number of individuals without ASD as our training set. Module 2 classifiers were trained from an NDAR collection of 362 with ASD and 282 individuals without ASD. Module 3 classifiers were trained on AGRE, with 510 individuals with ASD and 93 individuals without ASD (Table 1).

Table 3 Machine learning algorithms used in training

The 28 features for each of module 2 and module 3 were ranked using a support vector machine (SVM) based on their ability to differentiate between individuals with and without ASD. We used stepwise backward feature selection with 10-fold cross-validation in all eight machine learning algorithms. This feature selection procedure determined the optimal number of features by first training a classifier with all 28 features, iteratively removing the lowest-ranked feature, and building a new model using 90% of the data for training and the remaining 10% for testing. The process ended once a single feature remained, yielding a final set of 28 classifiers, which could each be assessed for their sensitivity and specificity. By plotting the sensitivity, specificity and accuracy of each classifier versus the number of features, the best classifier was identified as the one with the highest performance and smallest number of features (Figure 1). We aimed to maximize specificity (the true negative rate) over sensitivity (the true positive rate) because of the large class imbalance (Table 1).

Figure 1
figure 1

Module 2 logistic regression and logistic model tree (LMT) training results. Sensitivity and specificity of the module 2 logistic regression and LMT classifiers based on the number of features used during training on the National Database of Autism Research are provided in Table 1. The nine-feature logistic regression classifier (blue dot) was used in testing.


After finding the optimal classifiers for modules 2 and 3, we validated these classifiers on the remaining four data sets not used for training. The module 2 classifier was tested on AC, AGRE, SSC and SVIP, totaling 1089 individuals with ASD and 66 individuals without ASD (Table 1). The module 3 classifier was tested on AC, NDAR, SSC and SVIP, totaling 1924 individuals with ASD and 214 individuals without ASD (Table 1).


Module 2 results

Two algorithms using the same nine features displayed optimal performance on the NDAR training data (98.90% sensitivity, 98.58% specificity and 98.76% accuracy), a logistic regression22 and a logistic model tree (LMT)23 (Table 3; Figure 1). LMTs combine decision trees with logistic regression, thereby allowing the incorporation of nonlinear patterns into the model. When such nonlinear patterns exist and help explain additional variance in the data, LMTs outperform logistic regression.23 However, in our data, no such patterns were detected and the nine-feature LMT consisted of just the root node with a logistic regression model. Thus we chose logistic regression over LMT for use in further testing and validation.

For independent validation of the nine-feature logistic regression classifier, we collated score sheets for module 2 from the AC, AGRE, SSC and SVIP (Table 1) to determine whether the classifier could recapitulate the sensitivity and specificity of training data on held-out test data. Across our four test sets, the logistic regression classifier misclassified 13 out of 1089 individuals with ASD (98.81% sensitivity) and 7 out of 66 individuals without ASD (89.39% specificity), resulting in 98.27% accuracy (Supplementary Table S2). Of the 13 misclassified individuals with autism, 6 had a clinical diagnosis of autism, 3 had a clinical diagnosis of pervasive developmental disorder–not otherwise specified and 1 had a best-estimate clinical diagnosis of non-spectrum. For the seven misclassified individuals without autism, three had a non-spectrum clinical diagnosis, three had an autism best-estimate clinical diagnosis and one individual had a clinical diagnosis of broad spectrum. For a subset of individuals, their best-estimate clinical diagnosis was available (autism N=618, non-spectrum N=35). When independently predicting the best-estimate clinical diagnosis, the sensitivity and specificity of the nine-feature logistic regression model was 98.38% and 88.57%, respectively.

Although the ADOS-2 module 2 uses different algorithms for individuals based on their age, our nine-feature logistic regression classifier does not.14 Because age and the log-odds of the prediction were significantly correlated (r=0.45; P<2.2 × 10−16), we hypothesized that adding age as a covariate to the regression might explain additional variance in the outcome. However, the effect of age on the classifier was negligible (β 0.015, odds ratio 1.055), and adding it to the model slightly decreased sensitivity (−0.28%) and accuracy (−0.16%). Therefore, we elected not to incorporate age into the regression. IQ measures were also significantly correlated after controlling for gender, including full-scale IQ (r=−0.37; P<2.2 × 10−16), verbal IQ (r=−0.42; P<2.2 × 10−16) and nonverbal IQ (r=−0.27; P<3.8 × 10−15).

The behaviors tested assessed by the module 2 classifier segregated into the two domains associated with ASD: (1) social communication and social interactions and (2) restricted interests and repetitive behaviors. Feature A5 (stereotyped/idiosyncratic use of words or phrases), A8 (descriptive, conventional, instrumental or informational gestures), B1 (unusual eye contact), B3 (shared enjoyment in interaction), B6 (spontaneous initiation of joint attention), B8 (quality of social overtures) and B10 (amount of reciprocal social communication) correspond to the domain of social communication and interaction. D2 (hand and finger and other complex mannerisms) and D4 (unusual repetitive interests or stereotyped behaviors) stem from the domain of restricted interests and repetitive behaviors.

Module 3 results

Of the eight machine learning algorithms trained for module 3, the radial kernel SVM24 performed best overall on the AGRE training data (100% sensitivity, 98.92% specificity and 99.83% accuracy) (Table 3; Figure 2) and contained 12 behavioral features. This 12-feature SVM classifier was tested on the four data sets not used in training: AC, NDAR, SSC and SVIP. Across the four test sets, our classifier misclassified 44 out of 1924 individuals with ASD and 6 out of 214 individuals without ASD (97.71% sensitivity, 97.20% specificity and 97.66% accuracy) (Supplementary Table S3). Of the 44 individuals with ASD who were misclassified, clinical diagnoses were available for 30. Six had a confirmed autism diagnosis, six had Asperger’s disorder and the remaining 18 had pervasive developmental disorder–not otherwise specified. For the six individuals without autism that were misclassified, three had a non-spectrum clinical diagnosis, and the remaining three individuals had no recorded clinical or best-estimate clinical diagnosis. For the individuals for whom a best-estimate clinical diagnosis was available (autism N=1568; non-spectrum N=175), the 12-feature SVM displayed 99.11% sensitivity and 70.86% specificity (Figure 3).

Figure 2
figure 2

Module 3 SVM training results. Sensitivity and specificity of the module 3 SVM classifier based on the number of features used during training on Autism Genetic Resource Exchange are provided in Table 1. The 12-feature SVM classifier was used in testing. SVM, support vector machine.

Figure 3
figure 3

Module 3 SVM test results. The 12-feature SVM decision values from testing data for the two classes: autism (red) and non-spectrum (blue). Forty-four misclassified individuals with autism (red triangles), and six individuals without autism (blue circles) contributed to 97.71% sensitivity and 97.20% specificity. ADOS, Autism Diagnostic Observation Schedule; SVM, support vector machine.

Similar to the module 2 classifier, the features in the module 3 SVM classifier aligned with the two core domains of ASD. Feature A7 (reporting of events), A8 (conversation), A9 (descriptive, conventional, instrumental or informational gestures), B1 (unusual eye contact), B2 (facial expressions directed to others), B7 (quality of social overtures), B8 (quality of social response) and B9 (amount of reciprocal social interaction) correspond to the domain of social communication and interaction. A4 (stereotyped/idiosyncratic use of words or phrases), D1 (unusual sensory interest in play material/person), D2 (hand and finger and other complex mannerisms) and D4 (excessive interest in unusual or highly specific topics or objects) stem from the domain of restricted interests and repetitive behaviors.


Despite significant evidence for the genetic heritability of ASD,25 it remains diagnosed through behavior. Although use of standard instruments for ASD diagnosis has been effective, the practice remains difficult to scale and time intensive, contributing to the growing waiting times between initial warning signs and diagnosis. Machine learning techniques have been previously applied by our group and others to test whether ASD10, 11, 12 and ADHD26 detection can be achieved with smaller numbers of behavioral measurements. Here, we sought to expand upon our previous work to a wider range of ages and levels of vocabulary by applying machine learning techniques to recorded clinical evaluations of individuals using modules 2 and 3 of the ADOS. We implemented stepwise backward feature selection with eight machine learning algorithms to create small but robust classifiers that retained levels of sensitivity and specificity similar to those of the full ADOS. The logistic regression algorithm produced the top-performing classifier for module 2 using nine features that exhibited 98.81% sensitivity and 89.39% specificity when tested across 1089 individuals with ASD and 66 individuals without ASD. A SVM consisting of 12 behavioral items showed the optimal performance when run on score sheets from module 3, exhibiting 97.71% sensitivity and 97.20% specificity when tested across 1924 individuals with ASD and 214 individuals without ASD.

Both the module 2 and module 3 classifiers contained a large number of items found on the ADOS-2 algorithms, suggesting that our abbreviated classifiers preserve much of the diagnostic validity of the original algorithm. However, we cannot discount the inherent bias in features used in the ADOS-2 algorithm, as those features are used in forming the diagnosis. Despite this, several features in both the ADOS-2 module 2 and module 3 algorithms ranked low in their classification ability. In module 2, A7 and B11 were ranked 13th and 25th, whereas in module 3, B4 and B10 ranked 13th and 14th, respectively, out of the 28 features. The low ranking of these features can be explained by lack of variation in responses between individuals with and without ASD. Of the 9 and 12 features used in the module 2 and 3 classifiers, five behaviors overlapped between the two machine learning classifiers identified in our study, namely unusual eye contact, quality of social overtures, amount of reciprocal social interaction, descriptive, conventional, instrumental and informational gestures, and hand, finger and other complex mannerisms. Since each module of ADOS is designed for a specific level of developmental ability, the inclusion of these five features in both classifiers may reflect their relative importance to the classification of ASD independent of the language and developmental level of the individual.

When performing a clinical evaluation of an individual with ASD using ADOS modules 2 and 3, the clinician uses 14 prescribed activities designed to elicit specific behaviors by the subject under evaluation. It is possible that the smaller number of behaviors represented in our classifiers may correspond to a compensatory reduction in the number of activities needed for an ASD risk assessment. For example, 3 of the 14 activities in module 2 (Table 4) and 6 of the 14 activities in module 3 (Table 5) would no longer be required to measure the behaviors used in the classifiers (Supplemental Discussion). Further examination and testing of this possibility is certainly needed, but it supports the possiblity that use of fewer behaviors may translate to shorter timeframes for observation. We have previously tested the potential for detection of risk for ASD in short home videos,27 and we hope in future studies to test whether the behaviors used in the classifiers presented here may also be adequately measured in short home video clips.

Table 4 Module 2 activities
Table 5 Module 3 activities

Lastly, the output of the module 2 logistic regression classifier provides a quantitative score of the log-odds of the confidence in the classification. Borderline log-odds indicate lower confidence, and therefore need for more testing, before arriving at a risk score and/or diagnosis. The ability to quantitatively measure risk provides another dimension to understand the prediction from the classifier itself. Disagreements among diagnostic exams are not uncommon.28 By providing the probability of the classification, the module 2 logistic regression classifier could assist in instances of uncertainty. In additionally, if such a scoring system could be used as a pre-clinical screening method, it may be possible to prioritize individuals based on the log-odds of the classification—enabling brief appointments for individuals with clear risk, and longer appointments for individuals that prove clinically challenging.


Given that our study focused on analysis of archival records, we were limited by the content of these preexisting data sets. Due to the nature of recruitment, there was a large imbalance in favor of individuals with ASD versus those who tested negative for ASD in AC, AGRE, NDAR and SSC (Table 1). Although the AC and SSC were family-based studies and collected detailed phenotype data for all family members, the ADOS and the ADI-R were administered only to the child with risk for ASD and not to the parents (N=2760) or unaffected siblings (N=2278). Therefore, the individuals without a confirmed ASD in this study were all at least initially suspected of having ASD and administered an ADOS. As such, these non-spectrum individuals served as valuable controls for our study, helping to support the possibility that our classifiers can distinguish between individuals with ASD and those with other developmental delays (Supplementary Table S1). To further measure the specificity of such classification tools, more effort is needed both to balance the number of individuals with and without ASD and to recruit individuals confirmed to have other developmental and/or learning delays.

Defining an appropriate 'truth set' for classifier construction and validation is an important challenge in the field. For ASD, the choice of the truth set is typically among the ADOS, ADI-R, the clinician’s diagnosis and the best-estimate clinical diagnosis or some combination thereof.14 However, none of the potential truth sets are truly independent, as the ADOS and ADI-R can (and often should) influence the clinician’s diagnosis and all three can contribute to the best-estimate clinical diagnosis.28 In the present study, we used the ADOS-2 diagnosis for our truth set during the machine learning trainining processes, given the class imbalance and the fact that 54% of the individuals who tested negative for ASD by the ADOS-2 were missing both the clinician’s and best-estimate clinical diagnosis. Yet in our independent validation procedures, we tested the classifiers’ performance against all available best-estimate clinical diagnoses. Both analyses provided encouraging results, suggesting that measurement of fewer behaviors can achieve results similar to a full ADOS exam and/or a clinical decision. Nevertheless, it is important to note that the high performance exhibited by the classifiers is based on a truth set that contains subjective observations, and therefore potential biases.14


Time-intensive behavioral examinations and questionnaires are currently the primary methods used in the diagnosis of ASD. Using machine learning, we created classifiers from two modules of one of the most universally administered behavioral tests, the ADOS. The logistic regression classifier based on analysis of archival records from ADOS module 2 consisted of nine items, 67.86% fewer than the complete ADOS module 2, and performed with 98.81% sensitivity and 89.39% specificity in independent testing. The SVM module 3 classifier based on analysis of archived ADOS module 3 records consisted of 12 items, 57.14% fewer than the complete ADOS module 3, and performed with more than 97% sensitivity and specificity in testing. These results support the notion that fewer behaviors when measured using machine learning tools can achieve high levels of accuracy in autism risk prediction. Furthermore, these results may help encourage future efforts to develop screening-based instruments for ASD detection and mobile health approaches that ultimately enable individuals to receive more expedient care than is possible under the current paradigms.