Using 2D video-based pose estimation for automated prediction of autism spectrum disorders in young children

Clinical research in autism has recently witnessed promising digital phenotyping results, mainly focused on single feature extraction, such as gaze, head turn on name-calling or visual tracking of the moving object. The main drawback of these studies is the focus on relatively isolated behaviors elicited by largely controlled prompts. We recognize that while the diagnosis process understands the indexing of the specific behaviors, ASD also comes with broad impairments that often transcend single behavioral acts. For instance, the atypical nonverbal behaviors manifest through global patterns of atypical postures and movements, fewer gestures used and often decoupled from visual contact, facial affect, speech. Here, we tested the hypothesis that a deep neural network trained on the non-verbal aspects of social interaction can effectively differentiate between children with ASD and their typically developing peers. Our model achieves an accuracy of 80.9% (F1 score: 0.818; precision: 0.784; recall: 0.854) with the prediction probability positively correlated to the overall level of symptoms of autism in social affect and repetitive and restricted behaviors domain. Provided the non-invasive and affordable nature of computer vision, our approach carries reasonable promises that a reliable machine-learning-based ASD screening may become a reality not too far in the future.

Autism spectrum disorders (ASD) are a group of lifelong neurodevelopmental disorders characterized by impairments in social communication and interactions, and the presence of restricted, repetitive patterns of interests and behaviors 1 . Despite advances in understanding the neurobiological correlates of these disorders, there is currently no reliable biomarker for autism, and the diagnosis uniquely relies on the identification of behavioral symptoms. Although ASD can be detected as early as 14 months 2 and with high certitude before two years of age 3 , the latest prevalence reports reveal that more than 70% of the affected children are not diagnosed before the age of 51 months 4 . Even in the absence of a highly specialized intervention program, earlier diagnosis is associated with a significantly better outcome. Indeed, specific strategies can be deployed to optimally support the child's development during a period of enhanced brain plasticity 5 . Previous studies have demonstrated a linear relationship between age at diagnosis and cognitive gain 6,7 , whereby children diagnosed before the age of two years can gain up to 20 points in intellectual quotient (IQ) on average over the first year following diagnosis, while children diagnosed after the age of four will not show any substantial cognitive gain even with adequate intervention 7 . An efficient early screening, followed by early diagnosis, is the cornerstone to timely intervention. Most currently used screening tests are questionnaire-based, performing with low to moderate accuracy 8 . Further, they are prone to recall and subjectivity bias 9 . To overcome these limitations, tools that can deliver objective and scalable quantification of behavioral atypicalities are needed, particularly for the early detection of the signs indicative of autism.
A growing number of studies focus on the objective quantification of behavioral patterns relevant for autism, using the advances in machine learning (ML) and computer vision (CV) (for a review see 10 ). For instance, Hashemi and colleagues 11 developed an automatized CV approach measuring the two components of an early screening test for autism 12 . By tracking facial features, they automatically measured head turn to disengage attention from an object and head turn to track a moving object visually, the behaviors that previously were scored only manually. Another study using a name-calling protocol coupled with CV corroborated the well established clinical finding that toddlers with ASD respond less frequently when their name is called 13 . Additional to the automation of well established behavioral coding procedures, the use of these advanced technologies has allowed more subtle, dynamic measures of behavioral markers that would otherwise elude the standard human coding. Indeed, applying CV to the name-calling protocol revealed that, when children with ASD respond to their name, they tend to do so with an average of a 1-s delay compared to the typically developing children 13 . In other studies, the use of motion capture and CV allowed to measure the reduced complexity of emotional expression in children with ASD, especially in the eye region 14 . Additionally, the combined use of motion capture and CV have provided insights on (1) the atypical midline postural control in autism 11,15 , (2) highly variable gait patterns in ASD 16 and (3) unique spatio-temporal dynamics of gestures in girls with ASD 17 that has not been highlighted in standard clinical assessments. Altogether, these studies demonstrate how computer vision and machine learning technologies can advance the understanding of autism, as they have the potential to provide precise characterizations of complex behavioral phenotypes.
The studies using ML and CV made a substantial contribution to the understanding of the disorder, offering a precise, objective measure of behavioral features that were traditionally assessed mostly qualitatively, if at all. However, there is still an important work to be done to enhance the scope and scalability of this approach. Most of the studies in this domain used fairly small samples, addressing rather specific questions focusing on one individual at time and measured behaviors elicited in controlled contexts 10 . A recent study undertook an effort to deploy a more holistic approach and, besides measuring the unique signature in the child's behavior pattern, also focused on the child's relation to immediate social context 18 . The authors used motion tracking to measure the approach and avoidance behaviors and the directedness of children's facial affect during the diagnostic assessment-the Autism Diagnosis Observation Schedule (ADOS, 19,20 ). With these objective measures, the authors accounted for 30% of the standardized scores measuring the severity of autistic symptoms from only 5-min excerpts of the free play interaction with the examiner. These results are auspicious as they do not focus on an individual in isolation but are a product of a more complex effort, the dynamic measure of the child's relatedness to the social world. There is a critical need to take a more holistic stance to tackle the complex task of measuring how the child with autism interacts socially in settings close to everyday situations to advance towards a fully ecological and scalable approach.
Here, we present a machine learning algorithm to discriminate between ASD and typically developing (TD). From videos, acquired in our larger study on early development in autism, which feature social interactions between a child (with autism or TD) and an adult, we trained a deep neural network over the gold standard diagnostic assessment 19,20 . The dimensionality of the input videos was reduced applying the multi-person 2D pose estimation OpenPose technology 21 to extract skeletal keypoints for all persons present in the video (see Fig. 1). Following 22 , we then applied the CNN-LSTM architecture sensible to action recognition. Our goal was to explore the potential of purely non-verbal social interactions to inform automated diagnosis class attribution. The data included in this study comprised a Training set (34 TD children and 34 from children with ASD, age range 1.2-5.1 years old), and two validation samples, namely Testing Set 1 (34 from typically developping-TD children and 34 from children with ASD, age range 1.2-5.1 years old) and Testing Set 2 ( n = 101 , uniquely children with ASD, age range 2-6.9 years old) (see Table S1). The trained model distinguished children with ASD from TD children with an accuracy exceeding 80%. These results hold potential in accelerating and automatizing autism screening approach, in a manner that is robust and only minimally influenced by video recording conditions.

Results
The final model architecture was obtained upon testing various configurations (see Fig. S2 and Supplementary section). The retained model was trained over the Training Set videos (68 ADOS videos, equally split between ASD and TD groups; see Table S1) that contained solely skeletal information on the black background, without www.nature.com/scientificreports/ sound (see Fig. 1). Figure 2, Figure S1, Methods and Supplementary sections detail different stages of the model training and validation. The predictions were obtained for individual 5s video segments and aggregated over the entire ADOS video for each subject from the two testing sets (see Fig. 2). We further examined the stability of the diagnosis prediction as a function of the video input length. Finally, we explored the potential of a nonbinary, continuous value of ASD probability to capture meaningful clinical characteristics, examining whether standardized scores obtained from various gold-standard clinical assessments related to the ASD probability extracted from the neural network.
Prediction of autism. Our model achieved an F1 score of 0.818 and a prediction accuracy of 80.9% over a sample of 68 videos in Testing Set 1 ( Table 1). The same trained model achieved a prediction accuracy of 80.2% over a Testing Set 2 comprising 101 videos from children with ASD, thus endorsing the model's stability.
Consistency of the ASD prediction over the video length. We further tested the extent to which the video length influenced our model's prediction accuracy. By varying the length of the video input in the Testing set 1, we demonstrated that an average 70% accuracy is already obtained with 10 min video segments (see Fig. 3A). As shown in Fig. 3B, the prediction consistency is also very high across the video of a single individual, even with relatively short video segments. For instance, for half of the ASD sample, our method achieves a 100% consistency in prediction based on randomly selected 10 min segments. These results strongly advocate for the feasibility of video-based automated identification of autism symptoms. Moreover, the ADOS videos used in the present study were acquired using different systems. However, the accuracy of classification showed robustness to the variability in context, thus again highlighting the potential for generalization of this type of approach (see Supplementary section and Fig. S4).

Figure 2.
Neural network architecture. The pretrained Convolutional Network VGG16 22 was used to extract the characteristics of all videos split into 5s segments. The output from this feature extraction step was fed into a LSTM network operating with 512 LSTM units. Finally, the output of the LSTM was followed by 512 fully connected ReLU activated layers and a softmax layer yielding two prediction outputs for the given segment. The segment-wise classifications were aggregated for the video's duration to obtain a final prediction value (ranging from 0 to 1) that we denote "ASD probability". The video was classified as belonging to a child with ASD if the mean value of ASD probability was superior to 0.5.  The above correlations were based on ADOS severity scores, representing a summarized measure of the degree of autistic symptoms. In addition, we were interested in understanding how the automatically derived ASD probability related to individual autistic symptoms and potentially inform us about the symptoms that were more closely related to ASD class attribution. After applying Bonferroni corrections, ASD probability was positively related with three symptoms in the communication domain of ADOS, namely gestures ( r s (68) = 0.435 , p < 0.001 ), pointing ( r s (68) = 0.540 , p < 0.001 ) and intonation (r s (68) = 0.426 , p < 0.001 ) (Fig. S6, panels A-C). In the social interaction domain of the ADOS the ASD probability was related to unusual eye contact ( r s (68) = 0.500 , p < 0.001 ), directed facial expressions ( r s (68) = 0.488 , p < 0.001 ), spontaneous initiation of joint attention ( r s (68) = 0.450 , p < 0.001 ), integration of gaze and other behaviors ( r s (68 To account for the collinearity among ADOS symptoms we deployed a multivariate analysis Partial Least Squares Correlation(PLS-C) 23,24 (see "Methods" section). This analysis is particularly suitable in the case where there is no assumption of independence among tested variables. In this manner, alongside taking into account the correlation within the symptoms, we were able to show the symptom pattern that best explained the variation in NN derived ASD probability. The PLS-C yielded a significant latent component (p = 0.001) best explaining the cross-correlation pattern between the ASD probability and the symptoms patterns. As shown on Fig. 5 the symptoms that showed higher contribution to the pattern of were predominantly non verbal (e.g. Facial Expressions, Quality of Social Overtures, Gestures, etc.). These symptoms depend heavily on the coordination of various non verbal behaviors, such as directing the facial affect (through head orientation) or using various communicative gestures. The symptoms least contributing to the pattern were among verbal symptoms including Immediate Echolalia and Stereotyped language that are especially discriminative in children with more fluid verbal language which was not the characteristics of the current sample. Intonation was the only symptom with predominantly vocal components that loaded highly on the current LC. However, in our clinical experience, the higher scores on intonation are usually followed by higher scores on many non verbal behaviors. The atypical intonation is seen in more severely affected children who usually are minimally verbal and present a lot of undirected vocalizations, bringing them simultaneously to higher scores in many other symptom domains, such as atypical social overtures, atypical social response, lack of directed facial expressions.

Discussion
Our neural network model operated on a low dimensional postural representation derived from social interaction videos between a child and an adult and robustly distinguished whether the child has autism, with a model prediction accuracy of 80.9% (precision: 0.784; recall: 0.854). We choose the model that operates on a relatively reduced set of characteristics, targeting non-verbal features of social interaction and communication, particularly relevant in very young children 19,20 . We deployed an LSTM network learning temporal dependencies of relations between skeletal keypoints in 5s interaction segments to perform the classification. Our findings' clinical validity was corroborated with positive correlations between neural network derived ASD probability and levels of autistic symptoms and negative correlations between the same measure and cognitive and everyday adaptive functioning of these children. Moreover, we showed that the accuracy of classification of around 70% was achieved based on only 10 min of filmed social interaction, opening avenues for developing scalable screening tools using smaller excerpts of videos.
The choice of the reduced dimensionality (2D estimated postural representations) in input videos was twofold: to allow a pure focus on non-verbal interaction patterns and ensure de-identification of the persons involved (skeletal points plotted against the black background). To further promote the approach's scalability, videos were not manually annotated; thus, we removed the human factor in the initial feature breakdown. Our design aligns www.nature.com/scientificreports/ with and expands findings from a smaller number of studies that probed automated video data classification 25,26 . Zunino et al. 25 were able to determine a specific signature of grasping gesture in individuals with autism using the raw video inputs. This approach allowed a classification accuracy of 69% using 12 examples of grasping action per individual obtained in well-controlled filming conditions. In contrast, in our study, the input videos were remarkably heterogeneous in terms of recorded behaviors. They included moments of the interaction of a child with an examiner in the presence of caregiver(s). Moreover, different examiners performed the assessments as they are acquired as a part of the ongoing longitudinal study. Finally, regarding the pure physical aspects, these assessments took place in several different rooms and were filmed with different camera systems. Nevertheless, our study's classification accuracy is superior to the one reported in the study using the controlled video of a very precise grasping action.
The clinical validity of our findings is supported by the significant correlation observed at the level of individual autistic symptoms (single items from the ADOS). The lack of ability to adequately integrate gaze to other communication means and impaired use of visual contact together with the reduced orientation of facial affect were the symptoms that were the most related to the neural network derived values of probability of ASD class attribution. Other symptoms that were strongly linked to the probability of receiving ASD class attribution comprise aberrant gesture use, unusual social overtures, repetitive patterns of behaviors and unusual sensory interests. These findings emphasize the potential of these low-dimensional social interaction videos to convey the atypicality of the non-verbal interaction patterns in young children with ASD. Indeed, out of 27 selected behaviors the 15 that were significantly related to the neural network derived ASD probability were the symptoms with a predominant non-verbal (e.g., giving, showing, spontaneous initiation of joint attention). This finding is in line with findings reported applying ADOS-like rating on home videos 27 who found that the aberrant use of visual contact was a feature that was the most determinant of ASD classification. Another study building an automated screener for autism symptoms based on annotated home videos reported that the screener capitalized on the non-verbal behaviors (e.g. eye contact, facial expressions) in younger participants while relying more on verbal features in older participants 28 . Indeed, clinically, the aberrant use of visual contact and aberrant gesture use are among the most striking and early emerging features of the disorder 19,20,29 .
The major contribution of the automated identification of behaviors indicative of autism lies in enhancing the efficiency and sensitivity of the screening process. The diagnosis process is complex and delicate and is unlikely to be set on the track of automated performance before long. However, more informed, more objective and available screening is crucial to catalyze diagnosis referrals, hopefully leading to earlier intervention onset. Early interventions are of life-changing importance for individuals with autism. They improve their cognitive and adaptive functioning and reduce the severity of ASD symptoms 6,30 . In the year following the diagnosis, children who receive early intervention-and start developing language skills before the age of 3 show the most important gains as young adults 31,32 .
Our results speak in favor of more objective, holistic, automatized methods as complementary tools to the ones used in clinical practice. In ASD, the availability of standardized measures of autistic symptoms was crucial in informing the clinic and the research 33 . Nevertheless, these gold-standard measures still rely on somewhat coarse descriptions of symptoms. Individual symptoms of autism are assessed on a 3 or 4 point scale 19,20,29 while phenotypical differences between the behaviors brought on the same plane can be very pronounced. The development and improvement in quantitative measures leading to a more fine-grained "digital phenotyping" 34 can be a tremendous asset in the early detection of signs of the disorder and the follow-up of its manifestation through development. Besides being more objective compared to human coding, it can allow the processing of larger quantities of the data and at the spatio-temporal resolution that is off limits to human coding. Moreover, these precise and continuous measures may uncover behavioral manifestations of the disorder that were previously not evidenced. They also may help define sub-types of the disorder to allow more precise clinical action 35 . The finding that we were able to achieve a robust accuracy of classification based on a limited set of characteristics derived from social interaction videos is very promising. This approach would further benefit from the implementation of the spatio-temporal attentional mechanism 36 to allow knowledge on the specific elements in space and time used to inform the diagnosis process in the network and improve our understanding of the manifestation of the disorder. Additionally, building on the evidence from the present study, our next goal is to perform fine-grained annotation of the behaviors along the non verbal continuum that were the most contributing to the discrimination between the two groups. Thus, the precise annotation of the incidence of communicative gestures, shared affective states, atypical social overtures would be highly beneficial to provide the insight on the content of neural network learning and provide a more sensitive measure of the disorder manifestation.

Methods
Participants. The initial sample included sixty-eight children with autism ( 2.80 ± 0.92 years) and 68 typically developing children ( 2.55 ± 0.97 years) who were equally distributed to compose the Training and Testing set (Testing Set 1), matched for diagnosis, age, gender and ADOS module (see Table S1). To validate the robustness of our classification method we included an additional testing sample comprising 101 videos from children with ASD ( 3.46 ± 1.16 years) that we denote Testing set 2. All data used in this study were acquired as a part of a larger study on early development in autism. Detailed information about cohort recruitment was given elsewhere [37][38][39] . The current study and protocols were approved by the Ethics Committee of the Faculty of Medicine of the Geneva University, Switzerland. The methods used in the present study were performed in accordance with the relevant guidelines and regulations of the Geneva University. For all children included in this study, informed written consent was obtained from a parent and/or legal guardian. Children with ASD were included based on a clinical diagnosis according to DSM-5 criteria 1 , and the diagnosis was further corroborated using the gold standard diagnostic assessments (see Clinical Measures subsection and Supplementary section). Video data. To probe the diagnosis classification using machine learning on social interaction videos, we used filmed ADOS assessment acquired in the context of our study. Practical reasons drove this choice, ADOS being the most frequent video-based assessment in our study (systematically administered in all children included in our study). Moreover, ADOS provides a standardized and rich context to elicit and measure behaviors indicative of autism across broad developmental and age ranges 19 . Its latest version (ADOS-2) encompasses five modules covering the age from 12 months to adulthood and various language levels ranging from no expressive use of words to fluent complex language. To best fit the younger participants' developmental needs, Modules Toddler 1 and 2 are conducted while moving around a room using a variety of attractive toys, while Modules 3 and 4 happen mostly at a table and involve more discussion with lesser use of objects. In this work, we focused uniquely on the Modules Toddler, 1 and 2, as these require fewer language abilities and are more sensitive to non-verbal aspects of social communication and interaction that we target using machine learning. The clinical findings on the prevalence of non verbal-symptoms in younger children 19,20,29 drove our choice to focus uniquely on non-verbal aspects of communication and interaction.

Pose estimation.
To purely focus on social interaction and essentially its non-verbal aspects, we extracted skeletal information on people present in ADOS videos using deep learning based multi-person 2D pose estimator-OpenPose 21 . OpenPose estimates keypoints of persons detected on the image/video independently for each frame. It uses a simple 2D camera input not requiring any external markers or sensors, thus allowing the retrospective analysis of videos. It also is immune to variations in resolutions and setting that images and videos might present. For the OpenPose output, we opted for the COCO model providing 2D coordinates of 18 keypoints (13 body and 5 facial keypoints; see Fig. 1). The ordering of the keypoints is constant across persons and frames. Consistent with our focus on the non-verbal features of interaction during the semi-standardized ADOS assessment, we removed the background from all the videos and preserved only skeletal information for further analysis. To obtain feature vectors invariant to a rigid body and affine transformations and to increase the generalizability of our approach, we based our calculation on image output and not on raw keypoints coordinates ( Fig. 1) 42 . Of note, the informed written consent for publication of identifying information/images in an online open-access publication was obtained from both adults and parent and/or legal guardian of child featuring in the illustration on Fig. 1 and Fig. S1.
Building the neural network. The OpenPose processed videos were down-sampled from 696 × 554 to 320 × 240 pixels and split into segments of 5s (see Table 2). To estimate the training and validation loss we used a categorical cross entropy loss function using a rmsprop optimizer. We found that the 5-s video segments were optimal for model training and resulted in less validation loss compared to longer segments (10 s or 15 s) (see Fig. S3). We opted for a CNN LSTM architecture for our model as it previously showed a good performance in video-based action classification 43 . We used a VGG16 convolutional neural network 44 , pretrained on the ImageNet 45 dataset to extract high dimensional features from individual frames of the 5 s video clips.The VGG16 is a 16 layers convolutional neural network that works with a 224 × 224 pixel 3 channel(RGB) input frame extracted from the video segment. The resolution is then decreased along the each convolution and pooling layer as 64 @ 112 × 112, 128 @ 56 × 56, 256 @ 28 × 28, 512 @ 14 × 14 and 512 @ 7 × 7 after the last convolution or pooling stage which has 512 feature maps. The high dimensional features extracted are flattened and input to a 512 hidden layered 0.5 dropout LSTM at a batch size of 128 46 followed by fully connected dense layers with ReLU activation, 0.5 dropout and a softmax classification layer giving an 2 dimensional output (corresponding to the two classes, ASD and TD). The input training data of 68 ADOS videos were split in the ratio of 80-20, where the model used 80% of data for training and 20% of data was used for validation. We then analyze the model's training and validation loss to avoid overfitting and perform hyperparameter tuning. The training and validation loss over a varied number of epochs is shown on Figure S3. The least validation loss model was deployed to predict over 5-s segment of the videos from the two testing sets (Testing Set 1 and Testing Set 2). We average the predictions for all the video segments to obtain a final prediction value denoted as "ASD probability". We trained the neural network model over 5-s video segments at 128 batch size, 100 epochs and 80-20 training-validation split and used the trained model to make predictions over Testing Set 1 and Testing set 2 to check the accuracy of the prediction results across different testing sets.
The model training and validation was performed at University of Geneva High Performance Computing cluster, Baobab (Nvidia RTX 2080Ti, Nvidia Titan X and Nvidia Quadro P1000 GPUs).
Statistical analyses. The obtained measure of ASD probability derived from the neural network was further put in relation to standardized behavioral phenotype measures in children included in the Testing 1 sample. We calculated Spearman rank correlations with measures of severity of symptoms of autism (Total CSS, SA CSS and RRB CSS), adaptive (VABS-II Total and scores across four subdomains) and cognitive functioning (BEIQ) (Supplementary section). Furthermore, in order to obtain more fine-grained insight into the relation of ASD probability across the entire video and the specific symptoms of autism, this measure was correlated with raw scores on a selected set of 27 items that repeat across the three modules we used (for more details, please refer to Supplementary section). Results were considered significant at p < 0.05 . The significance level was adjusted using Bonferroni correction for multiple comparisons. Thus the results concerning the two ADOS subdomains results were considered significant at p < 0.025 and four VABS-II subdomains at p < 0.013 . For the analyses involving the 27 individual items of ADOS (for a full list please refer to Table S2), the results were considered significant at p < 0.002 . Considering the collinearity between our ADOS measures we employed the Partial Least Squares Correlation (PLS-C) analysis 23,24 , to model the relation with the ASD probability. This method refers to the optimal least squares fit to the part of the correlation matrix. A cross-correlation matrix (R) is computed between the NN derived ASD Probability (denoted as X or Imaging variable) and 27 individual ADOS symptoms (denoted as Y or Design variables) R = Y T X . The R is then decomposed using the Singular Value Decomposition R = U V T yielding two singular vectors U and V (also denoted as saliences). The singular vector U represents the symptom pattern that best characterizes the R, while the singular vector V represent the dependent variable pattern that best characterizes the cross-correlation matrix R. In order to test the pattern generalizability we performed the permutation testing by generating 1000 permutation samples yielding a sampling distribution of the singular values under the null hypothesis. The stability of the obtained saliences U and V is obtained by creating 1000 bootstrap samples (sampling with replacement the observations in X and Y). To derive the latent variable pairs (LV) reflecting the maximal covariance patterns between the ASD probability on one side and a pattern of individual symptoms of ASD on the other side the singular vectors U and V were projected into the original Behavior (Y) and Imaging (X) matrices respectively. Thus obtained patterns by definition would be independent of alternative LV ASD probability-symptom pairings.
For a comparison of the Training and Testing samples with regards to clinical assessments (ADOS, VABS-II and BEIQ), we used Student t-tests and Mann-Whitney tests when measures did not follow a normal distribution according to the D'agostino & Pearson test (See Table S1). Statistical analyses were conducted using GraphPad Prism v.8, (https:// www. graph pad. com/ scien tific-softw are/ prism/) and SPSS v.25 (https:// www. ibm. com/ analy tics/ spss-stati stics-softw are).
Relation of video length to prediction accuracy. Our final aim was to apprehend the length of video segments required for stable class attribution, thus probing the approach's scalability. To this end, we employed a sliding window approach, starting with a length of 20 s and then stepwise increasing the window length by 20 s until window length matched video duration. In each window, ASD probability values are averaged over the containing 5-s segments for each of 68 videos in the Testing set 1 (Fig. 3A). Using this method, we also test the prediction consistency for videos of a single individual class by identifying the sliding window length required for stable class attribution (Fig. 3B).