Introduction

Visual observation and analysis of children’s natural behaviours are instrumental to the early detection of developmental disorders, including autism spectrum disorder (ASD). While a gold standard observational tool is available, there are limitations that hinder the early screening of ASD in children. Interpretative coding of child observations, parent interviews and manual testing1 are costly and time-consuming2. In addition, the reliability and validity of the results obtained from a clinician’s observations can be subjective3, arising from differences in professional training, resources and cultural context. Furthermore, behavioural ratings typically do not capture data from the children in their natural environments. Such limitations combined with rising incidence rates call for the development of new methods of ASD diagnosis without compromising accuracy, in order to reduce waiting periods for access to care. This is critical as diagnosis and intervention within the first few years of life can provide long-term improvements for the child and can even have greater effect on outcomes4.

Early behavioural risk markers of ASD have been discovered with the help of retrospective analysis of home videos5,6,7. Research studies have documented ASD-related behavioural markers that emerge within the first months of life; these include diminished social engagement and joint attention8,9, atypical visual attention such as difficulty during response-to-name protocol10, longer latencies to disengage from a stimulus if multiple ones are presented11, and non-smooth visual tracking12. Furthermore, children with ASD may exhibit atypical social behaviours such as decreased attention to social scenes, decreased frequency of gaze to faces13 and decreased expression of emotion. In addition, evidence suggests that differences in motor control are an early feature of ASD14,15,16,17.

Over the past decade, computer vision has been used in the field of automated medical diagnosis as it can provide unobtrusive objective information on a patient’s condition. A recent finding has shown that utilising computer vision methods to automatically detect symptoms can pre-diagnose over 30 conditions18. For example, computer vision-based facial analysis can be used to monitor vascular pulse, assess pain, detect facial paralysis, diagnose psychiatric disorders and even distinguish ASD individuals from individuals with typical development (TD) through behaviour imaging19. The main rationale for using computer vision for a clinical purpose would be to remove any potential bias, develop a more objective approach to analysis, increase trust towards diagnosis, as well as decrease errors related to human factors in the decision-making process. Furthermore, computer vision-based systems provide a low-cost and non-invasive approach, potentially reducing healthcare expenditures when compared to medical examinations.

Computer vision techniques have been effectively exploited in the last years to automatically and consistently assess existing ASD biomarkers, as well as discover new ones20. To further examine how computer vision has been useful in ASD research, a systematic review of published studies was conducted on computer vision techniques for ASD diagnosis study, therapy and autism research in general. First, eligible papers are categorised based on the quantified behavioural/biological markers. In addition, different publicly available ASD datasets suitable for computer vision research are reviewed. Finally, interesting research directions are outlined. To this end, this systematic review can serve as an effective summary resource that researchers can consult when developing computer vision-based assessment tools for automatically quantifying ASD-related markers.

Materials and methods

Eligibility criteria

All titles and abstracts were initially screened to include studies that meet the following inclusion criteria: (1) the study focussed on autism in humans (i.e. animal studies were excluded); (2) the study mainly focussed on the use of computer vision techniques in autism diagnosis study, therapy of autism or autism research in general; (3) the study explained how behavioural/biological markers can be automatically quantified; and (4) the study included an experiment, a pilot study or a trial with at least one group of individuals with ASD. Finally, results in the form of review, meta-analysis, keynote, narrative, editorial or magazine were excluded.

Search process

An electronic database search of PubMed, IEEE Xplore and ACM Digital Library was conducted by including simple terms and Medical Subject Headings terms for keywords [‘autis*’ AND (‘computer vision’ OR ‘behavio* imaging’ OR ‘behavio* analysis’ OR ‘affective computing’)] in all fields (title, abstract, keywords, full text and bibliography) from January 1, 2009 to December 31, 2019. A snowballing approach was also conducted to identify additional papers. Included peer-reviewed articles followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement21. Duplicates were removed and the title and abstract of each article were scanned for relevance. The full text of potentially relevant studies was assessed for eligibility considering established criteria detailed above. A PRISMA flow diagram was constructed and is shown in Appendix A.

Data items and analysis

Identical variables in eligible studies were extracted where possible into an Excel spreadsheet: (1) quantified behavioural/biological markers; (2) application focus; (3) child diagnosis and size of participants’ groups; (4) age range of the participants or age mean and standard deviation; (5) input data and devices used; (6) computer vision method applied in the study; and (7) dataset used in the study. 94 eligible studies were categorised based on the behavioural/biological markers that were quantified.

Results

Overview of behavioural/biological markers used in eligible papers

The findings in this survey show that there is an increase in the number of significant contributions of computer vision methods to autism research. Over the last decade, computer vision has been used to capture and quantify different information, such as: (a) Magnetic Resonance Imaging (MRI)/functional MRI (see Table 1) (b) facial expression/emotion (see Table 2) (c) eye gaze data (see Table 3) (d) motor control/movement pattern (see Table 4) (e) stereotyped behaviours (see Table 5) and (f) multimodal data (see Table 6). Identical variables (discussed in ‘Data Items and Analysis’) were reported for each quantified information.

Table 1 Magnetic resonance imaging (MRI)/functional MRI (fMRI).
Table 2 Facial expression/emotion.
Table 3 Eye Gaze Data.
Table 4 Motor control/movement pattern.
Table 5 Stereotyped behaviours.
Table 6 Multimodal data.

This review presents consolidated evidence on the effectiveness of using computer vision techniques in (1) determining behavioural/biological markers for diagnosis and characterisation of ASD, (2) developing assistive technologies that aid in emotion recognition and expression for ASD individuals and (3) augmenting existing clinical protocols with vision-based systems for ASD therapy and automatic behaviour analysis. The following subsections discuss in detail how each quantified marker was utilised in autism research.

Magnetic resonance imaging (MRI)/functional MRI (fMRI)

The need for a more quantitative approach to ASD diagnosis has pushed research towards analysing brain imaging data, such as MRI and fMRI. Generally, MRI and fMRI techniques scan different parts of the brain to provide images which are then used as input for further processing. These images have been used to determine potential biomarkers that show differences between ASD and TD subjects. For example, Samson et al.22 used fMRI scans to explore the differences of complex non-social sound processing between ASD and TD subjects. With increasing temporal complexity, TD subjects showed greater activity in anterolateral superior temporal gyrus while ASD subjects have greater effects in Heshl’s gyrus. Abdelrahman et al.23 used MRI scans to generate a 3D model of the brain and accurately calculate the volume of white matter in the segmented brain. Considering the white matter volume as a discriminatory feature in a classification step using k-nearest neighbour algorithm, their system reached an accuracy of 93%. Durrleman et al.24 examined MRI scans to find differences in the growth of the hippocampus in children with ASD and control subjects. Their findings suggest that group differences may be better identified by maturation speed rather than shape differences at a given age. Ahmadi et al.25 used independent component analysis to show that within-network connections on fMRI images of ASD subjects are lower when compared to TD subjects.

The remaining eligible studies developed new techniques for diagnosing ASD using MRI26,27 and fMRI28,29,30 data in the ABIDE repository. Based on their recent findings, Chaddad et al.26,27 demonstrated the potential of hippocampal texture features extracted from MRI scans as biomarkers for the diagnosis and characterisation of ASD. They used Laplacian-of-Gaussian filter31 across a range of resolution scales and performed statistical analysis to identify regions exhibiting significant textural differences between ASD and TD subjects. They identified asymmetrical difference in the right hippocampus, left choroid-plexus and corpus callosum and symmetrical difference in the cerebellar white matter.

Some of the techniques are based on conventional machine learning techniques, such as Support Vector Machines (SVM). For example, Chanel et al.32 used a multivariate pattern analysis approach in two different fMRI experiments with social stimuli. The method, based on a modified version of SVM Recursive Feature Elimination algorithm33, is trained independently and then combined to obtain a final classification output (e.g. ASD or TD). Their results revealed classification accuracy of between 69% and 92.3%. Crimi et al.30 used a constrained autoregressive model followed by an SVM to differentiate individuals with ASD from TD individuals. Zheng et al.34 constructed multi-feature-based networks (MFN) and SVM to classify individuals of the two groups. Their results showed that using MFN significantly improved the classification accuracy by almost 14% compared to using morphological features. Their findings also demonstrated that variations in cortico-cortical similarities can be used as biomarkers in the diagnostic process.

Deep learning techniques have also been proposed for automating ASD diagnosis by extracting discriminative features from fMRI data and feeding them to a classifier28. In order to increase the number of training samples and avoid overfitting, Eslami and Saeed28 used Synthetic Minority Over-Sample (SMOTE)35. They also investigated the effectiveness of the features extracted using an SVM classifier. Their model achieved more than 70% classification accuracy for four fMRI datasets, with highest accuracy of 80%. Attaining similar performance, Li et al.29 adopted a deep transfer learning neural network model for ASD diagnosis. Compared to traditional models, their approach led to improved performance in terms of accuracy, sensitivity, specificity and area under receiver operating characteristic curve.

Facial expression/emotion

Emotion classification focusses on the development of algorithms that produce an emotion label (e.g. happy or sad) from a face in a photo or a video frame. Recent advances in the field of computer vision have contributed to the development of various emotion classifiers that can potentially play a significant role in mobile screening and therapy for ASD children. However, most classifiers are biased towards neurotypical adults and can fail to generalise to children with ASD. To address this, Kalantarian et al.36,37 presented a framework for semi-automatic label frame extraction to crowdsource labelled emotion data from children. The labels consist of six emotions: disgust, neutral, surprise, scared, angry and happy. To improve the generalisation of expression recognition models to children with ASD, Han et al.38 presented a transfer learning approach based on a sparse coding algorithm. Their results showed that their method can more accurately identify the emotional expression of children with ASD. Tang et al.39 proposed a convolutional neural networks-based (CNN) method for smile detection of infants in mother–infant interaction. Their results showed that their approach can achieve a mean accuracy of 87.16% and F1-score of 62.54%.

Several papers have focussed on using computer vision to develop assistive technologies for ASD children40,41,42,43. For example, researchers40,42,43 developed and evaluated a wearable assistive technology to help ASD children with emotion recognition. Vahabzadeh et al.44. provided initial evidence for the potential of wearable assistive technologies to reduce hyperactivity, inattention and impulsivity in school-aged children, adolescents and young adults with ASD. Leo et al.45 and Pan et al.46 developed an automatic emotion recognition system in robot-children interaction for ASD treatment. Their results suggest that computer vision could help to improve the efficiency of behaviour analysis during interactions with robots.

Most research mainly focusses on qualitative recognition of facial expressions. This is due to the fact that computational approach on facial expression analysis is an emerging research topic. There are a few attempts to automatically quantify facial expression production of ASD children47,48,49,50,51,52. For example, Leo et al.47 proposed a framework to computationally analyse how ASD and TD children produce facial expression. Guha et al.52 investigated differences in the overall and local facial dynamics of TD and ASD children. Their observations showed that there is reduced complexity in the dynamic facial behaviour of ASD children arising primarily from the eye region. Computer vision has also been used to predict engagement and learning performance. For example, Ahmed and Goodwin53 analysed facial expressions from video recordings obtained when kids interacted with a computer-assisted instruction programme. Their results showed that emotional and behavioural engagement can be quantified automatically using computer vision analysis.

Harrold et al.54,55 developed a mobile application that allows children to learn emotions with instant feedback on performance through computer vision. White et al.56 presented results which showed that children with ASD found their system to be acceptable and enjoyable. Similar to this approach, Garcia-Garcia et al.57 presented a system that incorporates emotion recognition and tangible user interfaces to teach children with ASD to identify and express emotions. Jain et al.58 proposed an interactive game that can be used for autism therapy. The system tracks facial features to recognise the facial expressions of the participant and to animate an avatar. Developed as a game, the system attempts to teach kids how to recognise and express emotions through facial expressions.

A deep learning approach has also been applied to recognise developmental disorders through facial images. For example, Li et al.59 introduced an end-to-end CNN-based system for ASD classification using facial attributes. Their results show that different facial attributes are statistically significant and improve classification performance by about 7%. A deep convolutional neural network (DCNN) for feature extraction followed by an SVM for classification has been trained by Shukla et al.60 to detect whether a person in an image has ASD, cerebral palsy, Down syndrome, foetal alcohol spectrum syndrome, progeria or other intellectual disabilities. Their results indicate that their model has an accuracy of 98.80% and performs better than average human intelligence in distinguishing between different disorders.

Eye gaze data

Analysing attention and psychological factors encoded in eye movements of individuals could help in ASD diagnosis. Computer vision has been used to automatically analyse children’s gaze and distinguish ASD-related characteristics present in a video61. Research has shown that there is a significant difference in gaze patterns between children with ASD and TD. Eye tracking technology provides automatic assessment of gaze behaviour in different contexts. For example, Balestra et al.62 showed that it can be used to study language impairments, text comprehension and production deficits. In addition, it can be used to identify fixation and saccades63, recognise affective states64 and even reveal early biomarkers associated with ASD65,66. Furthermore, eye tracking can be used to detect saliency differences between ASD and TD children. Researchers67,68,69,70 showed that there is a difference in preference for both social and non-social images. This finding is consistent with a similar published study of Syeda et al.71, which examined face scanning patterns in a controlled experiment. By extracting and analysing gaze data, the study revealed that children with autism spend less time looking at core features of faces (e.g. eyes, nose and mouth). Chrysouli et al.72 proposed a deep learning-based technique to recognise the affective state (e.g. engaged, bored, frustrated) of an individual (e.g. ASD, TD, etc.) from a video sequence.

Building upon the knowledge of previous research, several studies have concentrated on using visual attention preference of children with ASD for diagnosis. For example, Liu et al.73,74 proposed a machine learning-based system to capture discriminative eye movement patterns related to ASD. They also presented a comprehensive set of effective feature extraction methods, prediction frameworks and corresponding scoring frameworks. Vu et al.75 examined the impact of visual stimuli and exposure time on the quantitative accuracy of ASD diagnosis. They showed that using a ‘social scene’ stimulus with 5-s exposure time has the best performance at 98.24%. By also using visual attention preference, Jiang and Zhao76 leveraged recent advances in deep learning for superior performance in ASD diagnosis. In particular, they used a DCNN and SVM to achieve an accuracy of 92%. Higuchi et al.77 developed a novel system that provides visualisation of automatic gaze estimation and allows for experts to perform further analysis.

Most of the studies have been conducted in a highly controlled environment in which the subjects were asked to view a screen for a short period of time. Recently, Chong et al.78 presented a novel deep learning architecture for eye contact detection in natural social interactions. In their study, eye contact detection was performed during adult–child sessions in which the adult wears a point-of-view camera. Their results showed significant improvement over existing methods, with a reported precision and recall of 76% and 80%, respectively. Toshniwal et al.79 proposed an assistive technology that tracks attention using mobile camera and uses haptic feedback to recapture attention. Their evaluation study with users with various intellectual disabilities showed that it can provide better learning with less intervention.

Motor control/movement pattern

The use of computer vision has also shown potential for a more precise, objective and quantitative assessment of early motor control variations. For example, Dawson et al.80 used computer vision analysis to analyse differences in midline head postural control, as reflected in the rate of spontaneous head movements between toddlers with ASD versus those without ASD. Their study followed a response-to-name protocol where a series of social and non-social stimuli (i.e. in the form of a movie) were shown on a smart tablet while the child sat on a caregiver’s lap. During the protocol, the examiner, standing behind the child, calls the child’s name and the child’s reaction is recorded using the smart tablet. Afterwards, a fully automated computer vision algorithm detects and tracks 49 facial landmarks and estimates head pose angles. Their study revealed that toddlers with ASD exhibited a significantly higher rate of head movement compared to their typically developing counterparts. Using the same approach, Martin et al.81 examined head movement dynamics of a cohort of children. They found that there is an evident difference in lateral (yaw and roll) head movement between children with ASD and TD children.

Deep learning has also been employed to develop novel screening tools that analyse gestures captured in video sequences. For example, Zunino et al.82 used CNN to extract features, followed by a long short-term memory (LSTM) model with an attentional mechanism. They demonstrated that it is possible to determine whether a video sequence contains grasping action performed by ASD or TD subjects. In another study, Vyas et al.83 estimated children’s pose over time by retraining a state-of-the-art pose estimator (2D Mask R-CNN) and trained a CNN to categorise whether a given video clip contains a typical (normal) or atypical (ASD) behaviour. Their approach with an accuracy of 72% outperformed conventional video classification approaches.

Computer vision has also been used to develop motion-based touchless games for ASD therapy. For example, Piana et al.84 conducted an evaluation study of a system designed for helping ASD children to recognise and express emotions by means of their full-body movement captured by RGB-D sensors. Their results showed that there is an increase in task (recognition) accuracy from the beginning to the end of training sessions. Bartoli et al.85 showed the effectiveness of using embodied touchless interaction to promote attention skills during therapy sessions. Similarly, Ringland et al.86 developed SensoryPaint that allows whole-body interactions and showed that it is a promising therapeutic tool. Magrini et al.87 developed an interactive vision-based system which reacts to movements of the human body to produce sounds. Their system has been evaluated by a team of clinical psychologists and parents of young patients.

Computer vision has also been used to develop robot-mediated assistive technologies for ASD therapy. Dickstein-Fischer and Fischer88 developed a robot, named PABI (Penguin for Autism Behavioural Interventions), with augmented vision to interact meaningfully with an autistic child during therapy. Similarly, Bekele et al.89 developed a robot with augmented vision to automatically adapt itself in an individualised manner and to administer joint attention prompts. Their study suggests that robotic systems with augmented vision may be capable of enhancing skills related to attention coordination. This confirms an earlier study of Dimitrova et al.90 where adaptive robots showed potential for educating children in various complex cognitive and social skills that eventually produce a substantial development impact.

Stereotyped behaviours

In the context of autism research, atypical behaviours are assessed during screening using different clinical tools and protocols. For example, Autism Observation Scale for Infants (AOSI) consists of a set of protocols that is designed to assess specific behaviours91. During the last decade, research has been growing towards behavioural imaging to create new capabilities for the quantitative understanding of behavioural signs, such as those outlined in AOSI. For example, Hashemi et al.92,93 examined the potential benefits that computer vision can provide for measuring and identifying ASD behavioural signs based on two components of AOSI. In particular, they developed a computer vision tool to assess: (1) disengagement of attention: the ability of kids to disengage their attention from one of two competing visual stimuli and (2) visual tracking, to visually follow a moving object laterally across the midline. Similarly, computer vision analysis has also been explored to automatically detect and analyse atypical attention behaviours in toddlers in a response-to-name protocol. A proof of concept system that used marker-less head tracking was presented by Bidwell et al.94 and scalable applications were developed by Hashemi et al.95, Campbell et al.96 and Hashemi et al.97. The latter systems run on a mobile application designed to elicit ASD-related behaviours (e.g. social referencing, smiling while watching a movie and pointing) and use computer vision analysis to automatically code behaviours related to early risk markers of ASD. When compared to a human analyst, computer vision analysis was found to be as reliable in predicting child response latency. Using the response-to-name protocol, Wang et al.98 proposed a non-contact vision system that achieved an average classification score of 92.7% for assistant screening of ASD. The results of the mentioned studies show that computer vision tools can capture critical behavioural observations and potentially augment clinical behavioural observations when using AOSI. Bovery et al.99 also used a mobile application and movie stimuli to measure attention of toddlers. They used computer vision algorithms to detect head and iris positions and determine the direction of attention. Their results showed that toddlers with ASD paid less attention to the movie, showed less attention to the social as compared to the non-social visual stimuli and often directed their attention to one side of the screen.

Behaviours other than those outlined by AOSI have also been quantified using computer vision. For example, self-stimulatory behaviours refer to stereotyped, repetitive movements of body parts. Also known as ‘stimming behaviours’, these behaviours are often manifested when a person with autism engages in actions like rocking, pacing or hand flapping. Researchers100,101,102 have introduced a dataset with stimming behaviours and used computer vision to determine if these behaviours exist in a video stream. Another quantified behaviour is social interaction and communication among individuals with ASD. Winoto et al.103 developed an unobtrusive sensing system to observe, sense and annotate behavioural cues which can be reviewed by specialists and parents for better tailored assessment and interventions. Similarly, children’s responses when interacting with robots have been quantified using computer vision techniques. Feil-Seifer and Matarić104 showed that computer vision can be used to study behaviours of ASD children towards robots during free-play settings. Moghadas and Moradi105 proposed a computer vision approach to analyse human-robot interaction sessions and to extract features that can be used for ASD diagnosis.

Multimodal data

Over the last decade, there has been increasing interest in incorporating multiple behavioural modalities to achieve superior performance and even outperform previous state-of-the-art methods that utilise only a single modality for ASD screening. For example, Chen and Zhao106 proposed a privileged modality framework that integrates information from two different tasks; (1) photo taking task where subjects freely move around the environment and take photos and (2) image-viewing task where their eye movements are recorded by an eye-tracking device. They used CNN and LSTM to integrate features extracted from these two tasks for more accurate ASD screening. Their results showed that the proposed models can achieve new state-of-the-art results. They also demonstrated that utilising knowledge across the two modalities dramatically improved performance by more than 30%.

Wang et al.107 presented a standardised screening protocol, namely Expressing Needs with Index Finger Pointing (ENIFP), to assist in ASD diagnosis. The protocol is administered in a novel non-invasive system trained using deep learning to automatically capture eye gaze and gestures of the participant. Their results showed that the system can record the child’s performance and reliably check mutual attention and gestures during the ENIFP protocol. Computer vision techniques have also been used during robotic social therapy sessions proposed by Mazzei et al.108.

Computer vision systems that incorporate multimodal information have also been used to detect behavioural features during interaction with a humanoid robot. For example, Coco et al.109. proposed a technological framework to automatically build a quantitative report that could help therapists to better achieve either ASD diagnosis or assessment tasks. Furthermore, computer vision has been used to address autism therapy through social robots that automatically adapt their behaviours. For example, researchers110,111,112,113 have presented systems that simultaneously include eye contact, joint attention, imitation and emotion recognition for an intervention protocol for ASD children. Egger et al.114 presented the first study showing the feasibility of computer vision techniques to automatically code behaviours in natural environments. Another assistive technology was introduced by Peters et al.115. to assist people with cognitive disabilities in brushing teeth. It uses behaviour recognition and a machine learning network to provide automatic assistance in task execution.

Rehg et al.116. proposed a new action recognition dataset for analysis of children’s social and communicative behaviours based on video and audio data. Their preliminary experimental results demonstrated the potential of this dataset to drive multi-modal activity recognition. Similarly, Liu et al.117 proposed a ‘Response-to-Name’ dataset and a multimodal ASD auxiliary screening system based on machine learning. Marinoiu et al.118 introduced one of the largest existing multimodal datasets of its kind (i.e. autistic interaction rather than genetic or medical data). They also proposed a fine-grained action classification and emotion prediction task recorded during robot-assisted therapy sessions of children with ASD. Their results showed that machine-predicted scores align closely with human professional diagnosis.

Computer vision has also been applied to multimodal data, such as fMRI and eye gaze information, in order to test differences in response selectivity of the human visual cortex between individuals with ASD and TD. Schwarzkopf et al.119. have shown that sharper spatial selectivity in visual cortex is not characterised in ASD individuals.

Datasets used in eligible papers

The dataset requirement typically depends on the target behavioural/biological marker and the computer vision methods to be employed. In this section, the publicly available datasets used by eligible papers are reviewed and those with autistic samples are focussed on.

Magnetic resonance imaging datasets

Autism Brain Imaging Data Exchange (ABIDE) initiative has aggregated functional and structural brain imaging data collected from different laboratories to accelerate understanding of the neural basis of autism. ABIDE I represents the first ABIDE initiative120. This effort yielded a total of 1112 records (sets of magnetic resonance imaging (MRI) and functional MRI), including 539 from individuals with ASD and 573 from TD individuals. ABIDE II was established to further promote discovery of brain connectome in ASD121. It consists of 1114 records from 521 individuals with ASD and 593 TD individuals. Hazlett et al.122 conducted an MRI study with 51 children with ASD and 25 control children (including both developmentally delayed and TD children) between 18 and 35 months of age.

Autism spectrum disorder detection dataset

This dataset consists of a set of video clips of reach-to-grasp actions performed by children with ASD and TD82. In the protocol, children were asked to grasp a bottle and perform different subsequent actions (e.g. placing, pouring, passing to pour, and passing to place). A total of 20 children with ASD and 20 TD children participated in the study.

DE-ENIGMA dataset

DE-ENIGMA dataset is a free, large-scale, publicly available multi-modal (e.g. audio, video, and depth) database of autistic children’s interactions that is suitable for behavioural research123. A total of 128 children on the autism spectrum participated in the study. During the experiment, children within each age group were randomly assigned to either a robot-led or a researcher/therapist-led teaching intervention which was implemented across multiple short sessions. This dataset includes ~13 TB of multi-modal data, representing 152 h of interaction. Furthermore, 50 children’s data have been annotated by experts for emotional valence, arousal, audio features and body gestures. The annotated data are in effect ready for future autism-focussed machine learning research.

Multimodal behaviour dataset

The Multimodal Dyadic Behaviour (MMDB) dataset is a unique collection of 160 multimodal (video, audio and physiological) recordings and annotations of the social and communicative behaviours of 121 children aged 15–30 months, gathered in a protocol known as the Rapid-ABC sessions116. This play protocol is an interactive assessment (3–5 min) consisting of five semi-structured play interactions in which the examiner elicits social attention, interaction and non-verbal communication from the child.

Saliency4ASD dataset

Saliency4ASD Grand Challenge aims to align the visual attention modelling community around the application of ASD diagnosis and to provide an open dataset of eye movements recorded from children with ASD and TD. The database consists of 300 images with various animals, buildings, natural scenes and combinations of different visual stimuli124. Each image has corresponding eye-tracking data collected from 28 participants.

Self-stimulatory behaviour dataset

Due to the lack of a database containing self-stimulatory behaviours, Rajagopalan et al.101 searched for and collected videos on public domain websites and video portals (e.g. YouTube). They classified the videos into three categories: arm flapping, head banging and spinning. Compared to other datasets, their dataset is recorded in natural settings. The dataset contains 75 videos with an equal number of videos for each category.

Other datasets

Until recently, autism datasets have been relatively small when compared to other datasets in which machine learning has seen tremendous application. As a result, earlier published research has resorted to using a subset of videos of neurotypical individuals from human action recognition datasets [UCF101125, Weizmann126], facial expression datasets [Cohn-Kanade(CK)127, CK+128, FERET129, Hollywood2130, HELEN131, CelebA132, AffectNet133, EmotioNet134, BU-3D Facial Expression135] and gesture recognition datasets [Oxford Hand Dataset136, Egohands137] to help train systems that analyse autistic behaviours.

Limitations

This review has some limitations: one is linked to the number of included papers and the other to the quality of papers included. Although it has been attempted to make the review as inclusive as possible through the PRISMA checklist, there are studies that might not have been included because of the chosen keywords and time period used. Nevertheless, as far as is known, this is the first systematic review of the current state of computer vision approaches in autism research.

Being a relatively new field of research, some published papers have few longitudinal studies or included small cohorts of participants, thus the quality of the results may change as more clinical trials are conducted. Nonetheless, this systematic review suggests that these advances in computer vision are applicable in the ASD domain and can stimulate further research in using computer vision techniques to augment existing clinical methods. However, these approaches require further evaluation before they can be applied in clinical settings.

Discussion

In this work, a systematic review has been provided on the use of computer vision techniques in autism research in general. Although there have been considerable studies on this area, different factors such as controlled experiments in a clinical setting mean that quantification of human behaviours in real scenarios remains challenging in the context of understanding image or video streams. In this paper, publicly available datasets relevant to behaviour analysis have also been reviewed, in order to rapidly familiarise researchers with datasets applicable to their field and to accelerate both new behavioural and technological work on autism. The primary conclusion of this study on computer vision approaches in autism research are provided below:

  1. 1.

    Different behavioural/biological markers have already been quantified, to some extent, using computer vision analysis with comparable performance to a human analyst.

  2. 2.

    For feature extraction and classification tasks, deep learning-based approaches have shown superior performance when compared to traditional computer vision approaches.

  3. 3.

    The growing number of large-scale publicly available datasets provides the required scale of data needed for furthering machine learning and deep learning developments.

  4. 4.

    Multimodal methods attain superior performance by combining knowledge across different modalities.

In the current state of the art, it is evident that computer vision analysis is useful for the quantification of behavioural/biological markers that can further lead to a non-invasive, objective and automatic tool for autism research. It can also be used to provide effective interventions using robots with augmented vision during therapies. In addition, it can be used to develop technologies that assist individuals with ASD in certain tasks, such as emotion recognition.

To date, most published studies are related to the use of computer vision in a clinical setting. However, in complex scenes outside of clinical protocols, there are many issues with feature learning in single or even multimodal data. In addition, it is challenging to compare the performance of the eligible studies due to the lack of benchmarked datasets that researchers have ‘agreed’ on for the use of deep learning138. Until recently, there have been no large-scale datasets that researchers could use to compare their results. Given the current state of research, researchers in this area should address the following problems:

  1. 1.

    Multimodal approaches based on multimodal fusion methods. In current research, most studies have focussed on RGB data from image or video streams. However, an increasing number of studies has shown that superior performance can be achieved through a combination of multimodal information.

  2. 2.

    Researchers should agree to work on a benchmark dataset and evaluate their models on them for more reliable comparison of performance. The datasets reviewed in this paper serve as a starting point for researchers to use in computer vision research. Experts can borrow knowledge gained from existing state-of-the-art human activity recognition models trained on neurotypical individuals, apply them to these datasets, and build models that can generalise to individuals with ASD.

  3. 3.

    Computer vision approaches that address fully unconstrained scenarios. Most published studies require participants to be in clinical settings that typically do not capture data from the children in their natural environments.

  4. 4.

    Longitudinal studies or a collection of a large cohort of individuals with ASD and TD individuals should be conducted to evaluate the performance of succeeding computer vision systems. This requires a careful and systematic empirical validation to ensure their accuracy, reliability, interpretability and true clinical utility. This would help determine if these systems can generalise across different participant groups (e.g. multiple ages, cultural differences) and demonstrate fairness and unbiasedness.

  5. 5.

    It is also important to gain a deeper understanding of human factors, user experience and ethical considerations surrounding the application of vision-based systems. This would help develop usable and useful systems and determine if these systems can really be used to augment existing behavioural observations in a clinical setting.