Identifying activity level related movement features of children with ASD based on ADOS videos

Autism spectrum disorder (ASD) is a neurodevelopmental disorder that affects about 2% of children. Due to the shortage of clinicians, there is an urgent demand for a convenient and effective tool based on regular videos to assess the symptom. Computer-aided technologies have become widely used in clinical diagnosis, simplifying the diagnosis process while saving time and standardizing the procedure. In this study, we proposed a computer vision-based motion trajectory detection approach assisted with machine learning techniques, facilitating an objective and effective way to extract participants’ movement features (MFs) to identify and evaluate children’s activity levels that correspond to clinicians’ professional ratings. The designed technique includes two key parts: (1) Extracting MFs of participants’ different body key points in various activities segmented from autism diagnostic observation schedule (ADOS) videos, and (2) Identifying the most relevant MFs through established correlations with existing data sets of participants’ activity level scores evaluated by clinicians. The research investigated two types of MFs, i.e., pixel distance (PD) and instantaneous pixel velocity (IPV), three participants’ body key points, i.e., neck, right wrist, and middle hip, and five activities, including Table-play, Birthday-party, Joint-attention, Balloon-play, and Bubble-play segmented from ADOS videos. Among different combinations, the high correlations with the activity level scores evaluated by the clinicians (greater than 0.6 with p < 0.001) were found in Table-play activity for both the PD-based MFs of all three studied key points and the IPV-based MFs of the right wrist key point. These MFs were identified as the most relevant ones that could be utilized as an auxiliary means for automating the evaluation of activity levels in the ASD assessment.

lists key setting parameters for 20 selected videos.

S2. Classification model selection.
The training process of the person classification model for Video 1 is used as an example to illustrate how to select the optimal classification model. As shown in Fig. S2, when the training accuracy and validation accuracy are optimal (reach 100%), the model with the lowest validation loss is selected as the optimal model. The best model for Video 1 is at Epoch 198, and its training accuracy, validation accuracy, training loss, and validation loss are 100%, 100%, 0.016659, and 0.037091, respectively. The training and validation performance for the selected 20 videos are listed to provide more details (See Table S2). We also include testing results. For the 200 randomly selected masked samples, above 96% testing accuracy can be achieved, and the average testing accuracy is 99.06%.

S3. Key point filtering algorithm.
The end of the person's limbs in the mask image may be masked, so the bounding box image corresponding to the mask image is used to identify the key points. When cutting the bounding box image, we extend 50 pixels outward on the original border to avoid losing part of the limb image. Because of overlaps of multiple persons, other persons might appear in the child's bounding box images. So, multi-person's key points may be output from OpenPose. We elaborate the filtering algorithm as follows. 1) Mask-Rcnn is used to identify multiple person's mask images and their bounding box coordinate information on an original image. The person classification model is used to label these mask images with categories automatically. Then the mask images' information is stored in an array, including the mask image name, category, classification probability, and the bounding box coordinates (denoted as (x1, y1) and (x2, y2)).
2) If a mask image is classified as the child, its corresponding bounding box image will be pasted to the black background image for key points recognition in the next step. As shown in Figure  S3A of the supplementary materials, two mask images are identified as the child, and their bounding box image, i.e., Bounding box 1 with a classification probability of 0.97 and Bounding box 2 with a classification probability of 0.74, are pasted to black background image. Bounding box 1 is selected as the child's bounding box because of its higher classification probability.
3) All person's key points are extracted by OpenPose. In Figure  Accuracy (the parent and the child) are displayed. The total non-zero key point number for Person i in the masked image is denoted as Ki, and the key points number in the bounding box for Person i is denoted ki. We define the proportion of key points as ki/Ki. As shown in Figure S3A, the total number of non-zero key points for Person 1 K1 is 20, and the number of key points in the child's bounding box (i.e., Bounding box 1) k1 is 19. The proportion k1/K1 is 0. 95 (19/20). Similarly, the total number of non-zero key points for Person 2 K2 is 24, and the number of key points in the child's bounding box k2 is 9. The proportion k2/K2 is 0.375 (9/24). By comparing the two persons' proportions, the one's key points with a higher value belong to the child's key points, as shown in Figure S3B.

S4. Mismatch of children's key points.
Limited by time and human resources, up to 100 images are randomly selected for each video for accuracy estimate. For some videos, random selection may cause duplication, then less than 100 images were chosen for evaluation. The statistical data includes the image number, the number of mismatched images, the mismatch rate before and after using the filtering algorithm, and the improvement of the filtering algorithm.

S5. The comparison of pose estimation software
We compared different body parts' average precision (AP) for the software in Table S4 on the MPII Multi-Person Dataset. In Table S4 we can see that OpenPose is one of the best pose estimation software, and image processing speed is also optimal.  Table S4. The comparison between different pose estimation software on MPII Multi-Person Dataset S1, S2

S6. The performance of Mask-Rcnn
In our work, we use the Mask-Rcnn benchmark developed by Facebook to segment people on the images in the videos. In Table S5, it can be seen that Mask-Rcnn performs very well in instance segmentation, although it is slightly inferior to the newer D-SOLO method S4 . However, the average precision (AP) for person segmentation is very well. Under the Mask-only model, the person's bounding box AP is 53.6, and the person's mask AP is 45.8 S3 . Additionally, the Mask-Rcnn benchmark is based on PyTorch 1.0, which can be compatible with our person classification algorithm. It provides our framework with a lot of convenience for integration and automation. The pre-training model of people and some common objects has brought us a lot of convenience in our application and supported us in doing a lot of extended research on ASD. The backbone of the Mask-Rcnn we used in our study is ResNet-152-FPN. Mask-Rcnn is a two-stage method, i.e., detect-and-mask. In our application, we set the threshold for Mask-Rcnn to recognize people to 0.8. This will eliminate some interference, such as the mask image of a hand, a leg, the hair, and so on.

S7. Movement features characteristics at two key points.
We show the characteristics of four MFs of two body key points, e.g., right wrist and middle hip, in five activities in figures. These are shown in Fig. S4 and Fig. S5, including the range and standard deviation of the MFs.

S8. Pearson's correlation coefficient (PCC).
We carried out the PCC calculations between different MFs, e.g., , , and , in Table-play for the neck key point, the PCC values are in the range of 0.5 and 0.8, as shown in Table S6. We also carried out the PCC between the same MFs but in different activities for the neck key point, and the results are in the range of 0 to 0.7, see Table S7.  Table S6. Pearson's correlation between various MFs of the neck key point in the Table-play

S9. Correlation power analysis.
We conducted a post hoc power analysis to confirm if we had a sufficient sample size to achieve the study's goal. We used the current sample size as well as the strongest correlations found from the study to calculate power, and set a minimum threshold of r = 0.50 for the power analysis. We set the significant level at p < 0.0125 (1/4 of a conventional p < 0.05 significance level). The results are shown in Table S8.