Neural network modeling of altered facial expression recognition in autism spectrum disorders based on predictive processing framework

The mechanism underlying the emergence of emotional categories from visual facial expression information during the developmental process is largely unknown. Therefore, this study proposes a system-level explanation for understanding the facial emotion recognition process and its alteration in autism spectrum disorder (ASD) from the perspective of predictive processing theory. Predictive processing for facial emotion recognition was implemented as a hierarchical recurrent neural network (RNN). The RNNs were trained to predict the dynamic changes of facial expression movies for six basic emotions without explicit emotion labels as a developmental learning process, and were evaluated by the performance of recognizing unseen facial expressions for the test phase. In addition, the causal relationship between the network characteristics assumed in ASD and ASD-like cognition was investigated. After the developmental learning process, emotional clusters emerged in the natural course of self-organization in higher-level neurons, even though emotional labels were not explicitly instructed. In addition, the network successfully recognized unseen test facial sequences by adjusting higher-level activity through the process of minimizing precision-weighted prediction error. In contrast, the network simulating altered intrinsic neural excitability demonstrated reduced generalization capability and impaired emotional clustering in higher-level neurons. Consistent with previous findings from human behavioral studies, an excessive precision estimation of noisy details underlies this ASD-like cognition. These results support the idea that impaired facial emotion recognition in ASD can be explained by altered predictive processing, and provide possible insight for investigating the neurophysiological basis of affective contact.

Impaired affective contact is a core symptom of autism spectrum disorder (ASD), as reported by Kanner 1 , for which the recognition of facial emotion is an essential skill 2 . In fact, a previous meta-analysis found that individuals with ASD have difficulty in facial emotion recognition 3 , and a substantial number of studies have reported atypical processing of facial stimuli or difficulties with real-life emotional recognition in ASD 4 . The neural basis of facial emotion recognition has been intensively investigated by functional neuroimaging studies in healthy subjects [5][6][7] . These previous studies have revealed a hierarchical structure among several brain regions; namely, the activities in the visual cortex correspond to the processing of lower-level sensory information, including features of faces, and the activity patterns in higher-level brain areas such as the fusiform gyrus or superior temporal sulcus correspond to the emotion category 6,7 . In addition, clinical studies have demonstrated that the activity pattern in the facial emotion recognition network is altered in ASD 4,8 . Despite these findings regarding the anatomical neural networks related to facial emotion expression, the developmental process responsible for categorizing visual facial expression information into emotional groups and its alteration in ASD are largely unknown. Although the emergence of emotional categories was hypothesized to be a natural process based on the similarity among facial feature patterns 9,10 , this hypothesis and its alterations in ASD have rarely been

Methods and materials
Facial expression movie datasets and preprocessing. The facial expression movies were obtained from the CK+ public database 25,26 , which included movies in which the face changed from neutral to peak emotions. Written informed consent was obtained for analysis and publication of the images. The movies consisted of image frames that were taken 30 times per second. Each movie in the CK+ database was labeled based on criteria regarding the movements of facial landmarks (i.e., facial action coding system) 27 and the perceptual judgments The preprocessing of facial expression movies. The trajectories of the 9 facial features corresponding to the facial expressions were utilized as target sequences after mapping to normal face normalization, which is detailed in the methods and materials section. The face image used in this conceptual diagram is that of one of the authors (YT). (B) Overview of S-CTRNNPB. The S-CTRNNPB is a hierarchical recurrent neural network model implementing top-down prediction and bottom-up modulation processes, aiming to minimize precision-weighted prediction error. To simulate autistic cognition, we manipulated the heterogeneity of intrinsic neural excitability by changing K parameters (i.e., the variance of the activity thresholds in lower-level neurons). It is noteworthy that the emotional labels were not provided to the S-CTRNNPB in the learning process, and higher-level neuronal representations were self-organized based on the similarity among the sensory inputs. S-CTRNNPB stochastic continuous time recurrent neural network with parametric bias, PB parametric bias.   25,26 . The current study used the movies focused on the basic six emotions (Anger, Disgust, Fear, Happiness, Sadness, and Surprise) 2 . In the CK+ database, only a few of the total six emotional expressions are included for most subjects. However, an imbalance in the number of emotions in training and test data would hinder the investigation of the relationship among the higher-level neuronal representations of the six emotions. Therefore, we used eightfold cross-validation with an equal number of emotions in each group to evaluate the model (see Supplementary  Methods for details). From the movie, we extracted the X-Y coordinates of 68 facial landmarks (136 features) using automatic face detection and feature tracking system 25 . Thereafter, owing to the limitation of computational cost, features with very strong correlations with other features and almost immobile features were removed; the remaining nine features were used for subsequent analysis (i.e., the X-coordinate of the lip corner, and Y-coordinates of the middle of the eyebrow, the inner eyebrow, ala of the nose, the central upper lip, upper lip vermillion, lip corner, the central lower lip and lower lip vermillion, in the right face). See also Fig. 1A and Supplementary Methods.
The preprocessing of the sequence data was performed as follows. Suppose that we have a sequence of , which has T (i) time steps. First, from the vector of sequence, its first step value is subtracted to make each feature's first step value zero.
Next, the values were scaled to a range of values into [− 0.9, 0.9] for each feature over all target sequences.
The subtraction in Eq. (1) was referred to as "mapping to normal face normalization" because this process unifies the positions of features in the first step among all the target sequence. The resulting sequence data are referred to as "target sequences" and subjected to the following analysis.
Neural network model. In the current study, the main component of the facial emotion recognition process based on the predictive processing framework was modeled using S-CTRNNPB ( Fig. 1B) 21,22 . This framework explains cognition through two processes: (i) top-down prediction of the next sensory input based on the current sensory input and the internal states of the network (i.e., prior belief) and (ii) bottom-up modulation based on precision-weighted prediction error minimization. While the lower-level neurons in S-CTRNNPB represent the dynamics (i.e., short-term sensory processing) of the target sequence, the higher-level neurons represent the abstract meaning for all steps of each sequence. The higher-level neurons in the S-CTRNNPB model are referred to as the parametric bias (PB). The input to the S-CTRNNPB are sensory states corresponding to the facial image at the current time step, and the S-CTRNNPB outputs the prediction of sensory states for the next time step corresponding to the changes in facial images. As a result, S-CTRNNPB can generate a sequence corresponding to the dynamic facial expression of a particular emotion (i.e., target sequence). The S-CTRNNPB predicts not only the value of the target sequence in the next step, but also the precision (i.e., variance of the target values assuming a normal distribution including noise).
In the top-down prediction of S-CTRNNPB, the internal state of the i th neuron at time step t was calculated as where I P , I L , I I , I M , and I V are the index sets of the PB, lower-level, input, predicted mean, and estimated variance neurons, respectively; w ij is the synaptic connection weight from the jth neuron to the ith neuron; x (s) t,j is the jth external input value at time step t of the s th sequence; l (s) t,j is the jth lower-level neuron activity; p (s) t,j is the jth PB activity; τ i is the time constant of the i th neuron; and a i is the activity threshold of the i th neuron.
The output of each neuron is calculated by the activation function as shown below.
(1) Experimental procedure. The experimental procedure consisted of training and test phases. The training phase is analogous to the developmental learning process and aims to optimize the network structure (i.e., synaptic weights) of the S-CTRNNPB and the PB activity associated with each target sequence. It is noteworthy that the network was trained only for predicting the changes in the sensory states of facial expression movies, but the labels of emotions were not provided to the model. After the training, each target sequence was associated with particular activities of PBs, and the relationships among the target sequences (similarity and differences) were expected to be "self-organized" in the state-space of PB activities.
In the test phase, the network was required to predict an unseen target sequence. In this test phase, while the network structure was fixed, the PB activities were updated to minimize the precision-weighted prediction error for an unseen test sequence. This PB update process for an unseen test sequence was regarded as "emotion recognition" based on the similarity of the PB activity for a test sequence to the PB clusters for the training sequences of a particular emotion category. A detailed explanation of the clustering index for PB spaces is provided in the Supplementary Methods section.
Parameter manipulations to simulate autistic cognition. We simulated two pathological conditions of network structures based on biological or computational findings in ASD. First, the intrinsic heterogeneity of network excitability is important for efficient information processing [29][30][31] , and its alterations have been suggested to be related to ASD, i.e., altered "excitatory-inhibitory balance" 24 . In the current experiment, as shown in Eq. (11), the activity threshold of lower-level neurons (i.e., a i in Eq. (5)) is initialized to follow a Gaussian distribution and fixed without being updated by learning.
The K parameter in Eq. (11) determines the heterogeneity of intrinsic neuronal excitability, and as the parameter K is increased, the excitability of the network becomes more heterogeneous. In the current analysis, the heterogeneous network with K = 1000 was regarded as a typical developmental model, and homogeneous networks with K = 0.001 and 1 were assumed to be possible ASD-like models. Second, as there are biological studies that showed that the brains of individuals with ASD have a greater number of cortical minicolumns (i.e., structures that constitute basic functional assemblies of neurons) 32, 33 , the influence of an increased number of lower-level neurons on the performance of the models was also evaluated in this study. To investigate the influences of the parameter K and the number of lower-level neurons on the model performance, the four representative network structures in Table 1 were subjected to the analysis described earlier.

Results
The performance of neural networks of typical development model. The learning curves indicated that both the prediction errors for the training and test sequences substantially decreased ( Fig. 2A), suggesting that the model not only succeeded in reproducing the training target sequences, but also acquired the generalization capability to predict the unseen test target sequences.  Fig. 2B. The target sequences, which differed according to emotions, were well reproduced by network prediction. The activity pattern in the lower-level neurons corresponded to the short-term dynamics of the target sequences. Among the lower-level neurons, the activities of several neurons changed over time, while the activities of other neurons maintained almost constant levels of activity. On the other hand, the  www.nature.com/scientificreports/ activity patterns of PBs seemed to correspond to a more abstract level of characteristics of a sequence independent of the short-term changes in sensory states (Fig. 2B).
The PB activities for all target sequences are illustrated in Fig. 2C. In Fig. 2C, the PBs corresponding to the training target sequences (outline symbols) seemed to be clustered according to the emotional categories. Based on this finding, although the emotion labels were not provided in the training, the emotional clusters successfully emerged through the predictive processing framework. In addition, the PB activities for the test sequences (filled symbols) that were optimized through the emotion recognition process by minimizing the precision-weighted prediction error were located close to the PB clusters for the training sequences of the same emotion (Fig. 2C), indicating that the models successfully recognized facial expressions of unseen test sequences.
Predictive performances among various models. The influence of the network characteristics on predictive processing was investigated using various network structures, including the intrinsic heterogeneity of network excitability and the size of the network (Table 1). Compared with the heterogeneous (typical development) network model, the excessively or modestly homogeneous network models had smaller training errors and larger test errors, while the network with an increased number of neurons (large network model) showed larger training and test errors ( Fig. 3A; see Supplementary Figs. S1A,B, S2A for more details).
In relation to the aberrant precision hypothesis in ASD, the estimated variance of the target values was also investigated among the various models. Compared to the heterogeneous (typical development) network, the homogeneous network models tended to estimate the lower variance (i.e., higher precision) and the large network estimated the larger variance (i.e., lower precision) (Fig. 3B, see Supplementary Fig. S2B for more details). Based on the abovementioned findings, the excessively or modestly homogeneous network models estimated excessive precision on sensory input, resulting in overfitting (i.e., low training error and high test error), which is consistent with the ASD-like cognition based on the aberrant precision hypothesis.
Emotion recognition among various models. Emotion recognition (i.e., the degree of clustering of PB activities for training sequences and optimized PB activities for unseen test sequences) was also compared among the various networks. For each model condition, the distribution of PB activities acquired for training and test datasets is illustrated in Fig. 4A Table 1) and the Y axis indicates the MSEs which are averaged over the sequence length, feature numbers, and sequence numbers. An enlarged view is shown in Supplementary Fig. S1A (Table 1) and the Y axis indicates average silhouette widths based on eightfold cross-validation. The silhouette width is a measure of similarity of an object to its own cluster compared to other clusters (detailed in Supplementary  Methods). PB parametric bias, MNF normalization; mapping to normal face normalization.  Fig. 4F. In the excessively homogeneous network (Fig. 4A), the PB activities for training sequences were not clustered, and those for the test sequences were remotely located from those for the training sequences. Therefore, the excessively homogenous network model failed to acquire the PB representation corresponding to the emotional categories or recognize the similarity between training and test sequences within the same emotion. The clustering index showed that not only excessively but also modestly homogeneous networks showed a tendency toward weaker clustering for the training and test sequences (Fig. 4F). On the other hand, in the heterogeneous network (Fig. 4C,F) and large network (Fig. 4D,F), PB activity for training and test sequences were successfully clustered according to emotion. Additionally, to evaluate the effect of mapping to normal face normalization on emotion recognition, the PB activity optimized for the target sequences in the heterogeneous network without this normalization are shown in Fig. 4E,F. Remarkably, without this normalization, the PB activities for training sequences were not clustered and those for the test sequences were remotely located from those for training sequences, suggesting that the process of mapping to a normal face was essential for the emergence of emotional categories and facial emotion recognition in the predictive processing framework.
Tolerance of higher-level neural representation. As mentioned above, the heterogeneous (typical development) network acquired the generalization capability to predict test target sequences using PB activity similar to those for training datasets with the same emotion. We then hypothesized that the PB representations in a heterogeneous network would tolerate subtle differences among the target sequences for prediction, while the homogeneous network would be fragile to subtle differences due to overfitting to a particular sequence of the training targets. To investigate this hypothesis, we evaluated the number of training sequences that could be predicted well (i.e., average mean squared error < 0.005) by changing the levels of PB activities for various network conditions. This number of successfully predicted sequences would reflect the tolerability of PB representation for application to different sequences. Figure 5A-D shows the number of well-predicted sequences for each PB activity with various network conditions in Table 1. Excessively or modestly homogeneous networks reproduced a smaller number of target sequences for each PB activity, which implies that the homogeneous network cannot tolerate the subtle difference among the target sequences even in the same emotion category (Fig. 5A,B). This intolerance to the subtle difference supported the idea that the homogeneous network was overfitted to the training sequence, in addition to the findings of these models' high test error and low training error. On the other hand, the heterogeneous network could predict a larger number of sequences by each PB activity and tolerated the difference within the emotion category (Fig. 5C). For the large network, no PB activities enabled the generation of the training sequence with a sufficiently small prediction error (< 0.005) (Fig. 5D), which is consistent with the findings of the relatively large training error of these models in Fig. 3A. The distribution of lower-level neuron activity depending on the intrinsic heterogeneity of network excitability. Considering that many lower-level neurons showed almost constant levels of activity during sequence generation (Fig. 2B), the anatomical size of the network (total number of neurons) and the number of neurons actually recruited to embed the dynamics of facial expression (functional size of network) could be dissociated. To investigate the relationship between the functional size of the network and the intrinsic heterogeneity of network excitability, the distributions of the range of activity (i.e., the difference between the maximum and minimum outputs) were plotted as a function of the activity threshold of each lower-level neuron in the homogeneous and heterogeneous network conditions (Fig. 5E). In excessively and modestly homogenous networks, the activity thresholds are tightly distributed and the range of activity is widely distributed in non-zero values, indicating that all of the lower-level neurons are activated. On the other hand, the heterogeneous network had a wider distribution of activity thresholds, and the activity ranges of a substantial number of lower-level neurons were nearly zero. Therefore, compared with the homogeneous network, the heterogeneous network was characterized by not only the large variance in activity thresholds but also the smaller size of the functional network.

Discussion
To the best of our knowledge, the current study is the first to evaluate facial emotion recognition based on a predictive processing framework related to ASD. The current study succeeded in showing the following: First, the perceptual categories of emotions can emerge in a hierarchical neural system through a learning process based on precision-weighted prediction error minimization. Second, the cognitive process to estimate the emotion of unseen facial expressions can be understood as the process of adjusting higher-level neural states based on minimizing precision-weighted prediction error. Third, altered facial emotion recognition in ASD can be simulated by homogeneous intrinsic neural excitability in lower-level neurons.
Using a hierarchical predictive processing framework, we demonstrated that predictive learning of facial features is sufficient for the self-organization of emotional categories in the higher level of network hierarchy without explicit emotional labels provided. Related to this finding, a previous study reported that self-organized higher-level neural representation can be used to discriminate genuine and fake emotions from facial movies using a hierarchical RNN with PB 34 . Combined with our findings, this suggests that the extraction of abstract information from dynamic facial expressions can be understood using the predictive processing framework.
In the current study framework, emotional categorization or recognition and generalization were influenced by the intrinsic heterogeneity of neural excitability, which could be understood by the functional size of the neural network (i.e., the number of neurons whose activity changes over time) rather than the anatomical size of the neural network (i.e., total number of neurons). The relationship between model performance and functional network structures is summarized in Fig. 6. The excessively or modestly homogeneous networks showed altered  www.nature.com/scientificreports/ The increased (anatomical) number of lower-level neurons decreased the predictive accuracy for both training and test sequences, but the emotional categories emerged after learning and could recognize the emotion of unseen sequences, which is different from characteristic ASD cognition. Indeed, there are conflicting results in studies using artificial neural networks to simulate ASD cognitive patterns by manipulating the number of neurons [35][36][37] . The current study suggests that, rather than anatomical network size, functional network size would be more closely related to autistic cognitive traits (i.e., altered generalization capability and perceptual emotional categorization).
Since the current neural networks model the firing frequency of a neuron population in a living brain, it is important to note that the number of neurons in our model does not directly correspond to that in the biological brain. Therefore, based on the results of the current neural network, it is difficult to directly discuss the specific number of neurons in an organism. However, we believe that we have succeeded in showing a decreasing tendency of the model's generalization ability for excessively large functional networks. This trend, demonstrated computationally in this study, is consistent with biologically confirmed findings that ASD patients have larger brains 38 , more minicolumns 32,33 , and an excitatory/inhibitory imbalance 39 .
We also demonstrated that mapping to normal face normalization was essential for facial emotion recognition with self-organized emotional categories in higher-level neurons. There is accumulating evidence supporting that specific brain areas (i.e., fusiform face area or superior temporal sulcus) are involved in preprocessing of facial expression information, which is different from that of the other visual objects [40][41][42][43] . Given our findings, these brain areas mediating facial information processing are likely involved in the "mapping to normal face" function to extract the emotion from the dynamic facial stimuli.
In the current study, the predictive processing framework achieved clustering by emotion using only perceptual information and also achieved generalization, but the higher-level neural representations showed overlap between PB clusters representing different emotions. This is not surprising, however, given that the higher-level neuronal representations were formed solely by predicting the visual features of facial expression images. This may reflect the fact that the visual features of facial expression images are similar, to some extent, for each emotion, and that they are not clearly divided into emotional clusters. Even in healthy subjects, confusion can occur when emotion recognition is based only on visual information of facial expressions 44 . Therefore, a clearer emotional category would be created by integrating not only facial expressions but also body posture, voice, and background information.
The spatial relationship between emotional clusters in higher-level neural representations is interesting, but there are limitations to what can be discussed from this study. In the current study, it was difficult to find a clear correspondence between the arousal and valence axes and emotional location in higher-level neural representations. One reason may be the limitation in the information that can be read from the visual information of facial images alone, as mentioned above. Another reason could be that the valence and arousal levels in the six basic emotions used in this study are not balanced, with more negative-valence and high-arousal emotions. Future studies preparing training data for such purpose could investigate the relationship between emotion-clusters and valence and arousal axes in higher-level neuron representations.
In the current study, we focused on the variance of activity thresholds in the lower-level neurons (i.e., K parameter) for simulating autistic symptoms by referring to the excitatory-inhibitory imbalance hypothesis 39 and previous studies on network heterogeneity and efficient coding [29][30][31] . In this study, we achieved the best emotional clustering and generalization with K = 1000, but the optimal K value depends on other various experimental conditions [29][30][31] . This study was successful in demonstrating that, for a small activity threshold variance, there is a tendency for an ASD-like phenotype, but the range of K values should be considered for each experimental condition. Currently, there is a lack of biological studies available to determine the specific K value for typical development and ASD.
The current study demonstrated that the facial emotion recognition process and its alterations in ASD can be understood using a predictive processing framework based on computational psychiatry methods. Computational psychiatry methods using a predictive processing framework have been suggested to be useful in understanding autistic behavior or perception in previous studies 23,24 , while our study is the first to suggest that these methods could also be applied to investigate the social interaction defects of ASD symptoms (i.e., affective contact). Our findings may open the door to future studies investigating the relationship between network characteristics and various components of psychiatric symptoms by simulating system-level information processing using computational psychiatry methods.