Background & Summary

Faces contain rich information useful for social interaction1. Researchers have widely used facial stimuli to explore cognitive and emotional processing in both healthy individuals and those with disorders2,3,4. Emotional faces are very common stimuli in emotional studies, and standardized emotional face datasets, such as the Radboud Faces Database5, FACES6, and the American Multiracial Faces Database7, have been created to provide research materials with relatively uniform facial features and good image quality. However, researchers found significant cross-cultural differences in emotion recognition among different races8.

To overcome the influence of culture on facial expression recognition, researchers in China have established a localized emotion face database based on the basic emotion model (including happiness, anger, fear, sadness, disgust, and surprise). The most widely used database is the Chinese Affective Face Picture System9, which was collected from recruited actors, when they imitate different emotions. However, some of the expressions may appear exaggerated or artificial to observers. Furthermore, as some of the volunteers may not present all six emotions well, some expressions are missing for many subjects in the datasets. Therefore, a more natural and standardized Chinese emotional face dataset is needed.

Our faces undergo changes as we grow older. However, age information is rarely considered in face-related studies. Initial research has shown that individuals’ social judgements of young faces primarily consisted of two dimensions: trustworthiness and dominance10,11. Nevertheless, with the inclusion of age as a factor and diverse age faces for observers to evaluate, attractiveness emerged as the third dimension in facial social judgments12. Moreover, classical face models propose that age, along with emotion and sex, plays a critical role in facial preception13. These findings underscore the importance of studying facial age as a critical factor in social judgments. While facial age information is frequently unavailable in current facial datasets, impeding advancements in age-related research on faces.

Additionally, dynamic facial expressions are often seen in social interactions. Dynamic facial expressions evoke stronger emotional responses compared to static ones and are easier to recognize with higher accuracy14,15. The dynamic face dataset contributes to explain how people recognize the dynamic properties of faces. While researchers have created a dynamic face dataset based on Caucasian women and men14, the Chinese version is missing. It is necessary to establish a Chinese dynamic facial expression dataset, which can effectively capture the characteristics of dynamic facial expression changes in Chinese individuals and provide valuable support for cross-cultural comparisons.

Though several Chinese facial datasets are available9,16,17, the authenticity of facial expressions exhibited by volunteers, as well as the diversity among ages of faces, is limited. The credibility and validity of research findings based on such datasets are thus compromised. Furthermore, collecting a large number of volunteers across diverse ages is challenging and requires a substantial investment of time and resources to train these volunteers to exhibit the required emotions on their faces. The adoption of recent AI (Artificial Intelligence) technologies can help overcome this bottleneck in data collection18.

Compared to collecting real faces, using AI-generated faces offers advantages in terms of increased experimental control, standardization, and the ease of obtaining novel stimuli7. We propose a method that introduces the facial action units into pre-trained StyleGAN to achieve high-quality expression editing. The approach produces naturally synthesized expressions without artifacts. Furthermore, we trained our model using Chinese faces with well-controlled identities, resulting in the generation of consistent basic emotions for each individual. Additionally, our method also includes the functions of age progression and dynamic attribute editing. This proposed method can serve as an extension of the currently available facial datasets, enhancing their quality, authenticity, and diversity.

In this study, the Generative Adversarial Networks (GAN) technique, namely the StyleGAN model, was employed to generate facial images. Our contribution includes the creation of a comprehensive face dataset called SZU-EmoDage, which comprises facial images of 120 individuals (equally divided between men and women) with six basic emotions, various ages, and dynamic emotions. Specifically, the StyleGAN model enables the manipulation of facial expressions and age, to produce all six distinct basic facial emotions for each individual. To meet the growing interest in understanding facial age perception, facial images representing ages ranging from 10 to 70 in 10-year increments were also generated. Notably, the SZU-EmoDage dataset incorporates dynamic and continuous changes in facial expressions, providing a valuable resource for further research in the field.

In summary, we present SZU-EmoDage, the first facial dataset synthesized using AI technologies, for face perception study. Notably, the authenticity of expressions and the diversity of faces across different age groups surpass that of existing face datasets. This dataset makes a valuable contribution to the field of facial perception, particularly in areas such as cross-cultural analysis, dynamic facial perception, and facial age perception. Additionally, the extensive variation in face material can serve as an effective tool for detecting mental disorders. The dataset generated in this study represents a significant expansion of currently available facial materials and is very likely to have a profound impact on related research, owing to its improved quantity and diversity.

Methods

Participants

We recruited 120 participants (including 60 men and 60 women, aged from 18 to 28, M ± SD: 20.47 ± 1.83) to finish the study. All the participants reported no history of mental illness, and had normal or corrected-to-normal vision. All the participants have signed the informed consent form; and we followed the principle of voluntary withdrawal and no harm. After participants finished the experiment, they were paid 100 RMB. The study was performed in agreement with the Declaration of Helsinki and approved by the local ethics committee of Shenzhen University.

Procedure

The procedure can be summarized into three parts: (1) To ensure that the generated faces align with Chinese facial features, we used the open face datasets5,9,17,19,20 to train a StyleGAN-based editing model and applied the model to transform a neutral face to six different expressions. All the data we used for the research was obtained with informed consent from the participants. (2) In the process of transforming a neutral face into different expressions, interpolations of latent vectors were employed. This technique enabled the generation of dynamic expressions with varying intensities. (3) Finally, to generate neutral faces of different ages, the open-source SAM (Style-based Age Transform) model21 was used. By starting with the neutral face of a subject, this model was able to generate faces ranging from 10 years old to 70 years old.

Specifically, we used StyleGAN22 based AU (Action Unit) editing to change the expression of facial images. AU is the contraction or relaxation of one or more muscles of the face. As facial expressions can be decomposed into a combination of multiple AU23, the change of a group of AUs can lead to the synthesis of desired expressions on a facial image.

Our model comprises three main modules: the StyleGAN encoder, the AU fusion module, and the StyleGAN generator. The StyleGAN encoder utilizes the encoder architecture and pretrained model from Pixel2Style2Pixel24, and remains unchanged throughout training. Its primary function is to extract image features and encode them into the latent space of StyleGAN, to obtain the corresponding latent vector for the image. The AU fusion module consists of the AU encoder, Style extractor, and Style fusioner. The AU encoder maps the input target AU intensity vector to the space of the latent vector, capturing specific attributes of AUs and target expression information. In this mapping process, a 5-layer multi-layer perceptron (MLP) is employed as the AU encoder. Both the Style extractor and Style fusioner also use a 5-layer MLP. The Style extractor extracts features such as identity and background from the latent vector, which are then concatenated with the target AU latent vector obtained from the AU encoder. The resulting concatenated vector is then input to the Style fusioner, which combines style attribute features with expression features, and generates a new latent vector with the desired AU. Through the AU fusion module, manipulation of AU and expression in the latent space can be achieved. The StyleGAN generator utilizes the state-of-the-art StyleGAN ffhq pretrained model25, and remains unchanged throughout training, which output the face with desire expression, given the latent vector with the target AU.

In the training process, we paired different expression images of the same person to obtain the original expression image I1 and the target expression image I2. Then we obtained the latent vector w1 corresponding to image I1 by StyleGAN encoder24, and an AU vector au2 representing the contraction or relaxation of 17 AUs of face image I2 using AU extractor26. The latent vector w1 is input into the Style extractor to extract style features, which are then concatenated with the result obtained from the target expression AU vector au2 fed into the AU encoder, and then fed into the Style fusioner to obtain a new latent vector w2’ for the target expression. Finally, w2’ was fed into the StyleGAN generator22 to generate the synthesized face image I2’ with the target expression. To generate different expressions of a face image Is, a set of AU vectors AUt = (aut1, …, aut7) of 7 target expressions (including neutral) were extracted from the reference images with seven expression labels. The latent vector ws of Is was then input together with the target AU vector auti (i  [1,7]) into the trained model to obtain the latent vector wt, which was then used by StyleGAN generator to synthesize a face image It with the target expression (Fig. 1).

Fig. 1
figure 1

The workflows of editing the facial expression.

All images were mapped by StyleGAN into a smooth latent space, W. Two latent vectors with close distances in the latent space will generate similar images. As a result, interpolation in the latent space W can be used to generate intermeddle expressions between the face with original expression Is and the target expression It. Specifically, we performed linear interpolation between the original expression latent vector ws and the target expression latent vector wt to generate multiple intermediate latent vectors. If the intermediate latent vector is closer to ws, the expression image generated by StyleGAN generator is more similar to Is, and vice versa. In this way, we obtained many faces with intermediate expressions interpolated between two expression images, which were connected together to form a dynamic group.

For age synthesis, SAM21 was used to obtain images with desired age Iage, which can be mapped into a latent vector wage in the latent space of StyleGAN. GAN prior embedded network (GPEN)27 was further used to increase the resolution of facial images. Similar to the interpolation of expressions, the dynamic change of age can be realized through interpolations of latent vectors between faces of different ages. Finally, we generated faces of seven basis, aging faces and emotional dynamic faces of 180 individuals (half men and women) in total.

To validate the efficacy of the proposed data generation method, our method was compared in Fig. 2 to several state-of-the-art expression editing methods including HiSD28, GANimation29, Expression-manipulator (ExprMAN)30, and InterfaceGAN31. Each of these methods was utilized to generate neutral and the six basic expressions for the same individual.

After using StyleGAN to generate various facial images, we recruited participants to rate the representation of the morphed faces by using the 9-point scale9. The development process of this study refers to related facial dateset9. Eight participants were firstly invited to evaluate the emotional representation of these pictures and performed a preliminary screening. Finally, the faces of 60 men and 60 women were selected as formal experimental materials. The 120 individuals have 840 emotional faces in total, which is ready to be evaluated for the emotional category and emotional dimension (including valence, arousal and dominance). To prevent fatigue from judging numerous faces, we divided the assessment into 3 parts and recruited 40 college students (20 men and 20 women) in each part. The first group of participants were aged from 18 to 23, (M ± SD: 19.88 ± 1.65), who were asked to evaluate the emotional category of presented faces. The second group of participants were aged from 18 to 25 (M ± SD: 20.50 ± 1.88), who were asked to evaluate valence (positive, natural and negative), arousal (from 1 = “very not excited” to 9 = “very excited”), dominance (from 1 = “A weak sense of dominance” to 9 = “A strong sense of dominance”) and the authenticity (from 1 = “not authentic at all” to 9 = “very authentic”) of the faces. The third group of participants were aged from 19 to 28 (M ± SD: 21.18 ± 1.77), who were asked to evaluate the ages of faces with neutral expression (Fig. 3).

Fig. 2
figure 2

Facial expressions generated by different algorithms.

Fig. 3
figure 3

Overview of faces acquisition. The faces dataset includes faces with dynamic expressions, different ages, and emotions.

Data Records

The face dataset is free and available at https://osf.io/7a5fs/ under a CC license32. The face images and videos of different emotions, ages and dynamic expressions are stored in three separate compressed folders. Within each folder, different face images or videos generated from the same individualare organized into a subfolder named as “<gender> <id>”, where “gender” and “id” refer to gender and id of the individual. Face images are named according to the corresponding expressions or ages while videos are named according to the corresponding expressions and duration of videos.

Technical Validation

We conducted a comparative analysis between our method and several state-of-the-art expression editing methods, including HiSD, GANimation, ExprMAN, and InterfaceGAN. Notably, both HiSD and GANimation exhibit limitations in accurately editing the expressions, leading to the generation of low-quality images with noticeable artifacts. Conversely, while InterfaceGAN generates fewer artifacts, it produces expressions that appear unnatural. In comparison, our method excels by producing high-quality images with minimal artifacts and capturing natural expressions, thereby outperforming other methods.

We compared the expression categories of the 840 faces in our dataset with the categories labeled by volunteers recruited for the study, and the matching proportions are listed in Table 1. On average, the percentages of matching are higher than 70%. Happiness has the highest matching rate (100%), followed by neutral (98%), surprise (83%), sadness (82%), disgust (71%), anger (57%) and fear (51%). Furthermore, a confusion matrix was computed to illustrate the matching rate of each type of facial expressions (Fig. 4).

Table 1 The percentage of different matching rates of seven emotions (%).
Fig. 4
figure 4

Confusion matrix of rated facial expressions. Columns represent the facial expressions perceived by raters, while rows represent the real expressions.

We compared the accuracy of basic emotion recognition in SZU-EmoDage to other Chinese-expression databases, including facial-expression database of Chinese (FEDC)-Han20, FEDC-Hui20, FEDC-Tibetan20, Tsinghua facial-expression database17, the first version of CAFPS (CAFPS1)16 and the update version of CAFPS (CAFPS2)33. The results showed that the accuracy of basic emotion recognition in SZU-EmoDage was similar to that in other databases for neutral, happy, surprised, disgusted, and sad expressions. The accuracy of disgusted and fearful expressions in the two versions of Chinese Facial Affective Picture System was below 30%, while in SZU-EmoDage, it was above 51% (see Table 2 and Fig. 5). The results of this research paper demonstrate the potential of deep learning in emotion recognition and its ability to generate reliable and accurate facial expressions.

Table 2 The accuracy rate of basic emotion recognition in different databases (%).
Fig. 5
figure 5

The accuracy rate of basic emotion recognition in SZU-EmoDage, Facial-Expression Database of Chinese Han, Hui, and Tibetan people, Tsinghua facial expression database and two version of Chinese Facial Affective Picture System.

Table 3 shows the percentage of emotional valence rating for each emotion. The results indicated that the majority of negative emotions, including anger, disgust, and sadness, were rated as having a negative emotional valence, with percentages ranging from 65.35% to 68.67%. Fear was also rated as having a negative emotional valence, but with a lower percentage of 37.96%. In contrast, happiness expressions were rated as having a positive emotional valence, with a percentage of 98.08%. Neutral and surprise were rated as having a neutral emotional valence, with percentages of 94.33% and 67.31%, respectively.

Table 3 The percentage of the emotional valence rating (%).

We compared the arousal and dominance among different emotions. The results showed that happiness was rated as the most arousing emotion, while neutral and disgust were rated as the least arousing. Anger was rated as the most dominant emotion, while a neutral face was rated as the least dominant. To assess the extent to which emotions are expressed naturally, participants were also asked to rate the authenticity of the facial expression. The average authenticity rating for all emotions was above five, indicating that participants perceived the facial expressions as at least somewhat genuine. Pictures of happy expressions were rated as the most authentic (Table 4).

Table 4 The degree of arousal, dominance and authenticity of seven emotions.

To assess the stability and reliability of facial expressions, we analyzed the internal consistency coefficient of each emotion category in terms of arousal, dominance, and authenticity. The results indicate that all seven emotional categories demonstrated high reliability, suggesting that the evaluation process of selected faces in the database was highly stable and reliable. Cronbach alpha values are all larger than 0.9 (see Table 5).

Table 5 The Cronbach alpha internal consistency reliability coefficient of each facial expression in the dimension of arousal, dominance and authenticity.

The current dataset also includes faces aged from 10 to 70, with a 10-year interval. The rating results indicate that the proportion of faces in the age ranges of 10–20, 30–50, and 60–70 years old were 25.2%, 34.1%, and 40.7%, respectively.

Usage Notes

The SZU-EmoDage dataset and the proposed method contribute significantly for face perception related studies. Deep-learning models serve as powerful tools to achieve a trade-off between experimental control and ecological validity18, ultimately helps generate naturalistic and standardized datasets. Researchers can leverage our AU-integrated StyleGAN model to generate a large number of faces as required. However, the usage of the method requires some basic technical knowledge, including deep learning fundamentals and proficiency in Python programming, as well as access to computational resources such as GPUs with high memory capacity, to accelerate the image generation process. Additionally, the StyleGAN can be further developed to model new Chinese facial datasets related to social attributes, including facial attractiveness, trustworthiness, and dominance10,11,12. This would allow for the investigation of more scientific questions related to social cognition and the development of new face models for improving facial-perception technology. The generated datasets can also serve as stimuli to detect individual differences in facial expression recognition, particularly those related to emotional disorders, and investigate cross-cultural disparities in facial perception.