Background & Summary

Recognizing human emotions is crucial to people’s survival and social communication. There are various carriers and channels of emotional expression. The relevant existing works in both psychology1 and computer science2,3 mainly focus on human faces and voices as well as emotional scene (context, situations, and conditions). Although the classical theories of emotion have highlighted the significance of body movements since the 19th century4,5, as de Gelder1 said, “bodily expressions never occupied centre stage in emotion research.” In recent decades, psychologists have taken up this issue and found that body movements can provide comparable recognition accuracy relative to facial expression, regardless of static and dynamic conditions6,7. Moreover, body movements may show advantages in emotional integration, where multiple emotional stimuli affect each other. For example, when someone is experiencing the most intense emotion of her/his life (e.g., winning the Olympic gold medal), others cannot recognize his/her emotion from just the face and have to use body cues8. Similarly, complexities in various situations also exist due to the emotional interactions with voice9 and scene10,11. Therefore, human body indicators are essential for thorough emotion recognition.

In previous studies, researchers have created some emotional body movement sets using motion capture techniques. These stimulus sets record people’s emotional expression while dancing, walking, and performing other actions7,12,13,14,15,16,17,18,19,20,21,22,23,24. However, most of these researchers have published finished products (e.g., point-light displays, video clips), but have not included raw kinematic data, and there are fewer recordings in those sets than in those for facial indicators25. Some sets consist only of arm movements and do not demonstrate the whole body26. Moreover, some studies use only one scenario for a particular emotion (e.g., fear with a pursuing attacker)6,27, or the same situational guidance across all emotions (e.g., walking)14,28. In daily life, people do not always express their emotions in a single situation or through a single physical posture. Kinematic data includes rich information, such as the position and rotation of body segments, joint angles, and spatio-temporal gait parameters. These points demonstrate the restrictions that have kept researchers from studying how emotion is encoded in body movements and revealing the relationship between quantitative assessment for each joint or body segment and emotion dimensions (e.g., pleasant-unpleasant, valence; deactivated-activated, arousal)29. Therefore, there is a lack of public human body kinematic dataset expressing various emotions. Considering this lack of available information, we aimed to produce a comprehensive dataset to assist in recognizing emotional cues from all parts of the body in richer daily situations with more ecological validity.

Taken together, we report a human body kinematic dataset that consists of 1402 trials while expressing six basic emotions (happiness, sadness, anger, fear, disgust, surprise) and neutral. Twenty-two semi-professional actors (half male) participated in this study. A low cost and validated inertial motion capture system with 17 sensors was used30,31,32. The actors performed according to the standardized guide and carefully screened daily events. The resulting kinematic dataset contains the position and rotation data of 72 anatomical nodes, which is stored in the BioVision Hierarchy (BVH) structure. This work expands the scope of emotion recognition and can help us to better understand human’s emotion conveyed via body movements. To the best of our knowledge, this is, at present, the largest emotional kinematic dataset based on the whole human body. These data are expected to be used repeatedly and foster progress in several fields, including social neuroscience, psychiatry, human-computer interaction, computer vision, and biometric and information forensics.

Methods

Preparation phase

Equipment and environment

The kinematic data were collected using a wireless motion capture system (Noitom Perception Neuron, Noitom Technology Ltd., Beijing, China) with 17 wearable sensors. This apparatus was connected to the Axis Neuron software (version 3.8.42.8591, Noitom Technology Ltd., Beijing, China) on a laptop computer (Terrans Force T5 SKYLAKE 970 M 67SH1, Windows 10 operating system, Intel Core i7 6700HQ processer). These sensors, with a sampling rate of 125 Hz, were placed on both sides of the actors, including their upper and lower arms, hips, spine, head, feet, hands, shoulders, and both upper and lower legs (see Fig. 1a). The tasks were conducted in a quiet laboratory (see Fig. 1b). The actors needed to execute each performance in a square stage of 1 m × 1 m that was 0.5 m from a wall. Proper and limited performance space would also control the horizontal distance between the actor and the object of emotions so that the horizontal displacement of all recordings was not very different.

Fig. 1
figure 1

Sensor locations and laboratory environment. (a) Seventeen sensors were attached on the actors’ both sides-upper arm and lower arm, hip, spine, head, feet, hands, shoulders, and both legs’ upper leg and lower leg. (b) The actors were required to perform in a stage of 1 m × 1 m.

Scenarios

In order to guide the actors to perform in a typical and natural manner, we created 70 daily event scenarios (10 for each emotion and neutral; see Supplementary File 1) according to the basic concepts of emotions33,34,35 and previous research6,7,20,21,22,26,27,36. To test the validity of the scenarios, 70 college students (mean age = 23.10 years, SD = 1.64, 42 females) were required to classify the emotion expressed in the scenarios displayed randomly on a seven-alternative forced-choice questionnaire (see Supplementary File 2). The five most recognizable scenarios for each emotion were retained (see Online-only Table 1). Note that, for emotional but not neutral recordings, free (non-scripted) performances were added, during which the performance was spontaneous, and the actors were free to interpret and express the emotions as they thought fit, not be restricted by the scenarios. Thus, there were 35 scenarios used in the recording phase.

Actors

Another group of 24 college students (mean age = 20.75 years, SD = 1.92; mean height = 1.69 m, SD = 0.07; mean weight = 58.52 kg, SD = 12.42; mean BMI = 20.25, SD = 3.30) was recruited from the drama and dance clubs of the Dalian University of Technology to perform as actors for this study (see Table 1). Actors F04 and F13 were excluded because they dropped out. All actors were physically and mentally healthy and right-handed. Each actor gave their written informed consent before performing and was told that their motion data were to be used only for scientific research. The study was approved by the Human Research Institutional Review Board of Liaoning Normal University in accordance with the Declaration of Helsinki (1991). After the recording phase, the actors were paid appropriately.

Table 1 Demographic information of the actors.

Recording phase

The actors wore black tights, and 17 Neuron sensors were attached to the corresponding body location. A four-step calibration procedure using four successive static poses was done for the Axis Neuron software before performances and when necessary (e.g., bad WIFI signal or after a rest) (see Fig. 2; for details, see https://neuronmocap.com/content/axis-neuron). To prevent inconsistencies in interpretation by different actors, standardized instructions were given before each recording (see Supplementary File 3). The actors started in a neutral stance (i.e., facing forward and arms naturally at sides). For each kind of emotion, we successively asked the actors to give a six-second free performance based on their self-understanding and the scenario performance. The order of emotions displayed for the actors was random, and the order of scenarios within each emotion was random as well. When the actors were ready and we said “start”, the Axis Neuron software simultaneously recorded the motion data. After each performance, we reviewed it and evaluated the signal quality; hence, some performances needed to be repeated several times. The recording phase took approximately two hours, during which the actors could have a rest when they felt tired.

Fig. 2
figure 2

Four-step calibration procedure. (a) Steady pose. (b) A-pose. (c) T-pose. (d) S-pose. The figures are taken from another source with permission (https://neuronmocap.com/content/axis-neuron).

Data Records

A total of 1406 trials were collected. The commercial Axis Neuron software uses a RAW file format to store data. We exported those RAW trial files to BVH files, as it is a standard file format that can be analyzed using various software (e.g., 3ds Max, https://www.autodesk.com/products/3ds-max; MotionBuilder, https://www.autodesk.com/products/motionbuilder). Four raw files were impaired during this process (i.e., F09D0V1, M04F4V2, M06N1V1, and M06SU1V1). Therefore, the human body kinematic dataset—available from https://doi.org/10.13026/kg8b-1t4937 (mean duration = 7.22 s, SD = 1.57)—that was created consists of 1402 trials expressing six emotions and neutral.

Each actor has his/her own folder, named as the actor ID (see Table 1), consisting of BVH files for all emotions. Each trial was named systematically as “<actor_ID><emotion><scenario_ID><version>” (for details, see file_info.csv in https://doi.org/10.13026/kg8b-1t4937), where “actor_ID” represents the actor ID; “emotion” includes happiness (H), sadness (SA), neutral (N), anger (A), disgust (D), fear (F), and surprise (SU); “scenario_ID” consists of the free (indicated as 0) and scenario performance indicated with the corresponding numeral from 1 to 5; and “version” denotes the number of repetitions.

A BVH file contains ASCII text and two sections (i.e., HIERARCHY and MOTION). Beginning with the keyword HIERARCHY, this section defines the joint tree, the name of each node, the number of channels, and the relative position between joints (i.e., the bone length of each part of the human body). There are totally 72 nodes data (i.e., 1 Root, 58 Joints, and 13 End Sites) in this section (see Fig. 3), which are calculated by the commercial Axis Neuron software according to the 17 sensors (see Fig. 1a). The MOTION section records the motion data. According to the joint sequence defined, the data of each frame is provided, and the position and rotation information of each joint node is recorded. There are some legends in a BVH file:

  • HIERARCHY: beginning of the header section

  • ROOT: location of the Hips (see Fig. 3)

  • JOINT: location of the skeletal joint refers to the parent-joint (see Fig. 3)

  • CHANNELS: number of channels including position and rotation channels

  • OFFSET: X, Y, and Z offsets of the segment relative to its parent-joint

  • End Site: end of a JOINT which has no child-joint (see Fig. 3)

  • MOTION: beginning of the second section

  • FRAMES: numbers of frames

  • Frame Time: sampling time per frame

Fig. 3
figure 3

BVH skeletal structure. The approximate anatomical position and corresponding description of 72 nodes are given. Details of the hands are shown in the rectangles.

Technical Validation

Scenario

To improve the reality and recognition of actors’ performances, we conducted a pilot study (see Methods) and selected the top five recognizable daily events that had the highest accuracy for each emotion (mean ± SD, happiness: 0.994 ± 0.013; anger: 0.917 ± 0.040; sadness: 0.957 ± 0.026; fear: 0.971 ± 0.023; disgust: 0.906 ± 0.041; surprise: 0.871 ± 0.030; neutral: 0.920 ± 0.042).

Calibration

As described in the recording phase (see Methods), the motion capture system was calibrated with the four-step calibration procedure before performance and as needed. We also reviewed and visually checked the quality of the motion signal and the naturalness of performances trial by trial. After all sensors have been calibrated, the pose of the model in the recording software will be consistent with the actor’s initial stance (i.e., facing forward and arms naturally at sides; see Fig. 4a), and the spatial position of the mass center across all models in the recording software will be relatively stable. Otherwise, the model will be deformed, and the initial spatial position of the mass center will be inconsistent across all performances. Therefore, the spatial positions of the first frame mass center across all recordings can reflect the calibration quality. To evaluate the calibration quality, we used the Axis Neuron software and extracted the X, Y, and Z positions of the mass center (see Fig. 4a) of the first frame from each recording (see https://physionet.org/content/kinematic-actors-emotions/2.1.0/). The data distribution in these three dimensions was relatively centralized, showing that the initial states of the actors would be consistent (see Fig. 4b–e), which suggests a good calibration quality in our study.

Fig. 4
figure 4

The location of mass center and three-dimensional distribution of the mass center of the first frame. (a) A big red transparent ball represents the mass center of the model in the recording software (this figure is taken from another source with permission, https://neuronmocap.com/content/axis-neuron). (b) 3D scatter of the mass center of the first frame. (c) The distribution of X position. (d) The distribution of Y position. (e) The distribution of Z position. Figure 4 shows that, after the calibration procedure, the initial spatial position of the actors across all performances are relatively centralized and consistent, reflecting a good calibration quality in the present study.

Usage Notes

BVH files can be imported directly into 3ds Max (https://www.autodesk.com/products/3ds-max), MotionBuilder (https://www.autodesk.com/products/motionbuilder), and other 3D applications. Therefore, these data can be used to build different avatars in virtual reality and augmented reality products. Previous studies on emotion recognition in the field of computer and information science have mainly focused on human faces and voices; hence, the dataset created in this study can improve current technologies and contribution to scientific advancements. In the fields of psychiatry and psychology, researchers can also create experimental stimuli based on the present study, such as emotional point-light displays that contain biological motion information38. Such material has been applied in the field of social cognitive impairment and can contribute to the clinical diagnosis of autistic spectrum disorders, schizophrenia, and other conditions of psychiatric patients39,40.

Amongst the trials produced, some special trials should be noted. Although we asked the actors to complete each performance within six seconds, some trials are, in fact, much shorter or longer because of the difference in human time perception or operator error (e.g., 21.84 s for F07SA0V1, 2.688 s for M01D1V2). Future studies aim to be more consistent by avoiding these trial discrepancies when possible.

Although we have got richer data, reflecting the variability of body movements by creating more daily scenarios and recruiting more actors, and the present set also gives researchers more opportunities to select experimental materials, the differences among the scenarios in the same emotion may bring some “noise”. For example, if the object of anger is a proximal dog, the expression may be targeted at a lower vertical level than if the object is a car speeding off. We will further examine the specific relationship between scenario-induced movement and subjective emotional experience (e.g., emotional intensity, valence, and arousal).

Because all the actors were told that their motion data were to be used only for scientific research before the performance, we followed the recommendation of PhysioNet and chose a suitable license as data use agreement (Restricted Health Data License 1.5.0,

https://physionet.org/content/kinematic-actors-emotions/view-license/2.1.0/). Therefore, users need to sign this agreement online before downloading and using the present dataset.