K-EmoCon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations

Recognizing emotions during social interactions has many potential applications with the popularization of low-cost mobile sensors, but a challenge remains with the lack of naturalistic affective interaction data. Most existing emotion datasets do not support studying idiosyncratic emotions arising in the wild as they were collected in constrained environments. Therefore, studying emotions in the context of social interactions requires a novel dataset, and K-EmoCon is such a multimodal dataset with comprehensive annotations of continuous emotions during naturalistic conversations. The dataset contains multimodal measurements, including audiovisual recordings, EEG, and peripheral physiological signals, acquired with off-the-shelf devices from 16 sessions of approximately 10-minute long paired debates on a social issue. Distinct from previous datasets, it includes emotion annotations from all three available perspectives: self, debate partner, and external observers. Raters annotated emotional displays at intervals of every 5 seconds while viewing the debate footage, in terms of arousal-valence and 18 additional categorical emotions. The resulting K-EmoCon is the first publicly available emotion dataset accommodating the multiperspective assessment of emotions during social interactions.

recognition of human drivers' emotions by autonomous vehicles would lead to more safety as autonomous vehicles can better judge human drivers' intentions 12 .
Now for machines to become emotionally intelligent, they must first learn to recognize emotions, and the prerequisite to learning is data. However, there lie several challenges in the acquisition of emotion data. While emotions are prevalent, their accurate measurement is difficult. Most commonly, emotions are viewed as psychological states expressed through faces, with distinct categories 13 , but research evidence claims the contrary. Rather than distinct, facial expressions are compound 14 , relative 15 , and misleading 16 . A recent review of scientific evidence also presses against the common view, suggesting that facial expressions lack reliability, specificity, and generalizability 17 , together with past studies on contextual dependency [18][19][20] and individual variability of emotions 21,22 .
Such inherent elusiveness of emotion renders many existing emotion datasets inapplicable for studying emotions in the wild. The majority of emotion datasets consist of emotions induced with selected stimuli in a static environment, i.e., a laboratory [23][24][25][26][27][28][29] . This method provides experimenters with full-control over data collection, allowing assessment of specific emotional behaviors 30,31 and acquiring fine-grained data with advanced techniques like neuroimaging. Nevertheless, lab-generated data may generalize poorly to realistic scenarios as they frequently contain intense expressions of prototypical emotions, which are rarely observed in the real world 32,33 , acquired from only a subset of the population 34 .
An alternative approach utilizes media contents [35][36][37][38] and crowdsourcing 39 , compensating for the shortcomings of the conventional method. The abundance of contents available online, such as TV-shows and movies, allows researchers to glean rich emotion data representative of various contexts efficiently. Crowdsourcing further supports inexpensive data annotation while serving as another data source 40,41 . Datasets of this type have advantages in sample size and the diversity of subjects, but generalizability remains an issue. Datasets based on media contents often contain emotional displays produced by trained actors supposing fictitious situations. To what extent such emotional portrayals resemble spontaneous emotional expressions is debatable [42][43][44] . They also provide no access to physiological signals, which are known to carry information vital for detecting less visible changes in emotional states [45][46][47][48][49][50] .
To amend this lack of a dataset for recognition of emotions in their natural forms, we introduce K-EmoCon, a multimodal dataset acquired from 32 subjects participating in 16 paired debates on a social issue. It consists of physiological sensor data collected with three off-the-shelf wearable devices, audiovisual footage of participants during the debate, and continuous emotion annotations. It contributes to the current literature of emotion recognition, as according to our knowledge, it is the first dataset with emotion annotations from all possible perspectives as the following: subject him/herself, debate partner, and external observers.

Methods
Dataset design. Intended usage. Inspired by previous works that set out to investigate emotions during conversations 38,51-53 , K-EmoCon was designed in consideration of a social interaction scenario involving two people and wearable devices capable of unobtrusive tracking of physiological signals. The dataset aims to allow a multi-perspective analysis of emotions with the following objectives: 1. Extend the research on how having multiple perspectives on emotional expressions may improve their automatic recognition. 2. Provide a novel opportunity to investigate how emotions can be perceived differently from multiple perspectives, especially in the context of social interaction.
Previous research has shown that having multiple sources for emotion annotations can increase their recognition accuracy 54,55 . However, no research in our awareness employs all three available perspectives in the annotation of emotions (i.e., subject him/herself, interacting partner, and external observers). Having multiple perspectives relates to the issue of establishing ground truth in emotion annotations. Emotions are inherently internal phenomena, and their mechanism is unavailable for external scrutiny, even for oneself who is experiencing emotions. As a result, there may not be a ground truth for emotions. Should we consider what is most agreed upon by external observers of emotions as the ground truth, or what the person who experiences emotions reports to have felt the ground truth 56 ? Two views are likely to match if emotions are intense and pure, but as discussed, such emotions are rare. Instead, self-reported and observed emotions are likely to disagree for a variety of reasons. People often conceal their true emotions; sometimes, people are not fully mindful of their internal states; and some have difficulties interpreting or articulating emotions 57,58 .
With K-EmoCon, we intend to enable the comprehensive examination of such cases where perceptions of emotions do not match, by bringing all three available perspectives into the annotation of emotions, in the context of a social interaction involving three parties of: 1. The subject -is the source who experiences emotions firsthand and produces self annotations, particularly the "felt sense" 55 of the emotions. 2. The partner -is the person who interacts with the subject, experiencing the subject's emotions secondhand; thus, he or she has a contextual knowledge of the interaction that induced the subject's emotions and produces partner annotations based on that. 3. The external observers -are people who observe the subject's emotions without the exact contextual knowledge of the interaction that induced the emotions, producing external observer annotations.
Notice, that while our definition of perspectives involved in emotion annotation is similar to definitions previously used by other researchers (self-reported vs. perceived 55 /observed 59 ), we further segment observer annotations based on whether the contextual information of the situation in which the emotion was generated is available to an observer, as we wish to consider the role of contextual knowledge in emotion perception and recognition.
Existing datasets of emotions in conversations provide a limited scope on this issue as they at most contain emotion annotations from subjects and external observers 51 , leaving out annotations from other people who engaged in the conversation (whom we call partners). Or, they either only consider a particular type of annotations that is sufficient to serve their research goal 53 or their designs do not allow acquiring multi-perspective annotations 38,52 (e.g., a dataset is constructed upon conversations from a TV-show, only allowing the collection of external observer annotations). Refer to Table 1 to see how K-EmoCon is distinguished from existing emotion datasets.
Context of data collection. In this regard, we chose a semi-structured, turn-taking debate on a social issue with randomly assigned partners as the setting for data collection. This setting is appropriate for collecting emotions that may naturally arise in a day, as it is similar to a social interaction that one could engage in a workplace.
Also, the setting is particularly suitable for studying the misperception of emotions. It is sufficiently formal and spontaneous as it involves randomly assigned partners. We expect such formality and spontaneity of the setting compelled participants to regulate their emotions in a socially appropriate manner, allowing us to observe less pronounced emotions from participants, which were more likely to be misperceived by their partners 60 .
Data collection apparatus. Our choice of mobile, wearable, and low-cost devices to collect affective physiological signals together with audiovisual recordings, while primarily aims to make findings based on our data more reproducible and expandable, was also in consideration of our goal of investigating mismatches in perceptions of emotions in the wild. Research has shown that fusing implicit and explicit affective information can result in more accurate recognition of subtle emotional expressions from professional actors 61 . However, no work we are aware of has shown that a similar result can be achieved for subtle emotions collected from in-the-wild social interactions of individuals without professional training in acting. Therefore, our dataset provides an opportunity to examine if emotions of lower intensity, produced from non-actors during communication, can be recognized accurately.
It is also interesting to examine whether subtle emotions could signal instances where emotions are misperceived during communication if their accurate detection is possible. In the same vein, to what extent the intensity of emotions influences their decoding accuracy during a social interaction, where a broader array of contextual information is present, is also worth exploring. K-EmoCon could enable an in-depth investigation of such issues.
Further, we considered the use case of mobile and wearable technologies for facilitating emotional communication. Researchers are actively exploring the potential for using expressive biosignals collected via wearables to communicate one's emotional and psychological states with others [62][63][64][65][66] . Our dataset can contribute to the research of biosignal-based assistive technologies to enable affective communication by providing insights on when are apposite moments for communicating emotions.
Ethics statement. The construction of the K-EmoCon dataset was approved by the Korea Advanced Institute of Science and Technology (KAIST) Institutional Review Board. KAIST IRB also reviewed and approved the consent form, which contained information on the following: the purpose of data collection, data collection procedure, types of data to be collected from participants, compensation to be provided for participation, and the protocol for the protection of privacy-sensitive data.
Participants were given the same consent forms upon arriving at the data collection site and were asked to provide written consent after fully reading the form indicating that they are willing to participate in data collection. Since K-EmoCon is to be open to public access, a separate consent was obtained for the disclosure of the data that contains personally identifiable information (PII), which is the audiovisual footage of participants during debates, including their faces and voices. Participants were also notified that their participation is voluntary, and they can terminate the data collection at any point. The resulting K-EmoCon dataset includes the audiovisual recordings of 21 participants, out of 32, who agreed to disclose their personal information, excluding the 11 who did not agree.
Participant recruitment and preparation. 32 participants were recruited between January and March of 2019. An announcement calling for participation in an experiment on "emotion-sensing during a debate" was posted on an online bulletin board of a KAIST student community. The post stated that participants would have a debate on the issue of accepting Yemeni refugees on Jeju Island of South Korea for 10 minutes. It also stated that the debate must be in English, and participants should be capable of speaking competently in English, but not necessarily at the level of a native speaker. Specifically, participants were required to have at least three years of experience living in an English-speaking country, or have achieved a score above criteria in any one of standardized English speaking tests listed here: TOEIC speaking level 7, TOEFL speaking score 27, or IELTS speaking level 7.
Once participants were assigned a date and time to participate in data collection, they were provided four news articles on the topic of the Jeju Yemeni refugee crisis via email. The email included two articles with neutral views on the issue 67,68 , one in favor of refugees 69 , and one in opposition to refugees 70 . We instructed the participants to read the articles beforehand to familiarize themselves with the debate topic.
All selected participants were students at KAIST, but their ages varied from 19 to 36 years old (mean = 23.8 years, stdev. = 3.3 years), as well as their gender and nationality. We randomly paired participants into 16 dyads based on their available times. See Table 2 for the breakdown of participants' gender, nationality, and age. Data collection setup. All data collection sessions were conducted in two rooms with controlled temperature and illumination. Two participants sat across a table facing each other with a distance in between for a M (21) Similarly, induced emotions are when a set of selected stimuli is used for their elicitation. For annotation types, S = self annotations, P = partner annotations, and E = external observer annotations. † A dataset was considered to contain induced emotions if scripted interaction was involved in the data collection, even though no artificial stimuli (such as an emotion inducing video clip) was used. ‡ Predefined emotion categories of stimuli and success rates of participants in a set of purposefully selected cognitive tasks were used as ground-truth labels.
www.nature.com/scientificdata www.nature.com/scientificdata/ During a debate, participants wore a suite of wearable sensors, as shown in Fig. 2, which includes: 1. Empatica E4 Wristband -captured photoplethysmography (PPG), 3-axis acceleration, body temperature, and electrodermal activity (EDA). Heart rate and the inter-beat interval (IBI) were derived from Blood Volume Pulse (BVP) measured by a PPG sensor. 2. Polar H7 Bluetooth Heart Rate Sensor -detected heart rates using an electrocardiogram (ECG) sensor and was used to complement a PPG sensor in E4, which is susceptible to motion. 3. NeuroSky MindWave Headset -collected electroencephalogram (EEG) signals via two dry sensor electrodes, one on the forehead (fp1 channel-10/20 system at the frontal lobe) and one on the left earlobe (reference). 4. LookNTell Head-Mounted Camera -with a camera attached at one end of a plastic circlet, was worn on participants' heads to capture videos from a first-person POV.
All listed devices can operate in a mobile setting. Empatica E4 keeps the data on the device, and the collected data is later uploaded to a computer. Polar H7 sensor and MindWave headset can communicate with a mobile phone via Bluetooth Low Energy (BLE) to store data. Table 3 summarizes sampling rates and signal ranges of data collected from each device. Data collection procedure. Administration. All data collection sessions were conducted in four stages of (1) onboarding, (2) baseline measurement, (3) debate, and (4) emotion annotation. Two experimenters administered each session (see Table 4 for the overview of a data collection procedure). One experimenter served as a moderator during debates, notifying participants of the remaining time and intervening under any necessary circumstances, such as when a debate gets too heated, or a participant exceeds an allotted time of 2 minutes in his or her turn.  www.nature.com/scientificdata www.nature.com/scientificdata/ Onboarding. Upon their arrival, participants were each provided a consent form asking for two written consents, first for the participation in data collection that was mandatory, and second for the disclosure of privacy-sensitive data collected during the session, which participants could opt-out without any disadvantage.
Once they agreed to participate in the research, participants decided whether they would argue for or against admitting the Yemeni refugees in Jeju. Participants could either briefly discuss to settle on their preferred positions or toss a coin to decide at random. The same procedure was followed for deciding who goes first in the debate.
Next, participants were given up to 15 minutes to prepare their arguments. Each participant was given a pen, paper, and prints of the articles that they previously received via email. After they finished preparing, experimenters equipped participants with wearable devices. Participants wore E4 wristbands on their non-dominant hand, as arm movements may impede an accurate measurement of PPG. Experimenters assured that wristbands are tightly fastened, and electrodes are in good contact with participants' skin. Experimenters also assured the EEG headsets and head-mounted cameras are well fitted on participants' heads, and manually adjusted head-mounted cameras' lens to make sure the captured views are similar to participants' subjective views. Participants wore Polar H7 sensors attached to flexible bands underneath their clothes, so the electrodes are in contact with their skin and placed the sensors above their solar plexus.
Baseline measurement. With all devices equipped, sensor measurements were taken from participants while they watched a short clip. This step was to establish a baseline that constitutes a neutral state for each participant. Establishing a neutral baseline is commonly used in the construction of emotion datasets to account for individual biases and reduce the effect of previous emotional states, especially when repeated measurements are taken.
A procedure for a baseline measurement varies across researchers and is often dependent on the purpose of an experiment 71 . In stimuli-based experiments, researchers take measurements as their subjects watch a stimulus intended to induce a neutral emotional state 23,24 or measure resting-state activities between stimuli if they are taking multiple consecutive measurements 25 . Similarly, for K-EmoCon, participants watched Color Bars clip, which was previously reported in the work of Gross et al. to induce a neutral emotion 72 . Experimenters also ensured that no devices were malfunctioning during the baseline measurement.

Debate.
A debate began at the sign of the moderator and lasted approximately 10 minutes. Participants' facial expressions, movements in their upper body, and speeches were recorded throughout a debate. Participants were Step

Allocated time Description
Read and sign consent forms 10 min Experimenters provided consent forms to participants, and two written consents each for participation and the collection of privacy-sensitive data were obtained.
Choose sides and the order 5 min Participants were assigned to either argue in favor of or against accepting refugees and decided on the first speaker.
Prepare debate 15 min Participants were provided with supplementary materials to prepare their arguments.
Equip sensors 10 min Experimenters explained wearable devices to participants and assisted them in wearing devices.

Measure baseline 2 min
A baseline corresponding to a neutral state was measured for each participant.
Overview debate 5 min The moderator explained the debate rules and notified participants that they are allowed to intervene.

Debate 10 min
Participants could speak for two consecutive minutes during their turns and they were notified twice at 30 and 60 seconds before the end of the debate.
Annotate emotions 60 min Participants annotated emotions at intervals of every 5 seconds, watching footage of themselves and their partners.  www.nature.com/scientificdata www.nature.com/scientificdata/ allowed to speak consecutively up to two minutes during their turns, with turns alternating between two participants. However, participants were also notified that they could intervene during an opponent's turn, to allow a more natural communication. The moderator notified participants 30 and 60 seconds before the end of their turns and intervened if they exceeded two minutes. A debate stopped at the ten-minute mark with some flexibility to allow the last speaker to finish his or her argument.
Emotion annotation. Participants took a 15-minute break upon finishing a debate. Participants then were each assigned to a PC and annotated their own emotions and their partner's emotions during the debate. Specifically, each participant watched one audiovisual recording of him/herself and another recording of his/her partner (both recordings from 2nd-person POV, including facial expressions, upper body movements, and speeches), to annotate emotions at intervals of every 5 seconds from the beginning to the end of a debate. We chose 5 seconds based on the report of Busso et al. that the average duration of the speaker turns in IEMOCAP was about 4.5 s 51 , and findings from linguistics research also support this number [73][74][75] .
This annotation method we employed, a retrospective affect judgment protocol, is widely used in affective computing to collect self-reports of emotions, especially in studies where an uninterrupted engagement of subjects during an emotion induction process is essential [76][77][78][79] . Likewise, we opted for this method as participants' natural interaction was necessary for acquiring quality emotion data.
Note that we did not provide 1st-person POV recordings captured from head-mounted cameras to participants, and they only had 2nd-person POV recordings to annotate felt emotions. One may have a reasonable concern regarding this choice, that participants watching their faces likely caused them to occupy a perspective similar to an observer. Hence, this might have resulted in an unnatural measurement of felt emotions. Indeed, the headcam footage could have been a more naturalistic instrument, as we intuitively take an embodied perspective to recall how we felt at a specific moment in the past.
However, we found the extent of information captured by the headcam footage insufficient for accurate annotation of felt emotions. Experimenters manually adjusted headcam lenses, so the recordings resembled participants' subjective views, but the headcam footage was missing fine-grained information such as participants' gazes. Also, past research on memories for emotions has shown that they are prone to biases and distortion [80][81][82] . In that regard, it seemed headcam videos, which contain limited information compared to frontal face recordings, would only result in an incorrect annotation of felt emotions, especially in retrospect. Further, we noted that it is not uncommon for people to infer emotions from their faces, as they frequently do when looking in a mirror or taking a selfie.
As a result, participants were given 2nd-person recordings of themselves for the retrospective annotation of felt emotions. In total, participants annotated emotions with 20 unique categories, as shown in Table 5. Experimenters assisted participants throughout the annotation procedure. Before participants began annotating, experimenters explained individual emotion categories to participants, so they correctly understood a meaning and a specific annotation procedure for each item. Experimenters also explicitly instructed participants to report felt emotions, not perceived emotions on their faces. Lastly, experimenters ensured that the start time and end time for two participants matched to obtain synchronized annotations.
External emotion annotation. Additionally, we recruited five external raters to annotate participants' emotions during debates (see Table 6). We applied the same criteria we used for recruiting participants in data collection to recruit the raters. The raters were provided the 2nd-person POV recordings of participants during debates and annotated emotions following the same procedure our participants followed. External raters performed their tasks independently, and the experimenters communicated remotely with the raters. Once a rater finished annotating, an experimenter checked completed annotations for incorrect entries and requested a rater to review annotations if there were any missing values or misplaced entries.

Data Records
Dataset summary. The resulting K-EmoCon dataset contains multimodal data from 16 paired-debates on a social issue, which sum to 172.92 minutes of dyadic interaction. It includes physiological signals measured with three wearable devices, audiovisual recordings of debates, and continuous annotations of emotions from three distinct perspectives of the subject, the partner, and the external observers.  www.nature.com/scientificdata www.nature.com/scientificdata/ Preprocessing. For the time-wise synchronization across data, we converted all timestamps from Korea Standard Time (UTC +9) to UTC +0 and clipped raw data such that only parts of data corresponding to debates and baseline measurements are included. For debate audios and the footage, subclips corresponding to debates were extracted from the raw footage. Audio tracks containing participants' speeches were copied and saved separately as WAV files. Physiological signals were clipped from the respective beginnings of data collection sessions to the respective ends of debates, as the initial 1.5 to 2 minutes immediately after a session begins corresponds to a baseline measurement for a neutral state. Parts in between baseline measurements and debates correspond to debate preparations, which may be excluded from the analysis. Note that we do not provide unedited audio/ video recordings and raw log-level data, nor codes for preprocessing this data, as they contain privacy-sensitive information outside the boundary of information we have been permitted to share. See Code Availability section for further detail. 83    www.nature.com/scientificdata www.nature.com/scientificdata/ 4. neuro_polar_zeros.csv -same as above. Note that zero values for NeuroSky data (Attention, BrainWave, Mediation) indicate the inability of a device at a given moment to obtain a sufficiently reliable measurement due to various reasons. 5. e4_outliers.csv -contains the number of outliers in each file. Chauvenet's criterion was used for outlier detection (refer to Code Availability section for its implementation in Python). 6. e4_completeness.csv -contains the completeness of each file as a ratio in the range of [0.0, 1.0]. 1.0 indicates a file without any missing value or an outlier. The completeness ratio was calculated as completeness = (total number of values − (number of outliers + number of zeros))/total number of values. 7. neuro_polar_completeness.csv -same as above, with completeness calculated as completeness = (total number of values − number of zeros)/total number of values.

Dataset contents. The K-EmoCon dataset
debate_audios.tar.gz. contains 16 audio recordings of debates in the WAV file format. The name of each file follows the convention of p<X>.p<Y>.wav, where <X> and <Y> stand for IDs of two participants appearing in the audio. The start and the end of each recording correspond to startTime and endTime values in the subjects. csv file, respectively.
debate_recordings.tar.gz. contains 2nd-person POV video recordings of 21 participants during debates in the MP4 file format. The name of a file p<X>_<T>.mp4 indicates that the file is the recording of participant <X> that is <T> seconds long.
neurosky_polar_data.tar.gz. includes subdirectories for each participant, from P1 to P32, which may contain up to four files as the following: 1. Attention.csv -contains eSense Attention ranging from 1 to 100, representing how attentive a user was at a given moment. Attention values can be interpreted as the following: 1 to 20 -"strongly lowered", 20 to 40 -"reduced", 40 to 60 -"neutral", 60 to 80 -"slightly elevated", and 80 to 100 -"elevated". 0 indicates that the device was unable to calculate a sufficiently reliable value, possibly due to a signal contamination with noises. e4_data.tar.gz. contains subdirectories for each participant (except P2, P3, P6, and P7), which may contain up to six files as the following: 1. E4_ACC.csv -measurements from a 3-axis accelerometer sampled at 32Hz in the range [−2g, 2g] under columns x, y, and z. Multiply raw numbers by 1/64 to convert them into units of g (i.e., a raw value of 64 is equivalent to 1g). 2. E4_BVP.csv -PPG measurements sampled at 64Hz. 3. E4_EDA.csv -EDA sensor readings in units of μS, sampled at 4Hz. 4. E4_HR.csv -the average heart rates calculated in 10-second windows. The values are derived from the BVP measurements, and the values are entered at the frequency of 1Hz. The first 10 seconds of data after the beginning of a recording is not included as the derivation algorithm requires the initial 10 seconds of data to produce the first value. 5. E4_IBI.csv -IBI measurements in milliseconds computed from the BVP. From a second row onwards, one row is separated from the previous row with an amount equal to a distance between two peaks (i.e., t i+1 − t i = IBI i ). Note that HR in terms of BPM can be derived from IBI by taking 60/IBI * 1000). 6. E4_TEMP.csv -a body temperature measured in the Celsius scale at the frequency of 4Hz.
Note that E4 data entries for P29, P30, P31, and P32 are entered with each row designated with either one of two unique device_serial values. It is necessary that the dataset users only use rows corresponding to a single device_serial. We further recommend using rows with the following device_serial values: • P29, P31 -A013E1 for all files, except A01525 for IBI. The first row in a valid file has annotations for the first five seconds, and rows coming afterward contain annotations for the next consecutive five-second intervals, non-overlapping. Also, each row in a valid file contains 10 non-empty values (eight numeric values, including seconds column, and two x's). Note that annotation files for a participant may not have an equal number of rows (e.g., there may be more self-annotations than partner/ external annotations for some participants). In that case, longer files should be truncated from the start such that they have the same number of rows as shorter files since the extra annotations at the beginning are possibly from participants mistakenly annotating emotions during baseline measurements.
technical Validation Emotion annotations. Distribution and frequency of emotions. The distributions and the frequencies of emotion annotations are as shown in Fig. 3. Overall, annotations for emotions measured on Likert scales (arousal, valence, cheerful, happy, angry, nervous, and sad) are biased towards a neutral with only a minuscule fraction of annotations for non-neutral states. Categorical emotion annotations (common and less common BROMP affective categories) are similarly biased, with a predominant portion of annotations falling under only two categories of concentration and none. This imbalance in annotations is as expected as emotion data is commonly imbalanced by its nature in the wild (i.e., people are more often neutral than angry or sad) [84][85][86] .
Inter-rater reliability. As individual-level information is missing in aggregated data, we used Krippendorff 's alpha 87 , which is a generalized statistic of agreement applicable to any number of raters, to measure the inter-rater reliability (IRR) of emotion annotations from different perspectives for each participant. Figure 4 shows heatmaps of alpha coefficients computed for seven emotions measured on ordinal scales (arousal, valence, cheerful, happy, angry, nervous, and sad).
All annotation values were interpreted as rank-ordered (ordinal scaled) for the IRR computation. Likert scales we used are not intervals or ratios with meaningful distances in-between. While participants and raters were provided numeric scales labeled with semantic meanings (see Table 5), the individual interpretations of scales were likely disparate.
Given that, before the computation, annotation values were scaled relative to a neutral, by estimating modes of columns as neutrals and deducting them from respective column values (i.e., if the mode of a cheerful column for a particular participant was one, then one was subtracted from all values in that cheerful column). This mode-subtraction step was necessary to prevent the underestimation of IRRs.
Annotations in our dataset for scaled emotions are highly biased, as shown in Fig. 3. However, while arousal and valence are explicitly centered at zero (which corresponds to 3 = neutral), five emotions measured in the scale of 1 = very low to 4 = very high (cheerful, happy, angry, nervous, and sad) are systematically biased without www.nature.com/scientificdata www.nature.com/scientificdata/ a zero neutral. All of their values indicate that some emotion is present, and this absence of zero results in a widely varying interpretation of scale values by our participants and raters.
Consider the following scenario further elaborating this issue: a subject rates that she was cheerful as much as 1 for the first half of a debate, then 2 for the rest, but her debate partner rates that she was cheerful as much as 3 for the first half then 4 for the rest. In this example, self and partner annotations both imply that the subject was less cheerful for the first half of the debate. However, an IRR of two sets of annotations is close to zero without www.nature.com/scientificdata www.nature.com/scientificdata/ subtracting modes. Indeed, it is possible that the partner perceived the subject as more cheerful overall, compared to the subject herself. In that case, a low IRR correctly measures the difference between emotion perceptions of the subject and partner. Nevertheless, this assumption cannot be confirmed, as there is no neutral baseline. Therefore, we applied the proposed mode-subtraction to emotion annotations such that alpha coefficients measure raters' agreement on relative changes in emotions rather than their absolute agreement with each other. This adjustment mitigates spuriously low alpha coefficient values obtained from raw annotations (refer to Code Availability section for the code implementing the mode-subtraction and plotting of heatmaps).
These fixed alpha coefficients are low in general. In particular, a noticeable pattern emerges when comparing alpha coefficients of self-partner (SP) annotations and self-external (SE) annotations. As shown in the last rows of heatmaps (Diff. [SE -SP]) in Fig. 4, the differences between the IRRs of SE annotations and SP annotations tend to be above zero (for 22 out of 32 participants for arousal: mean = 0.145, stdev. = 0.279). This pattern possibly indicates that there exists a meaningful difference in the perception of emotions from different perspectives, while further study is required to validate its significance.
Physiological signals. Data quality. The quality of physiological signal measurements in the dataset has been thoroughly examined. The examination results are included as a part of the dataset in the data_qual-ity_tables.tar.gz archive file.
Missing data. E4 data of 4 participants (P2, P3, P6, and P7) were excluded due to a device malfunction during data collection. While physiological signals in the dataset are mostly error-free with most of the files complete above 95%, a portion of data is missing due to issues inherent to devices or a human error: • IBI -data from P26 is missing as the internal algorithm of E4 that derives IBI from BVP automatically discards an obtained value if its reliability is below a certain threshold. • EDA -data from P17 and P20 is missing, possibly due to poor contact between the device and a participant's skin. • NeuroSky (Attention, Meditation) -measurements from P1 and P20 are missing due to a poorly equipped device. A portion of data is missing for P19 (∼32%), P22 (∼59%) and P23 (∼36%) for the same reason. No BrainWave data was lost. • Polar HR -data from seven participants (P3, P12, P18, P20, P21, P29, and P30) are missing due to a device error during data collection. Parts of data are missing from P4 (∼38%) and P22 (∼38%) due to poor contact.

Usage Notes
Potential applications. In addition to the intended usage of the dataset discussed above, there are uncertainties as to how physiological markers of an individual's capacity for flexible physiological reactivity relate to experiences of positive and negative emotions. Our dataset could potentially be useful to examine the role of physiological signal based markers in assessing an individual's use of emotion regulation strategies, such as cognitive appraisal. Additionally, various data mining and machine learning techniques could be applied to set up models for an individual's emotional profile based on sensor-based physiological and behavioral recordings. This could further be transferred to various positive computing use-cases 88 , such as helping children with autism in their social communication 89,90 , helping people who are blind to read facial expressions and get the emotion information of their peers 91 , finding opportune moments for conversational user interactions 92,93 , assisting social anxiety disorder patients to overcome their conditions 94 , allowing robots to interact more intelligently with people 95,96 , and monitoring signs of frustration and emotional saturation that affect attention while driving, to enhance driver safety 97,98 . Limitations. Data collection apparatus. Contact-base EEG sensors are known to be susceptible to noises, for example, frowning or eyes-movement might have caused peaks in the data. Other devices may also have been subject to similar systematic errors.
Data collection context. The context of the turn-taking debate may have caused participants to regulate or even suppress their emotional expressions, as an unrestrained display of emotions is often regarded undesirable during a debate. This may have contributed to a deflated level of agreement between self-reports and partner/external perceptions of emotions, which may not be a case for more natural interactions in the wild.
Retrospective emotion annotation with 2nd-person footage. We used retrospective affect judgment protocol where our participants annotated emotions they felt during debates watching the 2nd-person footage of themselves. This approach may have introduced unintended effects to self-ratings of emotions, which pertain to the interaction between interoception 99 , emotional reasoning, and self-perception. Nonetheless, we clearly illustrate our rationale for choosing this annotation method on page 5, under Emotion annotation. Further, our dataset includes annotations of participant emotions from debate partners and external raters who watched the same footage. Therefore, rather than being flawed, our dataset opens a window for investigating the effects mentioned above while altogether enables a comprehensive study of emotions by comparing their perceptions across multiple perspectives.
Mode-subtraction in IRR computation. With the mode-subtraction, inter-rater reliability values represent the agreement of raters on relative emotion changes rather than perceived emotions in an absolute sense (see