Facial expressions contribute more than body movements to conversational outcomes in avatar-mediated virtual environments

Oh Kruzic, Catherine; Kruzic, David; Herrera, Fernanda; Bailenson, Jeremy

doi:10.1038/s41598-020-76672-4

Download PDF

Article
Open access
Published: 26 November 2020

Facial expressions contribute more than body movements to conversational outcomes in avatar-mediated virtual environments

Catherine Oh Kruzic¹,
David Kruzic¹,
Fernanda Herrera¹ &
…
Jeremy Bailenson¹

Scientific Reports volume 10, Article number: 20626 (2020) Cite this article

10k Accesses
40 Citations
16 Altmetric
Metrics details

Subjects

Abstract

This study focuses on the individual and joint contributions of two nonverbal channels (i.e., face and upper body) in avatar mediated-virtual environments. 140 dyads were randomly assigned to communicate with each other via platforms that differentially activated or deactivated facial and bodily nonverbal cues. The availability of facial expressions had a positive effect on interpersonal outcomes. More specifically, dyads that were able to see their partner’s facial movements mapped onto their avatars liked each other more, formed more accurate impressions about their partners, and described their interaction experiences more positively compared to those unable to see facial movements. However, the latter was only true when their partner’s bodily gestures were also available and not when only facial movements were available. Dyads showed greater nonverbal synchrony when they could see their partner’s bodily and facial movements. This study also employed machine learning to explore whether nonverbal cues could predict interpersonal attraction. These classifiers predicted high and low interpersonal attraction at an accuracy rate of 65%. These findings highlight the relative significance of facial cues compared to bodily cues on interpersonal outcomes in virtual environments and lend insight into the potential of automatically tracked nonverbal cues to predict interpersonal attitudes.

A deepfake-based study on facial expressiveness and social outcomes

Article Open access 13 February 2024

A multi-lab test of the facial feedback hypothesis by the Many Smiles Collaboration

Article 20 October 2022

The nonverbal expression of guilt in healthy adults

Article Open access 08 May 2024

Introduction

Nonverbal cues are often heralded as the main source of social information during conversations. Despite the many decades social scientists have studied gestures, however, there are only a handful of large sample studies in which the body movements of interactants are measured in detail over time and associated with various communication outcomes. Hence, this experiment capitalizes on dramatic advancements in virtual reality (VR) technology to track and quantify the facial expressions and body movements of over 200 people speaking to one another while embodied in an avatar.

Steuer¹ defines VR as “a real or simulated environment in which a perceiver experiences telepresence.” Under this definition, VR includes immersive and non-immersive experiences involving technologies that contribute to feelings of vividness and interactivity, the two core dimensions of telepresence¹^. Multiple companies have launched avatar-mediated social VR platforms, which allow users to connect with others using customized avatars (i.e., digital representations of users controlled in real-time²) in virtual environments. One development that has made avatar-mediated communication particularly attractive has been the possibility to achieve unprecedented levels of behavioral realism³. Optical tracking systems (e.g., HTC Vive, Microsoft Kinect, Oculus Rift CV1) can measure users’ physical movements in real-time with great accuracy⁴ and render virtual representations accordingly. Although less common in consumer products, developments in computer vision allow for facial tracking through information extracted from RGB and/or infrared cameras. While facial tracking is yet to be widely available on social VR platforms, there has been a growing interest in developing technology that allows for a more seamless facial tracking experience^5,6,7.

Despite the significant interest in adding nonverbal cues to VR, little is known about the impact of incorporating nonverbal channels in avatar-mediated environments. While current industrial trends appear to revolve around the belief that ‘more is better’, studies show that technical sophistication does not necessarily lead to more favorable outcomes^8,9 Furthermore, considering that even minimal social cues are enough to elicit social responses¹⁰ and that verbal strategies are sufficient to communicate emotional valence¹¹, it is unclear whether incorporating additional nonverbal cues will linearly improve communication outcomes.

Understanding the impact of facial expressions and bodily movements within avatar-mediated environments can help further our understanding of the significance of these channels in FtF contexts. While there are a handful studies that lend insight into the independent and joint contributions of various nonverbal channels during FtF interactions, the majority of these studies were either conducted with static images^12,13 or posed expressions^14,15,16, rather than FtF interactions. In addition, the limited number of studies that did study the impact of different nonverbal cues in FtF dyadic contexts asked participants to wear sunglasses^17,18 or covered parts of their bodies^19,20, which inevitably alters the appearance of the target individual and reduces both the ecological validity and generalizability of results. By using identical avatars across conditions and only allowing the nonverbal information to differ, the present study offers an ideal balance between experimental control and ecological validity³.

Behavioral realism and interpersonal outcomes

The extant literature offers a mixed picture regarding the relationship between nonverbal cues and interpersonal outcomes within avatar-mediated contexts. On the one hand, studies show that increasing behavioral realism can improve communication outcomes^21,22. Moreover, past studies have demonstrated that increasing behavioral realism by augmenting social cues exhibited by avatars (e.g., eye gaze and facial expressions) can enhance collaboration and produce meaningful interactions^23,24,25. It is important to note, however, that the nonverbal cues included in these studies often manipulated responsive behaviors (e.g., mutual gaze, nodding), which are associated with positive outcomes^26,27. As such, it is uncertain if the purported benefits of behavioral realism were due to the addition of nonverbal cues or perceptions of favorable nonverbal behavior.

In contrast, other studies^28,29 found that general levels of behavioral realism do not uniformly improve communication outcomes. For instance, two studies^30,31 found that adding facial expressions or bodily gestures to avatar-mediated virtual environments did not consistently enhance social presence or interpersonal attraction. However, both of these studies employed a task-oriented interaction without time limits and a casual social interaction, which may have given participants enough time and relevant social information to reach a ceiling effect regardless of the nonverbal cues available. This is a reasonable conjecture, considering that increased interaction time can allow interactants to overcome the lack of nonverbal cues available in CMC³². As such, the effects of nonverbal cues independent of increased time or availability of social content are unclear. In addition, despite ample research that points to the association between interpersonal judgments based on nonverbal behavior³³, most studies did not utilize the automatically tracked nonverbal data to explore its association with interpersonal outcomes which could further our understanding of the sociopsychological implications of automatically tracked nonverbal cues.

Taking these limitations into account, the present study attempts to elucidate the unique influences of including facial expressions and bodily gestures on interaction outcomes (i.e., interpersonal attraction, social presence, affective valence, impression accuracy) by employing a goal-oriented task with time constraints. The present study also offers a less constricted representation of participants’ nonverbal behavior including expressions of negative and/or neutral states, rather than limiting the available nonverbal cues related to feedback or friendliness (e.g., head nodding, reciprocity, smiling).

Predicting interpersonal attraction with automatically detected nonverbal cues

Nonverbal cues not only influence impression formation, but also reflect one’s attitude toward their communication partner(s)^34,35 such as interpersonal attraction³¹, bonding³⁶, and biased attitudes³⁷. In addition to nonverbal cues that are isolated to the individual, studies have shown that interactional synchrony is associated with more positive interpersonal outcomes^38,39,40,41. Interactional synchrony is defined as the “the temporal linkage of nonverbal behavior of two or more interacting individuals”⁴². Under this definition, synchrony refers to the motion interdependence of all participants during an interaction focusing on more than a single behavior (e.g., posture or eye gaze). This view of synchrony is consistent with Ramseyer and Tschacher’s³⁹ characterization of synchrony and is grounded within the dynamical systems framework⁴³. Interactional synchrony has been associated with the ability to infer the mental states of others⁴⁴ and rapport⁴⁵. For example, spontaneous synchrony was related to Theory of Mind⁴⁶ for participants with and without autism, such that increased synchrony was associated with higher ability to infer the feelings of others⁴⁷.

While research has consistently found that nonverbal behavior is indicative of interpersonal outcomes³⁸, the vast majority of these studies quantified nonverbal behavior by using human coders who watched video recordings of an interaction and recorded the target nonverbal behaviors or Motion Energy Analysis (MEA; automatic and continuous monitoring of the movement occurring in pre-defined regions of a video). Coding nonverbal behavior by hand is not only slow and vulnerable to biases^42,48, but also makes it difficult to capture subtle nonverbal cues that aren’t easily detectible by the human eye. While MEA is more efficient than manual coding, it is limited in that it is based on a frame-by-frame analysis of regions of interest (ROI) and thus susceptible to region-crossing (i.e., movement from one region being confused with that of another region⁴⁹). That is, MEA does not track individual parts of the body, but pixels within ROI. Given these limitations, researchers have recently turned to the possibility automating the quantification of nonverbal behavior by capitalizing upon dramatic improvements in motion detection technology (e.g., tracking with RGB-D cameras) and computational power (e.g., machine learning)^36,42,50. While these methods are also prone to tracking errors, they have the advantage of tracking nonverbal cues in a more targeted manner (i.e., specific joints, facial expressions) and offer higher precision by utilizing depth data in addition to color (RGB) data.

While researchers have started to employ machine learning algorithms to determine the feasibility of using automatically detected nonverbal cues to predict interpersonal outcomes, they either relied solely on isolated nonverbal behaviors³⁶ or entirely on nonverbal synchrony^42,51 instead of both isolated and interdependent nonverbal cues. In addition, previous studies have employed relatively small sample sizes (N_dyad range: 15–53). Perhaps for this reason, prior machine learning classifiers either performed above chance level only when dataset selection was exclusive^42,51 or showed unreliable performance in terms of validation and testing set accuracy rates³⁶. Consequently, there is inconclusive evidence if automatically tracked nonverbal cues can reliably predict interpersonal attitudes. By employing machine learning algorithms to explore whether nonverbal behaviors can predict interpersonal attitudes, the present study aims to address if and, if so how, automatically tracked nonverbal cues and synchrony are associated with interpersonal outcomes through an inductive process.

Methods

Study design

The present study adopted a 2 Bodily Gestures (Present vs. Absent) × 2 Facial Expressions (Present vs. Absent) between-dyads design. Dyads were randomly assigned to one of the four conditions, and gender was held constant within a dyad. There was an equal number of male and female dyads within each condition. Participants only interacted with each other via their avatars and did not meet or communicate directly with each other prior to the study. The nonverbal channels that were rendered on the avatar were contingent on the experimental condition. Participants in the ‘Face and Body’ condition interacted with an avatar that veridically portrayed their partner’s bodily and facial movements. Participants in the ‘Body Only’ condition interacted with an avatar that veridically represented their partner’s bodily movements, but did not display any facial movements (i.e., static face). In contrast, participants in the ‘Face Only’ condition interacted with an avatar that veridically portrayed their partner’s facial movements, but did not display any bodily movements (i.e., static body). Finally, participants in the ‘Static Avatar’ condition interacted with an avatar that did not display any movements. A graphical representation of each condition is available in Fig. 1.

Participants

Participants were recruited from two medium-sized Western universities (Foothill College, Stanford University). Participants were either granted course credit or a $40 Amazon gift card for their participation. 280 participants (140 dyads) completed the study. Dyads that included participants who failed the manipulation check (N_dyad = 10) and/or participants who recognized their partners (N_dyad = 6) were excluded from the final analysis. To determine if participants who were part of a specific condition were more likely to fail the manipulation check or to recognize their interaction partners, two chi-square tests were conducted. Results indicated that there were no differences between conditions for either dimension (manipulation check failure: χ²(3) = 1.57, p = 0.67, partner recognition: χ²(3) = 1.78, p = 0.62).

Materials and apparatus

A markerless tracking device (Microsoft Kinect for Xbox One with adaptor for Windows) was used to track participants’ bodily gestures. Using an infrared emitter and sensor, the Microsoft Kinect is able to provide the positional data for 25 skeletal joints at 30 Hz in real-time, allowing unobtrusive data collection of nonverbal behavior. Studies offer evidence that the Kinect offers robust and accurate estimates of bodily movements⁵². While even higher levels of accuracy can be achieved with marker-based systems, this study employed a markerless system to encourage more naturalistic movements⁵³. The joints that are tracked by the Kinect are depicted in Fig. 2. The present study used 17 joints that belong to the upper body as studies have suggested that the Kinect tends to show poorer performance for lower body joints⁵² (i.e., left hip, right hip, left knee, right knee, left ankle, right ankle, left foot, right foot), which can result in “substantial systematic errors in magnitude” of movement⁵⁴.

Participants’ facial expressions were tracked in real-time using the TrueDepth camera on Apple’s iPhone XS. The TrueDepth camera creates a depth map and infrared image of the user’s face, which represents the user’s facial geometry⁵⁵. More specifically, the TrueDepth camera captures an infrared image of the user’s face and projects and analyzes approximately 30,000 points to create a depth map of the user’s face, which are subsequently analyzed by Apple’s neural network algorithm. Among other parameters, Apple’s ARKit SDK can extract the presence of facial expressions from the user’s facial movements. A full list of the 52 facial expressions that are tracked by ARKit are included in “Appendix 1”. The value of the facial expression (i.e., blendshape) ranges from 0 to 1 and is determined by the current position of a specific facial movement relative to its neutral position⁵⁵. Each blendshape was mapped directly from the participant’s facial movements. While we do not have a quantitative measure for tracking accuracy, qualitative feedback from pilot sessions with 40 participants suggested that participants found the facial tracking to be accurate.

Discord, one of the most commonly used Voice over Internet Protocol (VoIP) platforms⁵⁶, was used for verbal communication. Participants were able to hear their partner’s voice through two speakers (Logitech S120 Speaker System) and their voices were detected by the microphone embedded in the Kinect sensor. Participants were able to see each other’s avatars on a television (Sceptre 32" Class FHD (1080P) LED TV (X325BV-FSR)), which was mounted on a tripod stand (Elitech). The physical configuration of the study room can be seen in Fig. 3. The person pictured in Fig. 3 gave informed consent to publish this image in an online open-access publication. The avatar-mediated platform in which participants interacted was programmed using Unity version 2018.2.2. Further details of the technical setup are available in “Appendix 2” and information regarding the system’s latency can be seen in “Appendix 3”.

Procedure

All study procedures and materials received approval from the Institutional Review Board of Stanford Univeristy. All methods were performed in accordance with relevant guidelines and regulations. Participants in each dyad were asked to come to two separate locations to prevent them from seeing and interacting with each other prior to the study. Participants were randomly assigned to one of the two study rooms, which were configured identically (Fig. 3). Once participants gave informed consent to participate in the study, they completed a pre-questionnaire that measured their personality across five dimensions⁵⁷ (extraversion, agreeableness, neuroticism, conscientiousness, openness to experience). After each participant completed the pre-questionnaire the experimenter explained that two markerless tracking systems would be used to enable the participant and their partner to interact through the avatar-mediated platform. The participant was then asked to stand on a mat measuring 61 cm × 43 cm that was placed 205 cm away from the Kinect and 20 cm away from the iPhone XS. After the participant stood on the mat, the experimenter asked the participant to confirm that the phone was not obstructing her/his view. If the participant said that the phone was blocking his/her view, the height of the phone was adjusted. Upon confirming that the participant was comfortable with the physical setup of the room and that the tracking systems were tracking the participant, the experimenter opened the avatar-mediated platform and let the participants know that they would be completing two interaction tasks with a partner. After answering any questions that the participants had, the experimenter left the room.

Prior to the actual interaction, participants went through a calibration phase. During this time, participants were told that they would be completing a few calibration exercises to understand the physical capabilities of the avatars. This phase helped participants familiarize themselves to the avatar-mediated platform and allowed the experimenter to verify that the tracking system was properly sending data to the avatar-mediated platform. Specifically, participants saw a ‘calibration avatar’ (Fig. 4) and were asked to perform facial and bodily movements (e.g., raise hands, tilt head, smile, frown). The range of movement that was visualized through the calibration avatar was consistent with the experimental condition of the actual study. All participants were asked to do the calibration exercises regardless of condition in order to prevent differential priming effects stemming from these exercises and demonstrate the range of movements that could be expected from their partner’s avatars.

After completing the calibration exercises, participants proceeded to the actual study. Participants were informed that they would collaborate with each other to complete two referential tasks: an image-based task (i.e., visual referential task) and a word-based task (i.e., semantic referential task). The order in which the tasks were presented was counterbalanced across all conditions.

The image-based task was a figure-matching task adapted from Hancock and Dunham⁵⁸. Each participant was randomly assigned the role of the ‘Director’ or the ‘Matcher’. The Director was asked to describe a series of images using both verbal and nonverbal language (e.g., tone/pitch of voice, body language, facial expressions). The Matcher was asked to identify the image that was being described from an array of 5 choices and one “image not present” choice and to notify the Director once he or she believed the correct image had been identified (Fig. 5). Both the Matcher and Director were encouraged to ask and answer questions during this process. The Matcher was asked to select the image that he or she believed was a match for the image that the Director was describing; if the image was not present, the Matcher was asked to select the “image not present” choice. After 7 min or after participants had completed the entire image task (whichever came first), participants switched roles and completed the same task one more time.

The word-based task was a word-guessing task adapted from the ‘password game’ used in Honeycutt, Knapp, and Powers⁵⁹^. Each participant was randomly assigned the role of the ‘Clue-giver’ or the ‘Guesser’. The Clue-giver was asked to give clues about a series of thirty words using both verbal and nonverbal language. The Guesser was asked to guess the word that was being described. Both the Clue-giver and the Guesser were encouraged to ask and answer questions during this process. Given the open-ended nature of the task, participants were told that they were allowed to skip words if they thought that the word was too challenging to describe or guess. After 7 min or after they had completed the word task (whichever came first), participants switched roles and completed the same task one more time; the Clue-giver became the Guesser and the Guesser became the Clue-giver. The words used in the word-based task were chosen from A Frequency Dictionary of Contemporary American English⁶⁰, which provides a list of 5,000 of the most frequently used words in the US; 90 words were chosen from the high, medium, and low usage nouns and verbs from this list. The selected words were presented in a random order for the Clue-giver to describe.

These tasks were chosen for the following reasons: first, two types of referential tasks (i.e., visual and semantic) were employed in order to reduce the bias of the task itself toward verbal or nonverbal communication. That is, the visual task was selected as a task more amenable to nonverbal communication, while the semantic task was selected as one more amenable to verbal communication. Second, we adopted a task-oriented social interaction to avoid ceiling effects of the interpersonal outcome measures, given that purely social exchanges are more likely to support personal self-disclosures, which are associated with interpersonal attraction and facilitate impression formation.

After the interaction, participants completed the post-questionnaire which assessed perceptions of interpersonal attraction, affective valence, impression accuracy, and social presence. Participants’ bodily and facial nonverbal data were tracked and recorded unobtrusively during the interaction. As noted in “Methods”, participants gave consent for their nonverbal data to be recorded for research purposes. Once they completed the post-questionnaire, participants were debriefed and thanked.

Measures

Interpersonal attraction

Based on McCroskey and McCain⁶¹, two facets of interpersonal attraction were measured, namely social attraction and task attraction. Social attraction was measured by modifying four items from Davis and Perkowitz⁶² to fit the current context and task attraction was assessed by modifying four items from Burgoon⁶³. Participants rated how strongly they agreed or disagreed with each statement on a 7 point Likert-type scale (1 = Strongly Disagree, 7 = Strongly Agree). The wording for all questionnaire measures is included in “Appendix 4”.

Due to the similarity of the social and task attraction scales, a parallel analysis⁶⁴ (PA) was run to determine the correct number of components to extract from the eight items. PA results indicated that the data loaded on to a single component, as indicated by Fig. 6. A confirmatory factor analysis with varimax rotation showed that 56% of the variance was explained by the single component, and that the standardized loadings for all items were greater than 0.65 (Table 1). Thus, the two subscales of interpersonal attraction were collapsed into a single measure of interpersonal attraction. The reliability of the scale was good, Cronbach’s α = 0.89. Greater values indicated higher levels of interpersonal attraction (M = 5.84, SD = 0.61); the minimum was 3.75 and the maximum was 7.

Table 1 Factor analysis of interpersonal attraction with varimax rotation.

Full size table

Affective valence

A Linguistic Inquiry Word Count⁶⁵ (LIWC) analysis was performed on an open-ended question that asked participants to describe their communication experience. LIWC has been used as a reliable measure for various interpersonal outcomes, including the prediction of deception⁶⁶, personality⁶⁷, and emotions⁶⁸. Affective valence was computed by subtracting the percentage of negative emotion words from the percentage of positive emotion words yielded by the LIWC analysis⁶⁹. Greater values indicated relatively more positive affect than negative affect (M = 3.59, SD = 3.4); the minimum was − 2.94 and the maximum was 20.

Impression accuracy

Participants completed a self and an observer version of the short 15-item Big Five Inventory^70,71 (BFI-S). Participants rated themselves and their partner on 15 items that were associated with five personality dimensions (i.e., extraversion, agreeableness, conscientiousness, neuroticism, and openness to experience) on a 7 point Likert-type scale (1 = Strongly Disagree, 7 = Strongly Agree). Participants were given the option to select “Cannot make judgment” for the observer version of the BFI-S.

Impression accuracy was defined as the profile correlation score, which “allows for an examination of judgments in regard to a target's overall personality by the use of the entire set of […] items in a single analysis”⁷²; that is, impression accuracy was assessed by computing the correlation coefficient across the answers that each participant and their partner gave for the 15 items^72,73. Greater values indicated more accurate impressions (M = 0.39, SD = 0.36); the minimum was − 0.64 and the maximum was 0.98.

Social presence

Social presence was measured with items selected from the Networked Minds Measure of Social Presence^74,75, one of the most frequently used scales to measure social presence. To reduce cognitive load, 8 items were selected from the scale, which consisted of statements that assessed co-presence, attentional engagement, emotional contagion, and perceived comprehension during the virtual interaction. Participants rated how strongly they agreed or disagreed with each statement on a 7 point Likert-type scale (1 = Strongly Disagree, 7 = Strongly Agree). The reliability of the scale was good, Cronbach’s α = 0.77. Greater values indicated higher levels of social presence (M = 5.47, SD = 0.65); the minimum was 3.38 and the maximum was 6.75.

Nonverbal behavior

Participants’ bodily movements were tracked with the Microsoft Kinect. Due to non-uniform time distances in the tracking data, linear interpolation was used to interpolate the data to uniform time distances of 30 Hz. Then, a second-order, zero-phase bidirectional, Butterworth low-pass filter was applied with a cutoff frequency of 6 Hz to provide smooth estimates⁷⁶. Participants’ facial expressions were tracked in real-time using the TrueDepth camera on Apple’s iPhone XS and this data was also interpolated to 30 Hz.

Synchrony of bodily movement

Synchrony of bodily movements is defined as the correlation between the extent of bodily movements of the two participants, with higher correlation scores indicating higher synchrony. More specifically, the time series of the extent of bodily movements of the two participants were cross-correlated for 100 s of the interaction. Cross-correlation scores were computed for both positive and negative time lags of five seconds, in accordance to Ramseyer and Tschacher³⁹, which accounted for both ‘pacing’ and ‘leading’ synchrony behavior. Time lags were incremented at 0.1 s intervals, and cross-correlations were computed for each interval by stepwise shifting one time series in relation to the other³⁹. While the Kinect can capture frames at 30 Hz, the sampling rate varies and the resulting data is noisy. During post-processing, we addressed both shortcomings by filtering and downsampling to a uniform frequency. As noted above, a Butterworth low-pass filter with a cutoff frequency of 6 Hz was applied to remove signal noise, and then was interpolated to 10 Hz to achieve a uniform sampling rate across the body and face. In instances wherein less than 90% of the data were tracked within a 100 s interval, the data from that interval were discarded. Participants’ synchrony scores were computed by averaging the cross-correlation values.

Synchrony of facial expressions

Synchrony of facial expressions is similarly defined as the correlation between the time series of facial movements. Once again, the time series of facial movements of the two participants were cross-correlated for each 100 s interval of the interaction. Cross-correlations were computed for both positive and negative time lags of 1 s, in accordance with Jaques et al.³⁶). Time lags were incremented at 0.1 s intervals, and cross-correlations were computed for each interval by stepwise shifting one time series in relation to the other. The facial tracking data was downsampled to 10 Hz to compensate for gaps that were introduced after the data was mapped from a continuous to a uniformly spaced time scale. (Fig. 7). Once again, if less than 90% of the data were tracked within a given 100 s interval, the data from that interval were discarded. Participants’ synchrony scores were computed by averaging the cross-correlation values.

Extent of bodily movement

To assess the extent to which participants moved their body, the between-second Euclidean distance for each joint was computed across the interaction. This is equivalent to the Euclidean distance for each joint for each 0.03 s (30 Hz). The average Euclidian distance for each 0.03 s interval for each joint was then averaged across the 17 joints to form a single composite score.

Extent of facial movement

To assess the extent of facial movement during the interaction, the confidence scores for each facial movement (i.e., the deviation of each facial movement from the neutral point) was sampled at a rate of 30 Hz and averaged to form a single composite score. Facial expressions that had a left and right component (e.g., Smile Left and Smile Right) were averaged to form a single item. Finally, facial movements that showed low variance during the interaction were excluded to avoid significant findings due to spurious tracking values.

Machine learning

Machine learning is defined “a set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty”⁷⁷. Machine learning is an inductive method which can be used to process large quantities of data to produce bottom-up algorithms⁴². This makes machine learning suitable for discovering potential patterns within millions of quantitative nonverbal data points. Two machine learning algorithms—random forest and a neural network model (multilayer perceptron; MLP)—that used the movement data as the input layer and interpersonal attraction as the output layer were constructed. To allow for the machine learning algorithm to function as a classifier, participants were divided into high and low interpersonal attraction groups based on a median split⁷⁸. Then, the dataset was randomly partitioned into a training (70%) and test dataset (30%).

There were 827 candidate features for the input layer; bodily synchrony among 17 joints and 10 joint angles⁴²; facial synchrony among the 52 facial expressions (“Appendix 1”; four different types of nonverbal synchrony were included as candidates: mean cross-correlation score, absolute mean of cross-correlation scores, mean of non-negative cross-correlation scores, and maximum cross-correlation score); the mean, standard deviation, mean of the gradient, standard deviation of the gradient, maximum of the gradient, and maximum of the second gradient for each joint coordinate (i.e., X, Y, Z); the mean and standard deviation of the Euclidean distance for each joint for each 0.1 s interval; the mean, standard deviation, mean of the absolute of the gradient, and the standard deviation of the absolute of the gradient for the joint angles; the mean and standard deviations of the head rotation (i.e., pitch, yaw, roll); the mean and standard deviations of the gradient of the head rotation; the mean and standard deviations of the 52 facial expressions; the mean and standard deviation of the X and Y coordinates of point of gaze; the percentage of valid data and the number of consecutive missing data points; gender.

Two methods of feature selection were explored for the training set. First, features were selected using a correlation-based feature selection method, wherein features that highly correlated with the outcome variable, but not with each other were selected⁷⁹. Then, support vector machine recursive feature elimination⁸⁰ was used to reduce the number of features and identify those that offered the most explanatory power. The test dataset was not included in the data used for feature selection. 23 features were selected using this method (Table 2).

Table 2 Features selected.

Full size table

Using five-fold cross-validation, the selected features were used to train two different machine learning models (i.e., random forest, MLP) in order to assess initial model performance. More specifically, five-fold cross-validation was used to validate and tune the model performance given the training dataset prior to applying the classifier to the holdout test data. Five-fold cross-validation divides the training set into five samples that are roughly equal in size. Among these samples, one is held out as a validation dataset, while the remaining samples are used for training; this process is repeated five times to form a composite validation accuracy score (i.e., the percentage of correctly predicted outcomes).

Statistical analyses

Data from participants who communicate with each other are vulnerable to violating the assumption of independence and are thus less appropriate for ANOVA and standard regression approaches⁸¹. Multilevel analysis “combines the effects of variables at different levels into a single model, while accounting for the interdependence among observations within higher-level units”⁸². Because neglecting intragroup dependence can bias statistical estimates including error variance, effect sizes and p values^83,84, a multilevel model was used to analyze the data. Random effects that arise from the individual subjects who are nested within dyads were accounted for and a compound symmetry structure was used for the within-group correlation structure. Gender was included as a control variable, as previous research has found that females tend to report higher levels of social presence than their male counterparts⁸⁵. In line with these studies, correlation analyses (Table 3) showed that gender correlated with several of the dependent variables. A summary of the results of the multilevel analyses are available in Table 4.

Table 3 Bivariate Pearson correlations of variables.

Full size table

Table 4 Summary of multilevel analyses.

Full size table

Results

Manipulation check

To confirm that the manipulation of the nonverbal variables was successful, participants were asked if the following two sentences accurately described their experience (0 = No, 1 = Yes): “My partner's avatar showed changes in his/her facial expressions, such as eye and mouth movements” and “My partner's avatar showed changes in his/her bodily gestures, such as head and arm movements”. 11 participants who belonged to 10 separate dyads failed the manipulation check; these participants and their partners were removed from the final data analyses (N_dyad = 10, N_participant = 20).

An additional 7 participants who belonged to 6 separate dyads reported that they recognized their interaction partners. These participants and their partners (N_dyad = 6, N_participant = 12) were also removed from data analyses, resulting in a final sample size of 248 participants (N_dyad = 124).