Culture modulates face scanning during dyadic social interactions

Recent studies have revealed significant cultural modulations on face scanning strategies, thereby challenging the notion of universality in face perception. Current findings are based on screen-based paradigms, which offer high degrees of experimental control, but lack critical characteristics common to social interactions (e.g., social presence, dynamic visual saliency), and complementary approaches are required. The current study used head-mounted eye tracking techniques to investigate the visual strategies for face scanning in British/Irish (in the UK) and Japanese adults (in Japan) who were engaged in dyadic social interactions with a local research assistant. We developed novel computational data pre-processing tools and data-driven analysis techniques based on Monte Carlo permutation testing. The results revealed significant cultural differences in face scanning during social interactions for the first time, with British/Irish participants showing increased mouth scanning and the Japanese group engaging in greater eye and central face looking. Both cultural groups further showed more face orienting during periods of listening relative to speaking, and during the introduction task compared to a storytelling game, thereby replicating previous studies testing Western populations. Altogether, these findings point to the significant role of postnatal social experience in specialised face perception and highlight the adaptive nature of the face processing system.

face tracking, the user was required to set the minimum number of feature points (threshold) that are used to estimate the bounding box. Although only very few points are typically required (e.g., 5 points), the threshold was increased to 15 points for the current study given that spatial accuracy and precision of face regions were crucial to investigate scanning behaviour. In addition, the user was required to specify the maximum number of frames to be processed to avoid a decline in tracking quality over time since points can be lost across frames. For the present study, a maximum of 150 frames were processed before the script returned to the face detection stage. Furthermore, a pushbutton was included to manually trigger the return to the face detection stage at any time in case the bounding box location could no longer be estimated accurately using the automatic algorithm. The flowchart in Supplementary Figure S2 summarises the face detection and tracking process.

Data extraction
The location of the face region was now known for every frame and given by the edge coordinates of the bounding box. The face area was subdivided into an upper and a lower part as a proxy for the eye and mouth regions, respectively. This was done by splitting the bounding box at the midline (Supplementary Figure S1). The eye tracking data was loaded into MATLAB to extract the x-and y-coordinates of the gaze points, and each gaze point was associated with its corresponding scene frame. The gaze point was classified by checking whether its coordinates fell within the upper or lower face region (using the inpolygon function). For each participant, a binary event timeline was created. An entry was coded '1' if the gaze point fell within the lower/upper face, and '0' if not. A speech timeline was added to annotate periods as listening (coded '0') or speaking (coded '1'); this information was manually coded offline. Finally, an additional timeline indicated whether the gaze point was associated with a fixation (coded '1') or not (coded '0').

Coding performance
Manual checks were performed for 20% of data (10% per cultural group) collected for a separate study with the same paradigm. The mean accuracy was 99.02% (SD = 1.37%) for the upper face and 99.35% (SD = 0.97%) for the lower face. To code the upper and lower face and non-face regions for 1 minute of recording time, the semi-automatic method required 5 minutes and 16 seconds. Using a fully manual approach, gaze annotation took 11 minutes and 29 seconds (i.e., more than double the time).

Data-driven analysis of head-mounted eye tracking data
For screen-based eye tracking data, iMap 4,5 represents a data-driven method that aggregates gaze data across time and stimuli to produce density maps. Head-mounted eye tracking data, however, cannot simply be collapsed given that the position, size, and angle of the face changes with every frame. We applied linear transformations to re-map gaze points onto a normalised face template in a fully automatic fashion. Monte Carlo permutation testing (also named approximate permutation test or random permutation test) 6 was then used to identify cultural differences in gaze clustering.

Data extraction
To collapse gaze points across time and participants, the original absolute gaze coordinates that fell within the face region were re-expressed as relative coordinates with respect to the bounding box (rather than the scene frame), making them independent of the location, size, or angle of the face. This was achieved using the following steps: (1) Non-rectangular bounding boxes (for non-tilted, four-point polygonic shapes) were first transformed by fitting a minimally bounding rectangle around the four vertices.
Each bounding box and its corresponding gaze points were then rotated such that the top and bottom edges were aligned in parallel with the x-axis of the scene frame, i.e. such that the face was no longer tilted. Rotations were not required to be performed around a specific point. The angle α between the bottom edge of the bounding box and the x-axis was first computed to set up a rotation matrix R, To perform clockwise rotations, the angle α is negative in this rotation matrix. R was then used to rotate the bounding box and the original gaze coordinates to obtain the new coordinates of each shifted vertex (vx', vy') and the new gaze coordinate (x', y'), (3) The new gaze coordinates were then expressed relative to the rotated bounding box by setting its vertices to v1 = (-1, -1), v2 = (1,-1), v3 = (1,1), and v4 = (-1,1), i.e. the origin represented the centre of the face (nose tip). (4) For every participant and for each condition, a grid was set up to map all relative gaze coordinates into a unified coordinate space. For this study, a 100 x 100 grid was used with the same vertices as the bounding box, i.e., v1 = (-1,-1), v2 = (1,-1), v3 = (1,1), and v4 = (-1,1). (5) Each relative gaze coordinate was then mapped into the grid by finding its location within the grid and filling the corresponding entry. This represented the density maps with gaze collapsed across time. (6) The density maps were smoothed to consider both measurement error and that foveal visual attention occurs not only at the precise coordinate position but is distributed within 1.5º to 2º visual angle 7 . In this study, smoothing was performed using a twodimensional isotropic Gaussian kernel, with a kernel width corresponding to 2º (using imgaussfilt).

Statistical analysis
Comparing each pixel individually when contrasting 100px x 100px maps would result in 10,000 independent t-tests and introduce the multiple comparison problem. If the alpha-level is set to 0.05 for a single t-test, this would give 500px flagged as significant by chance. To adjust the alpha-level from a local scale (i.e., a single pixel) to a global scale (i.e., the entire map), several approaches are available. The Bonferroni correction method approximates an adjusted significance threshold by dividing the alpha-value by the number of tests (i.e., 0.05 / 10,000 = 0.000005 in the above example). This threshold, however, is too conservative due to the notion of spatial correlation. Given the spatial smoothing, gaze points are to an extent spatially dependent. The Bonferroni correction method, however, assumes independence between pixels, such that the adjusted threshold is overly conservative. An alternative method is based on Random Field Theory (RFT) 8,9 , which also provided the framework for iMap 4 . The smoothness underlying the activation maps is estimated, and the Euler characteristic (the number of clusters or "blobs" after thresholding) 10 is determined at varying thresholds. The threshold at which 5% (0.05 alphalevel) of equivalent statistical maps would occur under the null hypothesis can then be computed. RFT requires a Gaussian distribution and sufficient smoothness, and represents a powerful method when assumptions are met. However, RFT may produce unreliable results when data is not normally distributed or for paradigms with a low number of participants since maps may not necessarily be sufficiently smooth 10 .
Another approachand the one chosen hereis non-parametric permutation testing 6 , which does not require data to be normally distributed, and has previously been implemented in two screen-based studies 11,12 . In contrast to previous studies 11,12 , however, we applied a cluster-based approach to correct for multiple comparisons (as opposed to, e.g., FDR). Permutation testing uses the observed data itself to generate a null distribution that describes a gaze distribution that is entirely random. This is obtained by exchanging the data across conditions or groups in all possible arrangements to compute the frequency distribution of test statistics (e.g., t-score). Consider a between-subject design with Participants A and B in one group, and Participants C and D in another group. By shuffling participants into all possible combinations, test statistics are calculated for AB (Group 1) vs CD (Group 2), AC (Group 1) vs BD (Group 2), and AD (Group 1) vs BC (Group 2) to obtain the null distribution, i.e. the distribution of test statistics if group allocations were random. Naturally, these permutations are typically conducted on data sets with larger participant numbers, and computing all possible permutations is time-consuming and computationally demanding. The Monte Carlo method 13 can approximate the null distribution by running many permutationstypically in the order of several thousand iterations. Once the null distribution is computed, the proportion of test statistics that result in larger values than the observed statistic (the Monte Carlo significance probability) can be calculated. To obtain significant differences, this proportion should be minimal (e.g., less than 5%, or p < 0.05). Permutation testing only assumes exchangeability 6i.e. data needs to be exchangeable across conditions or groupsand this assumption is met when exchanging data sets from different participants.
The Monte Carlo permutation test was implemented in MATLAB using the CoSMoMVPA toolbox 14 and FieldTrip toolbox 15 . The statistical analysis involved clusterbased permutation tests, whereby a clustering procedure was applied to the original data set and to each permutated data set that was obtained by swapping participants between the cultural groups. Specifically, the clustering procedure involved identifying neighbouring pixels if their test criterion was greater than the critical value tcrit associated with a specified p-value threshold. This threshold was required to be set by the user, and a moderately strict threshold of 0.01 was chosen for the current study. To examine which clusters in the original map were significant, a cluster statistic was selected and used as comparison with each permuted map. We chose the size of the cluster as the statistic for the present analyses. For every iteration, the statistic of each cluster in the original map was compared against that in the permuted map. After all iterations were performed (here the number of iterations was set to 10,000), the Monte Carlo significance probability was calculated; in other words, the proportion of test statistics that resulted in a larger value than the actual observed statistic of the original cluster was obtained. If this only occurred very few times, i.e. less than 5% (0.05) of times, this cluster was flagged as significant.

Supplementary Analysis
To strengthen the current study interpretations, the following briefly presents the methods and results of a subset of the data obtained from a separate screen-based face scanning study 16 . The screen-based study aimed to examine cultural differences in face scanning and also included other cognitive tasks and tested infant and adult age groups. Here, only relevant face scanning data collected from the adult sample is presented. The sample consisted of the majority of participants who also took part in the present dyadic interaction study, with the results revealing cultural differences in face scanning. This suggests that the observed group differences in face scanning in the current dyadic interaction study are unlikely to be attributed solely to the local research assistant's individual-specific behaviour.

Methods
Thirty-one British (16 female) and 30 Japanese adults (17 female) participated in the screenbased study, of which 24 British and 26 Japanese participants also took part in the current dyadic interaction study. The same participant criteria as in the dyadic interaction study were applied.

Apparatus
A Tobii TX300 eye tracker (Tobii Technology, Sweden) was used to record eye movements at a sampling rate of 120 Hz. Face stimuli were presented on a 23" monitor.

Procedure
Participants sat in front of the monitor at 65cm distance. A five-point calibration was conducted prior to the start of the free-viewing experiment. Face stimuli were interleaved with other cognitive tasks, and presented in either static conditions (image of a face) or dynamic conditions (video of a face speaking the syllables do re mi fa sol la ti do). In each condition, every participant was shown four female face identities with a neutral facial expression (two faces of White-British ethnicity and two of Japanese ethnicity). Each face trial started with a gaze-contingent central stimulus before the face stimulus (measuring 16.5˚ in height and 12.0˚ in width) was displayed for 18 seconds in colour at 1920 x 1080 resolution (see Supplementary Figure 3 for an example). Sound was muted and replaced with instrumental music. Face identities were never repeated and were matched in location and speech timings.

Results
Given that the original study included eye movement data from both infants and adults, fixations were obtained using GraFIX, a semi-automatic tool designed to parse eye-tracking data of varying quality 17 . Regions-of-interest (ROI) included the eyes, bridge, nose, and mouth (see Supplementary Figure S3), and fixation time proportional to inner face fixation time was obtained for each ROI.

Supplementary Figure S3: Regions-of-interest superimposed onto a face.
For the purpose of this Supplementary Analysis, a three-way ANOVA was conducted with factors Group (British, Japanese), Face Ethnicity (White-British, Japanese), and ROI (eyes, bridge, nose, mouth), separately for static and dynamic faces. Greenhouse-Geisser estimates were used when the sphericity assumption was not met.