Four dimensions characterizing comprehensive trait judgments of faces

People readily attribute many traits to faces: some look beautiful, some competent, some aggressive. Modern psychological theories argue that the hundreds of different words people use to describe others from their faces are well captured by only two or three dimensions, such as valence and dominance, a highly inuential framework that has been the basis for numerous studies across social and developmental psychology, social neuroscience, and engineering applications. However, all prior work has used only a small number of words (12 to 18) to derive underlying dimensions, limiting conclusions to date. Here we employed deep neural networks to select a comprehensive set of 100 words that are representative of the trait words people use to describe faces, and to select a set of 100 faces. In two large-scale, preregistered studies we asked participants to rate the 100 faces on the 100 words (obtaining 2,850,000 ratings from 1,710 participants), and discovered a novel set of four psychological dimensions that best explain trait judgments of faces: warmth, competence, femininity, and youth. We reproduced these four dimensions across different regions around the world, in both aggregated and individual-level data. These results provide a new and most comprehensive characterization of face judgments.

Here we argue that to understand the true dimensionality of trait judgments from faces, it is essential to investigate a more comprehensively sampled set of trait words. To meet this challenge, we assembled an extensive list of trait words that people use to describe faces from multiple sources 12-15, 19,21, 25-31,33,38,39 and applied a pre-trained neural network to derive a representative subset of 100 traits (Fig. 1ad). Similarly, we combined multiple extant face databases and applied a pre-trained neural network to derive a representative subset of 100 face images (Fig. 1e-h) [see Methods]. We veri ed that the 100 selected traits were representative of the trait words people spontaneously generated for the selected 100 face images (Fig. 2a-b), and that the 100 selected face images were representative of the structural physiognomy of natural faces (Fig. 2c-d), although we note that only Caucasian faces with no emotional expressions were included [see Methods]. We collected ratings of the 100 faces on the 100 traits both sparsely online (Study 1) [750,000 ratings from 1,500 participants with repeated ratings for assessing within-subject consistency for every trait] and densely on-site (Study 2) [10,000 ratings from each of 210 participants across North America, Latvia, Peru, the Philippines, India, Kenya, and Gaza]. All experiments were preregistered on the Open Science Framework (see Methods).

Results
Four dimensions underlie comprehensive trait judgments of faces Study 1 examined the underlying dimensions of the ratings that participants had given to the faces (ratings aggregated across participants) by rst applying an exploratory method (exploratory factor analysis [EFA]; preregistered) and subsequently a con rmatory method with cross-validation (an autoencoder arti cial neural network [ANN]). We con rmed that these ratings showed su cient variance We determined the optimal number of factors to retain in EFA using ve widely recommended methods 50,51 (see Methods), as solutions are considered most reliable when multiple methods agree.
Four methods-Horn's parallel analysis, Cattell's scree test, optimal coordinates, and empirical BIC-all indicated that the optimal number of factors to retain was four ( Supplementary Fig. 3a).
EFA was thus applied to extract four factors using the minimal residual method, and the solutions were rotated with oblimin for interpretability. The four factors each explained 31%, 31%, 11%, and 12% of the common variance in the data (85% in total; 87% in total if ve factors were extracted) and were weakly correlated (r 13 = -0.33, r 14 = -0.23, r 23 = 0.21, r 24 = 0.33 [ps < 0.05]; r 12 = -0.15, r 34 = 0.12 [ps > 0.05]). None of the factors were biased by words with particularly low or high within-subject consistency or between-subject consensus; and the trait words occupied the four-dimensional space fairly homogeneously ( Fig. 3). We interpreted these four factors as describing judgments of warmth, competence, femininity, and youth ( To corroborate the four dimensions discovered from EFA, we applied an approach with minimal assumptions-arti cial neural networks (ANN) with cross-validation to compare different factor structures (see Methods). Autoencoder ANNs with one hidden layer that differed in the number of neurons (range from 1 to 10) were constructed (Fig. 4a). These ANNs were trained on half of the data (i.e., aggregated ratings across half of the participants) and tested on the other held-out half (Adam optimization algorithm 52 and mean squared error loss function with a batch size of 32 and 1500 epochs were used to train the ANNs, repeated for 50 iterations). Both linear and nonlinear activation functions were examined (Fig. 4b). Model performance of the best con guration (i.e., linear activation functions in both the encoder and decoder layers) increased substantially as the number of neurons in the hidden layer increased from 1 to 4 (explained variance on the test data increased by 18%, 5%, and 5%, respectively); the improvement was trivial beyond 4 neurons (increased by less than 1%) [ Fig. 4c]. Critically, the four-dimensional representation learned by the ANN reproduced the four dimensions discovered from EFA (mean rs = 0.98, 0.92, 0.91, 0.94 [SDs = 0.01, 0.05, 0.02, 0.05] between factor loadings from EFA and the ANN's decoder layer weights with varimax rotation) and con rmed good performance (explained variance obtained with linear activation functions was 75% [SD = 0.6%] on the test data, comparable to PCA).

Comparison with existing dimensional frameworks
Prior work 1,13,14,35,37 suggests that the various words people use to describe faces can be represented by two or three dimensions. Our ndings support the general idea of a low-dimensional space, but revealed four dimensions that differ from those previously proposed. This discrepancy was not explained by methodological differences: we reanalyzed our data using principal components analysis (PCA), a method used in prior work 13,14,37 , and reproduced the same four dimensions as reported above ( Supplementary Fig. 5a).
Instead, the four-dimensional space did not appear in previous studies because of limited sampling of traits in prior work: we interrogated two subsets of our data which each consisted of 13 traits that corresponded to those used in the discovery of the two most popular prior dimensional frameworks (2D and 3D frameworks 13,37 ). The four-dimensional space was not evident when analyses were restricted to these two small subsets of traits; instead, we reproduced the prior 2D and 3D frameworks, respectively (Table 1) Table 1 Factor loadings from EFA on subsets of 13 traits used in previous studies. a, Factor loadings from EFA on the subset of data corresponding to 13 traits ( rst column) that are the same or most similar to those used in a prior study that discovered the popular 2D framework 13 ( rst column, in brackets). Two factors-the optimal number of factors as indicated by both Cattell's Scree Test and empirical BIC-were extracted and rotated with oblimin. The largest absolute loading across factors for each trait is highlighted in bold. b, Factor loadings from EFA on the subset of data corresponding to 13 traits ( rst column) that are the same or most similar to those used in a prior study that discovered the popular 3D framework 37 ( rst column, in brackets). Three factors-the optimal number of factors as indicated by Cattell's Scree Test, the optimal coordinates index, Velicer's MAP test, and empirical BICwere extracted and rotated with oblimin. The largest absolute loading across factors for each trait is highlighted in bold.
We next showed that the rst two dimensions (warmth, competence) discovered from comprehensive trait judgments were different from the two dimensions of the popular prior 2D framework (valence, dominance). Analyses with the subset of 13 traits (Table 1a) showed that, replicating prior ndings 13 , judgments of traits such as sociable, trustworthy, responsible, and weird were represented by the valence dimension (absolute rs = 0.94, 0.88 0.86, 0.85 between factor scores on the valence dimension and ratings for these traits across the 100 faces), but less well represented by the warmth dimension (absolute rs = 0.47, 0.67, 0.44, 0.23; see also Supplementary Fig. 4a); the valence and warmth dimensions were moderately correlated (absolute r = 0.41 between factor scores). Since using PCA we replicated the four dimensions from our full dataset and the 2D framework from the subset of 13 traits, we repeated the above analyses using PCA scores, which con rmed that the valence and warmth dimensions best represented different types of trait judgments (r = 0.09, p = 0.370). Similarly, as previously found 13 , judgments on aggressive and submissive were represented by the dominance dimension (absolute rs = 0.94, 0.95), but not by the competence dimension (absolute rs = 0.15, 0.14, ps > 0.05; see also Supplementary Fig. 4a); the two dimensions were not signi cantly correlated (r = 0.01, p = 0.894); these results were corroborated by analyses using PCA scores (r = -0.09, p = 0.383).
Finally, we directly compared how well different frameworks characterized trait judgments of faces. Using linear combinations of traits with the highest loadings on each dimension as regressors (two for each dimension, due to only two traits loaded on one of the dimensions in the 3D framework, Table 1b), we found that the four-dimensional framework better explained the variance for 82% of the trait judgments (that were not part of the linear combinations) than did any of the existing frameworks (Supplementary Fig. 5b; mean adjusted R-squared across all predictions was 0.81 for the four-dimensional framework, 0.72 for the 3D framework, and 0.72 for the 2D framework).

Robustness of the four dimensions
We quanti ed the robustness of our results both across different numbers of trait words and across different numbers of participants. First, we removed trait words one by one and reperformed EFA to extract four factors as before (all pairs of trait words were ranked from the most to the least similar, the trait with lower clarity rating was removed from each pair). The four dimensions discovered from the full set versus the subsets of traits were highly correlated ( Fig. 5a; see Supplemental Table 2a for the complete list of correlations). Second, we randomly removed participants one by one (50 randomizations each) and used the new aggregated ratings for EFA to show that the four dimensions discovered from the full dataset were robust to participant sample size ( Fig. 5b; Tucker indices of factor congruence > 0.95 for all sub-datasets with no fewer than 19 participants per trait).
Finally, we extracted the smallest subset of speci c trait words that still yielded the four-dimensional space discovered from the full dataset, a subset of 18 trait words that could be used most e ciently in future studies when collecting ratings for a larger set of traits is not feasible (Supplementary Table 2b).

Generalizability across different countries and regions
Prior work has reported both common 3 and discrepant dimensions in different cultures 13,14,24,35,37 . To test the generalizability of our ndings, we conducted a second preregistered study to collect data across seven different regions of the world. We rst analyzed the aggregate-level ratings for each sample (preregistered; we con rmed these ratings had satisfactory consistency and consensus, see Methods).
We began by asking whether the seven samples shared a similar correlation structure (the Pearson correlation matrix across trait ratings) with the sample in Study 1, using representational similarity analysis 33  Parallel analysis, optimal coordinates, and empirical BIC all showed that a four-dimensional space was most common across samples (in 5 of 7 samples: North America, Latvia, Peru, the Philippines, India) [ Fig. 6a and Supplementary Fig. 3b-h]. We therefore applied EFA to extract four factors from each sample. Results showed that the warmth, competence, femininity, and youth dimensions emerged in multiple samples (interpreted based on factor loadings shown in Supplementary Fig. 6).
We further computed Tucker indices of factor congruence (the cosine distance between pairs of factor loadings), which con rmed that the four-dimensional space was largely reproduced across samples (Fig. 6b); but, as expected, reproducibility was attenuated by the data quality available (as assessed by within-subject consistency, Fig. 6c)

Reproducibility across individual participants
So far, we have reproduced the four-dimensional space across samples, but we have not ruled out the possibility that this space might be an artifact of aggregating data across participants. Could the same four-dimensional space be reproduced in a single participant? This important question has been di cult to address since one needs to have complete data per participant. We met this challenge by collecting ratings on all traits for all faces from every participant in Study 2 (requiring approximately 10 hours of testing per participant; see Methods).
We rst performed RSA to investigate whether single participants (n = 86 who had complete datasets for all traits after data exclusion; see Methods) shared the correlation structure of our Study 1 sample. RSAs varied considerably across participants (range = [0.14, 0.85], M = 0.56, SD = 0.16) and, as expected, were attenuated by data quality as assessed by within-subject consistency ( Fig. 7a-b).
We next analyzed the dimensionality of each individual dataset. Parallel analysis (preregistered) showed that a four-dimensional space was most common ( Fig. 7c) but, again, attenuated by data quality (fourdimensional spaces were found for data with higher within-subject consistency than data that produced other-dimensional spaces [unpaired t-test t(34.57) = 3.29, p = 0.001]). We therefore applied EFA to extract four factors from each participant's dataset and computed their factor congruence with the data from Study 1. We found that the four-dimension space was successfully reproduced in individual participants (see examples of factor loading matrices in Supplementary Fig. 7a, and Tucker indices for all participants in Supplementary Fig. 7b), but also found a considerable amount of individual differences, in line with prior research 54 .

Discussion
Across two large-scale, pre-registered studies we found that comprehensive trait judgments of faces are best described by a four-dimensional space (Figs. [3][4], with dimensions interpreted as warmth, competence, femininity and youth ( Supplementary Fig. 4). We showed that our divergence from prior work was not due simply to methodological differences ( Supplementary Fig. 5a), but to the prior lack of comprehensively and representatively sampled trait words (Figs. 1-2 and Table 1).
We showed that the warmth and competence dimensions reported here were different from the valence and dominance dimensions previously proposed. These ndings help reconcile studies of face perception with the broader social cognition literature, which has long theorized that warmth and competence are two universal dimensions of social cognition 11 . The other two dimensions we found, femininity and youth, are likely linked to overgeneralization 27 and corroborate recent neuroimaging ndings on social categorization from face perception 32,55 .
This four-dimensional space was largely reproduced across samples from different regions, even using different languages (Spanish in Peru) [ Fig. 6 and Supplementary Fig. 6], as well as in individual participants (although this was more di cult to assess, due to data quality) [ Fig. 7 and Supplementary  Fig. 7]. However, despite the predominance of the four-dimensional space, we also found notable variation across samples and individuals (Figs. 6b, 7c). Since the sources of this variation are unknown and may largely re ect measurement error (Figs. 6c, 7b), we refrain from drawing any speci c conclusions about cultural differences, for which larger-scale studies focusing on cultural effects will be needed. Similarly, conclusions about individual differences will require future studies that collect much denser, and likely longitudinal, data in individual participants.
Face stimuli incorporating various races or emotional expressions will likely modify the dimensions of face judgments 15,24,27 , as will viewing angle, background, and other context effects. Our ndings provide the most comprehensive characterization of trait judgments from the physiognomy of faces alone, yielding candidate mental dimensions to investigate with respect to all these further variables, as well as in neuroimaging studies of face judgments 56 .

Sampling of trait words
Here we follow the de nition of a biological trait as being a temporally stable characteristic. Traits in our study include personality traits as well as other temporally stable characteristics that people spontaneously infer from faces, such as age, gender, race, socioeconomic status, and social evaluative qualities ( Supplementary Fig. 1a, e.g., "young", "female", "white", "educated", "trustworthy"). By contrast, we excluded state attributions, such as "smiling" or "thinking" (words that can describe both trait and state variables were not excluded, e.g., we included "happy," but disambiguated its usage as a trait in our instructions to participants, e.g., "A person who is usually cheerful").
Our goal was to representatively sample a comprehensive list of trait words that are used to describe people from their faces. We derived a nal set of 100 traits (Supplementary Table 1) through a series of combinations and lters (detailed below; also in our preregistration at https://osf.io/6p542). These 100 traits were further veri ed to be representative of words that people freely generate to describe trait judgments of our face stimuli ( Fig. 2a-b).
To derive the nal set of trait words, we rst gathered an inclusive list of 482 adjectives and 6 nouns that included all major categories of trait judgments of faces: demographic characteristics, physical appearance, social evaluative qualities, personality, and emotional traits, from multiple sources 12-15, 19,21, 25-31,33,38,39 . Many of the 482 adjectives were synonyms or antonyms. To avoid redundancy while conserving semantic variability, we sampled these adjectives according to three criteria: their semantic similarity (detailed below), clarity in meaning (from an independent set of 29 MTurk participants), and frequency in usage (detailed below). For those words with similar meanings, clarity was the second selection criterion (the one with the highest clarity was retained). For those with similar meanings and clarity, usage frequency was the third selection criterion (the one with the highest usage frequency was retained).
To quantify the semantic similarity between these 482 adjectives, we represented each of them as a vector of 300 computationally extracted semantic features that describe word embeddings and text classi cation using a neural network provided within the FastText library 40 ; this neural network had been trained on Common Crawl data of 600 billion words to predict the identity of a word given a context. We then applied hierarchical agglomerative clustering (HAC) on the word vectors based on their cosine distances to visualize their semantic similarities. To quantify clarity of meaning, we obtained ratings of clarity from an independent set of participants tested via MTurk (N = 31, 17 males, Age (M = 36, SD = 10)).
To quantify usage frequency, we obtained the average monthly Google search frequency for the bigram of each adjective (i.e., the adjective together with the word "person" added after it) using the keyword research tool Keywords Everywhere (https://keywordseverywhere.com/).
The 94 adjectives representatively sampled using the above procedures and the additional 6 nouns consisted of our nal set of 100 trait words. To verify the representativeness of these 100 trait words, we compared the distributions of our selected words and of 973 words human subjects freely generated to describe their spontaneous impressions of the same faces (see Supplementary Fig. 1a and Methods below), using the 300 computationally extracted semantic dimensions (Fig. 2a-b).
To ensure that the dimensionality of the meanings of the words that we used was not limiting the dimensionality of the four factors we discovered in our study, we derived a similarity matrix among our 100 words using the FastText vector of their meanings in the speci c one-sentence de nitions we gave to participants in the experiments (Supplementary Table 1; basic stop-words such as "a", "about", "by", "can", "often", "others" were removed from the one-sentence de nitions for the computation of vector representations), and then conducted factor analysis on the similarity matrix. Parallel analysis, Optimal Coordinate Index, and Kaiser's Rule all suggested 13 dimensions; Velicer's MAP suggested 14 dimensions, and empirical BIC suggested 5 dimensions (empirical BIC penalizes model complexity). We used EFA to extract 5 and 13 factors using the same method as for the trait ratings (13 factors explained the same common variance as 14 factors, 70%; 5 factors explained 60%; factors were extracted with minimal residual method and rotated with oblimin to allow for potential factor correlations). None of the dimensions obtained bore resemblance to our four reported dimensions, arguing that the mere semantic similarity structure of our 100 trait words was not a constraint in deriving the four factors that we report.

Sampling of face images
Our goal was to derive a representative set of neutral, frontal, white faces of high quality (clear, direct gaze, frontal, unoccluded, and high resolution) that are diverse in facial structure. We aimed to maximize variability in facial structure while controlling for factors such as race, expression, viewing angle, gaze, and background, which our present project did not intend to investigate. We rst combined 909 highresolution photographs of male and female faces from three publicly available face databases: the Oslo Face Database 43 , the Chicago Face Database 42 , and the Face Research Lab London Set 41 . We then excluded faces that were not front-facing, not with direct-gaze, with glasses or other adornments obscuring the face. We further restricted ourselves to images of Caucasian adults and neutral expression. This yielded a set of 426 faces from the three databases.
To reduce the size of the stimulus set while conserving variability in facial structure, we sampled from the 426 faces using maximum variation sampling. For each image, the face region was rst detected and cropped using the dlib library 44 , and then represented with a vector of 128 computationally extracted facial features for face recognition, using a neural network provided within the dlib library that had been trained to identify individuals across millions of faces of all different aspects and races with very high accuracy 44 . Next, we sampled 50 female faces and 50 male faces that respectively maximized the sum of the Euclidean distances between their face vectors. Speci cally, a face image was rst randomly selected from the female or male sampling set, and then other images of the same gender were selected so that each new selected image had the farthest Euclidean distance from the previously selected images. We repeated this procedure with 10,000 different initializations and selected the sample with the maximum sum of Euclidean distances. We repeated the whole sampling procedure 50 times to ensure convergence of the nal sample. All 100 images in the nal sample were high-resolution color images, with the eyes at the same height across images, had a uniform grey background, and were cropped to a standard size. See preregistration at https://osf.io/6p542.
To verify the representativeness of our selected 100 face images, we again performed UMAP analysis 46 to compare the distribution of our selected faces with a) N = 632 neutral, frontal, white faces from a broader set of databases [47][48][49] (Fig. 2c-d) and b) N = 5376 white faces "in the wild" 57,58 that varied in angle, gaze, facial expression, lighting, and backgrounds ( Supplementary Fig. 1b), using the 128 computationally extracted facial identity dimensions 44 as well as 30 traditional facial metric dimensions 42 (Supplementary Fig. 1c).

Freely generated trait words
To verify that our selected 100 trait words were indeed representative of the trait judgments people spontaneously make from faces, we collected an independent dataset from participants who freely generated words about the person that came to mind upon viewing the face. As preregistered, 30 participants were recruited via MTurk (see preregistration at http://bit.ly/osfpre4); different from the preregistration, we decided to not only include Caucasian participants but included participants of any race (27 participants were white, 3 participants were black).
Participants viewed the 100 face images one by one, each for 1 second, and typed in the words (preferably single-word adjectives) that came to mind about the person whose face they just saw.
Participants could type in as many as ten words and were encouraged to type in at least four words (the number of words entered per trial-words entered by a participant for a face-ranged from 0 words [for 8 trials] to 10 words [for 190 trials] with mean = 5 words). There was no time limit; participants clicked "con rm" to move on to the next trial when they nished entering all the words they wanted to enter for the current trial. All data can be accessed at https://osf.io/4mvyt/.

Study 1 Participants
All studies in this report were approved by the Institutional Review Board of the California Institute of Technology and informed consent was obtained from all participants. We predetermined our sample size for Study 1based on a recent study that investigated the point of stability for trait judgments of faces 59 : across 24 traits, a stable average rating could be obtained in a sample of 18 to 42 participants (ratings were elicited using a 7-point rating scale, the acceptable corridor of stability was +/-0.5, and the con dence level was 95%). Based on these ndings, we preregistered our sample size for Study 1 to be 60 participants for each trait (at https://osf.io/6p542).
Participants were recruited via MTurk (N = 1,500 (800 males), Age (M = 38 years, SD = 11), the median of educational attainment was "some post-high-school, no bachelor's degree"). All participants were required to be native English speakers located in the U.S. of 18 years old or older, with normal or corrected-tonormal vision, with an educational attainment of high school or above, and with a good MTurk participation history (approval rating ≥ 95%).
We also collected data about whether our participants were currently being treated for psychiatric or neurological illness. The majority of our participants (79.7%) were not currently being treated for any psychiatric or neurological illness. All dimensional analyses that are reported in the main text on the full sample were repeated also on those 79.7% of participants and the results corroborated all ndings from the full dataset: Tucker indices of factor congruence for the four dimensions = 1.00, 1.00, 0.99, 0.99.

Study 1 Procedures
All experiments in Study 1 were completed online via MTurk. Considering the large amount of time it would take for a participant to complete ratings for all 100 traits and 100 faces, we divided the experiment into 25 modules: the 100 traits were randomly shu ed once and divided into 25 modules, each consisting of 4 traits. Each participant completed one module.
To encourage participants to use the full range of the rating scale, we brie y showed all faces (in ve sets of arrays of 20 each) at the beginning of a module, so that participants had a sense of the range of the faces they were going to rate. In each module, participants rated all faces on each of the four traits (in random order) in the rst four blocks; in the last ( fth) block they rerated all faces on the trait they were assigned in the rst block again, thus providing sparse within-subject consistency data.
At the beginning of each block, participants were instructed on the trait they were asked to evaluate and were provided with a one-sentence de nition of the trait (Supplementary Table 1). Participants viewed the faces one by one in random order (each for 1 second) and rated each face on a trait using a 7-point rating scale (by pressing the number keys on the computer keyboard). Participants could enter their ratings as soon as the face appeared or within four seconds after the face disappeared. The orientation of the rating scale in each block was randomized across participants. At the end of the experiment, participants completed a brief questionnaire on demographic information. See preregistration at https://osf.io/6p542.

Measures of reliability in Study 1
Data were rst processed following three preregistered exclusion criteria (see preregistration at https://osf.io/6p542): of the full sample with a registered size of N = 1,500 participants and L = 750,000 ratings, n = 48 participants and l = 27,491 ratings were excluded from further analysis. Each of the 100 traits was rated twice for all faces by nonoverlapping subsets of participants (ca. n = 15 per trait). As preregistered, we applied linear mixed-effect modeling to assess within-subject consistency, which adjusted for non-independence in repeated individual ratings by incorporating both xed effects (that were constant across participants) and random effects (that varied across participants). Ratings from every participant for every face collected at the second time were regressed on those collected at the rst time (ca. l = 1,445 pairs of ratings per trait) while controlling for the random effect of participants.
As preregistered, we assessed the between-subject consensus for each trait with intraclass correlation coe cients (ICC(2,k)), using ratings of every face by every participant (ca. n = 58 participants and l = 5,780 ratings per trait). A high intraclass correlation coe cient indicates that the total variance in the ratings is mainly explained by the variance across faces instead of participants. We observed excellent between-subject consensus (ICCs greater than 0.75) for 93 of the 100 traits, and good between-subject consensus for the remaining 7 traits (ICCs greater than 0.60) [see Fig. 3].
Determination of the optimal number of factors As recommended 50,51,60,61 , ve methods were included to determine the optimal number of factors to retain in EFA. No single method was regarded as the best method for determining the number of factors; solutions are considered most reliable when multiple methods agree. Parallel analysis retains factors that are not simply due to chance by comparing the eigenvalues of the observed data matrix with those of multiple randomly generated data matrices that match the sample size of the observed data matrix. Prior studies showed that parallel analysis produces accurate estimations of the number of factors consistently across different conditions (e.g., the distribution properties of the data) 60,61 . Cattell's scree test retains factors to the left of the point from which the plotted ordered eigenvalues could be approximated with a straight line (i.e., retains factors "above the elbow"). The optimal coordinates index provides a non-graphical solution to Cattell's scree test based on linear extrapolation. Empirical Bayesian information criterion (eBIC) retains factors that minimize the overall discrepancy between the population's and the model's predicted covariance matrices while penalizing model complexity. Velicer's minimum average partial (MAP) test is "most appropriate when component analysis is employed as an alternative to, or a rst-stage solution for, factor analysis" 62 . It is also included in our present study due to its popularity. MAP retains components by partialing out those that resulted in the lowest average squared partial correlation.

Labeling of Dimensions
Dimensionality reduction methods do not provide labels for the factors discovered, which must instead be interpreted by the investigators. We note that our third and fourth dimensions describe stereotypes related to gender (femininity-stereotypes) and age (youth-stereotypes) commonly reported in the literature 11 . In fact, essentially all trait judgments based on faces, and therefore all of our dimensions, are a re ection of people's stereotypes of some sort, since in our study nothing else is known about the people whose faces are used as stimuli, and therefore no ground truth is provided. We therefore omitted "stereotypes" in our labeling of all dimensions, since it implicitly applies to all of them.

Con rmatory analyses with arti cial neural networks and cross-validation
To compare different theoretical models and test potential nonlinearity in our data, we employed an arti cial neural network approach, in particular, autoencoders 63 , with cross-validation. The aim of an autoencoder model is to learn a lower-dimensional representation of the data. We constructed different autoencoders based on the different models we wished to test (the existing 2D and 3D frameworks 13,37 , the 4D framework from EFA). We trained these autoencoders on half of the data (for each trait, 50% of the individuals were randomly selected and their ratings were used to compute new aggregated ratings per face per trait) and tested them on the held-out other half of the data. We used the Adam optimization algorithm 52 and mean squared error loss function with a batch size of 32 and 1500 epochs to train the neural networks (the loss converged after 1000 epochs in all our models). We repeated this process for 50 iterations and compared the performance of different models. For completeness, both linear and nonlinear activation functions were explored for model tting (linear, tanh, sigmoid, recti ed linear activation unit, L1-norm regularization, Fig. 4b-c); a simple linear activation function ended up with the best results.
Existing frameworks 13,37 suggest that all face-impression dimensions are of the same order (i.e., no dimension is a higher-or lower-order dimension of the others), but that the number of dimensions varies. Therefore, we rst constructed different autoencoder models with only one hidden layer that varied in the number of neurons in this hidden layer, corresponding to the number of underlying dimensions (from 1 to 10). The input layer and output layer were the same for all models, where each face was represented by a vector of ratings across the 92 traits and each trait corresponded to a neuron. All layers were densely connected. We trained these different models and compared their performance (assessed with the explained variance on the held-out test data).
In addition, we tested potential hierarchical factor structure in our data by adding one hidden encoder layer with various numbers of neurons (from 1 to 10) before the middle hidden layer (also with various numbers of neurons from 1 to 10); since autoencoder models are by de nition symmetric, these hierarchical latent structures were mirrored in the decoder layers (i.e., three hidden layers). Results showed that adding hidden layers did not increase model performance.

Study 2 Participants
The study was approved by the Institutional Review Board of the California Institute of Technology and informed consent was obtained from all participants. We preregistered to recruit participants through Digital Divide Data, a social enterprise that delivers research services, in seven countries/regions of the world: North America (U.S. and Canada), Latvia, Peru, the Philippines, India, Kenya, and Gaza. All participants were required to be between 18-40 years old, pro cient in English (except participants in Peru, where everything was translated to Spanish), have been educated at least through high school, have been trained in basic computer skills, and have never visited or lived in Western-culture countries (except participants in North America and Latvia). In addition, we aimed to have a roughly equal sex ratio of participants in all locations.
The sample size for each location was predetermined to be 30 participants. This sample size was determined based on two criteria: rst, the sample size should be large enough to ensure stable average trait ratings (for a corridor of stability of +/-1.00 and a level of con dence of 95%, the point of stability Eighty of the 100 trait words were used in Study 2-twenty words were excluded for their low correlations with other traits as found in Study 1 (sarcastic, white, thrifty, shallow, homosexual, nosey, conservative, and reserved), their ambiguity or similarity in meaning as found in feedback from Study 1 (trustful, natural, passive, reasonable, strict, enthusiastic, affectionate, and sincere), and their potential offensiveness in some cultures (idiot, loser, criminal, and abusive).

Measures of reliability in Study 2
Data were rst processed following our preregistered exclusion criteria A to C (see preregistration at https://osf.io/tbmsy): of the full sample with a preregistered size of N = 30 participants and L = 300,000 ratings at each of 7 locations (N = 210 total), we excluded from further analysis n = 1 participant in India and l = 24,236 ratings in North America, l = 2,507 ratings in Latvia, l = 16,366 ratings in Peru, l = 3,178 ratings in the Philippines, l = 14,389 ratings in India, l = 9,117 ratings in Kenya, and l = 4,096 ratings in Gaza. Registration criterion D was not applied for the analyses of within-subject consistency and between-subject consensus because it imposed a strict lower bound on the within-subject consistency to ensure data quality, which might lead to an overestimation of the reliability of the data.
All participants at all locations rated a subset of twenty traits twice for all faces. Analyses of withinsubject consistency identical to those in Study 1 were performed for each of the seven datasets (l = 100 pairs of ratings across faces per participant for ca. n = 28 participants per location). We found acceptable within-subject consistency at all locations (r s > 0. 20 (Figs. 3-4).
Assessment of between-subject consensus at each location used data from all participants within the same location (l = 100 ratings per participant for the 100 faces from ca. n = 28 participants per trait per location). As hypothesized in our preregistration, traits regarding physical appearance such as feminine, youthful, beautiful, and baby-faced showed high between-subject consensus in all seven locations (all ICCs > 0.86). At the other extreme, some locations had trait ratings with near-zero consensus within that location (the ratings of compulsive in Gaza, prudish in India and Kenya, self-critical in Gaza and the Philippines). This stood in contrast to the ndings from Study 1 where ICCs > 0.61 for all the one hundred traits (Fig. 3), and to the samples from North America (ICCs > 0.61 for all traits) and Latvia (ICCs > 0.50 for all traits).
Data processing for RSA and dimensionality analysis in Study 2 To ensure high quality and complete data from individuals, we registered four exclusion criteria (A-D) while data collection was underway and data had not yet been analyzed (see registration at https://osf.io/tbmsy), in addition to those planned in our original preregistration (https://osf.io/qxgmw).
Analyses of representational similarity and dimensionality for both aggregated and individual data were performed using data that were processed with exclusion criteria A-D. Following those criteria, thirty-one participants across seven locations were excluded for further analysis (n = 3 for North America, n = 2 for Latvia, n = 7 for Peru, n = 3 for the Philippines, n = 10 for India, n = 2 for Kenya, and n = 4 for Gaza). Among those remaining participants, n = 86 participants had complete data for all 80 traits-data from these 86 participants were used in the individual-level analyses (Fig. 7).

Data and code availability
All data, codes, and materials are available at Open Science Framework: https://osf.io/4mvyt/ and https://osf.io/xeb6w/. Figure 1 Sampling  Representativeness of the sampled traits (a-b) and face images (c-d). a, Distributions of word similarities.

Declarations
The similarity between two words was assessed with the cosine distance between the 300-feature vectors40 of the two words. The blue histogram plots the pairwise similarities among the 100 sampled traits. The red histogram plots the similarities between each of the freely generated words during spontaneous face judgments (n = 973, see Supplementary Fig. 1a) and its closest counterpart in the sampled 100 traits. Dashed lines indicate means. All freely generated words were found to be similar to at least one of the sampled traits (all similarities greater than the mean similarity among the sampled traits [except for the words "moving" and "round"]). Eighty-ve freely generated words were identical to those in the 100 sampled traits.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. Supplemental.docx