Introduction

Generalized trust, or default expectations about the trustworthiness of people in general, is the foundation of well-functioning communities, markets, and hierarchies1,2,3. Despite long-standing interest in generalized trust, the concept has proven difficult to measure: research consistently shows that common measures of generalized trust are inaccurate and non-invariant4,5,6,7,8,9,10. This is because the terms used in traditional measures, such as the ā€œmost-people trustā€ question, which in its classic form asks, ā€œSome people say that most people can be trusted. Others say you canā€™t be too careful in your dealings with people. How do you feel about it?ā€ means different things to different people6,9,11,12,13. In particular, the literature shows that interpretations of ā€œmost peopleā€6,11,12,13 and what ā€œpeople can be trustedā€ to do14,15 vary from one respondent to the next, leading to measurement error and biased responses.

To address these issues, Robbins7 recently developed a new instrumentā€”the Stranger Face Trust scale (SFT)ā€”that measures generalized trust as the average of trust ratings across a small set of trust domains and human faces. By presenting respondents with a standardized set of specific strangers and particular matters, Robbins9,10 finds that the measurement equivalence problems plaguing traditional measures of generalized trust are mitigated with SFT. While various psychometric tests also provide strong empirical support for the reliability and validity of SFT7,8,9,10, there is little evidence on whether facial appearance biases previous tests. Assessing facial heterogeneity is important because, if SFT is valid and reliable across different sets of faces, it would indicate that SFT accurately measures generalized trust regardless of the faces being assessed, and that SFT can be modified for different settings without sacrificing validity and measurement equivalence.

Here, we test whether facial heterogeneity affects the psychometric properties (validity, reliability, measurement equivalence) of SFT using an online survey experiment. In September of 2022, we recruited 5001 Qualtrics web panel members living in the United States. Respondents were randomly assigned to one of five conditions or sets of faces (see Fig.Ā 1 for an illustrative example of the faces in question): (i) six ā€œoriginalā€ faces from SFT, (ii) six ā€œlow trustā€ faces, (iii) six ā€œhigh trustā€ faces, (iv) six ā€œsingle raceā€ faces, and (v) six ā€œAI-synthesizedā€ faces. Irrespective of the condition, respondents rated each face along four trust domains: keeping a secret, repaying a loan, watching a loved one, and providing financial advice. To assess consistency across psychometric tests, we also measured respondentsā€™ demographic characteristics (e.g., age, gender), economic preferences including positive reciprocity and unconditional altruism16, and prior trusting behaviors such as lending money to friends and leaving oneā€™s door unlocked5.

Figure 1
figure 1

Examples of faces (Black females) from each set of faces. The first four facesā€”Original Face (A), Low-Trust Face (B), High-Trust Face (C), and Single-Race Face (D)ā€”come from the Chicago Face Database 18, which is a publicly available database consisting of high-resolution photographs of male and female human faces of various ages and ethnicities. The last faceā€”AI-Synthesized Face (E)ā€”comes from www.thispersondoesnotexist.com, which is a publicly available database of synthetic faces that were created using a generative adversarial network trained on a large dataset of real images of human faces. More information about the sets of faces can be found in the Supplemental Materials online.

We find that facial heterogeneity has negligible effects on the measurement validity and measurement equivalence of SFT. Tests evaluating measurement invariance, or the extent to which a latent construct is measured in the same way across groups, show that the factor loadings and item intercepts of SFT are invariant across sets of faces. Psychometric tests that assess the extent to which SFT correlates with closely related constructs with which it should correlate (convergent validity), with different constructs with which it should not correlate (discriminant validity), and with criterion variables with which it should predict (concurrent validity) yield largely consistent results across sets of faces. The one exception is the ā€œlow trustā€ faces. For this set of faces, we observe slight mean differences, as well as minor differences in the regression coefficients for two of the five criterion variables used to assess convergent validity. These results provide further evidence for the accuracy and consistency of SFT, and the utility of using AI-synthesized faces in social science research.

Results

Measurement equivalence

Nested model comparisons between configural invariance (Nā€‰=ā€‰4753, SRMRā€‰=ā€‰0.019, RMSEAā€‰=ā€‰0.071, CFIā€‰=ā€‰0.974, TLIā€‰=ā€‰0.957), metric invariance (Nā€‰=ā€‰4753, SRMRā€‰=ā€‰0.035, RMSEAā€‰=ā€‰0.071, CFIā€‰=ā€‰0.962, TLIā€‰=ā€‰0.957), and scalar invariance (Nā€‰=ā€‰4753, SRMRā€‰=ā€‰0.044, RMSEAā€‰=ā€‰0.079, CFIā€‰=ā€‰0.939, TLIā€‰=ā€‰0.946) models yield inconsistent changes to SRMR, RMSEA, CFI, and TFI. These tests indicate that SFTā€™s factor loadings and item intercepts are invariant across sets of faces.

Convergent validity

Table 1 shows that the effect sizes (or standardized Ī²s) of the original SFT faces parallel previously published estimates7,10. Joint or familywise tests of the equality of coefficients fail to reject the null hypothesis that interaction effects are equal to zero, except for IST, F(4, 4650)ā€‰=ā€‰3.88, pā€‰=ā€‰0.003, and PST, F(4, 4709)ā€‰=ā€‰2.41, pā€‰=ā€‰0.046. In both cases (i.e., IST and PST), the coefficients of the ā€œlow trustā€ faces are statistically significantly smaller than those of most other sets of faces (Ī”Ī²s \(\le\) 0.10). In all other cases, the effect size differences between the coefficients are relatively trivial and statistically non-significant (see the Supplemental Materials online for coefplots showing the regression coefficients and their confidence intervals).

Table 1 Familywise tests of interactions between SFT and sets of faces.

Discriminant validity

Across a number of economic preferences16, including measures of risk preferences, time preferences, and social preferences, we find that effect sizes are statistically equivalent between the five sets of faces. For every measure of economic preferences, joint tests of the equality of coefficients fail to reject the null hypothesis that interaction effects are equal (see Table 1). As Table 1 suggests, the effect size differences between the coefficients are relatively trivial and statistically non-significant (see the Supplemental Materials online for the regression coefficients and their confidence intervals). Effect sizes of the original SFT faces replicate estimates from previous research10.

Concurrent validity

The original SFT faces are moderately related to trusting behavior, Ī²ā€‰=ā€‰0.378, pā€‰<ā€‰0.001, supporting previous findings7. A joint test of the equality of coefficients fails to reject the null hypothesis that interaction effects are equal, F(4, 4720)ā€‰=ā€‰0.91, pā€‰=ā€‰0.457, which is further supported by the regression coefficients and their confidence intervals (see the Supplemental Materials online).

Descriptive analysis

FigureĀ 2 shows that the distributions of SFT across sets of faces are roughly equivalent, with the exception of low-trust faces, which have a higher density of scores below values of 1 than the other sets of faces. Row-mean scales developed by Robbins7 show that the mean of ā€œlow trustā€ faces is significantly different from all other faces, F(4, 4748)ā€‰=ā€‰15.23, pā€‰<ā€‰0.001.

Figure 2
figure 2

Density plots of SFT (row-mean scales) across five different sets of faces.

Discussion and conclusion

Research shows that traditional measures of generalized trust are inaccurate and non-equivalent4,5,6,7,8,9,10,11,12,13. Because of these measurement issues, it is difficult to discern which scientific discoveries are real or an artifact of measurement. Some scholars have offered solutions to these longstanding measurement problems7,8,9,10, but the solutionā€”SFTā€”rests on assumptions about the consistency of responses across different faces. We show that ratings of SFT yield reliable scores for five different sets of faces, including AI-synthesized faces, and that very little, if any, bias is introduced by manipulating the facial heterogeneity of SFT. We do, however, observe slight mean differences for ā€œlow trustā€ faces, as well as small differences in some tests of convergent validity, suggesting that care should be taken in selecting faces for SFT within and between different settings, or researchers run the risk of obtaining lower-bound estimates of generalized trust. These findings have two important implications.

First, our results extend previous research on the measurement of generalized trust7,8,9,10. While SFT has been shown to be more valid, reliable, and invariant than traditional measures, we additionally demonstrate the versatility and robustness of SFT to changes in its core stimuli: human faces. As an instrument, tests of factorial validity (factor loadings and item intercepts), convergent validity, discriminant validity, and concurrent validity yield comparable results regardless of the faces assessed (with the exception of a few criterion variables). These results suggest that faces can be modified in SFT without sacrificing validity or measurement equivalence. However, the extent to which this recommendation applies to other cultural contexts with different ethnoracial groups outside of the United States requires further investigation. One possible avenue for future research is to evaluate and compare the validity, reliability, and measurement equivalence of drastically different sets of faces, such as all Asian faces or all African faces, within and between countries that vary in their levels of ethnic heterogeneity. This would help determine which sets of faces have broader cross-cultural applicability than others.

Second, our results are consistent with recent work on synthetically generated faces, where respondents judge AI-synthesized faces and real faces as comparably trustworthy17. While there are growing risks associated with this technology, such as online fake profiles, fraud, and disinformation campaigns, AI-synthesized images and videos could find a home in social science research. Synthetically generated images and videos provide social scientists with opportunities to better understand human sociality without posing threats to the privacy of individuals who offer their images for purposes of research.

Data and methods

Sample and procedures

In September 2022, we recruited 5001 Qualtrics web panel members to participate in an online survey experiment. We included respondents who passed an attention check, lived in the United States, were 18 years of age or older, and who met our quotas for age, gender, and education. The quota sample was proportionally representative of the college-educated, U.S. population by age cohort, gender, and education. Age quotas included the following: at least 27% 18 to 34 y old, 22% 35 to 49 y old, 23% 50 to 64 y old, and 19% 65 y old plus. Gender quotas were at least 45% male and 45% female. Quotas for educational attainment were at least 8% less than high school, 25% high school diploma or equivalent, 25% some college or Associateā€™s degree, 20% Bachelorā€™s degree, and 10% graduate degree (i.e., Masterā€™s, Doctoral, etc.). Respondents were, on average, 47.80 years old (SDā€‰=ā€‰17.09); 53.2% were female, 69.9% were non-Hispanic White, 43.7% had a high school diploma or equivalent, and 43.5% were married.

The survey experiment consisted of 9 blocks: some blocks were fixed at the beginning of the experiment (e.g., consent form, demographics), while the remaining blocks were presented in random order from respondent to respondent. These blocks were organized thematically, such as a block for the Imaginary Stranger Trust scale (IST), a block for SFT (where respondents were randomly assigned to one of five conditions or sets of faces), a block for traditional measures of generalized trust, a block for measures of political trust, a few blocks for measures of economic preferences, and a block for measures of trusting behavior. The survey completion rate was 33.33% (Qualtrics contacted 15,100 web panelists, and 5,001 of them completed the study). Participants who completed the study received an incentive of various types (e.g., cash, airline miles, etc.). The median length of the study was 7.43 min.

Instruments and measures

Six different strangers were initially selected for inclusion in the ā€œoriginalā€ SFT scale7,8,9,10. The strangers were drawn from the Chicago Face Database18, which is a publicly available database consisting of high-resolution photographs of male and female human faces of various ages and ethnicities. Each face in the database is represented by a neutral expression photograph that has been normed by an independent rater sample (>ā€‰1000 independent raters). The six faces were selected based on four criteria: age (median U.S. age), race (Caucasian, African-American, and Latino/a), gender (male and female), and perceived trustworthiness (neutral ratings).

For three other sets of faces, including the ā€œlow trustā€, ā€œhigh trustā€, and ā€œsingle raceā€ faces, we again selected faces from the Chicago Face Database, but used different selection criteria. The faces represent the genders (male and female) and races (Caucasian, African American, and Latino/a) that received the lowest perceived trustworthiness ratings (ā€œlow trustā€), the highest perceived trustworthiness ratings (ā€œhigh trustā€), or the most consensus on perceived race (ā€œsingle raceā€) from the independent rater sample for a given genderā€‰Ć—ā€‰race. Unlike the ā€œoriginalā€ faces, age was not a criterion for selecting these faces.

For the AI-synthesized faces, we selected images from the www.thispersondoesnotexist.com. The website uses an algorithm trained on a large dataset of real images of human faces. It then uses a generative adversarial network to fabricate faces. The authors selected the six ā€œAI-synthesizedā€ faces given their perceptions of gender (male and female) and race (Caucasian, African-American, and Latino/a). More information about each face can be found in the Supplemental Materials online.

For all five sets of faces, SFT asks respondents, ā€œImagine meeting the following stranger for the first time. Please identify how much you would trust this stranger for each of the following.ā€ In contrast, IST, a short-form scale developed by Robbins7, does not show respondents faces. Instead, respondents are asked, ā€œimagine meeting a total stranger for the first time. Please identify how much you would trust this stranger for each of the following.ā€ SFT and IST contain four domains (or matters) of trust for which respondents would rate each of the six human faces (SFT) and the imaginary stranger (IST): (i) ā€œTo keep a secret that is damaging to your reputationā€ (SECRET); (ii) ā€œTo repay a loan of one thousand dollarsā€ (LOAN); (iii) ā€œTo look after a child, family member, or loved one while you are awayā€ (CHILD); and (iv) ā€œTo provide advice about how best to manage your moneyā€ (ADVICE). Each trust domain is measured using a 4-point scale, ranging from Do not trust at all to Trust completely, with Do not trust very much and Trust somewhat in-between the anchors, and a Donā€™t know option at the end of the scale. For SFT, the order of the six faces is randomized across respondents, meaning that for this study, respondents are randomly assigned to a set of faces and the order of the faces within each set is randomized from respondent to respondent. The order of the four trust domains is also randomized across faces. For IST, only the order of the four domains is randomized across respondents.

Convergent validity is determined by assessing the degree to which an operationalization is similar to (or converges on) other operationalizations to which it should theoretically be similar. In addition to IST, four other instruments common to the General Social Survey and the World Values Survey were used to establish convergent validity: a 3-item Misanthropy Scale (MST); a 3-item Generalized Social Trust scale (GST); a 3-item Particularized Social Trust scale (PST); and a 4-item Political Trust scale (POT). We selected these four scales because they have been used as criterion variables in previous work evaluating the validity of SFT7,9,10. Information on the wording and scaling of each item can be found in the Supplemental Materials online.

Discriminant validity is determined by assessing the degree to which an operationalization is not similar to (or diverges from) other operationalizations to which it should not theoretically be similar. Many theories in the social sciences assume that a set of preferences motivates behavior, such as preferences for risk, the timing of costs and benefits, and reciprocity and altruism. Recent research has shown that generalized trust is weakly correlated with several kinds of economic preferences, such as positive reciprocity and unconditional altruism10,16. These findings are intuitive because trust is a belief, not a preference. As a result, generalized trust should be weakly correlated with economic preferences like positive reciprocity10. To measure economic preferences, we used 9 items from Falk et al.16 that capture time preferences, risk preferences, positive and negative reciprocity, and unconditional altruism. Information on the wording and scaling of each item can be found in the Supplemental Materials online.

Concurrent validity is determined by assessing the degree to which an operationalization predicts an outcome that it should theoretically predict. Concurrent validity, in other words, identifies the strength of a relationship between the operationalization and a criterion variable at the time the operationalization is administered (or measured). Note that concurrent validity is different from predictive validity, which determines the ability of an operationalization to predict an outcome (i.e., criterion variable) in the future. By this logic, self-report measures of trust should be correlated with measures of trusting behavior, such as lending money and personal possessions to friends5. A 3-item scale (Trusting Behavior) adapted from Glaeser et al.5 was used to establish concurrent validity. Information on the wording and scaling of each item can be found in the Supplemental Materials online.

Analysis

We investigate measurement equivalence by comparing three nested models with multiple group analysis19. In multiple group analysis, each successive model includes the previous model restrictions plus additional constraints. Model 1, the configural invariance model, tests the equivalence of the factor structure. Model 2, the metric invariance model, tests the equivalence of the factor loadings. Model 3, the scalar invariance model, tests the equivalence of measurement intercepts. Nested models can be tested with Ļ‡2 difference tests, but since the test is sensitive to sample size we rely on changes to tests of absolute and relative model fit.

The tests we use include the root mean squared error of approximation (RMSEA), the standardized root mean square residual (SRMR), the comparative fit index (CFI), and the Tucker-Lewis index (TLI). RMSEA and SRMR are tests of absolute model fit, while CFI and TLI are tests of relative model fit. The RMSEA is an index that measures the difference between the hypothesized model and the population covariance matrix. RMSEA ranges between 0 and 1, with values less than 0.08 indicating adequate model fit. The SRMR is a standardized measure of the square root of the difference between the sample covariance matrix and the model covariance matrix. Like the RMSEA, SRMR ranges between 0 and 1, with values less than 0.08 indicating adequate model fit. By contrast, the CFI adjusts for issues of sample size inherent to the Ļ‡2 test of model fit, and measures the relative improvement in model fit going from the baseline model (i.e., a model with the worst fit) to the hypothesized model. CFI ranges from 0 to 1, with values greater than 0.90 suggesting adequate model fit. The TLI measures the relative reduction in misfit per degree of freedom for the baseline model and the hypothesized model. TLI ranges from 0 to 1 but can occasionally be negative or exceed 1, with values greater than 0.90 indicating adequate model fit. We use criteria of a 0.015 change in RMSEA and SRMR paired with changes in CFI and TLI of 0.01. If nested model comparisons yield Ī”RMSEAā€‰>ā€‰0.015, Ī”SRMRā€‰>ā€‰0.015, Ī”CFIā€‰>ā€‰0.01, and Ī”TLIā€‰>ā€‰0.01, then the null hypothesis of equivalence should be rejected20,21.

We report nested model comparisons between models that estimated configural invariance, metric invariance, and scalar invariance by groups (i.e., sets of faces). To identify the configural invariance model, we follow Vandenberg and Lance19 and constrained the factor means and variances to 0 and 1, respectively, across groups. Factor loadings as well as item intercepts and residuals variances were freely estimated across groups. To identify the metric invariance model, we constrained the factor loadings to equality across groups, constrained the factor variance to 1 for the first group (i.e., Original Faces) but freely estimated the factor variances for all other groups, constrained the factor means to 0 across groups, and freely estimated item intercepts and residual variances across groups. To identify the scalar invariance model, we constrained the factor loadings and item intercepts to equality across groups, constrained the factor variance and mean to 1 and 0, respectively, for the first group (i.e., ā€œoriginalā€ faces) but freely estimated the factor variances and means for all other groups, and freely estimated item residual variances across groups.

For all measurement validity tests found in Table 1, we estimated the following models:

$$CriterionVariable = {\beta}_0 + {\beta}_1SFT + {\beta}_1 FaceDummies + {\beta}_3 \text (SFT \times FaceDummies) + \varepsilon$$

where CriterionVariable is a validation variable, such as IST or MST, SFT is a single continuous variable consisting of latent factor scores of each set of faces (Original Faces,ā€¦, AI-Synthesized Faces), FaceDummies is a vector of dummy variables for each set of faces (Original Faces is the referent category), and SFTĀ Ć— FaceDummies is a vector of interaction effects between latent factor scores and dummy variables of each set of faces.

Ethics approval

Ethics approval was obtained from the New York University Abu Dhabi Institutional Review Board (Approval Number: HRPP-2022-87). Our work also conforms to the Code of Ethics of the American Sociological Association (ASA), and although our study is not a medical study, we adhere to the World Medical Associationā€™s Code of Ethics (Declaration of Helsinki) for the protection of human research participants.

Consent to participate

Informed consent was obtained from all study participants.