Introduction

People’s success and well-being heavily depend on how their personalities are judged by others and—increasingly—algorithms1. Ranging from the first impressions based on facial appearance2 to close friends’ well-informed opinions3, others’ perceptions affect one’s personal, educational, and occupational success; social capital; health; wealth; and many other consequential outcomes4. Importantly, others’ perceptions matter regardless of their accuracy3,5, as illustrated by those suffering (or benefiting) from prejudice and stereotypes6,7.

Particularly consequential are perceptions of public figures’ personalities. Politicians’ perceived personality influences their electoral success8, their approval ratings9, and even geopolitics10. CEOs’ perceived personality influences their own success but also their companies’ reputation, valuation, and performance11,12. Celebrities’ perceived personality affects the recognition, consumer attitudes, and purchase intentions toward the brands they promote13. Musicians’ perceived personality drives their music’s popularity14. Unsurprisingly, public figures invest much effort and resources into shaping others’ impressions, while researchers and practitioners across many disciplines study their formation and assessment15,16.

Perceptions of public figures’ personalities are typically measured by surveying qualified informants or the general public, a costly and time-consuming approach9,12. Such perceptions are also reflected in public discourse and communications17. As public discourse and journalism increasingly shift to digital environments, people’s views and perceptions are now increasingly recorded in written digital sources such as blog posts, tweets, Wikipedia entries, newspaper articles, and books. This signal is further amplified, as people’s perceptions and actual personality cues shape others’ perceptions, leading to self-amplifying feedback loops. Taken together, these phenomena suggest that costly and time-consuming surveys could be supplemented with perceived personality estimates extracted from digital language samples.

Past research has confirmed that perceptions of others’ personalities could be successfully extracted from texts, such as social media posts, biographies, or books18,19,20. The main challenge of this approach is obtaining the text corpora necessary to extract personality perception cues. Yet, this challenge has been addressed by the recent explosion in the size and availability of large language models (LLMs), such as Word2Vec, BERT, or GPT321,22,23. LLMs are trained on huge and diverse text corpora that include, among other things, language revealing people’s perceptions of public figures’ personalities as well as the cues to their actual personalities. For example, GPT-3—the state-of-the-art LLM model used here—was trained on the contents of billions of websites, the entire English Wikipedia, and over 10,000 books23. The collection and analysis of such data are beyond the technological capacity of most researchers, not to mention the associated financial and environmental costs. For example, the training of GPT-3 was estimated to cost $12 million and to emit 552 tons of carbon dioxide24,25.

Past work showed that the perceptions of public figures’ warmth and competence can be extracted from an earlier LLM, Google’s Word2Vec26. Here we show that word embeddings extracted from GPT-323 can predict people’s general sentiment toward public figures (likability) as well as their perceptions of their Big Five personality traits (openness, conscientiousness, extraversion, agreeableness, and emotional stability) that were shown to capture much variance in individual differences and reliably predict a wide range of individual and social outcomes27.

Methods

Our study focused on the 300 most popular public figures from 43 countries selected from among the 11,341 public figures listed in the Pantheon 1.0 dataset28. Their popularity was approximated by their Wikipedia page views between 2008 and 2013. As artists were particularly popular, we limited their number to 100 to include public figures from seven other domains including business and law, exploration, humanities, institutions, science and technology, sports, and others. The dataset includes public figures’ gender and birth year (with some missing data). As raters may have been less familiar with public figures born before 1900, we did not include them in our studies.

Public figures’ names were presented to raters employed on Prolific. Each of the 600 raters rated the likability (on a 200-point scale from extremely negative to extremely positive) and Big Five personality traits (on the Ten-Item Personality Inventory; TIPI) of 10 random public Figures29. Raters could skip targets that they were unfamiliar with. Public figures received 18.89 ratings on average (SD = 10.38). We removed 74 public figures who were recognized (and thus rated) by fewer than 10 raters. See Supplementary Materials for the rationale and the intraclass correlation coefficient (ICC)30, a measure of the agreement between two or more raters.

GPT-3 stores knowledge about words’ meaning in a 12,288-dimensional semantic space, a functional equivalent of the semantic memory in humans. The closer two words (or phrases) are in this space, the more similar their meaning31. For example, “Donald Trump” is similar to “arrogant,” while “Mother Teresa” is close to “sympathetic.” The embeddings of public figures’ names, representing their location in this space, were entered into a Ridge regression26 to predict human ratings. Ridge regression is suitable for the analyses of high-dimensional data, as it reduces multicollinearity between predictors by penalizing large coefficients. The embeddings were standardized (by column) to ensure that the penalty was applied equally to each dimension. To prevent overfitting, 20-fold cross-validation was used: Predictions for each public figure were estimated using a model trained on all other public figures. The alpha parameter was tuned within each cross-validation fold using another 20-fold cross-validation.

Like all measures, human ratings include some errors. The split-half reliability of the ratings for the six attributes that we measured ranges from 0.79 to 0.88. This range serves as a benchmark for the highest accuracy that a predictive model might potentially achieve. Given that our interest lies in accurately predicting actual perceived personalities—not imperfect proxies—we adjusted the correlations using the square root of each scale’s reliability, a process known as correction for attenuation32. This adjustment enables a more equitable comparison of the model’s performance across various traits, notwithstanding the differing levels of agreement among human raters about these traits. For transparency, we also report the raw, uncorrected values.

Results

Figure 1 (green bars) shows that GPT-3’s embeddings accurately predicted human perceptions. The Pearson product-moment correlations between models’ predictions and human ratings ranged from r = 0.78 for extraversion to r = 0.88 for openness, which translates into Cohen’s d range of d = 2.49 (huge effect) to d = 3.75 (huge effect)33. Raw accuracy (i.e., the accuracy obtained without controlling for attenuation) was also high, ranging from r = 0.7 to r = 0.8. To put models’ accuracy in perspective, consider the following well-known diagnostic accuracies: The accuracy of computer tomography when detecting metastases from head and neck cancer equals r = 0.64, and the accuracy of ultrasonography when detecting peripheral artery disease is r = 0.8334. In other words, public figures’ perceived personalities can be inferred from the GPT-3 embeddings of their names with an accuracy comparable to how some of their ailments could be diagnosed by modern medical diagnostic tools. Moreover, given that individual human ratings predicted aggregate ratings with an accuracy of r = [0.56—0.66], embeddings predict aggregate judgments better than individual judgments do.

Figure 1
figure 1

The model’s accuracy in predicting public figures’ perceived personality without any controls (green bars) and while controlling for likability and demographics (red bars). Confidence intervals equal 95%. Values in parentheses represent raw accuracy (uncorrected for attenuation). All correlations are significant at the p < .001 level.

Predictions were more accurate for more popular public figures. The profile similarity between human ratings and model predictions was correlated with the logarithm of the number of Wikipedia pageviews at the level of r = 0.16 (refer to Supplementary Materials for more details). This indicates that it was more accurate for more popular figures that, presumably, appeared more frequently in its training data.

Table 1 shows the top and bottom 10 public figures, arranged according to their predicted perceived personality traits (full list at https://osf.io/854w2). It shows that the embedding-based predictions have high face validity. For example, individuals predicted to be perceived as the most open-minded, liberal, creative, and artistic (i.e., high openness) included mostly artists: Freddie Mercury, David Bowie, Michael Jackson, Lady Gaga, Jimi Hendrix, and Quentin Tarantino. In contrast, those predicted to be perceived as the most conservative and traditional (i.e., low openness) included mostly autocrats (Kim Jong-un and Kim Jong-il); conservative politicians (Richard Nixon, Mitt Romney, George H. W. Bush, Margaret Thatcher, Donald Trump, Dick Cheney, and George Bush); and Queen Elizabeth II.

Table 1 Top and bottom 10 public figures according to their predicted perceived traits.

A closer inspection of the names presented in Table 1 suggests a link between models’ predictions and public figures’ profession. Gender seems to play a role, too, with women dominating the top of the perceived agreeableness ranking and entirely absent from its bottom. Moreover, personality perceptions are linked with likability: Many of the least likable figures (e.g., Lee Harvey Oswald, Charles Manson, and both Kims) appear repeatedly on the socially undesirable (i.e., low) extrema of the perceived personality trait. As detailed in Table 2, public figures’ likability, birth year, and gender significantly correlate with perceived personality in our sample. Birth year, for example, correlates strongly and negatively with both human ratings of extraversion (r = 0.35) and GPT-3’s prediction (r = 0.44).

Table 2 Means, standard deviations, and correlations between human judgments, model predictions, and demographic variables.

Such links are not necessarily problematic, as they represent actual phenomena. Studies show that both actual and perceived personality correlate with profession, gender, age, and likability35,36. Women, for example, tend to be both: more agreeable than men37 and perceived as such7. People with desirable personalities are more likable, and likable people are perceived to have desirable personalities (i.e., “personality halo effect”)38. Yet, such links also imply that it is sufficient to predict demographics and likability to estimate—with some accuracy—perceived personality.

Could models predict perceived personality beyond what is explained by likability and demographics? To answer this question, we regress the human perceptions of each of the personality traits against public figures’ likability and demographics. The residuals of these models represent perceived personality traits cleaned of the influence of these variables. Next, we predict these residuals from public figures’ names’ embeddings using Ridge regression.

The results presented in Fig. 1 (red bars) show that GPT-3 can accurately predict perceived personality even when controlling for demographics and likability. The accuracy decreased but remained very high, ranging from r = 0.53 for emotional stability to r = 0.70 for extraversion, which translates into Cohen’s d range of d = 1.25 (very large effect) to d = 1.96 (very large effect). The raw (uncorrected) accuracies range from r = 0.43 to r = 0.61. To put those results in perspective, consider the following well-known diagnostic accuracies: The accuracy of dental X-rays when detecting between-tooth cavities equals r = 0.43; the accuracy of ultrasound results when detecting deep venous thrombosis equals r = 0.6034. In other words, even when controlling for demographics and likability, name embeddings allow for diagnosing public figures’ perceived personalities, as widely used medical diagnostic tools allow for diagnosing dental cavities or venous thrombosis. For further comparison, the accuracy of models employing people’s language to predict their own self-reported Big Five scores is about r = 0.4039.

The regression models trained here can be further interpreted, as they span GPT-3’s semantic space filled with interpretable words and concepts. For example, the line defined by the regression model predicting perceived extraversion stretches from the edge of semantic space occupied by public figures perceived to be the most introverted to the edge occupied by those perceived to be the most extraverted. To further interpret the models, we map the location of 525 person-descriptive adjectives obtained from40 on these regression lines. (Or, in other words, we computed the predicted scores for these adjectives.)

The adjectives maximizing and minimizing the models’ predictions can be found in Table 3 (scores for all 525 adjectives are at https://osf.io/854w2). Those results are highly congruent with the definitions of the Big Five personality traits. For example, adjectives at the bottom of the extraversion scale include “quiet,” “lonely,” “depressed,” “boring,” “lonesome,” “thoughtful,” “withdrawn,” “soft-spoken,” “philosophical,” and “thinking”; while those on top include “entertaining,” “hilarious,” “lively,” “glamorous,” “comical,” “sexy,” “energetic,” “playful,” and “good-humored.” It seems that public figures’ humor, instead of their sociability, is the most salient cue of extraversion to laypeople.

Table 3 The 10 person-descriptive adjectives maximizing and minimizing the predictions of a model trained to predict human perceptions.

Interestingly, those lists correctly captured behavioral correlates of personality. The regression models were trained on human responses to a 10-item personality questionnaire that never mentioned alcohol. Yet, it ranked “alcoholic” as the third (out of 525) adjective most characteristic of low perceived conscientiousness, the third adjective most characteristic of low perceived emotional stability, and the 42nd adjective most characteristic of low agreeableness. This is consistent with past research findings, which linked alcohol addiction with low conscientiousness, low agreeableness, and low emotional stability41. Moreover, the results reflect the correlation between the likability of a public figure and the desirability of their perceived personality (i.e., “personality halo effect”). For example, “corrupt” was associated with undesirable (low) levels of agreeableness, emotional stability, and openness.

Discussion

Our results indicate that public figures’ perceived personality can be accurately predicted from their names’ location in GPT-3’s semantic space. Our models remained accurate even when controlling for public figures’ demographics and overall likability. Moreover, the models showed high face validity as revealed by the examination of public figures predicted to score at the top/bottom of each of the traits, as well as the personality-descriptive adjectives occupying the models’ extremes.

These findings have multiple implications. First, they show that LLMs’ semantic spaces can be used to study and approximate people’s personality perceptions. This could be of interest to researchers and practitioners across disciplines ranging from political psychology to organizational behavior. Second, the research expands our understanding of word embeddings, which bear some similarity to human semantic memory31. They are known to encode words’ meanings42, including information about group stereotypes43,44,45. Our results show that they also capture individual differences, like individuals’ perceived personality traits. Our studies add to the growing body of social science research utilizing LLMs. For example, recent studies have found that LLMs can predict the directional relationships between ideological attitudes46, approximate the voting choices of different social groups47, and mirror human behavior in economic games48 and reasoning tasks49, as well as pass theory of mind tests50.

Our studies focused on predicting the perceived personality of public figures with sufficient presence in the sample used to train GPT-3. Yet, a similar approach could be used to measure the perceived personality of people absent from the training data. Given a sample of text describing an individual, one could estimate its location in the model’s semantic space and convert it into perceived personality using regression models trained on public figures. Another limitation of our approach is that it requires collecting human ratings to train regression models. Yet, given the alignment between our models and personality-descriptive adjectives, it is likely that similar results could be achieved without collecting human ratings. Instead, one could predict public figures’ perceived personality by estimating their distance from personality-descriptive adjectives or by asking generative language models to describe a person using person-descriptive adjectives, as we did in our follow-up study51. People similar to “outgoing” and dissimilar to “shy,” for example, could be classified as extraverted. Finally, models’ predictions are focused on the period reflected in the training data. For example, if people changed their mind about a given public figure, it would take until the next model training cycle for this change to be reflected in the embeddings.

The feasibility of automated extraction of perceived traits exposes a potential privacy threat52. Word embeddings may contain information about traits that the target would prefer to keep private. Even if there are no explicit cues in the training data, the models may still be able to extract intimate information. This mirrors privacy threats pertaining to other data types. For example, people are not very accurate when predicting others’ intimate traits from their facial images or Facebook Likes and thus do not perceive such data as overly sensitive. Yet, computer algorithms achieve high accuracy when extracting personality, political orientation, and even sexual orientation from such data sources52,53,54. The current results show that the impression of intimate traits of public figures can be easily extracted from widely available LLMs. As the collective impression of an individual often correlates with their actual traits, this could amount to a potential threat to privacy17,55.