Main

Images increasingly pervade the information we consume and communicate daily. The number of images in online search engines has leapt from thousands to billions in just two decades2. Every day, millions of people view and download images from platforms such as Google and Wikipedia5,6, and millions more are socializing through hyper-visual platforms such as Instagram, Snapchat and TikTok, which are based predominantly on the exchange of images. This growing trend is widely recognized by the tech and venture capital industries3,4, as well as by news agencies and advertisers who are now relying more heavily on images to attract people’s attention online7,8. This trend is also reflected by changes in the habits of the average American. A longitudinal survey from the American Academy of the Arts and Sciences shows that the amount of time Americans spend reading text is steadily declining1, whereas the time they spend producing and viewing images continues to rise2,4. What consequences does this unprecedented shift towards visual content have on how we ‘see’ the world? At the dawn of photography, Frederick Douglass—esteemed writer and civil rights leader—forewarned of the potential for images to reinforce social biases at large, arguing in his 1861 lecture ‘Pictures and Progress’ that “the great cheapness and universality of pictures must exert a powerful though silent influence on the ideas and sentiment of present and future generations”19. Since Douglass’ time, the internet has made it only cheaper and easier to circulate images on a massive scale3,4, potentially intensifying the impact of their silent influence. In this study, we explore the impact of online images on the large-scale spread of gender bias.

Despite the swelling proliferation of online images, most quantitative research into online gender bias focuses on text13,15,20,21,22. Only a few recent studies examine gender bias in a small sample of Google images16,17,18, without comparing the prevalence of gender bias and its psychological impact across images and text. Yet numerous psychological studies suggest that images may provide an especially potent medium for the transmission of gender bias. Research into the ‘picture superiority effect’ shows that images are often more memorable and emotionally evocative than text9,10,23, and may implicitly underlie the comprehension of text itself11,12,24,25. Images also differ from text in the salience with which they present demographic information. A textual description of a person can easily minimize gender bias by leveraging gender-neutral terminology or by omitting references to gender. For example, the sentence ‘The doctor administered the test’ makes no mention of the doctor’s gender. By contrast, an image of a doctor directly transmits demographic cues that elicit perceptions of the doctor’s gender. In this way, images strengthen the salience of gender in the representation of social categories. These intrinsic differences between images and text point to the prediction that online images amplify gender bias, both in its statistical prevalence and in its psychological impact on internet users.

Comparing gender bias in images and text

In this study, we developed computational and experimental techniques for comparing gender bias and its psychological impact across massive online corpora of images and texts. Our main analyses compared images and text data from the world’s most popular search engine, Google. Our findings were replicated using more than half a million images and billions of words from Wikipedia and Internet Movie Database (IMDb)26,27,28 (Extended Data Figs. 1 and 2; see Supplementary Information sections A.1.1 and A.1.2 for details). We implemented our model at scale by examining the gender biases in images and texts associated with all 3,495 social categories drawn from WordNet, a canonical database of categories in the English language29. These categories include occupations—such as doctor, lawyer and carpenter—and generic social roles, such as neighbour, friend and colleague.

To measure gender bias in online images, we automatically retrieved the top 100 images from Google corresponding to each social category in Google Images (Extended Data Fig. 3; see ‘Data collection procedure for online images’ in Methods). Collecting 100 images for 3,495 categories yielded 349,500 images. In the Supplementary Information, we report analyses showing that our results held when we increased the number of images collected for each category, and when we used gender-specific Google searches for each category (for example, female doctor), which yielded an extra 491,169 images (Supplementary Figs. 1 and 2). The scale of our image dataset is orders of magnitude larger than prior studies of gender bias in Google Images, which have typically examined 50 occupations or fewer, using only a few thousand images in total16,17,18. Each search was implemented from a fresh Google account with no prior history to avoid the uncontrolled effects of Google’s recommendation algorithm, which customizes results based on browsing history30. Searches were run by ten distinct data servers in New York City. All image data were collected in August 2020. Our results were replicated when collecting Google images using Internet Protocols from five further locations around the world: Amsterdam (the Netherlands), Bangalore (India), Frankfurt (Germany), Singapore (Singapore) and Toronto (Canada) (Supplementary Figs. 3 and 4).

To identify the gender of faces in each image, we hired a team of 6,392 human coders from Amazon Mechanical Turk (MTurk). The gender of each face was determined by identifying the majority (modal) gender classification selected by three unique coders who labelled faces as ‘female’, ‘male’ or ‘non-binary’ (2% of classification judgements indicated ‘non-binary’; these were excluded from our analyses). Our focus is not on how people self-identify in terms of gender. Rather, we focus on the gender that internet users perceive in online images. We replicated our findings using a canonical image dataset28 of 72,214 celebrities depicted across IMDb and Wikipedia (511,946 images), where each image is associated with the self-identified gender of the person depicted (Extended Data Fig. 2 and Supplementary Information section A.1.2). All coders were fluent English speakers based in the USA, and our results are robust to controlling for coder demographics and the rate of intercoder agreement (Supplementary Tables 1 and 2; see ‘Demographics of human coders’ in Methods). Coders reached unanimous agreement in their gender classifications for 91% of images. A standard chance-corrected measure of classification agreement (Gwet’s Agreement Coefficient, AC) indicates satisfactory intercoder reliability in our sample (Gwet’s AC1 = 0.48). For each category, we calculated the gender balance of the faces in its top 100 Google Image search results. We normalized this measure such that −1 indicates 100% female representation, 0 indicates perfect gender balance (50%/50%) and 1 indicates 100% male representation.

To measure gender bias in online texts, we leveraged word embedding models that construct a high-dimensional vector space based on the co-occurrence of words (for example, whether two words appear in the same sentence), such that words with similar meanings are closer in this vector space. Harnessing recent advances in natural language processing22,31, we identified a gender dimension in word embedding models that captures the extent to which each category co-occurs with textual references to either women or men. This method allows us to position each category along a −1 (female) to 1 (male) axis, such that categories closer to −1 are more commonly associated with women and those closer to 1 are more commonly associated with men (see ‘Constructing a gender dimension in word embedding space’ in Methods). We focus here on applying this method to the canonical word2vec model32 trained on the 2013 Google News corpus consisting of more than 100 billion words. Our results hold when comparing against our own word2vec model trained on a more recent sample of online news published between 2021 and 2023 (Extended Data Fig. 4). We also replicated our findings when comparing online images with a range of word embedding models, including Global Vectors for Word Representation (GloVe), Bidirectional Encoder Representations from Transformers (BERT), FastText, ConceptNet and Generative Pre-trained Transformer 3 (GPT-3), which vary in their dimensionality, their data sources (including Twitter and a random sample of the web) and the time period during which their training data were collected, ranging from 2013 to 2023 (Supplementary Table 3 and Supplementary Fig. 5).

Both our image-based and text-based measures capture the frequency with which each social category co-occurs with representations of each gender, along a −1 (female) to 1 (male) continuum, where 0 indicates equal association with each gender. To maximize the correspondence between our image-based and text-based measures, we apply minimum–maximum normalization to our text-based measure, so that −1 and 1 represent the most female and male categories, respectively, according to each method (results are robust to alternative normalization procedures; Supplementary Fig. 6). We were able to associate 2,986 social categories in WordNet with word embeddings in the Google News corpus, so we focus our comparisons on these categories (our image results are robust to including all 3,495 categories; Supplementary Fig. 7).

Using these measures, we quantify gender bias as a form of statistical bias along three dimensions. First, we examine the extent to which social categories are associated with a specific gender in images and texts. Second, we examine the extent to which women are represented, compared with men, across all social categories in images and texts. Third, we compare the gender associations in our image and text data with the empirical representation of women and men in public opinion and US census data on occupations. This allows us to test not only whether gender bias is statistically stronger in images than texts, but also whether this bias reflects a distorted representation of the empirical distribution of women and men in society.

Gender bias is stronger in images

To begin, we confirm that the gender associations for each social category are highly correlated across online images (Google Images) and texts (Google News) (P < 0.0001, r = 0.5, Fig. 1a, Pearson correlation, two-tailed, n = 2,986 categories), indicating shared patterns of gender representation across these sources. Yet the gender associations in images from Google Images are statistically more extreme than those in texts from Google News. Figure 1b shows that the magnitude of gender bias is significantly stronger in images than text for both female-skewed (P < 0.0001) and male-skewed categories (P < 0.0001) (Wilcoxon signed-rank test, n = 2,986 categories, two-tailed). This result holds when comparing only categories for which the gender associations agree across images, texts and human judgements (Extended Data Fig. 5). Figure 1c highlights this gap by showing the gender associations in these images and texts for an illustrative sample of occupations.

Fig. 1: Gender bias is more prevalent in online images (from Google Images) and online texts (from Google News) for both male- and female-typed social categories.
figure 1

a, The correlation between gender associations in images from Google Images and texts from Google News for all social categories (n = 2,986), organized by deciles. Our image-based measure captures the frequency of female and male faces associated with each category in Google Images (−1 means 100% female; 1 means 100% male). Our text-based measure captures the frequency at which each category is associated with men or women in the Google News corpus (−1 means 100% female; 1 means 100% male; measure is minimum–maximum normalized; ‘Constructing a gender dimension in word embedding space’). Data are shown as mean values, and error bars represent 95% confidence intervals. ***P = 2.2 × 1016 (Pearson correlation, two-tailed). b, The strength of gender association in these online images and texts for all categories (n = 2,986), split into whether these categories are female- or male-skewed according to each measure separately. Box plots show interquartile range (IQR) ±1.5 × IQR. c, The gender associations for a sample of occupations according to these online images and texts; this sample was manually selected to highlight the kinds of social categories and gender biases examined.

Source Data

Yet we also find that, on average, women are underrepresented in images, compared with texts (Fig. 2). Figure 2a shows that texts from Google News exhibit a relatively weak bias towards male representation (average bias (µ) = 0.03, P < 0.0001), whereas this male bias is more than four times stronger in images from Google Images (µ = 0.14, P < 0.0001), marking a highly significant increase (mean difference = 0.11, P < 0.0001) (Wilcoxon signed-rank test, two-tailed, n = 2,986 categories). According to Google News, 56% of categories are male-skewed, whereas 62% are male-skewed according to Google Images (P < 0.0001, proportion test, two-tailed, n = 2,986 categories). The underrepresentation of women is accentuated when using a deep learning algorithm to classify gender in these online images (Supplementary Figs. 810). This inequality even persists when searching explicitly for ‘female’ and ‘male’ images of each category in Google (Supplementary Figs. 1 and 2).

Fig. 2: The underrepresentation of women is stronger in online images (from Google Images) than in online texts (from Google News), public opinion and US census data on occupations.
figure 2

a, The distribution of gender associations for social categories (n = 2,986) in images from Google Images and texts from Google News. The image-based measure captures the frequency of female and male faces associated with each category in Google Image search results (−1 means 100% female; 1 means 100% male); the text-based measure captures the frequency with which each category is associated with men or women in the Google News corpus (−1 means 100% female; 1 means 100% male associations). The solid lines indicate the average gender association according to text (green) and images (purple). b, The correlation of gender associations, paired at the category level (n = 2,986), as measured by these online images and texts, as well as by internet users’ (n = 2,500) judgements of each category. Human coders indicated their beliefs about the gender representation of each category by moving a slider along the same −1 (female) to 1 (male) scale (horizontal axis shows the average human judgement across evenly spaced bins). Data points show mean values for each bin, and error bands show 95% confidence intervals for the fitted curve defined by a locally estimated scatterplot smoothing (LOESS)-smoothed regression (span = 0.75). c, The gender association of all matched occupations (n = 685) according to (1) textual patterns in Google News (green), (2) the empirical distribution of gender in the 2019 US census Bureau of Labor Statistics (grey) and (3) Google Images (purple). Data are shown as mean values and error bars show 95% confidence intervals calculated using a Student’s t-test (two-tailed).

Source Data

Our findings continue to hold when controlling for (1) linguistic features of categories, such as ambiguity, word frequency and gender connotation (for example, uncle) (Supplementary Figs. 11 and 12 and Supplementary Table 4); (2) the method for constructing the gender dimension in embedding space (Supplementary Figs. 6 and 1315); (3) the frequency at which each category is searched in Google Images across the USA (Supplementary Figs. 16 and 17 and Supplementary Table 5); (4) the number of faces (Supplementary Fig. 18) and images (Supplementary Fig. 19) associated with each category, and the number of categories examined (Supplementary Fig. 7); (5) the ranking of images in Google search results (Supplementary Fig. 19 and Supplementary Table 7); (6) whether faces are automatically cropped from images before they are classified by human annotators (Supplementary Fig. 20) or a deep learning classifier (Supplementary Fig. 9); (7) whether images repeat in and across searches (Supplementary Table 8); (8) the number of faces associated with each Google search (Supplementary Table 7); and (9) whether images contain photographed or animated people (Supplementary Table 8).

Although these analyses support our prediction that online gender bias is more prevalent in images than texts, an open question is whether online images present a biased representation of the empirical distribution of gender in society. Next, we show that online images exhibit significantly stronger gender bias than public opinion and 2019 US census data on occupations.

To compare our results with public opinion, we hired a separate panel of 2,500 coders from MTurk who used the same −1 (female) to 1 (male) scale to provide their opinions about the gender they most associate with each category in our dataset (see ‘Collecting human judgements of social categories’ in Methods). Although both our image and text measures are highly predictive of gender associations in public opinion, Fig. 2b shows that texts significantly underestimate male bias in public opinion (by −0.084 on average, P < 0.001), whereas images significantly overestimate it (by 0.025 on average, P < 0.001) (Wilcoxon signed-rank test, two-tailed, n = 2,986 categories).

We also compare our measures with the frequency of genders across occupations according to the 2019 census by the US Bureau of Labor Statistics (n = 685 occupations could be matched between our data and the census). Figure 2c shows that, according to texts from Google News, the gender association of these occupations is neutral (µ = 0, P = 0.65) and significantly less male than the census (census µ = 0.08, P < 0.001) and Google Images (images µ = 0.15, P < 0.001) (Wilcoxon signed-rank test, two-tailed, n = 685 occupations). By contrast, although these occupations are male-skewed in both the census and Google Images, the same occupations are significantly more biased towards male representation in Google Images (mean difference = 0.07, P < 0.001, Wilcoxon signed-rank test, two-tailed, n = 685 occupations). Comparing images and texts separately for female- and male-typed occupations reinforces these findings (Supplementary Fig. 21).

Testing psychological effects of images

What consequences do these biases in online images have on internet users? Here we report the results of a preregistered experiment designed to test the impact of online images on gender bias in people’s beliefs (‘Data availability’). In this experiment, we recruited a nationally representative sample of US participants from the online platform Prolific (n = 450), who were tasked with using Google to search for descriptions of occupations relating to science, technology and the arts (Extended Data Fig. 6; see ‘Participant pool’ in Methods). A total of 423 participants completed the task. Each participant used Google to retrieve descriptions of 22 randomly selected occupations from a set of 54 (see ‘Participant experience’ in Methods). Participants were randomized into either (1) the Text condition, in which participants used Google News to search for and upload textual descriptions of these occupations, or (2) the Image condition, in which participants used Google Images to search for and upload images of occupations. After uploading the description for each occupation, each participant was asked to rate which gender they most associate with the occupation being described, using a −1 (female) to 1 (male) scale. To evaluate these experimental effects, participants were also randomized into the Control condition that used the same task design, except that participants used Google to search for and upload either images or textual descriptions of basic, unrelated categories (for example, apple and guitar) before rating the gender they associate with each occupation. In the Supplementary Information, we report the results of an extra condition in which a separate randomized group of participants were tasked with searching for textual descriptions using the generic Google search bar rather than the Google News search bar; altering the search bar had no effect on the outcomes (Supplementary Fig. 22). Across all conditions, our main outcome variable of interest is the absolute strength of participants’ gender associations for each occupation.

After completing the search task for all occupations, participants undertook an implicit association test (IAT)33, a standard method in psychology for detecting implicit biases (see ‘Measuring implicit bias using the IAT’ in Methods). We adopted an IAT designed to detect the implicit bias towards associating women with liberal arts and men with science (Extended Data Figs. 7 and 8), because prior work demonstrates the ability of this IAT to predict human judgements and behaviours34,35 relating to a consequential pattern of inequality in industry and academic institutions36,37. We administered the IAT to participants immediately after the experiment, and 3 days later. Participants’ implicit bias was measured using the standard IAT D score33; positive D scores indicate that participants are faster at associating women with liberal arts and men with science. We acknowledge important continuing debate about the reliability of the IAT38,39,40. Our specific choice of IAT is supported by (1) prior work demonstrating its stable results across decades34 and (2) a separate preregistered study we conducted that yielded consistent results with a similar design (Methods). We note, however, that the distribution of participants’ implicit bias scores was less stable across our preregistered studies than the distribution of participants’ explicit bias scores. Given these considerations, we view our implicit bias results as suggestive and emphasize our measure of participants’ explicit bias as our primary and most robust outcome of interest.

We begin by examining the extent of gender bias in the descriptions participants uploaded. A team of annotators labelled each textual description as female, male or neutral on the basis of whether it used female or male pronouns or names to describe the occupation (for example, a description referring to a ‘doctor’ as ‘he’ would be coded as ‘male’); textual descriptions were identified as neutral if they did not ascribe a particular gender to the occupation. Similarly, a team of annotators labelled the gender of the focal face in each uploaded image as female, male or neutral; images were coded as neutral if they contained no face or an undecipherable face. Then, for each occupation, we calculated the gender balance of the descriptions provided by participants by computing the average gender association across all descriptions. This approach compares gender associations across images and texts without relying on word embedding models, while also ensuring that the images and texts being compared were collected by users during the same time period.

Images amplify explicit gender bias

Consistent with our observational results, Fig. 3a shows that the descriptions participants uploaded were significantly more gendered in the Image condition than in the Text condition (mean difference = 0.42, P < 0.0001, Wilcoxon signed-rank test, two-tailed). Figure 3b shows that exposure to more gendered stimuli in the Image condition led participants to report significantly stronger explicit gender associations than those in the Text (mean difference = 0.06, P < 0.001) and Control (mean difference = 0.06, P < 0.001) conditions, whereas there was no significant difference between those in the Text and Control conditions (mean difference = 0.001, P = 0.56) (Wilcoxon signed-rank test, two-tailed; Wilcoxon equivalence test, P < 0.05 for all bounds greater than or equal to |0.11|, n = 54 occupations). For example, participants in the Text condition rated the category ‘model’ as female-skewed (µ = −0.32), but the female-skew of this rating nearly doubled in its intensity among participants in the Image condition (µ = −0.62). These findings hold when controlling for the number of online sources that participants encountered, the amount of time they spent evaluating descriptions and participants’ gender (Supplementary Fig. 23 and Supplementary Tables 911). Notably, the gender associations in participants’ uploads and self-reported beliefs are highly correlated with the gender associations detected for the same occupations in our observational analyses of Google Images and textual data from Google News (Extended Data Fig. 9).

Fig. 3: Googling for images rather than textual descriptions of occupations amplifies gender bias in people’s beliefs.
figure 3

af, Participants (n = 423) from a nationally representative sample were randomized to one of the following: the ‘Image’ condition, in which they googled for images of occupations; the ‘Text’ condition, in which they googled for textual descriptions of occupations from Google News; or the ‘Control’ condition, in which they googled for either image-based or text-based descriptions of random categories (for example, ‘apple’) unrelated to occupations. The green, purple and dotted vertical lines indicate the mean results for the Text, Image and Control conditions, respectively. a, The average absolute strength of the gender associations in participants’ uploads for each occupation (n = 54; averaged at the occupation level) in both the Text and Image conditions (not applicable to the Control condition). b, The average absolute strength of the gender associations that participants reported for each occupation (n = 54; averaged at the occupation level) in each condition. c, The linear correlation between the average gender association of the descriptions that participants uploaded and the average gender association they explicitly reported for each occupation, coloured by condition. ***P = 2.2 × 10−16 (Pearson correlation, two-tailed). d, The correlation between the average strength of the gender association of the descriptions that participants uploaded and the average strength of the gender association they explicitly reported for each occupation, coloured by condition. ***P = 6.2 × 10−11. e, The implicit gender bias (D score) that participants (n = 405) exhibited in each condition. f, The correlation between the strength of participants’ self-reported gender associations for each occupation and their implicit bias (D score) towards associating women with liberal arts and men with science (n = 9,167 observations across all participants). Error bars show 95% confidence intervals calculated using a Student’s t-test (two-tailed).

Source Data

Images prime gender bias more strongly

These findings suggest that exposure to gendered descriptions in the Image condition more strongly primed participants’ explicit gender ratings of occupations. This priming mechanism is supported by Fig. 3c, which shows a high correlation between the gender associations in the descriptions that participants uploaded and the gender associations in their own explicit gender ratings across occupations (r = 0.79, P < 0.0001), and by Fig. 3d, which shows a strong correlation between the absolute strength of gender associations in participants’ uploads and the absolute strength of the average gender associations they explicitly reported across occupations (r = 0.56, P < 0.0001) (Pearson correlation, two-tailed, n = 54 occupations). These results hold across occupations for both the Image and the Text conditions (Supplementary Table 12).

We found further evidence suggesting that images differ from text not only in the prevalence of gender bias they contain, but also in their ability to prime gender bias in people’s beliefs, holding prevalence constant (Extended Data Fig. 10). Participants who uploaded gendered images explicitly reported significantly stronger gender bias (µ = 0.41) than those who uploaded gendered textual descriptions of the same occupations (µ = 0.35; mean difference = 0.06, P < 0.0001, t = 4.58, Student’s t-test, two-tailed, n = 54 occupations). This holds even when controlling for the amount of gender bias in the distribution of images and texts to which participants were exposed (Supplementary Table 13). Thus, even when gender was salient in both text and images, exposure to images led to stronger bias in people’s self-reported beliefs about the gender of occupations.

Images amplify implicit gender bias

Finally, we report suggestive results indicating that extended exposure to online images may have also amplified participants’ implicit gender bias. Participants across all conditions exhibited significant implicit bias towards associating men with science and women with liberal arts (P < 0.0001 in all conditions, Wilcoxon signed-rank test, two-tailed, n = 423). Yet Fig. 3e shows that participants in the Image condition exhibited stronger implicit bias. There was no significant difference between participants’ implicit bias in the Text and Control conditions (mean difference = 0.06, P = 0.24, Wilcoxon rank-sum test, two-tailed; Wilcoxon equivalence test, P < 0.05 for all bounds greater than or equal to |0.13|). However, participants in the Image condition exhibited significantly stronger implicit bias than those in the Control condition (P = 0.005) (mean difference = 0.11, Wilcoxon rank-sum test, two-tailed). The difference in implicit bias between the Image and Text conditions did not reach conventional statistical significance (mean difference = 0.05, P = 0.09, Wilcoxon rank-sum test, two-tailed; Wilcoxon equivalence test, P < 0.05 for bounds greater than or equal to |0.14|). Across conditions, we find a clear correlation between the strength of participants’ self-reported gender associations and the strength of their implicit gender bias, both of which are greater in the Image condition (Fig. 3f; P < 0.0001, Jonckheere–Terpstra test = 19,382,281, two-tailed); this result is robust to a range of statistical controls (Supplementary Table 14). Notably, only participants in the Image condition exhibited significantly stronger implicit bias than control participants 3 days after the experiment (Supplementary Table 15), indicating enduring effects.

Conclusion

The rise of images in popular internet culture may come at a critical social cost. We have found that gender bias online is more prevalent and more psychologically potent in images than text. The growing centrality of visual content in our daily information diets may exacerbate gender bias by magnifying its digital presence and deepening its psychological entrenchment. This problem is expected to affect the well-being of, social status of and economic opportunities for not only women, who are systematically underrepresented in online images, but also men in female-typed categories such as care-oriented occupations41,42.

Our findings are especially alarming given that image-based social media platforms such as Instagram, Snapchat and TikTok are surging in popularity, accelerating the mass production and circulation of images. In parallel, popular search engines such as Google are increasingly incorporating images into their core functionality, for example, by including images as a default part of text-based searches43. Perhaps the apex of these developments is the widespread adoption of text-to-image artificial intelligence (AI) models that allow users to automatically generate images by means of textual prompts, further accelerating the production and circulation of images. Current work identifies salient gender and racial biases in the images that these AI models generate44, signalling that they may also intensify the large-scale spread of social biases. Consistent with related studies45, our work suggests that gender biases in multimodal AI may stem in part from the fact that they are trained on public images from platforms such as Google and Wikipedia, which are rife with gender bias according to our measures.

A promising direction for future research is to investigate the social and algorithmic processes contributing to bias in online images, pertaining not only to gender, but also to race and other demographic dimensions. The Google images we examine stem from various sources, with the most common source being personal blogs, followed by business, news and stock photo websites (Supplementary Fig. 24). The gender bias we observe seems to be driven partly by content that internet users choose to display on their blogs, and also by audiences’ preferences for which news to consume or images to purchase. Our supplementary results regarding celebrities on IMDb and Wikipedia (Extended Data Fig. 2) reflect extra contributing factors relating to status dynamics and hiring biases in entertainment media. In all cases, the human preference for familiar, prototypical representations of social categories is likely to play a role in perpetuating these biases46,47. We further anticipate that the study of online bias will benefit from extending our multimodal framework to analyse other modes of communication, such as audio and video, and to compare human and AI-generated content.

To keep pace with the evolving landscape of bias online, it is important for computational social scientists to expand beyond the analysis of textual data to include other content modalities that offer distinct ways of transmitting cultural information. Indeed, decades of research maintain that images lie at the foundation of human cognition11,12,25,48 and may have provided the first means of human communication and culture24,49. It is therefore difficult to imagine how the science of human culture can be complete without a multimodal framework. Exploring the implications of an image-centric social reality for the evolution of human cognition and culture is a ripe direction for future research. Our study identifies one of many implications of this cultural shift concerning the amplification of social bias, stemming from the salient way in which images present demographic information when depicting social categories. Addressing the societal impact of this ascending visual culture will be essential in building a fair and inclusive future for the internet, and developing a multimodal approach to computational social science is a crucial step in this direction.

Methods

Here we outline the computational and experimental techniques we use to compare gender bias in online images and texts. We begin by describing the methods of data collection and analyses developed for the observational component of our study. Then we detail the study design deployed in our online search experiment. The preregistration for our online experiment is available at https://osf.io/3jhzx. Note that this study is a successful replication of a previous study with a nearly identical design, except the original study did not include a control condition nor several versions of the text condition; the preregistration of the previous study is available at https://osf.io/26kbr.

Observational methods

Data collection procedure for online images

Our crowdsourcing methodology consisted of four steps (Extended Data Fig. 1). First, we gathered all social categories in WordNet, a canonical lexical database of English. WordNet contained 3,495 social categories, including occupations (such as ‘physicist’) and generic social roles (such as ‘colleague’). Second, we collected the images associated with each category from both Google and Wikipedia. Third, we used Python’s OpenCV—a popular open-source deep learning framework—to extract the faces from each image; this algorithm automatically isolates each face and extracts a square including the entire face and minimal surrounding context. Using OpenCV to extract faces helped us to ensure that each face in each image was separately classified in a standardized manner, and to avoid subjective biases in coders’ decisions for which face to focus on and categorize in each image. Fourth, we hired 6,392 human coders from MTurk to classify the gender of the faces. Following earlier work, each face was classified by three unique annotators16,17, so that the gender of each face (‘male’ or ‘female’) could be identified based on the majority (modal) gender classification across three coders (we also gave coders the option of labelling the gender of faces as ‘non-binary’, but this option was only chosen in 2% of cases, so we excluded these data from our main analyses and recollected all classifications until each face was associated with three unique coders using either the ‘male’ or the ‘female’ label). Although coders were asked to label the gender of the face presented, our measure is agnostic to which features the coders used to determine their gender classifications; they may have used facial features, as well as features relating to the aesthetics of expressed gender such as hair or accessories. Each search was implemented from a fresh Google account with no prior history. Searches were run in August 2020 by ten distinct data servers in New York City. This study was approved by the Institutional Review Board at the University of California, Berkeley, and all participants provided informed consent.

To collect images from Google, we followed earlier work by retrieving the top 100 images that appeared when using each of the 3,495 categories to search for images using the public Google Images search engine16,17,18 (Google provides roughly 100 images for its initial search results). To collect images from Wikipedia, we identified the images associated with each social category in the 2021 Wikipedia-based Image Text Dataset (WIT)27. WIT maps all images across Wikipedia to textual descriptions on the basis of the title, content and metadata of the active Wikipedia articles in which they appear. WIT contained images associated with 1,523 social categories from WordNet across all English Wikipedia articles (see Supplementary Information section A.1.1 for details on our Wikipedia analysis). The coders identified 18% of images as not containing a human face; these were removed from our analyses. We also asked all annotators to complete an attention check, which involved choosing the correct answer to the common-sense question “What is the opposite of the word ‘down’?” from the following options: ‘Fish’, ‘Up’, ‘Monk’ and ‘Apple’. We removed the data from all annotators who failed an attention check (15%), and we continued collecting classifications until each image was associated with the judgements of three unique coders, all of whom passed the attention check.

Collecting human judgements of social categories

We hired a separate sample of 2,500 human coders from MTurk to complete a survey study in which they were presented with social categories (five categories per task) and asked to evaluate each category by means of the following question (each category was assessed by 20 unique human coders): “Which gender do you most expect to belong to this category?” This was answered as a scalar with a slider ranging from −1 (females) to 1 (males). All MTurkers were prescreened such that only US-based MTurkers who were fluent in English were invited to participate in this task.

Demographics of human coders

The human coders were all adults based in the USA who were fluent in English. Supplementary Table 1 indicates that our main results are robust to controlling for the demographic composition of our human coders. Among our coders, 44.2% identified as female, 50.6% as male and 3.2% as non-binary; the remainder preferred not to disclose. In terms of age, 42.6% identified as being 18–24 years, 22.9% as 25–34, 32.5% as 35–54, 1.6% as 55–74 and less than 1% as more than 75. In terms of race, 46.8% identified as Caucasian, 11.6% as African American, 17% as Asian, 9% as Hispanic and 10.3% as Native American; the remainder identified as either mixed race or preferred not to disclose. In terms of political ideology, 37.2% identified as conservative, 33.8% as liberal, 20.3% as independent and 3.9% as other; the remainder preferred not to disclose. In terms of annual income, 14.3% reported making less than US$10,000, 33.4% reported US$10,000–50,000, 22.7% reported US$50,000–75,000, 14.9% reported US$75,000–100,000, 10.5% reported US$100,000–150,000, 2.8% reported US$150,000–250,000 and less than 1% reported more than US$250,000; the remainder preferred not to disclose. In terms of the highest level of education acquired by each annotator, 2.7% selected ‘Below High School’, 17.5% selected ‘High School’, 29.2% selected ‘Technical/Community College’, 34.5% selected ‘Undergraduate degree’, 14.8% selected ‘Master’s degree’ and less than 1% selected ‘Doctorate degree’; the remainder preferred not to disclose.

Constructing a gender dimension in word embedding space

Our method for measuring gender associations in text relies on the fact that word embedding models use the frequency of co-occurrence among words in text (for example, whether they occur in the same sentence) to position words in an n-dimensional space, such that words that co-occur together more frequently are represented as closer together in this n-dimensional space. The ‘embedding’ for a given word refers to the specific position of this word in the n-dimensional space constructed by the model. The cosine distance between word embeddings in this vector space provides a robust measure of semantic similarity that is widely used to unpack the cultural meanings associated with categories13,22,31. To construct a gender dimension in word embedding space, we adopt the methodology recently developed by Kozlowski et al.22. In their paper, Kozlowski et al.22 construct a gender dimension in embedding space along which different categories can be positioned (for example, their analysis focuses on types of sport). They start by identifying two clustered regions in word embedding space corresponding to traditional representations of females and males, respectively. Specifically, the female cluster consists of the words ‘woman’, ‘her’, ‘she’, ‘female’ and ‘girl’, and the male cluster consists of the words ‘man’, ‘his’, ‘he’, ‘male’ and ‘boy’. Then, for each of the 3,495 social categories in WordNet, we calculated the average cosine distance between this category and both the female and the male clusters. Each category, therefore, was associated with two numbers: its cosine distance with the female cluster (averaged across its cosine distance with each term in the female cluster), and its cosine distance with the male cluster (averaged across its cosine distance with each term in the male cluster). Taking the difference between a category’s cosine distance with the female and male clusters allowed each category to be positioned along a −1 (female) to 1 (male) scale in embedding space. The category ‘aunt’, for instance, falls close to −1 along this scale, whereas the category ‘uncle’ falls close to 1 along this scale. Of the categories in WordNet, 2,986 of them were associated with embeddings in the 300-dimensional word2vec model of Google News, and could therefore be positioned along this scale. All of our results are robust to using different terms to construct the poles of this gender dimension (Supplementary Fig. 18). However, our main analyses use the same gender clusters as ref. 22.

To compute distances between the vectors of social categories represented by bigrams (such as ‘professional dancer’), we used the Phrases class in the Gensim Python package, which provided a prebuilt function for identifying and calculating distances for bigram embeddings. This method works by identifying an n-dimensional vector of middle positions between the vectors corresponding separately to each word in the bigram (for example, ‘professional’ and ‘dancer’). This technique then treats this middle vector as the singular vector corresponding to the bigram ‘professional dancer’ and is thereby used to calculate distances from other category vectors. This same method was applied to the construction of embeddings for all bigram categories in all models.

To maximize the similarity between our text-based and image-based measures of gender association, we adopted the following three techniques. First, we normalized our textual measure of gender associations using minimum–maximum normalization, which ensured that a compatible range of values was covered by both our text-based and image-based measures of gender association. This is helpful because the distribution of gender associations for the image-based measure stretched to both ends of the −1 to 1 continuum as a result of certain categories being associated with 100% female faces or 100% male faces. By contrast, although the textual measure described above contains a −1 (female) to 1 (male) scale, the most female category in our WordNet sample has a gender association of −0.42 (‘chairwoman’), and the most male category has a gender association of 0.33 (‘guy’). Normalization ensures that the distribution of gender associations in the image- and text-based measures both equally cover the −1 to 1 continuum, so that paired comparisons between these scales (matched at the category level) can directly examine the relative ranking of a category’s gender association in each measure. Minimum–maximum normalization is given by the following equation:

$$\widetilde{{x}_{i}}=\frac{\left({x}_{i}-{x}_{\min }\right)}{\left({x}_{\max }-{x}_{\min }\right)}$$
(1)

where xi represents the gender association of category xi ([−1,1]), xmin represents the category with the lowest gender score, xmax represents the category with the highest gender score, and \(\widetilde{{x}_{i}}\) represents the normalized gender association of category xi. To preserve the −1 to 1 scale in applying minimum–maximum normalization, we applied this procedure separately for male-skewed categories (that is, all categories with a gender association above 0), such that xmin represents the least male of the male categories and xmax represents the most male of the male categories. We applied this same procedure to the female-skewed categories, except that, because the female scale is −1 to 0, xmin represents the most female of the female categories and xmax represents the least female. For this reason, after the 0–1 female scale was constructed, we multiplied the female scores by −1 so that −1 represented the most female of the female categories and 0 represented the least. We then appended the female-normalized (−1 to 0) and male-normalized (0 to 1) scales. Both the male and female scales before normalization contained categories with values within four decimal points of zero (|x| < 0.0001), such that this normalization technique had no effect of arbitrarily pushing certain categories towards 0. Instead, the above technique has the advantage of stretching out the text-based measure of gender association to ensure that a substantial fraction of categories reach all the way to the −1 female region and all the way to the 1 male region of the continuum, similar to the distribution of values for the image-based measure.

Experimental methods

Participant pool

For this experiment, a nationally representative sample of participants (n = 600) was recruited from the popular crowdsourcing platform Prolific, which provides a vetted panel of high-quality human participants for online research. No statistical methods were used to determine this sample size. A total of 575 participants completed the task, exhibiting an attrition rate of 4.2%. We only examine data from participants who completed the experiment. Our main results report the outcomes associated with the Image, Text and Control conditions (n = 423); in the Supplementary Information, we report the results of an extra version of the Text condition involving the generic Google search bar (n = 150; Supplementary Fig. 26). We only examine data from participants who completed the task. To recruit a nationally representative sample, we used Prolific’s prescreening functionality designed to provide a nationally representative sample of the USA along the dimensions of sex, age and ethnicity. Participants were invited to partake in the study only if they were based in the USA, fluent English speakers and aged more than 18 years. A total of 50.8% of participants were female (no participants identified as non-binary). All participants provided informed consent before participating. This experiment was run on 5 March 2022.

Participant experience

Extended Data Fig. 2 presents a schematic of the full experimental design. This experiment was approved by the Institutional Review Board at the University of California, Berkeley. In this experiment, participants were randomized to one of four conditions: (1) the Image condition (in which they used the Google Image search engine to retrieve images of occupations), (2) the Google News Text condition (in which they used the Google News search engine, that is, news.google.com, to retrieve textual descriptions of occupations), (3) the Google Neutral Text condition (in which they used the generic Google search engine, that is, google.com, to retrieve textual descriptions of occupations) and (4) the Control condition (in which they were asked at random to use either Google Images or the neutral (standard) Google search engine to retrieve descriptions of random, non-gendered categories, such as ‘apple’). Note that, in the main text, we report the experimental results comparing the Image, Control and Google News Text conditions; we present the results concerning the Google Neutral Text condition as a robustness test in the Supplementary Information (Supplementary Fig. 26).

After uploading a description for a given occupation, participants used a −1 (female) to 1 (male) scale to indicate which gender they most associate with this occupation. In this way, the scale participants used to indicate their gender associations was identical to the scale we used to measure gender associations in our observational analyses of online images and text. In the control condition, participants were asked to indicate which gender they associate with a given randomly selected occupation after uploading a description for an unrelated category. Participants in all conditions completed this sequence for 22 unique occupations (randomly sampled from a broader set of 54 occupations). These occupations were selected to include occupations from science, technology, engineering and mathematics, and the liberal arts. Each occupation that was used as a stimulus could also be associated with our observational data concerning the gender associations measured in images from Google Images and the texts of Google News. Here is the full preregistered list of occupations used as stimuli: immunologist, mathematician, harpist, painter, piano player, aeronautical engineer, applied scientist, geneticist, astrophysicist, professional dancer, fashion model, graphic designer, hygienist, educator, intelligence analyst, logician, intelligence agent, financial analyst, chief executive officer, clarinetist, chiropractor, computer expert, intellectual, climatologist, systems analyst, programmer, poet, astronaut, professor, automotive engineer, cardiologist, neurobiologist, English professor, number theorist, marine engineer, bookkeeper, dietician, model, trained nurse, cosmetic surgeon, fashion designer, nurse practitioner, art teacher, singer, interior decorator, media consultant, art student, dressmaker, English teacher, literary agent, social worker, screen actor, editor-in-chief, schoolteacher. The set of occupations that participants evaluated was identical across conditions.

Once each participant completed this task for 22 occupations, they were then asked to complete an IAT designed to measure the implicit bias towards associating men with science and women with liberal arts33,34,35,38. The IAT was identical across conditions (‘Measuring implicit bias using the IAT’). In total, the experiment took participants approximately 35 minutes to complete. Participants were compensated at the rate of US $15 per hour for their participation.

Measuring implicit bias using the IAT

The IAT in our experiment was designed using the iatgen tool33 (https://iatgen.wordpress.com/). The IAT is a psychological research tool for measuring mental associations between target pairs (for example, different races or genders) and a category dimension (for example, positive–negative, science–liberal arts). Rather than measuring what people explicitly believe through self-report, the IAT measures what people mentally associate and how quickly they make these associations. The IAT has the following design (description borrowed from iatgen)33: “The IAT consists of seven ‘blocks’ (sets of trials). In each trial, participants see a stimulus word on the screen. Stimuli represent ‘targets’ (for example, insects and flowers) or the category (for example, pleasant–unpleasant). When stimuli appear, the participant ‘sorts’ the stimulus as rapidly as possible by pressing with either their left or right hand on the keyboard (in iatgen, the ‘E’ and ‘I’ keys). The sides with which one should press are indicated in the upper left and right corners of the screen. The response speed is measured in milliseconds.” For example, in some sections of our study, a participant might press with the left hand for all male + science stimuli and with their right hand for all female + liberal arts stimuli.

The theory behind the IAT is that the participant will be fast at sorting in a manner that is consistent with one’s latent associations, which is expected to lead to greater cognitive fluency in one’s intuitive reactions. For example, the expectation is that someone will be faster when sorting flowers + pleasant stimuli with one hand and insects + unpleasant with the other, as this is (most likely) consistent with people’s implicit mental associations (example borrowed from iatgen). Yet, when the category pairings are flipped, people should have to engage in cognitive work to override their mental associations, and the task should be slower. The degree to which one is faster in one section or the other is a measure of one’s implicit bias.

In our study, the target pairs we used were ‘male’ and ‘female’ (corresponding to gender), and the category dimension referred to science–liberal arts. To construct the IAT, we followed the design used by Rezaei38. For the male words in the pairs, we used the following terms: man, boy, father, male, grandpa, husband, son, uncle. For the female words in the pairs, we used the following terms: woman, girl, mother, female, grandma, wife, daughter, aunt. For the science category, we used the following words: biology, physics, chemistry, math, geology, astronomy, engineering, medicine, computing, artificial intelligence, statistics. For the liberal arts category, we used the following words: philosophy, humanities, arts, literature, English, music, history, poetry, fashion, film. Extended Data Figs. 36 illustrate the four main IAT blocks that participants completed (as per standard IAT design, participants were also shown blocks 2, 3 and 4, with the left–right arrangement of targets reversed). Participants completed seven blocks in total, sequentially. The IAT instructions for Extended Data Fig. 3 state, “Place your left and right index fingers on the E and I keys. At the top of the screen are 2 categories. In the task, words and/or images appear in the middle of the screen. When the word/image belongs to the category on the left, press the E key as fast as you can. When it belongs to the category on the right, press the I key as fast as you can. If you make an error, a red X will appear. Correct errors by hitting the other key. Please try to go as fast as you can while making as few errors as possible. When you are ready, please press the [Space] bar to begin.” These instructions are repeated throughout all blocks in the task.

To measure implicit bias based on participants’ reaction times during the IAT, we adopted the following standard approach (used by iatgen). We combined the scores across all four blocks (blocks 3, 4, 6 and 7 in iatgen). Some participants are also faster than others, adding statistical ‘noise’ as a result of variance in overall reaction times. Thus, instead of comparing within-person differences in raw latencies, this difference is standardized at the participant level, dividing the within-person difference by a ‘pooled’ standard deviation. This pooled standard deviation uses the standard deviation of what are called the practice and critical blocks combined. This yields a D score. In iatgen, a positive D value indicates association in the form of target A + positive, target B + negative, which in our case is male + science, female + liberal arts), whereas a negative D value indicates the opposite bias (target A + negative, target B + positive, which in our case is male + liberal arts, female + science), and a zero score indicates no bias.

Our main experimental results evaluate the relationship between the participants’ explicit and implicit gender associations and the strength of gender associations in the Google images and textual descriptions they encountered during the search task. The strength of participants’ explicit gender associations is calculated as the absolute value of the number they input using the −1 (female) to 1 (male) scale after each occupation they classified (Extended Data Fig. 2). Participants’ implicit bias is measured by the D score of their results on the IAT designed to detect associations between men and science and women and liberal arts. To measure the strength of gender associations in the Google images that participants encountered, we calculated the gender parity of the faces uploaded across all participants who classified a given occupation. For example, we identified the responses of all participants who provided image search results for the occupation ‘geneticist’, and we constructed the same gender dimensions as described in the main text, such that −1 represents 100% female faces, 0 represents 50% female (male) faces and 1 represents 100% male faces. To identify the gender of the faces of the images that participants uploaded, we recruited a separate panel of MTurk workers (n = 500) who classified each face (there were 3,300 images in total). Each face was classified by two unique MTurkers; if they disagreed in their gender assignment, a third MTurk worker was hired to provide a response, and the gender identified by the majority was selected. We adopted an analogous approach to annotating the gender of the textual descriptions that participants uploaded in the text condition. These annotators identified whether each textual or visual description uploaded by participants was female (1), neutral (0) or male (1). Each textual description was coded as male, female or neutral on the basis of whether it used male or female pronouns or names to describe the occupation (for example, referred to a ‘doctor’ as ‘he’); textual descriptions were identified as neutral if they did not ascribe a particular gender to the occupation described. We were then able to calculate the same measure of gender balance in the textual descriptions uploaded for each occupation as we applied in our image analysis.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.