Introduction

In certain regions and environments, women are still rather seen than heard. Gender inequality in speaking and opinion sharing as a woman persists. For instance, women are challenged and interrupted more often than their male peers while presenting their arguments (Butler and Geis, 1990; A. Feldman and Gill, 2019; Jacobi and Schweers, 2017), have more of their talk-time taken by the audience while giving academic job talks (Blair-Loy et al., 2017), and make significantly fewer speeches relative to their male peers in parliament (Bäck and Debus, 2019; Bäck et al., 2014). Concomitantly, women think that their opinions are often diminished and do not matter as much (Miller, 2018). Hence, it raises the question of whether women’s opinions are viewed less favorably than those of their male peers, especially in decision-making. This study explores whether there is a difference in how people evaluate women’s opinions—views and experiences shared about goods and services—relative to their male peers when making buying decisions.

Other people’s opinions have long been valuable to individuals when making buying decisions (Chakravarty et al., 2010; Chatterjee, 2001). This is especially true in the active-evaluation phase of buying decisions (Court et al., 2009). Wives have relied on peer’s opinions in making buying decisions for household goods and food products (Arndt, 1967; Katz and Lazarsfeld, 1966), patients have relied on other people’s opinions when choosing physicians for medical care (S. P. Feldman and Spencer, 1965; Pechmann et al., 1989), and moviegoers rely on movie critics and friends’ opinions (Chakravarty et al., 2010). While people have relied on others’ opinions in their buying decisions, and with the steady progress towards gender parity in many areas over the years, it is unclear whether gender still plays a significant role in whose opinions are valued.

Individuals have long used gender as a judgement making heuristic (Fiske, 1998; Kunda and Spencer, 2003; Wheeler and Petty, 2001); and we have reason to believe that gender may still play a role in the evaluation of opinions. Previous research hints that gender-based differences may persist in the evaluation of opinions. For instance, writings by male authors are rated more highly than writings by female authors (Goldberg, 1968; Levenson et al., 1975), identical entrepreneurial pitches were more likely to receive investments when pitched by men than by women (Brooks et al., 2014), same lectures were rated high when they were perceived to be written by male professors versus female professors (Abel and Meltzer, 2007), and male voiced computer-speeches and tutorials were rated highly, considered more credible, and exerted more influence on decision-making than female voiced versions (E.-J. Lee, 2003, 2008; E. J. Lee et al., 2000; Morishima et al., 2001; Nass et al., 1997).

In the work reported here, over a series of three studies on online consumer-generated opinions on products (goods and services) in the United States (US), we test whether gender-based differences exist in the evaluation of product opinions. Particularly, we test (1) whether people are less likely to value a woman’s opinion on a product relative to their male peers; and (2) whether the evaluations are likely to favor a specific gender group if the opinions are for products typically associated with that gender. We investigate the aforementioned in both the search and experience goods contexts, and also check whether any observed differences in the evaluation of the opinions may be driven by in-group bias. If any differences exist, we are not suggesting that they are intentional or stem from a conscious effort to undermine product opinions from any gender, but that it may be due to implicit or unintended biases.

Online consumer-generated product opinions (hereinafter reviews) are well adapted for our research. First, about 97 percent of consumers regularly or occasionally consult reviews, and about 85 percent trust them as much as personal recommendations (Brightlocal, 2017), making it an important source of information in buying decisions. Second, a significant percentage of retail shopping is now done online, with the US sales market share of online shopping now higher than that of general merchandise stores (Ouellette, 2020; Rooney, 2019), making the online shopping environment an ideal context for investigating gender-based differences in the evaluation of opinions.

In the rest of the paper, we first present the three studies with their results. After which, we then conclude with a discussion and conclusion section. Study 1 uses an experimental design, while studies 2 and 3 use field data retrieved from online review platforms—Yelp.com and Amazon.com, respectively.

Study 1

The goal of this study is to experimentally test whether there are any gender-based differences in how individuals evaluate reviews contributed by women relative to those contributed by men, and to determine whether the evaluations are likely to favor a specific gender if the reviews are for products typically associated with that gender. The study was conducted on Amazon Mechanical Turk (MTurk) and had Institutional Review Board (IRB) approval.

Stimulus materials

Preparation of the stimuli involved creating product reviews (see Fig. 1 for examples) similar to what is obtainable on typical e-commerce sites. However, there were slight modifications to suit the nature of the study. First, each review had either a female-looking or male-looking avatar placeholder (as can be seen in Fig. 1) to aid participants in inferring the gender of the review contributor. The female-looking and male-looking avatar placeholders in the stimuli were matched with a corresponding gendered name (e.g., Mary and Grace for the female-looking avatars, Richard and William for the male-looking avatars) which was placed at the top of the review. Although contributors can have their full names, aliases, or just first names appear in a typical product review, we went with the option of only first names in the reviews to reduce ambiguity and facilitate easy gender inference. Second, we had an image of the reviewed product in the stimuli, as shown in Fig. 1 to help participants identify the reviewed product. Typical reviews on sites like Amazon.com may not have the product placed right by the review text like in the stimuli. The products used in the stimuli were a mix of gender-typed (products traditionally associated with a specific gender) and non-gender-typed (products not associated with any specific gender) products. To arrive at this selection of products for the stimuli, 12 participants were asked to rate a list of 12 products on a 7-point scale on whether the product is traditionally associated with men or women to them (1 = extremely associated with men, 4 = gender neutral, 7 = extremely associated with women). The initial list of 12 products and the process of identification were influenced by the literature (Fugate and Phillips, 2010; Morrison and Shaffer, 2003). Products with ratings of 2 or lower were considered to be associated with men, those with ratings of 6 or higher were considered to be associated with women, and those with ratings of between 3.5 and 4.5 were considered gender neutral. Three of the products were selected and used in the stimuli: toothbrush (gender neutral), baby care kit (associated with women), and tool kit (associated with men). The variations for each product followed a 2 (positive review vs. negative review) × 2 (women contributed vs. men contributed) design. Although we did not intend to check for the effect of review valence in this study, we included the positive and negative reviews to rule out any potential effect that may be driven by review valence. We pretested the stimuli to check if participants could infer the gender of the review contributors. In the pretest, 16 participants were asked to identify the gender of the review contributor. There was perfect agreement (Fleiss Kappa = 1.00) on the gender of review contributors on the stimuli.

Fig. 1: Example of online review.
figure 1

Contributed by a women and b men. For each of the online reviews in our study, only the avatar, and name were changed to vary gender.

We created two filler reviews with gender-neutral products—orange juice and flip-flops. The filler reviews had contributor names that were less gendered (e.g. Sam, Remy) and were sandwiched between the treatment reviews in the experiment.

Participants

We recruited 216 adult participants (Meanage = 40.6, SD = 10.8, 50 percent female) on Amazon MTurk to participate in the study for a payment of $2.80. We obtained informed consent from all participants. All study participants were located in the United States, had some experience with online shopping (Meanonline-shopping = 12.6 years, SD = 3.8), and all indicated that they use reviews when making buying decisions. About 75 percent of them had some form of a college education. We specifically set the selection criteria on Amazon MTurk to randomly assign participants into the treatment groups such that we had a balanced sample of male and female participants in each treatment group. We calculated our sample size based on an estimated effect size of d = 0.2, which required a sample size of ~160 participants for a study powered at 90 percent. However, we had 216 participants complete the study task.

Procedure

The participants (n = 216) were each presented with a total of five reviews, one at a time, for each of the five products, but all from the same treatment group. In essence, a participant saw all treated and filler product reviews; however, the reviews they saw will be from one of the following four treatment groups: positive and written by women, positive and written by men, negative and written by women, and negative and written by men. Reviews in positions 2 and 4 were the filler reviews. For each set of five reviews a participant saw, we changed the names on the reviews to avoid all five appearing to be written by the same individual. Thus, a sample participant will see all five reviews written by women but with different names instead of one name. Participants were asked to read and evaluate each of the reviews. After reading each review, participants indicated their perception of the (a) helpfulness of the review and (b) likelihood that the review will influence their purchase decision (all measured on a 9-point Likert scale). The survey items for review helpfulness were adapted from Yin et al. (2014) and can be seen in the Supplementary information. The Cronbach alpha for review helpfulness was 0.96, implying that all three survey items are strongly correlated. After the main task, participants completed an attention check.

Analysis and results

We employed repeated-measures ANOVA in our analyses. Results from the repeated-measures ANOVA analyses, as shown in Fig.2 revealed that reviews contributed by women were rated significantly lower in helpfulness than reviews contributed by men [meanwomen = 5.53, SDwomen = 2.25 vs. meanmen = 6.06, SDmen = 2.10, F(1,214) = 5.14, p < 0.05, \(\eta _{\rm {p}}^2\) = 0.02 (Huynh–Feldt-corrected for nonsphericity)]. Similarly, the likelihood of a review to influencing purchase decision was also significantly lower when contributed by women than when contributed by men [meanwomen = 5.29, SDwomen = 2.63 vs. meanmen = 5.89, SDmen = 2.38, F(1,214) = 4.23, p < 0.05, \(\eta _{\rm {p}}^2\) = 0.01, (Huynh–Feldt-corrected for nonsphericity)]. We also examined if the results hold for gender-typed products. That is, we tested if reviews for products typically associated with specific gender groups are perceived to be more valuable when individuals from the corresponding gender group contributed to them. For the product associated with men (tool kit), the results reveal that reviews contributed by men were perceived as significantly more helpful than those contributed by women [meanmen = 6.98, SDmen = 1.89 vs. meanwomen = 6.39, SDwomen = 1.71, F(1,214) = 5.78, p < 0.05, \(\eta _{\rm {p}}^2\) = 0.03]. However, it was marginally significant for the likelihood to purchase the product [meanmen = 6.58, SDmen = 2.31 vs. meanwomen = 6.04, SDwomen = 2.42, F(1,214) = 2.84, p = 0.09, \(\eta _{\rm {p}}^2\) = 0.01]. For the product associated with women (baby care kit), the results reveal no significant differences in the perceived helpfulness rating between reviews contributed by men and those contributed by women [meanmen = 5.40, SDmen = 2.35 vs. meanwomen = 4.88, SDwomen = 2.32, F(1,214) = 2.72, p > 0.1, \(\eta _{\rm {p}}^2\) = 0.01]. The difference in the likelihood to influence purchase decision was marginally significant for reviews contributed by men and those contributed by women [meanmen = 5.52, SDmen = 2.39 vs. meanwomen = 4.93, SDwomen = 2.65, F(1,214) = 3.01, p = 0.08, \(\eta _{\rm {p}}^2\) = 0.01].

Fig. 2: Helpfulness and likelihood to purchase for online reviews contributed by women vs. men in study 1.
figure 2

Error bars represent SEs.

To rule out the possibility that the gender-based differences observed in the results are driven by in-group bias, where men rate reviews contributed by men higher and women rate reviews contributed by women higher, we split the data by participant gender. The results, shown in Fig. 3, reveal that for male participants, there were no significant differences in how they rated reviews contributed by women and men in terms of helpfulness [meanwomen = 5.38, SDwomen = 2.17 vs. meanmen = 5.77, SDmen = 2.06, F(1,106) = 1.46, p = 0.23, \(\eta _{\rm {p}}^2\) = 0.01, (Huynh–Feldt-corrected for nonsphericity)] and likelihood to influence purchase decision [meanwomen = 5.31, SDwomen = 2.56 vs. meanmen = 5.51, SDmen = 2.36, F(1,106) = 0.24, p = 0.62, \(\eta _{\rm {p}}^2\) = 0.00, (Huynh–Feldt-corrected for nonsphericity)]. For the female participants, however, there were significant differences, with reviews contributed by women rated lower than those contributed by men in helpfulness [meanwomen = 5.68, SDwomen = 2.31 vs. meanmen = 6.36, SDmen = 2.10, F(1,106) = 4.09, p < 0.05, \(\eta _{\rm {p}}^2\) = 0.03, (Huynh–Feldt-corrected for nonsphericity)] and likelihood to influence purchase decision [meanwomen = 5.27, SDwomen = 2.70 vs. meanmen = 6.28, SDmen = 2.35, F(1,106) = 5.86, p < 0.05, \(\eta _p^2\) = 0.04, (Huynh–Feldt-corrected for nonsphericity)]. Although the results show no in-group bias, it appears that the gender-based differences manifest more with the women participants.

Fig. 3: Helpfulness and likelihood to purchase for reviews written by women vs. men, split by participant groups in study 1.
figure 3

Error bars represent SEs.

Study 2

This study aims to examine whether the gender-based difference observed in how individuals evaluate reviews extends to experience goods and services and to provide some external validity to the experimental study. To do this, we collected and analyzed review data from Yelp.com, a website that provides user ratings and textual reviews for businesses in the service industry, including restaurants, auto services, and home services among others.

Data collection

We collected reviews posted between January 2015 and May 2015 in the nightlife category of a major city in the Southeastern United States. In total, we extracted 7626 reviews contributed by 3854 unique individuals. For each review, we collected the contributor’s name, number of “useful” votes, and review rating. We also collected other contributor-specific and review-specific information, including the contributor’s status, number of friends, and length of the review, among others. Table 1 provides the list of all the variables and their description, while Table 2 shows the summary statistics and Pearson correlation.

Table 1 Yelp data variables description.
Table 2 Yelp data variable summary statistics and pearson correlations.

Inferring review contributor’s gender

To infer the review writer’s gender, we applied machine-learning techniques. We used the machine learning toolkit “genderizeR” (Wais, 2016), and inferred the gender of the review contributor using their first name. Using an individual’s name to infer their gender has been done in extant studies (Atir and Ferguson, 2018; Ruzycki et al., 2019). In our case, the process involved matching a contributor’s first name in our sample to that in an existing names database (Wais, 2016), and extracting the contributor’s name gender-probability estimate. A name’s gender-probability estimate is the probability that the name belongs to a man or a woman. So, a name with an 86 percent man gender-probability estimate implies that there is an 86 percent chance that it belongs to a man. After extracting the name gender-probability estimates for all the reviews, we dropped all reviews whose contributor’s name gender-probability estimate is less than 99 percent from our dataset. About 31.5 percent (n = 2399, nman = 902, nwoman = 1497) of the reviews were retained after this process. We validated the gender labels resulting from the machine-learning technique by manually checking the gender labels on a randomly selected subsample. We coded the review contributor’s gender as a dummy variable, “Women”, with value one if a woman and zero if a man.

Data analysis and results

Given the count nature of our dependent variable and its over-dispersion, we fit a negative binomial regression model in our analyses to determine the effect of gender on the number of “useful” votes received by a review. To account for review heterogeneity among women and men review contributors, we created a matched sample that paired each review written by a woman in our sample to a similar review written by a man and reran our analyses. The results are presented in Table 3. To check for the robustness of the results, we also fit additional models whose results can be found in Table S1 of the Supplementary information.

Table 3 Results of the relationship between gender and review usefulness votes.

Looking at the results, we observe an indication of gender differences in the evaluation of reviews. From the coefficient of Women (β = −0.2297, p < 0.001) in column 1 of Table 3, we observe a significant and negative effect of online reviews written by women on the number of useful votes received in the absence of controls. This result is robust to the use of the matched sample as seen in column 2 (β = −0.1805, p < 0.05). With the inclusion of controls in columns 3 (for the full sample) and column 4 (for the matched sample), the results β = −0.2040, p < 0.001 and β = −0.1506, p < 0.01, respectively, remained significant and directionally consistent. The estimated coefficient of the Women variable implied that online reviews written by women received 0.79 less useful votes than online reviews written by men on average. This result supports the finding in study 1, although in the context of service or experience products.

Study 3

In study 3, we test for the presence of gender-based differences in the evaluation of reviews like in study 1. This time, however, we use field data from an e-commerce website (Amazon.com) to further investigate whether the evaluations are likely to favor a specific gender if the reviews are for products typically associated with that gender. This study further lends external validity to the results reported in study 1 with respect to reviews on gender-typed products.

Data collection

Data for study 3 were obtained from Amazon.com, a popular e-commerce website that allows customers to post and rate product reviews. Amazon.com allows individuals to vote on reviews contributed on the platform by customers using the question, “Was this review helpful to you (yes/no)?” We collected data for all reviews posted in the beauty and home-improvement categories between January 1, 2014, and February 28, 2014. We chose these two categories because most products belonging to the categories are gender-typed (beauty for women and home-improvement for men). For each review, we recorded the name of the review contributor, review rating, number of helpfulness votes, total votes, and other review-specific information. Table 4 provides the list of all the variables and their description; Table 5 shows the summary statistics, including the breakdown by gender; and Table 6 shows the Pearson correlation.

Table 4 Amazon data variables description.
Table 5 Summary statistics of amazon data by gendered product type and full sample.
Table 6 Pearson correlations of amazon data full sample.

Inferring review contributor’s gender and data processing

Like in study 2, we employed machine-learning techniques to determine the writer’s gender and dropped all reviews whose contributor’s name gender-probability estimate was less than 99 percent. Of the 15948 (nbeauty = 8458, nhome-improvement = 7490) reviews in the sample, we further preprocessed the data by removing all reviews that received zero votes, as has been done in extant studies (Mudambi and Schuff, 2010; Salehan and Kim, 2016), leaving us with 3262 reviews (20.5 percent of the sample). The split across the women’s (beauty) and men’s (home-improvement) categories was 1759 and 1503 reviews, respectively.

The gender splits across categories were beauty [women (1280 reviews), men (479 reviews)] and home-improvement [women (359 reviews), men (1144 reviews)] as can be seen in Table 5. In total, there were 1639 reviews contributed by women and 1623 reviews contributed by men. We coded the review contributor’s gender using a dummy variable, “Women”, and it takes the value one if the review contributor is a woman and zero if the contributor is a man. The dependent variable, helpfulness, was measured as the proportion of helpful votes out of the total votes received for each review. Thus, a review that received 70 ‘helpful’ votes out of a total of 100 votes will have a helpfulness value of 0.7.

Data analysis and results

Given that the dependent variable is a proportion between 0 and 1, we estimated a binomial regression with logit transformation in our analyses for the 3262 reviews in our sample. The results are presented in Table 7 in the results section. To check the robustness of the results, we also fit additional models whose results can be found in Table S2 of the Supplementary information.

Table 7 Results of the relationship between gender and review helpfulness.

The results in Table 7 suggest the presence of gender-based differences in the evaluation of product reviews. We observe that the coefficient of Women in column 1 is significant and negative (β = −0.3483, p < 0.001), implying that reviews contributed by women were rated lower than those of men in the absence of controls. The co-efficient (β = −0.3414, p < 0.001) remained directionally consistent and significant with the inclusion of controls, as seen in column 2. Splitting the data along gender-typed product categories, the result in column 3 (βwomen = −0.4127, p < 0.01) suggests that reviews contributed by women were rated as less helpful than those contributed by men in the gender-typed product category associated with men; while the result in column 4 (βwomen = −0.1586, p > 0.05) suggests that there was no significant difference between reviews contributed by women and those contributed by men in the gender-typed product category associated with women. Again, these results support our findings in study 1 and provide some external validity, particularly to the evaluation of reviews on gender-typed products.

Discussion and conclusion

How people’s opinions influence us and what we do with them is often contingent upon our receptivity to their opinions (Wilson and Peterson, 1989). Across three studies using different research methods, we find evidence of gender-based differences in how individuals evaluate and use product opinions provided by men and women. In study 1, we experimentally test whether gender-based differences exist in how individuals evaluate reviews contributed by women relative to those contributed by men, and to determine whether the evaluations are likely to favor a specific gender if the opinions are for products typically associated with that gender. Participants in the study rated reviews from men as more helpful than those from women. They also indicated that reviews from men were more likely to influence their decision to purchase the product than reviews from women. This result reasserts the notion that people are less likely to believe statements made by women compared to men (Miller, 2018; Solnit, 2008), even when it is their opinion about products that they may have purchased, used, or experienced. To the extent that online reviews written by women reflect their experiences about a product, this result aligns with research that has highlighted that women’s experiences are discounted or even considered exaggerated relative to men (Hoffmann and Tarzian, 2001; Zhang et al., 2021). Interestingly, we observe that the gender-based differences observed in the evaluations of reviews are driven more by the female participants than by male participants. One would expect that with the observed gender-based difference being less favorable to women, that the gender-based difference would be driven more by the male participants. Further, the result of the study indicates that while participants rated men’s reviews as more helpful and more likely to influence their purchase decision for products traditionally associated with men than women’s reviews, there was no difference in the helpfulness and likelihood to influence purchase decision ratings between men’s and women’s reviews for products traditionally associated with women. This suggests that people do not ascribe greater weight to the opinions provided by women about their experiences, even for products traditionally associated with them, when compared to the opinions of men about the same product. Women’s opinions have just as much value as men’s for products traditionally associated with women.

The analyses of archival data in study 2 and study 3 provide external validity to the results obtained in study 1. Study 2 investigates gender-based differences in the evaluation of reviews from a services review platform. Although the study looks at reviews in the nightlife category (which may be considered a gender-neutral product), the result is consistent with study 1. Study 3 investigates the same questions as study 1, albeit in an e-commerce website, and reinforces the gender-based differences observed in study 1. Apart from confirming that reviews written by women are considered less valuable than those written by men, study 3 also affirms that reviews written by men are considered equally as valuable as those written by women, even for products traditionally associated with women. Taken together, the results suggest that gender still predicts the way people evaluate others’ opinions, such that women’s opinions are still less valued than men’s opinions in buying decisions.

What might explain the gender bias in the evaluation and use of opinions? First, the stereotyping of men as being more analytical and brilliant, and women as less competent might still persist (Heilman and Eagly, 2008; Moss-Racusin et al., 2012). As such, people may be exhibiting gender-stereotypic responses toward presumed competency levels. Potentially a result of the long history of questioning women’s reasoning capacity (Parks, 2000), including viewing their opinions as “unreflective or immature” (Miles and August, 1990). Second, a preexisting subtle bias against women. Prior studies have highlighted some of its undermining effects in women’s evaluations (Goldberg, 1968; Moss-Racusin et al., 2012; Régner et al., 2019). For instance, Abdul-Ghani et al. (2022) document that participants in their study viewed women’s opinions as highly emotional relative to men’s opinions and therefore discounted them. Third, people may place less value on women’s opinions because women are underrepresented and less visible in expert opinion panels and media (Beaulieu et al., 2016); women’s opinions may, therefore, be more easily discounted.

The evidence reported from the three studies documents a form of implicit gender bias in the evaluation of opinions in buying decisions that might have significant implications. Women may continue to gain less visibility, eminence, and benefits on shopping platforms and environments where the value and quality of one’s opinion matter. For instance, many review service platforms, merchants, businesses, and organizations use the value consumers and users assign to reviews to reward review contributors. Given the above, we should continue to pursue endeavors and interventions that help break gender stereotypes to reduce these forms of bias.