Perceived gender and political persuasion: a social media field experiment during the 2020 US Democratic presidential primary election

Women have less influence than men in a variety of settings. Does this result from stereotypes that depict women as less capable, or biased interpretations of gender differences in behavior? We present a field experiment that—unbeknownst to the participants—randomized the gender of avatars assigned to Democrats using a social media platform we created to facilitate discussion about the 2020 Primary Election. We find that misrepresenting a man as a woman undermines his influence, but misrepresenting a woman as a man does not increase hers. We demonstrate that men’s higher resistance to being influenced—and gendered word use patterns—both contribute to this outcome. These findings challenge prevailing wisdom that women simply need to behave more like men to overcome gender discrimination and suggest that narrowing the gap will require simultaneous attention to the behavior of people who identify as women and as men.


Replication Materials
This document describes all materials and methods for the article "Perceived Gender and Political Persuasion: A Social Media Field Experiment during the 2020 Democratic National Primary," by Combs et al.All data, code, and the markdown file used to create this report will be available at this link on the Dataverse upon publication of the manuscript.

Pre-registration
The research design and hypotheses described in the main text were pre-registered using the Open Science Framework.Our pre-registration statement is available at this link.

Deviations from the pre-registration statement
Our study was a novel and complex endeavor that involved many moving parts and was conducted in a political context that was continuously shifting in unpredictable ways.As such it became clear to us that some changes to our pre-registered methods would be necessary.We summarize these changes here.

Differences in app design
The preregistration contains app design details and screenshots that differ in some places from that of the final app.The most important changes were tweaks to the design of our gendered avatars.The avatars deployed in the study (shown in Section 7) were colored in brighter shades of blue and pink than the avatars included in the preregistration, and the silhouettes were slightly larger.This change was made in response to the results of a pre-test that showed that versions with deeper colors prompted a stronger treatment response.In addition, we made small, non-substantive changes to the language in the settings window and discussion prompt questions after the submission of the preregistration to make them more clear, consistent, and concise in response to feedback obtained in final rounds of pre-launch app testing.Further information on the final app design is presented in Section 7.

Conversation completion standard
In our pre-registration statement, we write that we would tell respondents they must exchange 14 messages with their discussion partner to receive financial compensation, but counted their conversation as complete if they reached 10 exchanges.In addition, our statement does not mention that we would require everyone to have completed their conversation by Super Tuesday because our original plan was to launch the study well ahead of Super Tuesday.However, various complications related to our survey firm and last minute bugs in our mobile platform pushed our launch date back, making this new deadline necessary.Early that day, in an effort to get as many complete post-surveys as possible, we pushed all pairs who had exchanged some messages but not yet completed their conversation to our post-conversation survey.We counted their conversations as complete and included them in analysis if they completed the survey before the first polls closed at 6pm.This means that a few conversations in our final sample for conversation-level analysis (six conversations total) exchanged fewer than 10 messages.We believe this makes our results somewhat conservative, however, because these people had less time to influence or be influenced.All treatment effects that are statistically significant at least at the 5% level in our main specification and models with demographic controls remain significant when excluding the 6 conversations that had fewer than 10 exchanges or the 11 conversations that had fewer than 14 exchanges.

Outcome variables
In the preregistration we discuss two influence-related outcome variables: difference in partner's pre and post-conversation candidate rank and difference in partner's thermometer ratings.In the analyses presented here, we include these two measures for the respondent's pre-conversation top-choice candidate as well as a third outcome variable: partner's self-reported opinion change.We include this third measure in order to capture another more subjective, less specific dimension of opinion change.In addition, to ease interpretation and consider the three measures simultaneously, we combine these variables to create an index measure of aggregate influence.

Level of analysis
Our pre-registration statement stated that we would use individuals as the level of analysis.During our research, we found it was helpful to pair this individual-level perspective with a conversation-level analysis, presented in Figures 3 and 4.This allows us to draw conclusions not only about the changes in influence for individuals, but the changes in trajectories of conversations based on their labeling conditions.

Modeling strategy and control variables
In our pre-registration statement, we stated we will analyze our data using linear regression and control for a number of sociodemographic variables.In our main-text analyses, we do not use these control variables, and instead directly compare effects between conditions and participant groups.As our participants were randomly assigned to conditions and our analysis of balance between conditions (Section 5) shows that our randomization succeeded in creating groups that are sociodemographically similar, including control variables was unnecessary and risked creating an overfitted model considering our relatively small sample.We also run models with these control variables in analysis reported in Section 11 and find that the results do not differ in substance or significance from results of models without control variables reported in the main text.

Language
Our hypotheses about gendered language differences in the pre-registration posit differences in thoughtful/cerebral and emotional language between men and women and between experimental conditions.Our text analysis described here does not examine rational and emotional language specifically, but instead investigates gender differences in use of gendered political words (in the main text) and gender differences in receptiveness, features associated with politeness, and "warm and friendly" versus "tough and firm" communication styles (supplemental materials).This is because we had not yet encountered the political dictionary created by Roberts and Utych, which is used in the main text.When we learned about it, it became clear that the clearest way to test for gendered differences in language in political conversation would be to use a method that was specifically intended to measure the gendered content of political speech.Other text analyses presented in Section 10 are likewise performed using tools we felt were well-validated and got at the spirit of the preregistered hypotheses, which is gendered differences in conversation style.

Recruitment Process
Between February 28th and March 6th 2020, we hired the YouGov survey firm to recruit members of its online panel who are citizens of the United States and describe themselves as Democrats (or lean Democrat if they identify as independents) to download our app.Respondents were pre-screened to ensure they owned an iOS or Android mobile device or tablet.The sample was stratified such that half of the respondents identified as male when they entered the YouGov panel, and half identified as female (see Section 6, below, for more detail on operationalization of gender).After passing a screener questionnaire to ensure they were eligible to participate in the study (using the criterion above), respondents were shown an informed consent script, in which they were offered the equivalent of $10 in YouGov's points system to participate in the study.The informed consent script explained that respondents were being recruited to test a new social media platform for Democrats called UniteDem which would require them to engage with (or reply to) another user 14 times over five days.A reply is one "turn" in a conversation-or shift between a statement between one discussion partner and the other.YouGov contacted approximately three thousand people, of whom about 1800 agreed to download the app.In total, 938 people completed the pre-survey and were assigned to a study arm.In this paper, we analyze the 596 people who completed the post survey.See Section 5 for sample characteristics.

Study Recruitment Language
The text we used to recruit respondents to join the study was as follows: You have been selected for an opportunity to receive an additional 10,000 points for testing a new app called UniteDem.UniteDem is a new chat platform where Democrats can anonymously discuss which candidate for the 2020 presidential election would best represent their party with another member of their party.The app is being used by a team of researchers to study how people make decisions about which candidates to support.
To qualify for the 10,000 point bonus, you must a) install the UniteDem app on an Android or Apple smartphone (or tablet), and b) engage in a sincere conversation with another UniteDem user in which you make at least 14 replies, and answer questions about your views as well your experience on the new platform.The app will help you monitor how many replies you have made and inform you once you have reached the required threshold for payment.Although the time it takes to reply to these conversations may vary, we estimate it will take less than one hour spread across several days.You will have one week to complete the conversation.
The UniteDem app employs strict privacy and data security standards.The app collects no personally identifiable information about you.Your responses will be linked only by a numeric identifier assigned to you by YouGov that includes demographic data about you and your political preferences.The data that you generate as part of this research may be shared with future researchers to ensure that any scientific findings can be replicated, and to minimize the need for additional studies.
Your participation in this activity is entirely voluntary, and you have the right to decline to participate or stop participating at any point.If you choose to do this, you may request that any information obtained from you be removed by emailing the research sponsor at support@unitedem.org.
We encourage that you print or save this form for your own records.I am at least 18 years of age, and desire of my own free will to participate in this study.

• No
If respondents consented to participate, they were shown the following text: Thank you for agreeing to install the UniteDem App.Click this link on your mobile device to be taken to the app store in order to download the app.Once installed use invite code [code specific to each respondent] to sign in.Remember that users of UniteDem are asked not to disclose any information about themselves.You must complete 14 replies with your discussion partner in order to receive your YouGov points.

Characteristics of Study Population
Table S1 describes the characteristics of the 1835 Democrats over age 18 who agreed to participate in our study compared to the 1201 who did not.As the table describes, those who consented were significantly younger and more interested in the news.Consenters and those who did not consent are also differently distributed throughout the country; those who consented are more likely to reside in the South and less likely to reside in the Northeast.Consenting participants were also more likely to be college graduates, have higher incomes and identify more strongly as liberals; and less likely to be white, though these differences are not statistically significant a t t he p < .05 level.S2-S5.
Of those that consented to participate in the study, 938 completed the pre-survey in its entirety.Table S2 compares those who did and did not complete that survey.These respondents were significantly younger, whiter, have more interest in the news and identify more strongly as liberals than the 897 who did not complete the pre-survey.As shown in Table S3, 746 participants sent at least 1 message to their partner.Note that this is the first time treatment (gender mislabeling and same gender partners) could affect the results, so we add that variable to the tables here.Treated means the participant is in a mislabeled conversation, i.e. either the participant's or their partner's gender is mislabeled.We also include the variable "No of Candidates Recognized", which is the number of Democratic Primary candidates that the participant recognized in the presurvey.Of those that consented to use the app, completed the pre-survey and messaged their partner, only those who complete the post-survey before 5pm on March 3rd (the time at which election results were first reported on Super Tuesday) are considered in our final sample.Of the 746 respondents who messaged their partners at least once, 596 are included in our analysis.1These individuals tend to recognize a greater number of primary candidates when compared to respondents who did not complete the post-survey.The significant difference by treatment is further analyzed in Section 5.3.Additionally, we show that pre-conversation differences on candidate preferences do not exist for our analysis sample.These data are only collected within the UniteDem platform and high dimensional, s o we only report these statistics for the final analysis sample.The three tables below show that across all treatment conditions both overall (Table S6) and separately by gender (Tables S7 and S8) we do not observe significant differences in either the top ranked candidate or candidate feeling thermometers in the pre-conversation survey.C andidates are ordered by the number of supporters in our sample.

Candidate Preferences by Gender
Another key aspect of our study is participant gender, so we also report imbalances by gender on candidate preferences.Table S9 reports the same candidate preference measures split by men and women.We observe significant differences in top-ranked candidates by gender (p < 0.01) due to larger support for Senator Warren among women: 16% of women ranked her as their top choice while only 8% of men did the same.That imbalance is reflected i n m ore s upport f or S enator K lobuchar, R epresentative G abbard, a nd T om Steyer among men than women.However, we only see marginal differences in thermometer ratings for Senator Warren.Women rated her on average 72/100 and men 69/100.The only candidate with a significant rating difference is Representative Gabbard.Men rated her 33/100 and women 25/100, on average (p < 0.01).Note that these differences are not threats to causal inference because of the randomization.

Note:
N Incomplete is the number of conversations that began but at least one participant did not complete the postsurvey on time.N Complete is the number of conversations that began where both participants completed the postsurvey on time.All of these conversations are used in our main analysis.Completion Rate is the proportion of conversations in each condition that completed the postsurvey.

Sample Attrition
A particular concern here is differential attrition.If mislabeling causes participants to quit using the app, then we will be unable to measure influence in their conversation, which requires completion of the p ostsurvey.We focus attrition analysis on conversation-level attrition, i.e. of all conversations that began on the UniteDem platform, which ones can we use in our analysis.Individual level attrition is less relevant for our main results, Figures 3 and 4 in the main paper, which focus on conversation-level outcomes that require both conversation partners to complete the postsurvey, and harder to define.T o u se a n i ndividual i n t he a nalysis w e need that individual's partner to complete their postsurvey because a person's influence i s m easured by h ow their partner changed.
Table S10 below shows the number of complete and incomplete conversations along with the completion rate by treatment condition.Table S11 shows logistic regression coeffi cients to check for significant differences in attrition by treatment either without adjusting for covariates (first column) or with adjustment by all covariates measured external to the in-app surveys by YouGov (second column).The outcome is whether the conversation was included in our analysis (both participants complete the in-app postsurvey before Super Tuesday results are released).We see that conversations where the woman was mislabeled attrited at statistically significantly higher rates than correctly labeled c onversations.The o dds of a conversation making it to the analysis in control are about 2:1 (67% of control conversations complete the postsurvey), the odds of a conversation making it to the analysis when the woman is mislabeled are about 1:1 (52% of conversations where the woman is mislabeled complete the post survey).The significant difference holds when adjusting for demographic covariates.Conversations where men are mislabeled as well as both same gender conversations complete at slightly lower rates than control, but the differences are not statistically significant.
One complication of this analysis is conversations involving participants who were rematched.If someone left a conversation because their partner was unresponsive, then that conversation could not enter our analysis and that outcome, whether the conversation attrited from our sample, is dependent across conversations with the same participants.To that end, Table S12 runs the same regressions as Table S11 but excludes all conversations where any partner was rematched.In this population, the attrition rate among correctly labeled conversations and conversations with mislabeled men is almost identical.Conversations with mislabeled women differentially attrit at a similar rate as in the overall sample, and the difference is statistically significant only when adjusting for covariates.Note that in Section 11, we show that our results are unchanged when excluding conversations where either partner was rematched.
Differential attrition is concerning, however, we see no differential attrition between control and conversations with mislabeled men, the group for whom we see statistically significant t reatment e ffects.T hose r esults that we highlight in the main paper are not affected by this issue.We do not observe significant t reatment effects for mislabeling women.However, a potential explanation for that lack of effect is this attrition issue.If the conversations with mislabeled women that attrit had (unobservable to researchers) outcomes similar to other treated conversations, then our results would be stronger and potentially significant i f we h ad o bserved them.Of course, if the outcomes were more similar to control, then the results would stay the same (insignificant differences).However, the latter possibility is less likely because of the differential attrition; if those who attrit in the treatment group were the same as control there should be no differential attrition.
We also note that conversations where the man is from the South attrit at higher rates than conversations where the man is from the Midwest (the baseline level for Region).However, because region is balanced across treatment conditions in expectation due to randomization and confirmed a s b alanced i n o ur sample, this does not affect the estimated treatment effects.

Operationalization of Gender
Self-identified g ender i s c ollected by Y ouGov a s p art o f t heir s tandard d emographic p rofile questionnaires, and was not uniquely re-identified f or t his s tudy.T hey a sk " Are you. . .? " w ith two g ender o ptions, "Male" and "Female."We therefore do not have any indication of the prevalence of transgender or gender nonbinary respondents in the subject pool.

Respondent Experience on Study's UniteDem Platform
All respondents who consented to participate in our research were sent detailed instructions about how to install the app on an iOS or Android device.Each respondent was sent an "invite code" via the survey firm and instructed to use it when they logged into the platform for the first t ime.T he fi rst onboarding screen on the UniteDem platform allowed people to enter their invite code and gave them information about how to contact us to address any technical issues they might have experienced during the log-in process.The second screen included a message which explained that the platform was a forum where users could chat anonymously about which candidate should represent their party in the 2020 presidential election.Users were instructed not to disclose any personal information about themselves.On the third onboarding screen, respondents were reminded they must engage in at least fourteen "exchanges" with their discussion partner to receive financial c ompensation.H owever, we c onsidered p eople a s t reated i f t hey c omplete 1 0 exchanges (see below).On the fourth screen, users were shown the names of all eight candidates currently vying for the presidential nomination in the Democratic Party and asked "Which of these democratic candidates have you heard of?(check all that apply)."On the fifth s creen, u sers s aw t he n ames o f t he c andidates t hey checked off in the previous screen and the following text: "Please rank which candidates you think have the best chance of defeating President Trump in the election this fall by dragging the candidates with the best chance to the top and the lowest chance to the bottom" On the sixth screen, respondents rated each candidate independent of their electability via the following question: "Regardless of whether you think they can beat President Trump, please tell us how you feel about the candidates below using a feeling thermometer.Ratings above 50 indicate favorable/warm feelings; Ratings below 50 indicate unfavorable/cold feelings." After completing the survey portion of the app experience, users were matched with another respondent in the study.If the system was unable to match the user, the respondent was shown a screen that said "Searching for match" followed by a screen titled "No match found yet!" with the text, "We haven't found a chat partner for you yet.But don't worry!We will keep searching and notify you when we do."When a match became available, the user received a push notification on their phone and the next time they opened UniteDem, they were taken to a screen that showed the avatar of the person with whom they have been matched.The avatars constitute the study's main treatment.In the control condition and same sex conditions, both participants viewed avatars associated with the gender their partner reported when they first j oined t he s tudy, w hile in the mislabeling treatment conditions one of the individuals was shown an avatar that did not match the identified gender of their partner.The avatars appeared to the user as part of a short .GIF video: First, blue male avatars and pink female avatars (see Figure S1) spun around and then the avatar representing their partner increased in size as the other avatars faded away, indicating to the user that their discussion partner is either female or male.Immediately afterwards, we assigned one of the following gender-neutral initials to the avatar: L.C., G.C., T.C., E.C, or B.C.In order to strengthen the treatment, the male or female avatar icon also appears next to each message within the app's main "chat" dialogue, described in the following section.The second user to visit the chat interface received a popup window that read "To begin, we asked [partner initials] to share their stance on this prompt: 'Let's find a c andidate f or t he 2 020 P residential E lection that all Democrats can stand behind!Tell your partner more about who you think is up to the task and why.' Write back to complete your first i nteraction o n U niteDem!" A fter t he u ser s ent a m essage, t hey were also shown a popup window titled "Great start!" with text "In order to receive compensation from YouGov, you'll have to reply to messages from [partner initials] at least 14 times."When the users reached 2 exchanges with each other, they received a third pop up notification t itled " Chat R ating" w here t hey a re a sked t o give a "thumbs up" or a "thumbs down" response to their discussion partner's last message.This popup read "You can tell us whether you like what your partner says throughout the conversation.How would you rate [partner initials]'s last message?" Once users reached the app's chat interface, they had access to a settings page via a gear icon at the top right of the page.The settings page gave them the option to report or block their partner, contact the UniteDem team (staffed by us), see how many exchanges they had completed with their partner, and view chat prompts.The chat prompts window was available via a button titled "Conversation stuck?" which showed the following message: "Think about some questions you could ask [partner initials].For example: What is most important to you when choosing a candidate?Which candidate do you think has the best chance of beating Donald Trump in November?Which candidate do you think has the best ideas about climate change?Health care?Foreign policy?Inequality?"These prompts were meant to serve as conversation starters and were optional for participants to use.
If users had not completed 14 replies with their discussion partner within approximately one week of starting or by the weekend prior to Super Tuesday (whichever came first), t he a pp s ent a n otification to th e last respondent who did not reply to their conversation partner as follows: "We noticed you haven't replied to [partner initials] in a while, and the deadline is coming up!If you don't respond soon, we'll have to assign them a new partner.Need ideas to keep the conversation going?Visit the settings menu for suggestions."Meanwhile, the other chat partner received the following message "We've noticed [partner initials] hasn't responded to you in a while and the deadline is coming up.Don't worry, we sent them a reminder.We will assign you a new partner if they don't respond soon."Prior to the weekend before Super Tuesday, people whose partners remained unresponsive were automatically rematched to others whose partners had also become unresponsive.This rematching was done in a way that preserved the experimental conditions.People were still assigned opposite gender partners in their same experimental condition and retained their same label.In this case, respondents were asked to reply to the new discussion partner's messages the same number of times that would have been required of the member of the pair who had previously completed the most replies.The results we report in the main text of our manuscript hold regardless of whether people who were re-matched (56 respondents) are dropped from the sample (see Section 11).Reminders were still sent to people who became unresponsive or whose partners became unresponsive during the last few days of the study period, but the rematching language was removed because these people would not have had sufficient time to complete a conversation with a new partner before the cutoff date of Super Tuesday.
Once respondents completed 14 replies to each other, or on the morning of Super Tuesday if they had exchanged at least one message at that time, they were redirected to an exit survey within the app.In addition to asking respondents to re-rank their preferred candidates and complete a thermometer rating of each candidate, we asked them how much they agree or disagree with the following statements on a seven point scale: 1) "[partner's initials] influenced my views on the candidates for the 2020 Democratic primary election"; 2) "I influenced [partner's initials] views on the candidates for the 2020 Democratic primary election"; 3) [partner's initials] was open to hearing my views on who is the best candidate for the 2020 Democratic primary election"; 4) [partner's initials] was well-informed and knowledgeable about the candidates for the 2020 Democratic primary election.Next, respondents were asked, "Which of these options do you think best describes your partner's gender?[Man, Woman, Not Sure]."If respondents answered "not sure" they were asked the following question: "Please make your best guess about your partner's gender" [Man, Woman].
Next, we asked each app user to complete a series of questions designed to measure their perceived and actual knowledge of politics.To measure perceived political knowledge, we asked: "How much do you think you know about politics compared to most Americans" with the following response options: "Much more than most Americans," "A little more than most Americans," "A little less than most Americans," and "Much less than most Americans."We asked a battery of six questions to gauge actual political knowledge: 1) " The political knowledge variable used in our analysis is the number of the above questions that were correctly answered, treating skipped questions as incorrect.
Then, respondents were asked about their overall interest in politics with the following questions: 1) "Some people seem to follow what's going on in government and public affairs most of the time, whether there's an election going on or not.Others aren't that interested.Would you say that you follow what's going on in government and public affairs. . .[Most of the time, some of the time, only now and then, hardly at all]." Finally, we asked two additional questions designed to capture political engagement: "1) Do you plan to donate money to any of the candidates?If so, please select the ones you plan to support; 2) Do you plan to volunteer for any of the candidates?If so, please select the ones you plan to volunteer for.
For descriptions on the conversations themselves, see Section 10.1.

Additional Text Analysis 10.1 Conversation Descriptions
In this section we provide some high-level summary statistics of the conversations themselves.An "exchange" occurs when one participant sends a message and then their partner replies without any word count requirement.Consequently, we treat a "message" as the concatenation of all text sent by a single respondent without a partner reply, essentially combining all consecutive text sent by a single person.The average message was 43 words long (IQR of 17 to 55 words) and the median time between messages was about 1.5 minutes (IQR of 1 to 15 minutes).The mean number of messages in conversation was 31 and the median was 30.Note that for each participant to have 14 exchanges, at least 28 messages must be sent in the conversation.
As a heuristic measure of conversation substance, we include a word cloud created from the word counts across all conversations.Figure S2 shows the 200 most common words scaled and colored by their overall frequency. 3Trump was the most-used word, reflecting general concern with the upcoming general election across many conversations.Senator Sanders and now-President Biden were also frequently mentioned.

Gender Differences in Language
This section details the text analyses we conducted in order to understand gender differences in language.First, we describe the technique developed by Roberts and Utych (2020) that we leverage in the main text of our manuscript.This technique employs a dictionary of 700 words with corresponding ratings that measure how "feminine" or "masculine" a word sounds.The ratings vary from 1.36 (for the word woman) to 6.40 (for the word man).It was developed with the explicit purpose of understanding the nexus between gender, language, and politics.As part of their findings, Roberts and Utych show that dominant and negatively valenced words-as measured by the Affective Norms for English Words (ANEW) database-are more likely to be rated as masculine.Overall, the authors suggest that individual gendered words have a subtle influence over candidate support.The dictionary was created using Amazon Turk.As Roberts and Utych describe: We recruited participants from Amazon's Mechanical Turk (MTurk) in June 2018 to rate words on their perceptions of masculinity or femininity.A total of 175 participants were recruited for the study, with each participant rating a randomly assigned hundred words, of the seven hundred total that were rated.Due to random assignment, words were rated by a minimum of fifteen raters, and a maximum of forty-four, with a mean of 25 (SD = 5.19).Participants were paid $1 for completing this task, which took an average of six minutes to complete.Roberts and Utych (2020, 43) This dictionary-based approach is not without limitations.Most importantly, our corpus contains 9,156 unique words (counting nouns, verbs, and adjectives), which means that using this dictionary leads to a significant l oss o f i nformation.T he words c ontained i n t his d ictionary a mount, o n average, t o 4 % o f t he total vocabulary used within each chat.Furthermore, eleven of the chats did not use any of these unique words.2020) dictionary that are used within each chat.The words in this dictionary represent (on average) approximately 4% of all words used in each chat.
Another notable feature of this dictionary is that, on average, male raters tend to rate words as sounding more "masculine," whereas women rate words as sounding more "feminine."Figure S4 reproduces the result in the main article but separating the averages by the gender of the rater.This does not change the results, but it does highlight the fact that the measurement we relied upon is noisy.
To complement the results revealed from using Roberts and Utych's dictionary, we employed the "conversational receptiveness" classifier d eveloped by Y eomans e t a l. ( 2020) t o m easure t he e xtent t o w hich p articipants use language in a way that communicates "one's willingness to thoughtfully engage with opposing views."Their method focuses on the stylistic properties of text, rather than its substance.Messages judged to be receptive often include: positive statements (e.g., "that's great"); explicit acknowledgement (e.g., "I understand," "I see your point"); points of agreement (e.g., "I agree," "you're right"); and hedging or indicators of uncertainty

Perceived Gender of Language by Treatment Condition
Figure S4: Patterns of gendered language usage in chats on the social media platform by treatment condition.X axis describes average gender connotation of words, based on the Roberts and Utych dictionary database.Higher scores indicate the overall pattern of word usage is more male.Across all treatment conditions, men used more male-sounding language, and women used more female-sounding language.
The results remain substantively the same when accounting for the gender of the rater in the Roberts and Utyc dictionary.used to soften claims.Finally, they note those whose language includes markers of receptiveness are rated more persuasive by others, a metric similar to the outcome of our experiment.The classifier predicts "conversational receptiveness" on the basis of a set of features related to polite speech.
Figure S5 suggests that women tend to be more receptive when talking to correctly labeled men, and less receptive when they are mislabeled (i.e., when their conversation partner thinks they are talking to another man).In other words, women tend to be less receptive when their male counterparts talk to them as if they were "one of the guys."Men on the other hand, tend to be less receptive when they are engaged in conversation with correctly labeled women-though the differences are minimal.

Same Gender Conversations
Women's Language when

Conversational receptiveness scores
Figure S5: The receptiveness scores are presented as the probabilities that each individual user is "receptive to opposing views" according to the classifier developed in Yeomans et al. (2020).The lines represent 90%confidence intervals.Both the classifier and the code that turns text data into politeness features are available as part of the "politeness" package in R (Yeomans, Kantor, and Tingley 2018).
We also examined whether there are gender differences in the likelihood that a message uses a "warm and friendly" communication style as opposed to a "tough and firm" c ommunication s tyle.T his c lassifier is trained on a different dataset collected by Yeomans, Kantor, and Tingley (2018) using Amazon Mechanical Turk and uses the same set of stylistic features related to polite speech described above.The training dataset was collected using Amazon Mechanical Turk, half of the workers were told to use a "warm and friendly" communication style in a message involving an online economic transaction; the other half were told to use a "tough and firm" communication style.Figure S6 suggests that men tend to be more "warm and friendly" towards mislabeled women and more "tough and firm" towards correctly labeled women.Women also tend to be more "warm and friendly" towards mislabeled men but "tough and firm" towards mislabeled women.
Finally, we examine which polite speech features are predictive of whether the participant is a man or a woman.To do this, we used a standard feature selection algorithm to identify the stylistic features that might be predictive of gender.The resulting coefficients are estimated to maximize the predictive accuracy of "Warm and friendly" communication scores

90% confidence intervals
Figure S6: The warm and friendly communication scores are presented as the probabilities that each individual user communicating using a "warm and friendly" style.The lines represent 90% confidence intervals.This classifier uses the previously mentioned politeness features and a training dataset that is also included in the "politeness" package.
the model as a whole, which means that we must caution against interpreting them as "true" coefficients, much less to be causal in any way.
The most notable aspect of this analysis is that it is very hard to predict switching based on these features (the in-sample accuracy is 61%), which means the differences are very subtle.Figure S7 compares messages written by all participants.The features most predictive of men are the expression "By the way" and use of swear words, whereas the features most predictive of women are direct requests ("Can you") and expressions of gratitude ("thank you," "I appreciate," etc.).Stylistic features that are predictive of the gender of participants

Gratitude
Figure S7: Coefficients of logistic regression with LASSO and cross-validation to predict whether the participant was male using glmnet R package (Friedman, Hastie, and Tibshirani 2010).

Additional Analyses
This section details several additional analyses and robustness checks for results reported in Figures 3-5 of the main text of our manuscript.We report regression results for the following: main paper results, adding demographic control variables, dropping conversations where either participant was rematched due to a non-responsive partner, and dropping conversations where participants saw through the gender mislabeling.
The final two a re n ot p roperly c ausal b ecause t hey r equire c onditioning o n a p ost-treatment variable, s o the results should be interpreted with caution.
The final column in main paper Figure S5 reports results on same gender conversations.Because of complexities in interpretation, we address these conversations separately in Section 11.4.Other results presented here apply only to cross-gender conversations.
The regression model is specified as f ollows.Y i is the outcome for conversation i, M M i is an indicator variable that is 1 if the man is mislabeled and 0 otherwise, WM i is an indicator variable that is 1 if the woman is mislabeled and 0 otherwise.X i is a vector of control variables for both individuals in conversation i.We estimate the model below with OLS regression: When the outcome is individual, as in main paper Figure S5, the same equation form is used by treatment indicators are for whether the individual is mislabeled (1 for yes, 0 for no) and whether the individual's partner is mislabeled (1 for yes, 0 for no).In the case of cross-gender conversations, models are only estimated on individuals of the same actual gender so there is no dependence or interference across units in the same regression.For example, women only talk to men, so by analyzing individual-level results among only women, we preserve independence across observations.The final column in main paper Figure S5 reports results on same gender conversations.Due to the difficulties detailed in Section 11.4, those results are reported without demographic controls and with standard errors clustered to reflect conversation-level dependence.
To ease interpretation of baseline values, we exclude covariates X i from results reported in the main text.We show that the results do not change in magnitude or significance when including them.Moreover, dropping conversations where someone was rematched also generally increases the size and significance of treatment effects.Note that when the outcome variable is influence measured on partner rankings or thermometers the regressions have a smaller sample sizes because that influence measure (the partner's pre-post change in feelings or rankings of someone's top candidate in the presurvey) cannot be computed if someone does not rank or rate their partner's top candidate in the presurvey.The survey measure of influence is available for all participants.

Influence Gap Results
This section reports results in Tables S13-S16 focusing on the influence gap metric presented in Figure 3 of the main paper.Table S13 recreates the results presented in the main paper.Next, Table S14  Recognized is the number of 2020 Democratic Primary candidates the individual recognized in the in-app presurvey.Political Knowledge is the number of correctly answered political questions in the in-app survey described in Section 7. The gender of the relevant participant is indicated in parentheses; M for men and W for women.Finally, Tables S15 and S16 report the main results (without demographic controls) for subpopulations of only conversations where neither individual was rematched (Table S15) or conversations where if someone was mislabeled, their partner was deceived by the mislabeling (Table S16).
The success of the deception was measured by whether the partner of the mislabeled person guessed the gender of their partner's avatar, i.e. they guessed their partner's actual gender incorrectly.Our results are the same for each of these sub populations.
All significant results presented in Figure 3 in the main text of our manuscript (Table S13 here) stay significant under these robustness checks.In conversations where the man is mislabeled, the influence gap is significantly smaller (the woman has more relative influence) than in correctly-labeled conversations on the influence index and the partner-reported survey measure.We also see significant effects when women are mislabeled indicating that mislabeling women increases the influence gap on select individual measures (the woman has less relative influence) than in correctly-labeled conversations when subsetting the treated conversations to those where both partners believed the gender of the avatars corresponded to their partner's gender on select influence measures but not the aggregate index.This is again consistent with the results that we report in the main text, a woman's influence is not improved when labeling her as a man.

Thermometer Rating Gap Results
This section analyzes the same robustness checks for main paper Figure 4.

Individual-Level Influence Results
This section reports results in Tables S18-S21 focusing on cross-gender conversation individual-level influence presented in Figure 5 of the main paper.Table S18 presents results used in main text Figure 5.Note that the effects of partner mislabeling correspond to the opposite-gender panel in Figure 5, e.g. the effect of mislabeling the man on the woman's influence i s m easured i n t he r egression u sing o nly women t o compare women with mislabeled, male partners to women with correctly-labeled, male partners.After adding the same demographic controls as in the above robustness check, the same treatment effect remains significant: mislabeling men significantly influences their partner's influence.Table S20 reproduces the main results after excluding conversations with rematches, and we observe the same significant effects as reported in the main text along with an additional significant effect that mislabeling men decreases their influence.Table S21 includes only individuals who believed the avatar represented their partner's gender.The significance of results is the same as those reported in the main text.
We perform analyses inclusive of the same gender conversations in the following section, Section 11.4, because of additional interpretational and statistical issues in the analysis highlighted in that section.Due to those issues, we report the individual level results in regressions both without (this section) and with those additional observations (next section).Our conclusions are consistent across the analyses.

Same Gender Conversations
This section reports in Tables S22-S25 on analysis of the same gender conversations reported in F igure 5.These regressions hold constant the gender of the target of influence, p reserving t he p ropensity t o b e influenced across comparisons.Thus, the conversations between two men are added to the regressions analyzing women's influence o n m en, a nd t he c onversations b e tween t wo w omen a re a dded t o t he r egressions a nalyzing men's influence o n w o men.C ontrasts b e tween s ame-gender a nd a ny o pposite-gender c onversations s hould n ot be interpreted as identifying causal effects of gender because only conversation pairs were randomly assigned, while participant's actual gender is clearly not randomized.As such, the contrast between womens' influence on men and mens' influence o n m e n d o es n o t i d entify t h e c a usal e ff ect o f o n e's o w n g e nder o n influence over men.Men and women differ on many observed and unobserved characteristics.Thus, we cannot say that observed differences in influence a r e c a usally a t tributable t o g e nder.W e c a n s a y t h at i n fluence on someone was higher or lower because that someone's partner was a man as opposed to a woman, but we cannot attribute that difference to the gender difference.The difference in influence c o uld b e c a used by some unobserved characteristic that varies with gender.In addition, by including both participants from the conversation in the same regression, the observations are now not independent.Consequently, we report cluster-robust standard errors with clustering at the conversation level.We observe that influence over m en a nd i nfluence over wo men do es no t di ffer by th e ge nder of th e person doing the influencing regardless of whether demographic controls are added.Note that the results for self mislabeling and partner mislabeling in Table S22 have identical point estimates to the results in Table S18.However, adding the demographic control variables produces different point estimates because the "Self' ' control variables are not balanced across genders.Those control variables have different distributions across the same gender and opposite gender conversations.
The only differences from the main text results are increases in statistical significance for the influence of mislabeled men on women.After either adjusting for demographic controls (Table S23) or excluding rematches (Table S24), mislabeled men have significantly less influence over correctly labeled women than correctly labeled men do.

Figure S1 :
Figure S1: Avatars used to cue gender

Figure S2 :
Figure S2: Most Frequent Words in Conversations 3 Stopwords from the SMART lexicon were removed.

FigureFigure S3 :
Figure S3: Percentage of words from the Roberts & Utych (2020) dictionary that are used within each chat.The words in this dictionary represent (on average) approximately 4% of all words used in each chat.

Table S1 :
Comparison of Consenter and Non-Consenters

Table S1 :
Characteristics of participants who consented and those who did not.Means and standard deviation reported for continuous attributes, number and percentage (in-group) reported for categorical attributes.P-values are calculated with a t-test for continuous variables and chi-square for categorical variables.The same methodology is applied in Tables

Table S2 :
Comparison of those who completed and did not complete the Pre-Survey

Table S3 :
Comparison of those who sent and did not send one message

Table S4 :
Comparison of those who completed and did not complete thePost-Survey (continued)

Table S8 :
Pre-Conversation Candidate Preferences for Men

Table S9 :
Pre-Conversation Candidate Preferences by Gender

Table S10 :
Conversation-Level Completion Rates by Treatment Conditions.

Same Gender Conversations Women's Language when Talking to Men Men's Language when Talking to Women
adds demographic covariates as controls.Age is the individuals age in years.White is a dummy variable where 1 indicates the individual is white, 0 indicates otherwise.College Grad. is a dummy variable where 1 indicates the individual graduated from college, 0 indicates otherwise.News Interest is a 5point scale indicating how much interest the individual has in following the news.Political ideology is a 5-point scale indicated how liberal or conservative the individual is, positive values indicate more liberal ideology.This is the same variable as Strength of liberal identification in Section 5. Northeast, South, and West are dummy variables indicating what region of the country the individual lives in (Midwest is the omitted category).Income is income in 1000s of dollars.Income Not Reported is an indicator variable for whether the participant elected to not disclose their income to YouGov.N. Cand.

Table S13 :
Treatment Effects on Influence Gap (Main Paper Results)

Table S14 :
Treatment Effects on Influence Gap (with Demographic Controls)

Table S16 :
Treatment Effects on Influence Gap (Believed Avatar Subpopluation) Regression coefficients for results presented in the main text Figure3for only conversations where the individuals believed the displayed avatar corresponded to their partner's gender.

Table
S17 presents all robustness checks.Our results are unchanged under all alternative specifications e xcept f or t he B elieved Avatar subpopulation, where the effects are only significant at the 10% level.

Table S19 :
Treatment Effects on Individual Influence (with Demographic Controls) Regression coefficients for results presented in the main text Figure 5 with demographic controls added.Parentheticals in demographics indicate which participant they correspond to.p<0.05; * * p<0.01; * * * p<0.001.

Table S21 :
Treatment Effects on Individual Influence (Believed Avatar Subpopluation) Regression coefficients for results presented in the main text Figure5for only conversations where the individuals believed the displayed avatar corresponded to their partner's gender.

Table S24 :
Treatment Effects on Individual Influence (No Rematching Subpopluation) Regression coefficients for results presented in the main text Figure5while excluding all conversations where either participant was rematched.Cluster-robust standard errors are reported with clustering at the conversation level to account for dependence between participants who conversed with each other.

Table S25 :
Treatment Effects on Individual Influence (Believed Avatar Subpopluation) Regression coefficients for results presented in the main text Figure5for only conversations where the individuals believed the displayed avatar corresponded to their partner's gender.Cluster-robust standard errors are reported with clustering at the conversation level to account for dependence between participants who conversed with each other.