First-mover advantage explains gender disparities in physics citations

Mounting evidence suggests that publications and citations of scholars in the STEM fields (Science, Technology, Engineering and Mathematics) suffer from gender biases. In this paper, we study the physics community, a core STEM field in which women are still largely underrepresented and where these gender disparities persist. To reveal such inequalities, we compare the citations received by papers led by men and women that cover the same topics in a comparable way. To do that, we devise a robust statistical measure of similarity between publications that enables us to detect pairs of similar papers. Our findings indicate that although papers written by women tend to have lower visibility in the citation network, pairs of similar papers written by men and women receive comparable attention when corrected for the time of publication. These analyses suggest that gender disparity is closely related to the first-mover and cumulative advantage that men have in physics, and is not an intentional act of discrimination towards women.


Introduction
Mounting evidence suggests gender bias in publications and citations of scholars in STEM 1,2 . Such biases can result in situations where women (or other under-represented minorities) may feel invisible and ignored in male-dominated environments. The feeling of not being part of the community can result in a higher dropout rate among women, a phenomenon known as leaky pipeline 3 . Leakage in the academic pipeline consequently affects the academic community for generations to come due to a lack of diversity, inclusion, innovation and role models. Thus, it is of utmost societal importance to accurately identify those biases and devise bottom-up approaches to tackle them.
Gender inequality in academia manifests itself in the production of science and performance outcomes. Some inequalities are inevitable: parenthood, career breaks, limited access to role models and resources can create situations in which women and other minorities show less productivity and performance compared with their white male peers. Frequently, these unavoidable inequalities are exacerbated through formal and informal social relationships, which in turn affect the citation network structure and reinforce existing inequalities.
Academic productivity is often associated with number of publications throughout a researcher's career. Previous studies have found that women publish fewer peer-reviewed articles than men 4,5 , while a more recent study found that the disparity in the productivity of men and women disappears if we compare the productivity with regard to the scholar's career length 6,7 . Women display higher publication rates later in their academic careers, but take up fewer leadership roles 5,8 . Mueller et al. suggest that publication productivity may be a factor that hinders women from advancing within surgery 9 , while Reed and colleagues point out that mid-career assessment of productivity may not be an appropriate measure of leadership skills 5 .
Beyond disparities in publication and productivity, analysing citation patterns can help to identify whether gender differences exist in the way scholars award and recognize each others' works. In other words, while productivity is associated with individual or collaborative efforts, citation is an indication of how these efforts are perceived by the community 10 . In this sense, one can argue that while the former operates among a small number of collaborators, the latter is related to the social processes that govern the community of scholars at large.
Previous studies have shown that patterns of citation can be different for men and women. This could be explained by intentional decision, quality difference, or paradigmatic research topics [11][12][13] . It has been argued that in the most productive countries, all articles with women in key author positions receive fewer citations than those with men in the same positions 14,15 . Moreover, some research concluded that the differences in citation rates between men and women increase with the number of authors per article 16 . This indicates that women are not only relatively less represented as high-impact key authors, but also that they attract significantly fewer citations for those key positions compared to men. One plausible assumption is that the lack of women in leadership positions causes this accentuated female under-representation (structural reasons) since the distribution of key authorships follows, by convention, a hierarchical order. In a recent paper, Dworkin and colleagues present a case study of citation patterns in top neuroscience journals, finding that papers for which first and last authors are men are over-represented in reference lists, and that the discrepancy is most prominent in the citation behaviours of men and is getting worse over time 2 .
A major methodological obstacle is that simply comparing number of publications and citations of men and women is misleading. Men and women have different rates of entry in the scientific community for historical reasons, and when combined with other non-academic responsibilities, they may not show a similar behaviour at the aggregate level. Indeed, recent findings show that when differences in career length are controlled for, male and female scientists have similar rates of publication and citation on average 7 . However, beyond these insights on a population level, do men and women really receive different recognition for a similar work published around the same time? To truly examine the gender differences in citation, one should compare pairs of papers that cover the same topics in a comparable way. Relying on analysing only the average performance may hide variations that exist in data, and drive the community to inaccurate conclusions or inappropriate policies.
In this paper, we focus on analysing publication and citation patterns in the physics community as one of the core STEM areas where women are exceedingly under-represented, often facing belittling remarks and harassment 17,18 . More importantly, we examine gender differences not only at the population level, but also at the microscale by comparing pairs of statistically validated similar papers.

Results
We start by describing the dataset we have analysed and briefly explaining the methodology we have used to build the citation network and the pairs of similar papers. Then, we proceed to study gender disparities, first at the aggregate level and then by comparing pairs of similar papers.

Data Description
We study an American Physical Society (APS) dataset from 1893 to 2010 which contains articles' metadata, the authors' basic information, and the citations within the papers. The metadata consists of authors' full names and a unique digital object identifier (DOI) of the publication in a string format. For those names that are repeated in the dataset, we used name disambiguation methods proposed by Sinatra et al. to detect unique authors and correctly match authors to publications 19 . To infer gender from names, we implemented a gender-detection procedure that combines author names with an image-based gender inference technique applied to search results from Google Images 20 . This combined method results in high accuracy in the gender identification of scholars from different nationalities (see Supplementary Sections 1 and 2). The final dataset consists of 541,448 scholarly articles published over the course of 117 years. We have identified 70,833 gendered names, 9,947 women and 60,886 men. The evolution of the number of authors per year is shown in Figure 1A.
Here, the notion of "gender" refers neither to the sex of the authors nor to the gender that the author self-identifies as. By the words "woman" or "female author", we mean an author whose name has a high probability of being assigned to female at birth or being identified as a woman due to facial characteristics. Given this limitation, we can safely argue that these methodologies are in accordance with social constructs and what people perceive as gender in society.

Constructing citation networks and assessing similar pairs
We build the citation networks by considering each paper as a node and making a link from paper i to paper j if i includes a citation to j. We measure the similarity between two papers using the bibliographic coupling strength 21,22 ; that is, the number of publications that both papers cite. Two papers that cover similar topics in a comparable way are assumed to include a similar set of outgoing citations. However, within subfields there is usually a handful of classic publications that are cited in most works, so their inclusion in two different papers may not indicate actual similarity, but a citation convention. To avoid such shortcomings of naive bibliographic coupling, and guarantee the significance of the overlapping set of citations, we apply a statistical test based on the hypergeometric distribution. This test controls for the incoming citations of the commonly cited papers and checks whether the size of the common set of citations is so large that it cannot be explained by randomness.
To explore gender disparities we select pairs of similar papers respectively written by male and female primary authors. Then, we compare the incoming citations to each element of the pair, such that, since the two publications are respectively led by a man and a woman, this comparison allows us to detect potential inequalities in the citation patterns. We have summarized this methodology in the diagram of Figure 2 and provided all the technical details in Methods.

Aggregate gender disparity trends
To characterize the gender disparities at the aggregate level, we first analyse the aspects of scientific production that depend primarily on individual choices and ability: in particular, productivity, dropout rate, and self-citations. Then, we discuss authorship order, which depends on the internal organization of research groups. Finally, we study the behaviour of the scientific community as a whole by comparing the citations received by men and women. Productivity. We define productivity as the number of publications that scholars produce during their career. In physics, we observe that women have a lower average number of publications compared to men across all their career ages ( Figure 1B). While in the first two years of author's career the publication gap is closing, we observe a sudden increase in the gap from the second to the eleventh year. After this point, the publication gap starts decreasing again. These fluctuations in publication productivity can be associated with the disproportionate family responsibilities that women have to take on compared with men 23 . Although a researcher's productivity can be considered to be determined mainly by individual skills, the collaborative nature of scientific work makes it dependent on external factors such as other team members or departmental organization. Likewise, these factors, together with other aspects like social perception or family responsibilities, affect women's motivation to keep working in academia, potentially leading to the leaky pipeline phenomenon. To quantify this phenomenon, in the next section we explore the difference in dropout rates between men and women.
Dropout rate. We compute dropout as a lack of publication activity for at least five years to distinguish the authors who are active in publishing from those who have dropped out. We investigate the ratio of dropout scholars at each career age compared to the number of active scholars by gender. Figure 1C shows that female authors have a higher dropout ratio throughout their whole career. Most notably, the largest gaps appear in the early career years, with a 3.63% difference between men and women in the fifth year. The dropout fluctuations after career age of 20 for women are caused by the low number of senior female scientists in the data (see Supplementary Figure 6). The dropout rates of authors who leave academia after their first year Detecting pairs of similar papers Citation Expected citation that is missing Publication date Figure 2. Assessing similar pairs. We use bibliographic coupling and hypergeometric statistical tests to select couples of similar papers based on their outgoing citation activities. Then we compare their respective popularity (incoming citations). Each node and each arrow represent a paper and a citation respectively, whereas each dashed arrow represents a potential citation that is missing.
(career age 0) are not shown in Figure 1C. This career age presents the highest dropout rates, with 28% for male authors and 38% for female authors.
Self-citation. Self-citation refers to cases where authors cite their own previous works. Self-citations increase the total citation count and the visibility of scholars [24][25][26] , potentially enhancing academic promotion and attention. We have measured the relative number of self-citations by all male and female authors with the following metric (r) to study the difference in self-citation ratios between the two genders over time 24 : r = % male self-citations % male citations % female self-citations % female citations (1) Figure 1D shows the temporal evolution of the ratio r. This result shows that women tend to cite themselves less than men and that this trend is consistent over the years (See Supplementary Table 3 for more details). Consequently, women's visibility in the citation network is partly penalized by the higher ratio of men citing their own previous works.
Another fundamental factor that affects an author's visibility is the position in which her name appears in the list of authors. This position depends on how the whole research group is organized and, crucially, in most cases it depends on the perceived level of contribution of each collaborator.
Authorship order analysis. In the majority of the scientific fields, including physics, the authorship order indicates relative contribution and seniority by putting emphasis on the first, the last, and the second positions 27,28 . In order to compare the positions of authors, we first discard those papers for which authorship order is alphabetical. For this purpose, we perform a string comparison of the last names of the contributing authors and consider them to be in alphabetical order if the paper has at least four authors and all of them follow this order. Around 4% of the papers can be considered as alphabetically ordered; in Supplementary Table 4 we detail their fraction by PACS subfield. After discarding those papers from the analysis, we study the authorship order in each publication and compare the proportion of female and male primary authors with what we would expect from the size of the population by conducting a two-proportion z-test (see Methods).
The results show that male authors are listed in the first position of physics publications more frequently than expected (See Supplementary Table 5). We verified the robustness of this result by performing the same computation for last authors, obtaining analogous results. This is in line with previous findings that women feature only rarely as first or last authors in leading journals 29 .
While the authorship order reflects how a researcher's coworkers perceive her contribution, the collective perception of the scientific community regarding the relevance of a paper is manifested in the citations of papers. In the following sections we will thoroughly compare the relative popularity of publications led by women and men.
Citation centrality analysis. The flow of citations determines the visibility and recognition of papers both locally and globally. To measure the local influence of papers we use the in-degree metric, and to measure the global influence, we use the PageRank centrality. Our aim is to verify if the visibility of papers written by women is proportionate to what we expect from their overall population size. To do that, we focus on the ranking of the nodes according to their respective centrality.
Understanding ranking centrality is important for three reasons. First, the authors of papers in top ranks gain more visibility for themselves and those central papers influence future citation patterns [30][31][32] . Second, the visibility of papers in top ranks is being exacerbated by algorithmic tools such as Google Scholar. Third, since citation networks follow a heavy-tail distribution, those in top ranks stabilize their ranking position and give few opportunities for other papers to catch up 33 . Because of these network effects, it is important to study how minorities are represented in top ranks.
We assigned each paper a gender by labeling it based on its first author. Then, we analysed the top h% in-degree/PageRank centrality of the papers. Figure 3A suggests that papers written by women have notably lower in-degree and pagerank centrality than expected from their overall proportion. Female-led publications are substantially under-represented in the highest 20th, 30th, and 40th percentages, and the deviation between the observed and the expected proportions likewise increases in the highest rank positions. While in-degree and PageRank follow a similar trend as expected, the proportion of females with high PageRank centrality is even lower when compared to the in-degree centrality. This suggests not only that papers written by women receive less attention but also that they are disadvantaged in terms of their position within the citation network.
So far, the global gender analysis points towards a notable disparity in productivity and citation of men and women. This could be partly due to historical reasons, to the cumulative advantage that early arrival confers to men, as well as to the high dropout rate of women 7 . The slower rate of arrival of women (see Figure 1A) may also play a relevant role. Together, these factors affect women's global visibility. The question that arises from these global results is, are scholars intentionally ignoring (and therefore, under-citing) research works led by women? To explore this possibility, in the following section we study pairs of papers written by men and women that are statistically validated twins, and measure the citations that each receives.

Pair-wise citation analysis
We identified statistically validated male-female pairs of similar papers using the methodology described in Methods and summarized in Supplementary Figure 5. Then, we computed the difference in the number of citations each member of the pair receives. The overall expectation is that similar pairs of papers should have a similar number of incoming citations on average. We use a z-test to assess if that is the case (see Methods). This computation is performed in each PACS subfield separately to control for potential differences in the citation biases per subfield.
Supplementary Table 7 shows that in the majority of subfields, the average number of citations received by publications with male primary authors is higher than for female primary authors. In fact, we are able to observe a statistically significant difference in five out of ten subfields.
To check whether the temporal difference between two papers is responsible for the citation disparity for women (an older paper has had more time to accumulate citations), we add a maximum three-year difference restriction between two similar papers and redo the citation difference analyses. Table 1 shows that when the time constraint is applied, the citation difference between two similar publications is no longer significant.

First-mover advantage drives the citation inequality
Given the above results, we now seek to confirm whether the time of publication is a main driver for the citation disparity and whether the first-mover advantage in publication affects male-led papers and female-led papers similarly. We define ∆ t = Y m −Y f as the year difference between the publication dates of male-female pairs of similar papers and ∆ C = c m − c f as their citation difference. We plotted the year difference ∆ t against the citation difference ∆ C in Figure 3B. We likewise elaborated ten analogous plots after categorizing the data into subfields by PACS number (shown in Supplementary Figure 8) to control for variations between subfields. Note that for this analysis we impose no time restriction between the publication times of the two papers of each pair.
To verify that the disparity in citations is caused by the first-mover advantage, we first need to test whether a first-mover advantage in fact exists. If that is the case, when a man publishes first (∆ t < 0) he should get more citations (∆ C > 0) on average, but when a woman publishes first (∆ t > 0) she is the one who should get more citations (∆ C < 0) on average; that is, in Figure  3B, quadrants Q2 and Q4 should be more populated than expected if we treated ∆ t and ∆ C as independent random variables. Equivalently, we should observe a negative correlation between ∆ t and ∆ C . To test this hypothesis, we compared the empirical joint probability distribution of ∆ t and ∆ C (P emp (∆ t , ∆ C )) with the one that we would obtain if they were independent variables (P null (∆ t , ∆ C ) = p(∆ t )p(∆ C )) by computing the probability anomaly as: The resulting values of P diff (∆ t , ∆ C ) are shown in the right panel of Figure 3B and, as can be observed, they support the hypothesis of the first-mover advantage, since Q2 and Q4 present positive anomalies while Q1 and Q3 present negative ones. It is worth emphasizing that a positive (resp. negative) anomaly indicates higher (resp. lower) density of points with respect to a situation of no correlation between ∆ t and ∆ C . To quantify this trend we computed the Pearson and Spearman correlations between ∆ t and ∆ C , obtaining −0.19 and −0.41 respectively.
Once the existence of the first-mover advantage has been confirmed, we need to test whether there exists an asymmetry in the relative advantage that men and women obtain when they publish first. If there is no asymmetry, the average number of citations that a woman obtains by publishing a certain number of years ahead of a man should be comparable to the number of citations that a man obtains in the equivalent situation.
To verify this, we compared the citation differences of Q2 with Q4 (pairs where the earlier paper received more citations)

PACS
Subfield   and Q1 with Q3 (pairs where the earlier paper received fewer citations) for each temporal difference; in other words, we compared the average absolute value |∆ C | of points from Q2 with the average |∆ C | of points from Q4 for each |∆ t | = 1, 2, . . . separately (analogously for Q1 and Q3). To perform this comparisons, we used z-tests for difference of means for each year difference (see Methods). The results of the tests for every subfield, shown in Supplementary Table 8, indicate that there is no significant disparity in the advantage obtained by women and men when they publish a paper a given number of years earlier than their corresponding statistical twin.
This thorough analysis indicates that, when we control for the similarity of the papers and time of publication, there is no significant evidence for any disparity between two statistical twins. Therefore, despite the common assumption that papers written by women generally receive fewer citations, this difference is mainly driven by the historical first-mover advantages that men have and not by deliberate discriminatory actions against women.

Historical trend in citation
Finally, we hypothesize that the physics community might have been less receptive to the contribution of women in the past compared to the present. To test this hypothesis, we measure the temporal evolution of the centrality differences (∆ C ) between male-female pairs by year and limit the publication time difference between the two papers to a trailing window of three years. Then, we compute the mean and standard error of ∆ C for all the pairs within each window. For comparison, we perform an analogous computation for random samples of similar male-male pairs. In each time window, we matched the number of sampled male-male pairs with the number of similar male-female pairs. We repeated the male-male computation 100 times independently and computed the average ∆ C and the standard error, which we use as a baseline. Figure 3C shows the citation differences for male-female pairs of similar papers over the years compared with the baseline given by male-male pairs of papers. The earlier male-female pairs seem to present a higher disparity favoring men than later pairs, whereas the ∆ C values for male-male pairs throughout the years are, as expected, consistently located around the equilibrium. After all, the similar male-male pairs were chosen randomly and there is no reason for one paper of the pair to have a higher or lower citation count than the other. The number of sampled pairs per year is shown in Supplementary Figure  9. To measure the apparent change of trend in the male-female pairs, we ran a Mann-Whitney U Test comparing the ∆ C of male-female pairs published before and after 1995, obtaining a p-value= 6.9 × 10 −9 . Hence, as hypothesized, the male-female pairs published before 1995 show a significant disparity favouring men when compared to those published after 1995.

Discussion
The primary objective of this research was to identify gender disparities in physics focusing on five topics of interest: productivity, author order analysis, self-citation analysis, and the comparison of citations for pairs of similar papers. Therefore, our study makes a substantial contribution to the current body of literature by comprehensively analysing the citation patterns 7/21 of men and women in physics. We assembled information about all papers published in the American Physical Society from 1894 to 2009. Using a technique that combines name and image recognition, we inferred the gender of the primary authors of papers and, to study potential gender biases, we looked for statistically significant differences in the citation patterns of papers written by male and female primary authors.
Despite all the efforts to avoid any biases in our analysis, some caveats should be considered. We have combined name and image inference to identify the gender of the scholars. Even with this careful examination, we cannot infer the gender of authors who have only initials as their first names. Another caveat is related to ethnicity, as we cannot identify the majority of Asian names originating from Korea and Japan 20 . However, we can safely argue that this lack of gender identification likely affects both genders similarly. Another sensitive step of our data processing pipeline is name disambiguation, used to identify all the papers published by a given author. Although we have used various criteria to disambiguate names, there still might be errors in identifying unique authors and these errors may affect minorities, which have lower numbers of instances in the data. There are other factors that can affect citation and may not be determined by assessing similar papers. For example, papers that are novel and ground-breaking or interdisciplinary in their nature may contain citations from outside physics that make them less similar to other established papers, and those are likely not being assessed in our analysis. In this case, we acknowledge that the focus of our analysis is on those scholars who work predominantly on mainstream physics.

Broader impact
Academic evaluation metrics. The academic community tends to evaluate scientists based on the behaviour of the majority, which in physics is predominantly the behaviour of white, Western men. This evaluation, at its core, is problematic and can cause discrimination against other groups that are historically, socially, or politically discriminated against. In such cases, more attention and care should be given to women and other minorities who are more likely to suffer from such historical disadvantages. Once the system moves towards a more diverse representation, its core values will no longer be determined by only one type of majority.
Structural inequalities and cumulative effects. The structure of the citation network can influence the future citations and recognition that papers receive. Through reading papers, scholars often follow cited papers to read and cite previous works. If papers written by women are under-represented in influential positions of the citation network, this will affect their future visibility even if they are cited adequately compared to their statistical twins. This phenomenon, also known as success-breeds-success 34 , in addition to cumulative advantages and the first-mover advantage 35 , can be consequential for the success and recognition of scholars, their visibility 32 , future success, and the scientific community's perception of their work 36 .
Collaboration barriers. Science, at its core, is a collaborative process. Through collaboration and research visits, scientists meet, ideas spread, and the foundations are laid for future collaborations. There are implicit factors that can indirectly affect the participation of women in scientific collaborations. For example, geographical distance is more likely to affect women due to their family responsibilities, restrictions on travel during pregnancy, and breastfeeding, to name a few reasons. Women might not be welcomed in certain social events that are predominantly preferred by men or for those with no family responsibilities. Lack of chemistry or shyness in interacting with another gender might also make women less likely to be invited for research visits and collaborations. We note that women are not the only group who suffer from geographical restrictions, as other forms of discrimination or simply high traveling costs can affect the collaboration of scholars from Muslim and developing countries.
The importance of diversity. Diversity has a crucial role in shaping and spreading new ideas. For example, one can safely argue that many recent publications that aim to understand the inequality and biases in academia and other social domains are directly related to the boost in participation of women and minorities. However, it is also known that despite their contributions to innovative research, minorities do not reap the benefits of their innovation when compared with majorities 37 . In future work, intersectional inequalities should be studied at large scale by considering the intersection of gender, ethnicity, and race.

Conclusion
In sum, we found that despite the rise of female participation in physics in recent years, the rate of entry of new women into the field is still much slower than for men. Women tend to be less productive than men in their mid-career, and they tend to have a higher dropout rate over their academic careers. Moreover, in agreement with previous works, we found that men tend to cite their own previous works with more frequency than women, penalizing the visibility of women and their potential for academic promotion. This disparity in visibility is also manifested in the under-representation of women at the top ranks of both degree and PageRank centrality of the citation network, which implies a disadvantage on both a local scale (lower number of citations) and a global scale (peripheral location within the network).
When assessing pairs of similar papers, we found that while earlier papers tend to have an advantage, there are no statistically significant differences in citations of men and women. These results combined suggest that the overall disparity in the citation

8/21
network is a result of cumulative advantages and the first-mover effect that men have in physics, and not an intentional discriminatory act against women. This cumulative advantage, however, could create implicit biases that should be tackled by appropriate policies that foster the participation of women and other minorities.

Assessing similar pairs of papers
The main objective of this paper is to compare pairs of similar papers in an unbiased fashion. The similarity analysis is based on the concept of bibliographic coupling strength N i j of pairs of articles (i, j), which is defined as the number of common articles cited by both i and j 21,22 . To overcome the shortcomings of the most commonly used normalized versions of N i j (the Jaccard index and fractional counting, described in Supplementary Section 3), we identify couples of similar papers by looking both at the outgoing references of the pair and the incoming citations of the articles they cite. In particular, we perform a statistical test using the hypergeometric distribution as a null model and detect pairs of papers whose set of common outgoing citations has a very low probability of having been generated by chance 38 . In Supplementary Figure 5 we present a diagram of this methodology, which is explained below in detail.
First, the citation network is built for each physics subfield (the first two digits of PACS), and then each paper in the citation network is further labeled by the gender of its primary author. After establishing the citation network, two sets S k A and S k B are defined: S k B includes all articles that are cited k times, and S k A includes all articles that cite any element in S k B . Notice that each publication may belong to one set, to the other or to both. Then, we build all possible pairs i, j ∈ S k A . In order to quantify the similarity between i and j, we compute the probability of i and j both referencing a certain number of publications using the hypergeometric distribution: Now, if i and j have actually cited N k i j common papers of in-degree k, the cumulative probability of X ≤ N k i j provides a measure of how probable it is that the size of their set of overlapping citations can be explained by randomness: The higher p i j (k) is, the less probable it is that the size of N k i j is due to chance. Therefore, we devise a measure of similarity as follows: Notice that q i j (k) is the probability of a particular bibliographic coupling strength of randomly selected papers i and j towards articles in S k B being greater than or equal to N k i j . This computation is repeated for all k and the different values of q i j (k) are stored. The similarity of the couple (i, j) is measured by the minimum overall possible values of k: Publications i and j are considered similar if q i j (k) min < p * , where p * is a threshold value. By studying q i j (k) min , we are now able to assess and compare pairs of similar papers.

9/21
In addition, we manually inspected several pairs of papers with validated similarity measurements to verify the accuracy of our approach. We set a low threshold value, p * = 10 −6 , and applied a constraint of maximum publication year difference of three years. We validated the similarity between the two papers through the inspection of keywords, titles, and citation activities. For instance, papers 39 and 40 , with q i j (k) min = 6.1969 × 10 −7 , present some connection between their main ideas and share a common author. Additionally, a large proportion of their citation activities align. Another similar pair is formed by articles 41 and 42 with q i j (k) min = 5.0855 × 10 −8 , which show extremely similar citation activities and deal with similar topics. As a final example, 43 and 44 , with q i j (k) min = 2.8198 × 10 −8 , share topic, citation activities, and a collaborating author. It is worth emphasizing that, due to the highly restrictive p * , some of these statistically validated pairs of similar papers share a common author, which is a strong verification of our algorithm.
In a nutshell, the hypergeometric probability testing compares how significant the overlapping outflow of citations is for two papers compared to what we expect from the in-degree and out-degree of the citation network. Using this technique, we are able to compare papers that are inherently similar in their subject field by not only comparing their overlapping references, but also accounting for variations in the citations received by each reference. Since we control both for the outgoing citations of the pair and the incoming citations of the commonly cited papers, the comparison is robust and unbiased.

Authorship order two-proportion z-test
We denote the total male and female population as N m and N f , and total number of male and female first authors as n m and n f , respectively. We further define p m = n m N m , p f = n f N f , p = n m +n f N m +N f and the two-proportion z-test is performed as below:

Calculating differences in received citations
Let where x denotes the index of pairs (m, f ) ∈ M(p * ). We define an average centrality difference per subfield as We perform a difference of means z-test with H 0 : c m = c f , with the z-statistic defined as Hence, a positive z-score indicates that the data displays higher degree centrality for male authors than expected.

Computing temporal citation differences
We compared the citation differences of Q2 with Q4 (pairs where the earlier paper received more citations) and Q1 with Q3 (pairs where the earlier paper received fewer citations) for each temporal difference; in other words, we compared the average absolute value |∆ C | of points from Q2 with the average |∆ C | of points from Q4 for each |∆ t | = 1, 2, . . . separately (analogously for Q1 and Q3). To perform these comparisons, we used z-tests for difference of means for each year difference: In this test we evaluate the mean (|∆ Q i C |) and the standard deviation (σ Q i ) of |∆ C | for two subsets of quadrants Q i and Q j . N(Q i ) is the number of data points in quadrant i (number of similar pairs). We run the z-test for (i, j) = (1, 3) and (i, j) = (2, 4). Figure 4. Author name disambiguation algorithm. This flow chart schematizes the author name disambiguation algorithm that Sinatra et al. used 19 . The algorithm first decides whether a publication is considered in the analysis. Then, for any two author names 1 and 2, it decides whether they are the same individual or two different authors.

Gender detection
In order to detect the gender from authors' names, the first step is to remove those authors whose first name is not mentioned and initialized. No existing name-based gender inference techniques can tackle those cases. For those authors that we had first names available, we first use the application Genderize 45 . Then, for the names whose gender this application is unable to infer, we use the picture-based gender inference technique Face++ 46 . In this second step, we perform a Google image search with the author's first name and family name, and feed the resulting images to Face++. This methodology was developed by Karimi et al. 20 , who compared it with commonly used dictionary-based gender detection techniques and showed that it consistently achieves high accuracy for names of different nationalities. The results they obtained for a random sample of researchers whose names and genders are known are shown in Table 2.
As a preliminary step to use the gender detection technique we performed a thorough standardization of names to avoid issues with the use of special characters. We followed the rules from the Program for Cooperative Cataloguing of the Library of Congress (NACO) 47 . Supplementing the NACO normalization by translating accented characters and other special characters accordingly improves the overall query matching by 63%. Using this methodology we were able to detect the gender of 124,000 authors.

Similarity measures for publications
The main objective of this paper is to compare pairs of similar papers in an unbiased fashion. The similarity analysis is based on the concept of bibliographic coupling strength N i j of pairs of articles (i, j), which is defined as the number of common articles cited by both i and j 21,22 . But using N i j without further considerations can lead to misleading results. For example, the similarity between two papers that include each 20 and 25 citations and share N i j = 5 of them should not be the same as the similarity between two papers that also share 5 references but respectively cite 65 and 82 publications. On the other hand, within subfields there are usually a handful of very popular publications that are cited in most works (such as review papers), so their inclusion in two different papers may not indicate actual similarity. In order to obtain meaningful measures of similarity, several normalization approaches have been explored. A widely used measure that addresses the first kind of the issues described above is the Jaccard index. The Jaccard index is computed as the quotient of the cardinality of the intersection and the cardinality of the union of the sets of cited publications by the two papers under consideration. One of the problems of this method is that it considers the weight of all citations to be identical and therefore does not take the significance of each paper into account 48,49 . In addition, narrowing our analysis to counting the common articles may not lead to an accurate interpretation due to the massive differences between male and female sample sizes. The reason is that, if the sizes of the sets of citations of the two papers are very different, their similarity is primarily determined by the size of the smallest one, as their intersection is bounded by the size of the smallest set 50 .
Fractional counting is another common normalization technique for bibliographic coupling. In this case, instead of normalizing by the outgoing citations of the two papers of interest, each commonly cited reference contributes to the similarity score with a weight inversely proportional to its number of incoming citations 51,52 . Therefore, fractional counting addresses the second kind of situation discussed above by compensating the disproportionate influence of very popular publications in the similarity score. However, unlike the Jaccard index, it does not take into account the relative size of the sets of outgoing citations.
To overcome the issues of the Jaccard index and fractional counting, we identify couples of similar papers by looking both at the outgoing references of the pair and the incoming citations of the articles they cite. In particular, we perform a statistical test using the hypergeometric distribution as a null model and detect pairs of papers whose set of common outgoing citations has a very low probability of having been generated by chance 38 . In Figure 5 we present a diagram of this methodology, which is explained in Methods in detail.

14/21
Set of papers that cite at least one publication with exactly k citations Set of papers with exactly k citations (in this example k=4) =3 =2 =2   Figure 3C of the main document. Percentiles 10% to 90% are shown in different shades of red (male-female) and blue (male-male) in steps of 10%. The two papers within each pair are published no more than 3 years from each other, and their citation difference is assigned to the year when the latter paper is published. We performed a robustness check by assigning different p * values and time intervals, and the resulting plots returned similar distributions.  Table 5. Statistical tests for author order analysis. In this table every pair (publication,author) is a unique data point, so each author appears repeated the number of times he or she has published in a given position. As a result, n f (resp. n m ) is the number of times a female (resp. male) author appears in a paper in the corresponding position. z-scores and p-values are accordingly calculated (see Methods) and are rounded up to the fourth decimal places with an exception of extreme values. n and p respectively denote sample size and proportion.  Table 6. Statistical tests comparing degree centrality by gender in the top ranks. Comparison of the proportion of papers respectively led by male and female primary authors in the top ranks of degree centrality. For reference, the overall proportion of female led papers is 0.08. The high z-scores and low p-values corroborate the gender disparities found in Figure 3A of the main document.  Table 8. Statistical tests of gender asymmetry in the first-mover advantage.

PACS
Comparison of the citation differences between quadrants Q1/Q3 and Q2/Q4 of Figure 8 for each year difference. The values shown in this table are z-scores computed according to equation (11) of the main document. Most of them lie in the range (−2, 2) (not significant), indicating that there is no gender asymmetry in the advantage gained by an author by publishing earlier. Data with less than three data points do not yield meaningful statistics and therefore are excluded from our analysis (they are marked as '-'). PACS 50 has a very small sample size and hence is not analyzed.