Letter

Quantitative evaluation of gender bias in astronomical publications from citation counts

• Nature Astronomy 1, Article number: 0141 (2017)
• doi:10.1038/s41550-017-0141
Accepted:
Published online:

Abstract

Numerous studies across different research fields have shown that both male and female referees consistently give higher scores to work done by men than to identical work done by women 1,2,3 . In addition, women are under-represented in prestigious publications and authorship positions 4,5 and women receive ~10% fewer citations 6,7 . In astronomy, similar biases have been measured in conference participation 8,9 and success rates for telescope proposals 10,11 . Even though the number of doctorate degrees awarded to women is constantly increasing, women still tend to be under-represented in faculty positions 12 . Spurred by these findings, we measure the role of gender in the number of citations that papers receive in astronomy. To account for the fact that the properties of papers written by men and women differ intrinsically, we use a random forest algorithm to control for the non-gender-specific properties of these papers. Here we show that papers authored by women receive 10.4 ± 0.9% fewer citations than would be expected if the papers with the same non-gender-specific properties were written by men.

We consider a complete sample of >200,000 publications from 1950 to 2015 from five major astronomy journals: Astronomy & Astrophysics, Astrophysical Journal, Monthly Notices of the Royal Astronomical Society, Nature and Science. We used the Smithsonian Astrophysical Observatory (SAO)/National Aeronautics and Space Administration (NASA) Astrophysics Data System (ADS) and the arXiv database to gather the following information about these papers: the names and number of authors; the number of references; the year of publication; the journal of publication; the abstract; and the name of the first author’s institution. We determine the gender of the first author by matching their name to a number of different publicly available databases. We clean the sample by removing entries for which we are not able to determine the gender of the first author. We also remove entries without references or citations. Our final dataset contains 149,741 papers. Further details about this procedure are available in the Methods.

Throughout the study we assume that men and women should receive the same number of citations for papers that have the same non-gender-specific properties. Any difference in the citation counts between papers led by men or women with matched non-gender properties is labelled as ‘gender bias’. For all practical purposes, the phrases ‘women’ and ‘men’ are to be understood as ‘first authors that we deduced to be women in this analysis’ and ‘first authors that we deduced to be men in this analysis’, respectively. Gender identities outside the male/female binary are not considered in the analysis.

We first examine whether there is a difference between men and women in terms of the number of citations. Papers written by men and women have different properties in the sample (see Methods). As the citation count is expected to correlate with certain non-gender-specific properties of the papers (such as seniority or number of references), we have to be careful when interpreting the quoted difference in the number of citations. We attempt to separate the gender bias effect from the effect caused by non-gender-specific properties of the papers.

Figure 1 shows the mean number of citations received by men divided by the mean number of citations received by women in a given year. We see a large difference between men and women in the early years of this study, with men receiving between 50 and 100% more citations than women. In this early period, the errors are large due to the small number of papers in total and even smaller numbers of papers authored by women. Overall, the difference has been decreasing over time. We also show the results of fitting the data with the functional form of $a 2 e a 1 ( y t − y ) + a 3$ , where y is the year. The best-fit parameters are a1 = 0.06 ± 0.02, a2 = 0.38 ± 0.24, a3 = 1.00 ± 0.04 and yt = 1974 ± 12. When written in this form, a 3 can be interpreted as the value of gender difference in the far future when the first term of the equation becomes negligible.

To quantify the difference in a single number, we introduce the variable, b y , defined as a constant fit to the data presented in Fig. 1 after a certain year. In this work, we use 1985 as the cutoff year—that is, b y is obtained by fitting the data with a constant from 1985 to 2015. Thus we search for the value of b y that minimizes $(1) ∑ y > ( y m i n =1985 ) ( d y − b y m i n ) 2 σ d y 2$

where d y is the gender difference measured in a given year (y) and $σ d y$ is the estimated error of the measured gender difference. Using this definition we find b 1985 = 1.056 ± 0.010. This means that men received around 6% more citations on average than women. Changing the cutoff year does not significantly change our results because the fit is always dominated by the data points in the latter years as a result of their smaller errors. For example, when taking the cutoff year to be 2000, we find b 2000 = 1.046 ± 0.009.

It is complex to estimate the amount of gender bias given the difference in the properties of papers written by men and women. Any difference that we see could just be a consequence of the fact that papers authored by men and women in the sample inherently differ in their properties and hence may receive fewer citations, not because of the authors’ gender, but because of some other parameter. Given that there are many possible variables influencing the citation number of papers, it is impossible to isolate or study a single variable (such as seniority or number of references) to capture the full span of possibilities influencing our estimate of gender bias. Therefore, we resort to machine learning techniques to correct and estimate more accurately the amount of gender bias.

The main idea is to train the random forest algorithm 13 on the sample of papers authored by men using all the non-gender-specific parameters available for the dataset. These non-gender-specific parameters include the seniority of the first author, the number of references, the total number of authors, the year of publication, the journal of publication, the field of study and the geographical region of the first author’s institution. With the predictor trained on the sample of papers written by men, we then estimate the number of expected citations for the papers written by women given the properties of their papers. By comparing the predicted number of citations with the measured number of citations, we are able to constrain the intrinsic gender bias, which is corrected for the non-gender-specific properties of papers.

Figure 2 shows the ratio of the measured number of citations that women have received to the number of citations that would be expected from our analysis. We find that papers written by women systematically receive fewer citations than would be expected given the other, non-gender-specific, properties of their papers. We also show the results of fitting the data with the functional form of $− b 2 e b 1 ( y t − y ) + b 3$ , where y is the year. The best-fit parameters are b 1 = 0.06 ± 0.01, b 2 = 3 ± 2, b 3 = 0.94 ± 0.03 and y t  = 1939 ± 25. We define the quantity $b ww ′$ , characterizing this difference between the simulated sample of papers written by women (w′) and the actual sample of papers written by women (w). We measured $b ww ′$ , fitting the data presented in Fig. 2 from 1985 with the same procedure as presented earlier. We find $b ww ′ =0.896 ± 0.009$ —that is, women systematically receive 10.4 ± 0.9% fewer citations than would be expected given the properties of their papers.

To check the consistency of the results presented here (bias that amounts to 10%) and the uncorrected gender difference that amounts to 6%, we replace the measured number of citations received for papers authored by women with the predicted number of citations. With this experiment, we measure what would be the difference in number of citations if there was no gender bias between papers written by men (m) and women (w′). We measure this value to be $b ww ′ =0.958 ± 0.008$ . In other words, if there was no gender bias we would expect that men in our sample should receive 4% fewer citations than women, purely from the differences in the properties of their papers. However, we detect that actually men receive ~6% more citations. Hence these two effects together add up to the 10% difference that we see between the expected and measured number of citations received by women.

Gender identification is of crucial importance in our analysis. We gather the data from the first names of authors and run the name through multiple algorithms to determine the gender. This is not possible if the author is only using initials throughout their publication history. Even if there is no bias between men and women in using only their initials, we will tend to miss women because they are likely to be younger (see Methods). We therefore expect to recognize the gender of only more established women in the field. This potentially also contributes to the observation that the women should receive ~4% more citations than men in our sample. We note that if we are indeed less effective in recognizing papers written by junior female astronomers, fully accounting for this effect would probably increase the observed difference in citation counts between women and men in astronomy.

Diversity is essential to delivering excellence in science. A pool of researchers from a wide range of backgrounds, experiences and perspectives maximizes creativity and innovation 14,15,16 . However, we find clear indications of the existence of gender bias in astronomy. The result that women received 10% less citations than expected given the non-gender-specific properties of their papers is based on all the available data that we could acquire. Additional analysis of other properties of the papers and authors (such as self-citation tendency, collaboration network, funding situation and conference attendance) is necessary to quantify gender bias further. We therefore encourage the community to work on and/or enhance our dataset, which we make publicly available (https://github.com/nevencaplar/Gender_Bias) for future investigation.

Methods

Data sources

To obtain a list of all published papers in the field of astronomy, we downloaded from the SAO/NASA ADS (http://adswww.harvard.edu/) all the entries available in the database ‘astronomy’ published between 1950 and 2015 in one of the following five journals: Astronomy & Astrophysics, Astrophysical Journal, Monthly Notices of the Royal Astronomical Society, Nature and Science. We choose these five journals because they encompass the vast majority of astronomical research today and are well-established journals with long historical records. The SAO/NASA astronomy API service provides many types of metrics for each paper. Specifically, we chose to download the names of the authors and their institutions, the number of citations, the number of references, the name of the publishing journal, the abstract of the paper and the year of publication. All information was downloaded in a single effort in June 2016 and therefore the number of citations for every paper reflects the state of the metric at that point in time.

We augmented the data with information available from the arXiv database (https://arXiv.org/) for papers where such data exist. For each paper found in the arXiv database, we recorded the designated field (‘astrophysics of galaxies’, ‘cosmology and nongalactic astrophysics’, ‘Earth and planetary astrophysics’, ‘high energy astrophysical phenomena’, ‘instrumentation and methods for astrophysics’, ‘solar and stellar astrophysics’) and downloaded the *.tex source file when possible from the Amazon S3 server (http://arxiv.org/help/bulk_data_s3) to determine the length of the paper and the number of equations and floats in the paper.

We use the following procedure to determine the length and subfield for each paper. When the *.tex files are available, we run the tool TeXcount (http://app.uio.no/ifi/texcount/) with the default settings to obtain the number of words, floats, equations and mathematical expressions embedded in the text of each paper. The tool fails or measures only a very small number of words in the paper (<500) for some papers with multiple *.tex files associated with a single paper. We ignore the measurements for these papers in further analysis.

To estimate the subfield of the papers for which arXiv classification is not available, we train a random forest algorithm on the sample of papers for which both the field classification and an abstract are available. We are able to achieve a high accuracy of classification and find that about 80% of papers are correctly classified. Reassuringly, the misclassification is often between similar categories, such as between ‘cosmology and nongalactic astrophysics’ and ‘astrophysics of galaxies’ or between ‘Earth and planetary astrophysics’ and ‘solar and stellar astrophysics’. If we exclude these similar misclassifications, then we find that the accuracy increases to ~90%. We then use this algorithm on all other papers to assign them to their field of research.

We determine the country of the institution of the first author to simplify and categorize the institutional information for each paper. In total, 85% of the papers include institutional information. We develop a list of about 100 keywords for which individual appearance in the affiliation string uniquely determines the country of origin. This list includes different spellings of country names, country codes, state names and abbreviations in the United States, and university and research institution names. Linking the affiliation strings to this list enables us to assign 97% of papers with affiliations uniquely to a country. To simplify this information, we assign the institutions to three categories: North America, Europe and other. We experiment with different classifications and find that these have a minimum effect on our conclusions.

Determining the first author’s gender is complex because many authors publish using their initials instead of their full first names. We partially mitigate this problem by matching the first and last names with the initials of all authors from the dataset of all papers. In this way, we are able to determine the first name of an author even if they provide only initials in a particular paper, but use their first name at least once during their publishing career. We took special care to ensure bijection between the author information with initials only and the corresponding author information with a full first name. In many cases, the second and third first name (middle names) help to identify the unique full name provided by the initials. As a result of this methodology, we are able to uniquely identify different authors in the entire dataset and their reappearance.

We use the year of an author’s first first-author paper as the baseline to define the seniority of an author. We define the seniority of an author as the number of years that have passed since their initial first-author publication. In cases where the exact first paper of the author cannot be identified due to possible confusion between authors with the same initials, we do not assign a seniority to such an author. In addition, we looked for authors who have changed their last name by looking for authors with last names that are part of other last names, while having the same first name. All possible cases have been individually checked to determine whether a change of last name is present. With this procedure, we are able to recover records for authors who have added another name to their surname during their publishing careers (perhaps due to marriage).

After determining the full first name, we match the name to three different databases to determine the gender. First, we look the name up with SexMachine (https://pypi.python.org/pypi/SexMachine/), a Python module. This database consists of 40,000 names from a wide geographical range that have been classified by native speakers. Second, we search for gender in the data available from the United States Social Security Administration and the UK Office of National Statistics, which track the gender of all children born in these countries (https://github.com/OpenGenderTracking/globalnamedata). It consists of about 100,000 names, but it does not have the geographical width of the first database. If the name is not found in these lists, we look the name up in Gender API (https://gender-api.com/), which includes nearly 2,000,000 names. If a given first name consists of several names, we check the gender for all of the names and weight the final gender assignment accordingly.

Cleaning and finalizing the dataset

Supplementary Fig. 1 shows the number of papers in our sample. The upper panel shows the number of published papers per year over time, the number of papers for which we were able to recognize the gender (the main sample we discuss in this work) and the number of papers published by men and women. The lower panel shows the same information as a fraction of papers with recognized gender, papers authored by men or papers authored by women. We are able to recognize gender for a large fraction of papers, ranging from 60% in the 1960s and 1970s and increasing to 75–80% in the 1980s to 2010s. The fraction of recognized papers slightly decreased in the last few years as the fraction of authors who have published only a single paper or only a few papers has increased; for these authors, it is less likely that the full author name is available from one of their papers. We also note the slow, but constant, increase in the fraction of papers written by women, from <5% in the 1960s to ~25% in 2015. This trend is consistent with the overall increase in women faculty members in astronomy departments 17 .

Completeness of seniority

We define seniority as the number of years since the author’s initial first-author publication. As only papers after 1950 are included in our analysis, there is the possibility that we do not determine the seniority accurately in the early time period. Supplementary Fig. 2 shows the fraction of papers with first authors who published their first paper before 1960 and 1965. We find that in 1978, 90% of all papers have a seniority after 1965—that is, from the year 1978 we are complete at least at the 90% level.

Constructing samples and training the random forest algorithm

We characterize the papers by using the following non-gender-specific properties: the seniority of the first author; the number of references; the number of authors; the year of publication; the journal of publication; the field of study; and the region of the first author institution. We do not use paper properties that do not span the whole dataset (such as the number of words in a paper) as we aim to characterize the evolution of gender bias through time. We remove papers that do not have this information. In particular, we remove 22,685 papers that do not have an institution region. Our results do not change significantly if we include these papers and remove geographical information as one of the parameters in the analysis.

We create a training and a testing subsample by randomly drawing papers from the total dataset of papers written by men. We create the testing subsample so that it contains the same number of papers as the sample of papers written by women in each year. This assures that the estimates of the error in each year are comparable between the testing subsample and the sample of papers written by women.

We then search for the optimum parameters of the random forest algorithm (number of trees, the minimum leaf size and the number of parameters considered when splitting). We use the trained random forest model on the testing subsample to generate a mock number of citations and then use these citations in the procedure described in the main text to evaluate the difference between the training and testing sets. We choose the values that show no difference between the training and testing sets (tree number = 50, maximum number of features considered when splitting = 80% of available parameters, minimum leaf size = 20). The code is openly available. We checked our results by running the code 40 times with different randomly selected training and testing subsamples and find that these results are robust. We use the scikit-learn Python package 18 for this analysis, but we also confirm that results are unchanged when using the Wolfram Mathematica implementation of the random forest algorithm 19 . The most important predictive features in the dataset, measured with the Gini importance estimator 20 are the number of references, the year of publication and the journal of publication.

Global trends of the sample

Supplementary Fig. 3 shows a few global average trends of publications in astronomy and highlights the differences between papers written by men and women. Supplementary Fig. 3a plots the average number of references in papers as a function of the year of publication. We note the strong increase in the number of references per paper over time: the average number of references increased from about 10 per paper in the 1960s to ~60 today, an ~500% increase. The striking feature about Supplementary Fig. 3a is the difference in the average number of references between papers authored by men and papers authored by women. From 1980 onwards, we find a clear trend that papers written by women contain 7 ± 3% more references than papers written by men.

Supplementary Fig. 3b, shows the average seniority of first authors in a given year. As seniority is determined from the first publication found in our database, we are not able to determine the seniority accurately before 1978. We reach 90% completeness in seniority by 1978 and therefore the trends before that date must be interpreted with care. We find that an average seniority of ~7 years in 1980 for both genders. After that point we find that the average seniority of the papers written by men increases steadily, whereas the average seniority of papers written by women remains roughly constant.

Supplementary Fig. 3c compares the average number of citations a paper receives for men and women as a function of publication year. Up to 2000, the average number of citations slightly increased for papers written by both men and women. The recent downturn can be explained by the trivial effect that not enough time has passed to cite those papers. Overall, we find an indication that the papers written by men have on average a higher citation count than women. Investigation of this effect contains the main part of our analysis.

Supplementary Fig. 3d investigates gender representation in all the journals selected. We find that women tend to be under-represented in the most prestigious journals, which tend to result in the most citations. Around 1980, the fraction of papers written by women was similar in all journals at ~10%. Until 2015, the fraction of papers written by women increased in all journals, but more significantly in Astronomy & Astrophysics, Astrophysical Journal and Monthly Notices of the Royal Astronomical Society (to ~25%) than in Nature and Science (to ~17%). As an example of the difference between the journals, for papers published in 2000, the number of citations that papers published in Science and Nature received is twice that of papers published in Astronomy & Astrophysics and Monthly Notices of the Royal Astronomical Society, whereas Astrophysical Journal papers received ~40% more citations than those published in Astronomy & Astrophysics and Monthly Notices of the Royal Astronomical Society.

Data availability

The final dataset and the random forest algorithm code are available at https://github.com/nevencaplar/Gender_Bias. Additional data or code are available from the corresponding author upon reasonable request.

How to cite this article: Caplar, N., Tacchella, S. and Birrer, S. Quantitative evaluation of gender bias in astronomical publications from citation counts. Nat. Astron. 1, 0141 (2017).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. 1.

et al. Double-blind review favours increased representation of female authors. Trends Ecol. Evol. 23, 4–6 (2008).

2. 2.

, , , & Science faculty’s subtle gender biases favor male students. Proc. Natl Acad. Sci. USA 109, 16,474–16,479 (2012).

3. 3.

& Nepotism and sexism in peer-review. Nature 387, 341–343 (1997).

4. 4.

& Gender matters: a call to commission more women writers. Nature 488, 590–590 (2012).

5. 5.

, , , & The role of gender in scholarly authorship. PLoS ONE 8, e66212 (2013).

6. 6.

, & On the compliance of women engineers with a gendered scientific system. PLoS ONE 10, e0145931 (2015).

7. 7.

, , , & Bibliometrics: global gender disparities in science. Nature 504, 211–213 (2013).

8. 8.

et al. Studying gender in conference talks — data from the 223rd Meeting of the American Astronomical Society. Preprint at (2014).

9. 9.

et al. Asking gender questions: results from a survey of gender and question asking among UK astronomers at NAM2014. Astron. Geophys. 55, 8–12 (2014).

10. 10.

Gender systematics in telescope time allocation at ESO. The Messenger 165, 2–9 (2016).

11. 11.

Gender-correlated systematics in HST proposal selection. Publ. Astron. Soc. Pacif. 126, 923–934 (2014).

12. 12.

Women, Minorities, and Persons with Disabilities in Science and Engineering (National Science Foundation , 2015).

13. 13.

Random forests. Machine Learning 45, 5–32 (2001).

14. 14.

& Groups of diverse problem solvers can outperform groups of high-ability problem solvers. Proc. Natl Acad. Sci. USA 101, 16,385–16,389 (2004).

15. 15.

& Diversity makes better science. APS Observer (27 April 2012);

16. 16.

The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies (Princeton Univ. Press, 2007).

17. 17.

, , & Women Among Physics and Astronomy Faculty (American Institute of Physics Statistical Research Center, 2013).

18. 18.

et al. Machine learning in Python. J. Machine Learning Res. 12, 2825–2830 (2011).

19. 19.

Mathematica, Version 11.1 (Wolfram Research, 2017).

20. 20.

, , & Classification and Regression Trees (Chapman & Hall/CRC, 1984).

21. 21.

Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).

Acknowledgements

We thank J. Woo for giving detailed comments on the manuscript. We acknowledge the stimulating comments given to us by M. Urry, R. Schubert, R. Marino, B. Trakhtenbrot, I. Moise and E. Pournaras. We thank A. Bluck for proofreading the manuscript. We acknowledge support from the Swiss National Science Foundation. This research made use of the National Aeronautics and Space Administation’s Astrophysics Data System, the arXiv.org preprint server and the Python plotting library Matplotlib 21 .

Affiliations

1. Institute for Astronomy, Department of Physics, ETH Zurich, CH-8093 Zurich, Switzerland

• Neven Caplar
• , Sandro Tacchella
•  & Simon Birrer

Authors

Contributions

N.C. initiated the project and carried out the data analysis. S.T. created the name-matching algorithm and prepared the sample. S.B. created the algorithm that matched the authors with their geographical location. N.C. and S.T. wrote the paper. All authors discussed the results and commented on the manuscript.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Neven Caplar.

PDF files

1. 1.

Supplementary Information

Supplementary Figures 1–3 and Supplementary Table 1.