Gender diversity of research consortia contributes to funding decisions in a multi-stage grant peer-review process

This study seeks to draw connections between the grant proposal peer-review and the gender representation in research consortia. We examined the implementation of a multi-disciplinary, pan-European funding scheme—EUROpean COllaborative RESearch Scheme (2003–2015)—and the reviewers’ materials that this generated. EUROCORES promoted investigator-driven, multinational collaborative research in multiple scientific areas and brought together 9158 Principal Investigators (PI) who teamed up in 1347 international consortia that were sequentially evaluated by 467 expert panel members and 1862 external reviewers. We found systematically unfavourable evaluations for consortia with a higher proportion of female PIs. This gender effect was evident in the evaluation outcomes of both panel members and reviewers: applications from consortia with a higher share of female scientists were less successful in panel selection and received lower scores from external reviewers. Interestingly, we found a systematic discrepancy between the evaluative language of written review reports and the scores assigned by reviewers that works against consortia with a higher share of female participants. Reviewers did not perceive female scientists as being less competent in their comments, but they were negatively sensitive to a high female ratio within a consortium when scoring the proposed research project.


Introduction
G ender is one of the strongest social constructs employed for stereotyping others (Wood and Eagly, 2012;Haines et al., 2016). When men succeed in a domain that is culturally defined as masculine, their success is attributed to innate ability and talent, while women's success is typically attributed to external factors, including luck (Swim and Sanna, 1996). Words such as intelligent and competent are found within the cluster of positive traits of men (Abele and Wojciszke, 2007) but not of women (Eagly and Karau, 2002;Carli et al., 2016). As a result, women are generally accorded less status and power than men (Harris, 1991;Diekman and Eagly, 2000;Fiske et al., 2002).
Academia is no exception to this behaviour. A number of studies point out that women are more often profiled as being less capable and skilled in the sciences (Foschi et al., 1994;Moss-Racusin et al., 2012;Chubb and Derrick, 2020), and are, thereby, devalued as scientific leaders (Ellemers et al., 2012). This, in turn, can negatively affect peer evaluations of women's scientific merit and success, systematically downgrading their qualifications and amplifying disparities in access to the resources-primarily, the research funding-needed to conduct science (Bedi et al., 2012;Bornmann et al., 2007;Alvarez et al., 2019;Huang et al., 2020). Such obstacles to research funding can reduce the presence of women in positions of scientific leadership and have important secondary effects on their mobility and career advancement (Bloch et al., 2014;Geuna, 2015).
Scholars refer to this phenomenon as gender bias in grant allocation. The term 'bias' is generally used to describe the representations that are produced by sub-optimal decision-making based on systematic simplifications and deviations from the tenets of rationality (Cosmides and Tooby, 1994;Haselton et al., 2016;Korteling et al., 2017). As such, an individual may attribute certain attitudes and stereotypes to another person based on observable characteristics, and in this case, gender. Previous research on gender bias in the peer-review of grant applications provides mixed evidence and is somewhat inconclusive. Some studies found a gender bias (Wenneras and Wold, 1997;Viner et al., 2004;Husu and De Cheveigné, 2010); others did not (Bazeley, 1998;Ley and Hamilton, 2008;Ceci and Williams, 2011;Lawson et al., 2021). The existing studies generally focus on a single stage in the evaluation (often those involving panel experts or external reviewers); on applications made at the individual level; and on a single country-see Table 1 for details.
Here we examined the implementation of a multi-disciplinary, pan-European funding scheme-EUROpean COllaborative RESearch (EUROCORES) Scheme-and the reviewers' materials that this generated. Our research contributes to the debate on "gender in science" by investigating the associations between the proportion of female PIs within a consortium and project evaluation, through all stages of peer review, and for all scientific domains between 2003 and 2015.
Our study has three main objectives. First, we examined whether a potential gender bias is directed also against groups of individuals (i.e., research consortia), and not only against the individual PI, as revealed in previous studies. Second, EURO-CORES data allowed us to examine whether this gender effect is evident in the decision-making process of groups of individuals (expert panels) and single evaluators (external reviewers), and how the effect propagates through the subsequent stages of the evaluation process. Third, we provided a lexicon-based sentiment analysis of the written reports of external reviewers to examine whether the sentiment polarity and the rate of emotion in the review texts are consistent with the review scores.
Our study contributes to explain the persistent underrepresentation of women in top academic roles, with important implications for institutions and policy makers.

Data and methods
The EUROCORES scheme. The data for our analyses are drawn from the multi-stage peer-review evaluation process of the EUROCORES scheme (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015). The aim of this scheme was to promote cooperation between national funding agencies in Europe by providing a mechanism for the collaborative funding of research on selected priority topics in and across all scientific domains. EUROCORES was based on a number of Research Programmes, the topics of which were selected through an annual call for themes, and which, in turn, comprised a number of Collaborative Research Projects (CRPs). Each CRP included at least three Individual Projects (IPs), each led by a Principal Investigator (PI) affiliated with a European university. CRP consortia worked together on a common work-plan and towards the common goals set out in their Outline Proposal (OP) and, subsequently, in their Final Proposal (FP). EUROCORES, which was terminated in 2015, supported 47 Research Programmes in different areas of research with an overall budget of some 150M euros.
The evaluation process of the EUROCORES scheme consisted of three consecutive stages ( Fig. 1). At the first stage, the Expert Panels were responsible for evaluating Outline Proposals prior to inviting a consortium to submit a Final Proposal. Evaluations took place in separate face-to-face meetings over 1-2 days. Panels were made up, on average, of 12 members and each member was a spokesperson for three proposals (i.e., for at least nine Individual Projects). Panel decisions were made through consensus building, a process involving the interaction of peers within a scientific group. At the second stage, Final Proposals were received from selected applications and were sent for written reports to (at least) three anonymous referees. Reviewers were asked to complete a standardized evaluation form comprising 8-10 sections, each focusing on different criteria such as scientific quality, project feasibility, team interdisciplinarity, and others. Typically, the referee had 4 weeks to provide a written assessment of the proposal. For each section, a score was assigned on a 5-point Likert scale and comments were made in a dedicated space (at least 100 words for each question). One reviewer was responsible for evaluating just one project on each occasion. Thus, the overall evaluation was the result of an individual decision-making process with no (particularly strong) time constraints. At the third stage, once all the review reports had been received, expert panel members (the same as in the first stage of the evaluation) met a second time to make a decision based on the application, the referees' comments, the replies from applicants and open discussion. At this stage, panel members selected applications to be recommended for funding. The overall success rate for EUROCORES grant applications was~13%.
In this study, we examined the effect of the gender composition of the research consortia on grant evaluation decisions at three stages of the review process: first expert panel evaluation (1st stage); external reviewer evaluation (2nd stage); and second expert panel evaluation (3rd stage).
The sample. Our raw data contained information on 10,533 applicants, who teamed up to submit 1642 Outline Proposals [886 accepted and 756 rejected-success rate: 53%] and 886 Final Proposals [223 accepted and 663 rejected-success rate: 25%], and on 2182 external reviewers and 491 panel members throughout the three stages of the evaluation process. For each CRP application, which is our unit of observation, the data contain information on name, year of birth, gender and institutional affiliation of the applicants for each PI and evaluators (panel members and external reviewers), the application dates, the amount of funding requested as well as the review reports and scores given by the evaluators at corresponding stages.
The original data required some pre-processing. First, about 20% of the original observations were incomplete-e.g., they were missing the gender of the applicant, his or her age, and affiliation. Thus, where possible, we manually retrieved the missing information from each researcher's personal and/or institutional web page. Second, the sample for the analysis was limited to proposals with complete information for all consecutive stages. Hence, the final sample consists of 1347 CRPs from 9158 unique applicants [sampling fraction: 82% and 87%, respectively]; 467 individual panel members [95%]; and 1862 written reports from external reviewers [85%]. Project selection occurred in the first and third phases. The success rate in the first phase (OP to FP) was 38% (n = 511 projects; n = 3579 applicants), and in the third phase about 60% (n = 306 projects; n = 2200 applicants). A cursory glance at gender statistics suggests that female participation declined in each consecutive stage: 19.8% in the outline proposals, 17.5% in the final proposals and 16.7% after the second expert panel evaluation. The share of female evaluators is roughly 20% (Fig. 2).
Statistical analysis. We explored the association between the proportion of female PIs in EUROCORES research consortia and the decisions made by different evaluators throughout the three stages of the evaluation process.
The decisions of the panel experts were binary (i.e., selected vs. not selected) while the external reviewers assigned scores on a 5-point Likert scale and justified these scores in a short written submission (reviewer's report). Thus, we used probit (for the first and third stages) and ordinary least-squares (for the second stage) models with standard errors clustered at the research programme level. The dependent variable in the second stage of the peer-review process is only observable for a portion of the data-i.e., those proposals that passed the first stage. The sample selection bias that might result from the sequentiality of the evaluation process was rectified by Heckman's two-step estimation procedure (Heckman, 1979;Puhani, 2000).
For the first and third stages, we are interested in the factorsin primis the gender composition of a consortium-that influence the likelihood of being selected by panel experts. For the second stage of selection (external reviewers), we modelled both the reviewers' scores and the sentiments associated with their reviews, and test for inconsistencies between scores and sentiments.
More specifically, we determined sentiment polarity in the reviewers' reports using the VADER-Valence Aware Dictionary and sEntiment Reasoner-algorithm (Hutto and Gilbert, 2014) and developed a list of evaluative terms for both project and applicants using the Word2vec model (Mikolov et al., 2013). The general sentiment analysis tool VADER captured the emotional polarity (positive/negative) and intensity (strength) of reviews in relation to the applicants and their research proposals. Alternative algorithms were also tested (i.e., Syuzhet and sentimentR).
The evaluative terms were obtained through a Word2vec Skip-Gram model (Mikolov et al., 2013), trained on the corpus of reviewers' reports. Technically, the words of a vocabulary of size V were positioned in a D-dimensional space. Hence, each word was represented by a D-dimensional continuous vector-i.e., the word representation. Words that tend to appear close to each other in the corpus have similar vector representations. This property allowed us to identify a list of lexical features (bi-grams) that cooccurred with high probability in a context window surrounding three terms: 'principal investigator', 'consortium' and 'team'. 1 The identified bi-grams refer to the adjectives and adverbs used by reviewers to judge PIs and consortia, and include evaluative terms such as 'internationally recognized', 'highly qualified', or 'worldrenowned'. These terms are similar to those identified in previous studies on textual analysis of reviewers' reports (see, e.g., Magua et al., 2017). We arbitrarily selected the most frequent 30 attributes and created a binary indicator taking a value of 1 if a review included at least one from the list, 0 otherwise. 2 All models included fixed effects for the year, scientific domains, and research programmes, and a comprehensive set of covariates. We retrieved individual data from the Web of Science (WoS) and constructed several consortium-wide bibliometric indicators, including scientific productivity (total number of publications of the consortium normalized by the consortium size, in logs), participation of highly cited scientists, measures of research diversity and interdisciplinarity (Blau index), cognitive proximity (share of overlapping WoS subject categories between consortium and evaluators) and network proximity (if the consortium and evaluators have at least one common co-author prior to the grant application).
We also considered other factors that might influence evaluation decisions: the seniority of the consortium (average age of consortium members, in logs), its size and institutional reputation (whether the consortium includes at least one member affiliated with a university belonging to the Top 100 Shanghai Ranking), partnerships with the private sector (whether the consortium includes at least one member with a private sector affiliation), the size of the budget requested, experience with previous EUROCORES grants (whether the consortium includes at least one member having a EUROCORES-granted project prior to application), and the number of participating countries within a consortium. Finally, we included some characteristics of the evaluators, including gender, scientific productivity, age, and institutional reputation. We also took into account panel workload and the number of panel members.
Full details on variable construction, definitions, and descriptive statistics can be found in the SI material (Tables S1 and S2).

Results
First stage: expert panels select OPs. At this stage, we modelled the probability for an outline proposal to pass the first panel selection. Our data provide compelling evidence that consortia with a higher proportion of female PIs were at a comparative disadvantage in this first evaluation [Estimate: −0.213-std. error: 0.081-p-value < 0.01]. The most complete model specification, Table 2-Column 3, suggests that a 1% increase in the share of female scientists significantly reduced the likelihood of advancing to the second stage by~0.2%.
Other factors played a significant role in selection. For example, consortium size and the number of different participating countries were important determinants of success. Cognitive proximity between applicants and panel experts, along with prior success with EUROCORES applications, also had a positive influence on panel decisions. In contrast, consortia with a diverse research background and in close proximity to the panel members' collaboration network were more likely to find their application rejected. Apart from panel size and workload, the other panel characteristics had no significant impact on their evaluations.
Second stage: external reviewers assess FPs. At this stage, full proposals were sent for external review to at least three anonymous referees. We considered the average of the scores assigned to the scientific quality of the proposed research and the qualifications of the PIs as a proxy for scientific merit. This approach is justified as the format of the two questions related to the scientific quality of the proposal and the qualifications of the PIs remained unchanged throughout the period, while the format of the other questions underwent some minor changes, making them unsuitable for statistical analysis. We constructed reviewer sentiment measures and drew up a list of evaluative terms for the corpus of reviewers' reports, and explored the extent to which scores and language patterns in the reports differed for consortia with a high proportion of female scientists.
First, we found a negative relationship between gender and the reviewers' scores. Our data, Table 3-Column 6, show that a 1% increase in the proportion of female PIs within a consortium resulted in a 0.356% fall in the scores received [Estimate: −0.356-std. error: 0.175-p-value < 0.05]. The data also confirm that teams with greater scientific productivity scored higher, and evaluators penalized consortia closer to their areas of expertise, in line with previous research (Boudreau et al., 2016).
Second, we found a discrepancy between the scores and the written assessments contained in the reviewers' reports. Indeed, the analysis showed that the valence scores of the review corpus were neither positively nor negatively affected by the gender composition of the consortia (Table 4) [VADER estimate: 0.024std. error: 0.071-p-value > 0.10; Presence of evaluative terms: 0.042-std. error: 0.091-p-value > 0.10]. Hence, sentiment scores as well as the presence of evaluative terms were largely unrelated to review scores. Reviewers did not perceive female PIs as being less competent in their comments; however, they were negatively sensitive to a high female ratio within a consortium when scoring the proposed research project. These discrepancies between the quantitative (scores) and qualitative (sentiments and evaluative terms) aspects of the reviewers' reports imply that even though evaluation language seems to be similar, consortia with higher share of female PIs had significantly lower scores.
Third stage: expert panels make funding recommendation. At the third stage, we modelled the probability for a final proposal to receive a recommendation for funding.
As shown in Table 5, we found no direct evidence of gender composition of research consortia in panel decisions [Estimate: −0.089-standard error: 0.191-p-value > 0.10]. However, the decisions were strongly associated with the reviewers' scores, thus complying with EUROCORES guidelines that state the reviewers' reports should be considered as constituting the main basis for evaluations. Yet, we have already seen that the reviewers' scores seemed to be biased towards consortia with a higher proportion  Table  S1. All other explanatory variables are also defined in Table S1. The three models include different configurations of explanatory variables to test the sensitivity of the estimation. Model (3) is used as the first step in Heckman's selection model. Robust standard errors in parentheses, clustered at programme-level: ***, **, *, indicate significance at the 1%, 5% and 10% level, respectively.
of female members, implying that expert panel decisions may also be indirectly biased towards these consortia. Besides the gender dimension, the data also show an important positive role of network proximity consistent with 'old boy' network patterns (Rose, 1989;Travis and Collins, 1991), as identified in previous studies (Wenneras and Wold, 1997;Sandström and Hällsten, 2008).

Discussion
In this article, we examined an original dataset from EURO-CORES grants scheme and explored the factors affecting evaluation outcomes at each stage of the peer-review process. Our analysis reveals a noteworthy interaction between the gender of applicants and peer review outcomes, and provides compelling evidence of a strong negative impact on consortia with a higher representation of female scientists. This gender effect is present in the evaluation outcomes of both panels and external reviewers.
Our results also show that there is a mismatch between text and scores in external review reports that runs against consortia with a higher proportion of female PIs.
There is a growing body of theoretical and empirical research on gender bias. Biases in judgements are part of human nature and are often the results of some heuristics of thinking under uncertainty (Tversky and Kahneman, 1974). For example, the heuristic in evaluating a grant proposal may be to rely too heavily on easily perceived characteristics of applicants, which may be effective in part (think of scientific excellence), but prone to producing biases (Shafir and LeBoeuf, 2002). This does not necessarily mean that the bias is intentional; on the contrary, it may well occur outside of the decision maker's awareness (Kahneman, 2011). Our findings on the external reviewers' evaluations suggest that this might be the case. External reviewers may have had stereotypes that they did not report verbally, and that most likely occurred outside of their conscious awareness .747 *** 6.819 *** 6.115 *** 5.993 *** 5.809 *** 5.700 *** and control, but they were, nevertheless, clearly manifest in their scores. In the literature, the term implicit bias is often used to refer to such attitudes and stereotypes in general (Mandelbaum, 2015;Frankish, 2016). Overall, the findings reported here add to our understanding of gender bias in science by showing that such bias is not solely directed against the individual, as revealed in previous studies, but that it has a more pervasive effect and can involve groups of individuals too. The sequential nature of the grant evaluation process is designed to help screen proposals and applicants based on scientific merit, with subsequent steps functioning as 'filters' to keep the best proposals alive. However, our results show that precisely because of the sequential nature of the process, gender bias at one stage can indirectly influence decisions at subsequent stages.
Although our study emphasized a gender bias in the peer review process, we acknowledge that the effect found could be caused by the concomitance of other factors. Four aspects seem particularly relevant to us. First, the text characteristics of the outline and full proposals. Indeed, we could not measure the quality of the written proposal nor the editing style that might be dependent on the gender composition of a consortium. Although gender bias is a more complex problem than just the differences between men and women in using language, writing style and word choice may have an important impact on grant evaluation and selection processes (Tse and Hyland, 2008;Kolev et al., 2020). Second, we did not have information on the applicants' time allocation in work (e.g., teaching and administrative duties) and family (e.g., childcare and housework). Third, most EUROCORES projects involved interdisciplinary collaborations (e.g., physics and engineering, life and environmental sciences, biomedicine, social sciences and humanities), which made it impractical to investigate how gender bias varied by scientific macro-area. Finally, we cannot exclude the  (4)]. The dependent variables are different sentiment scores obtained through different algorithms, the names of which appear in the column header. All other explanatory variables are defined in Table S1. Robust standard errors in parentheses, clustered at programme-level: ***, **, *, indicate significance at the 1%, 5% and 10% level, respectively.
existence of a self-selection bias in the decision of a female scientist to apply for a EUROCORES grant. Future research should address these limitations to achieve a better understanding of the gendered nature of evaluation in peer-review processes.
The EUROCORES Scheme was ended in December 2015 after almost 12 years of activity. However, the lessons learned from this scheme are still relevant today, as there are many similar national and international funding schemes managed by different Research Funding Organizations (RFOs) such as the European Research Council (ERC), the National Institute of Health (NIH), the National Science Foundation (NSF), the French National Research Agency (ANR), and the German Research Foundation (DFG), among others. These RFOs develop and implement assessment procedures similar to EUROCORES that mostly rely on external peer-review and expert panels to determine successful applicants. Our results are, therefore, relevant to policy makers and RFOs.
Clearly, we endorse calls for a more equitable grant peer review system in order to avoid all forms of conflicts of interest, including cognitive, social and other forms of proximity. Anonymizing the applicant's profile can only be a partial solution to the problem, and one that may not be particularly effective as the identity of the applicants must be known to the evaluator in order to access their bibliometric indicators, for instance their publication and citation records. A more drastic and, perhaps, more effective approach would be to actually inform panel members and reviewers ex-post about any biases observed in their decisions. If the bias exists, then it is there; there is little that can be done to rectify it. Yet, the taking of such an approach might go some way to increasing an evaluator's awareness and to changing their future behaviour.

Data availability
We obtained EUROCORES grant records from the European Science Foundation (ESF). EUROCORES data are not publicly available. Codes for data preparation, variable construction, statistical analysis and details on data access are available from the corresponding author upon request and/or on GitHub. Notes 1 Some technical details about the estimation of word representations are in order. First, we removed all reviewers' reports with <15 words, a pre-defined set of stop words, and all words occurring <5 times in the corpus. Second, we pasted uni-grams into bi-grams depending on their co-occurence (threshold equal to 50). Third, we used negative sampling. Hence, we estimated a logit model where the binary dependent variable indicates whether or not two terms are close in the corpus, at distance c. For each observed neighbouring term pair (success), one should add k 'negative samples' (failures). The results presented along this article were obtained with the following parameter settings. We set the dimensionality of the dense word representation to 512 dimensions (we tried other dimensions: 256, 300, 512 and 1024). We defined a context window (distance c) of 7 words from both sides around the target. For each observed neighbouring term pair, we draw k = 15 negative example. 2 Most frequent 30 evaluative terms in reviewers' reports: connected internationally; considerable experience; excellent track; highly qualified; highly respected; international connection; international reputation; international standing; internationally competitive; internationally connected; internationally recognized; internationally renowned; leading expert; leading scientists; numerous cooperation; significant contribution; track record; world leader; world leading; world renowned.  Table  S1. All other explanatory variables are also defined in Table S1. The three models include different configurations of explanatory variables to test the sensitivity of the estimation. Robust standard errors in parentheses, clustered at programme-level: ***, **, *, indicate significance at the 1%, 5% and 10% level, respectively.