Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# Distributed peer review enhanced with natural language processing and machine learning

## Abstract

While ancient scientists often had patrons to fund their work, peer review of proposals for the allocation of resources is a foundation of modern science. A very common method is that proposals are evaluated by a small panel of experts (due to logistics and funding limitations) nominated by the grant-giving institutions. The expert panel process introduces several issues, most notably the following: (1) biases may be introduced in the selection of the panel and (2) experts have to read a very large number of proposals. Distributed peer review promises to alleviate several of the described problems by distributing the task of reviewing among the proposers. Each proposer is given a limited number of proposals to review and rank. We present the result of an experiment running a machine-learning-enhanced distributed peer-review process for allocation of telescope time at the European Southern Observatory. In this work, we show that the distributed peer review is statistically the same as a ‘traditional’ panel, that our machine-learning algorithm can predict expertise of reviewers with a high success rate, and that seniority and reviewer expertise have an influence on review quality. The general experience has been overwhelmingly praised by the participating community (using an anonymous feedback mechanism).

## Main

All large, ground- and space-based astronomical facilities serving wide communities, such as the European Southern Observatory (ESO), the Atacama Large Millimeter Array, the Hubble Space Telescope and the Gemini Observatory, face a similar problem. In many cases the number of applications they receive at each call exceeds 1,000, posing a serious challenge to running an effective selection process through the classic peer-review paradigm, which assigns proposals to preallocated panels with fixed compositions. Although, in principle, one could increase the size of the time allocation committee, this creates logistic and financial problems, which practically limit the maximum size, making this solution not viable.

Since the referees only have a limited amount of time to perform their task, the heavy load (which at ESO typically exceeds 70 proposals per referee; personal communication, ESO OPC office) has severe consequences for the quality of the review and the feedback that is provided to the applicants. This contributes to increasing levels of frustration in the community and to a loss of credibility for the whole selection process. In addition, although difficult to quantify, it will have consequences for the scientific output of the facilities. Different measures were considered by the various facilities to alleviate the load on the reviewers. This includes quite extreme solutions, like the one deployed by the National Science Foundation (NSF) to limit the number of applications1.

In this context, one of the most innovative propositions was put forward in ref. 2 (see also ref. 3, and ref. 4 for further developments). The concept is simple: by submitting a proposal the principal investigator (PI) undertakes to review n proposals submitted by peers, and to have their proposal reviewed by n peers. Also, by submitting m proposals, they undertake to review nm proposals, hence virtually limiting the number of submissions. We will refer to this concept as distributed peer review (DPR).

The Gemini Observatory deployed DPR for its Fast Turnaround channel5, which is capped to 10% of the total time. The NSF also explored this possibility with a pilot study in 2013, in which each PI was asked to review seven proposals submitted by peers6,7. The NSF pilot was based on 131 applications submitted by volunteers within the Civil, Mechanical and Manufacturing Innovation Division. The NSF did not publish a report following the study. A similar pilot experiment was carried out in 2016 by the National Institute of Food and Agriculture (https://nifa.usda.gov/resource/distributed-peer-review-pilot-foundational-program), but the results were not published in this case either.

We report on an experiment that employed the DPR at ESO during period 103 (call for proposals issued on 30 August 2018) in parallel with the regular Observing Programmes Committee (OPC). We mirrored the deployment of the DPR implementation at Gemini as an example, and enhanced the process using natural language processing and machine learning for referee selection (a different method of using natural language processing for proposal reviews can be found in ref. 8). This experiment also added feedback for individual reviews.

The experiment was designed to test if there is a measurable difference between the DPR and OPC, if the algorithm for referee selection performed well and what referee attributes influenced the quality of the referee report (as judged by the feedback to a review).

In Experiment overview, we describe the general setup of the experiment. Analysis and results is devoted to the statistical analysis, and is divided into Comparison of the DPR with the OPC, Domain knowledge inference for the description of the literature matching and Rating the helpfulness of review comments. The final section is Summary and conclusions.

The Supplementary Information gives a more detailed description of the time allocation process at ESO (Supplementary Section 1), a discussion of the demographics of the experiment (Supplementary Section 2), an extended analysis (Supplementary Section 3) and several datasets related to the experiment (Supplementary appendices). We will refer the reader to the supplementary material where appropriate.

## Experiment overview

We followed a general outline of DPR as described in ref. 2, but differences were introduced in several key areas. Specifically, we tested two different referee selection methods: (1) the first selection method emulates the way in which reviews are currently assigned to members of the OPC to make a comparison with the OPC evaluation of the same proposals; (2) the second method uses automated machine learning to assign referees to proposals (based on the DeepThought knowledge discovery method; see ref. 9). We believe that the advantages of this method are that it scales easily with the number of proposals and that, due to the automated construction, it might circumvent biases in self-efficacy (for example based on gender; see ref. 10). Finally, we also asked the participants to assess the reviews of their proposal to understand what influences a constructive review and potentially reward helpful referees in the future.

We have tested a DPR scheme on a voluntary basis for the ESO period 103. The outcome of this experiment had no influence on the telescope allocation. The DPR programme for ESO period 103 recruited 172 volunteers, with each submitting one participating proposal (this is 23% of the distinct PIs in period 103). These proposals were evaluated in the DPR process as well as with the general OPC methods.

For our experiment, the groups of proposers and referees were the same. We selected eight referees for each proposal using several rules (see ‘Reviewer selection methodology’ in the Methods for an overview of the two selection processes used).

The reviewers accessed the assigned proposals through a web application and were given two weeks to assess them. A detailed description of the review options is given in the Methods.

The proposal quartile and the eight unmodified comments were displayed to the proposer. We finally asked the proposer to evaluate the helpfulness of the comments (with details in ‘Review evaluation’ in the Methods).

Figure 1 summarizes the process as a flow-chart, and a detailed description of the process can be found in Methods.

## Analysis and results

We received complete reviews for 167 out of 172 reviewers (97.1%) at the deadline (with the other five proposals excluded from further analysis). The individual grades from each review were converted into a global rank for each proposal (with details found in Supplementary Information) and translated into quartiles (A, B, C, D).

There are a myriad of statistics available that can be interesting to apply to the dataset (some of them are explored in Supplementary Information). We therefore encourage the community to make use of the anonymized dataset (available at https://zenodo.org/record/2634598) to run independent analyses. Our main focus in this study is to address three questions. (1) How different is the DPR from a more traditional OPC review? (2) How well can our algorithm predict expertise for a proposal? (3) What reviewer properties influence the helpfulness of their referee report?

### Comparison of the DPR with the OPC

The proposals used in the DPR experiment were also reviewed through the regular ESO OPC channel. This allows a comparison between the outcomes of the two processes. However, these differ by construction in many aspects, and so a one-to-one comparison is not possible. This is because in the DPR experiment:

1. (1)

There is no a priori scientific-seniority selection.

2. (2)

The proposals are typically reviewed by Nr > 6 referees (while Nr = 3 in the premeeting OPC process).

3. (3)

The number of proposals per referee is much smaller.

4. (4)

The set of proposals common to different reviewers is much smaller.

5. (5)

There is no triage (an early removal of some proposals before the OPC meeting).

6. (6)

There is no face-to-face discussion.

A robust way of quantifying the consistency between two different panels reviewing the same set of proposals is that of the quartile agreement fraction introduced in section 9.2 of ref. 11. Following this concept, we can compute what we will refer to as the panel–panel quartile agreement matrix (QAM). The generic QAM element Mi,j is the fraction of proposals ranked by the first panel in the ith quartile of the grade distribution that were ranked in the jth quartile by the second panel. If we indicate with Ai and Bj the events ‘a proposal is ranked by panel A in quartile i’ and ‘a proposal is ranked by panel B in quartile j’, the QAM elements represent the conditional probability:

$${M}_{i,j}=P({A}_{i}| {B}_{j})=\frac{P({A}_{i}\cap {B}_{j})}{P({A}_{j})}.$$

For a completely aleatory process P(Ai ∩ Bj) = P(Ai)P(Bj), and therefore all the terms of the QAM would be equal to 0.25, while for a full correlation all terms would be null, with the exception of the diagonal terms, which would be equal to 1. We note that the matrix elements are not independent from each other as, by definition,

$$\sum _{i}{M}_{i,j}\equiv \sum _{j}{M}_{i,j}\equiv 1.$$

For the purposes of the main part of this paper, we will compare the internal agreement of the DPR panels with that of the OPC (premeeting). Supplementary Section 3.5 gives further comparison between the OPC and DPR statistics.

To do this, we will bootstrap the DPR data, extracting a number of subsets of three randomly chosen referees.

This choice for the DPR set is particularly interesting, as it is directly comparable to the results presented in ref. 11. The procedure is as follows. We first make a selection of the proposals with at least six reviews (164). For each of them we randomly select two distinct (that is non-intersecting) subsets of Nr = 3 grades each, from which two average grades are derived. These are used to compute the agreement fractions between the two subpanels. The process is repeated a large number of times and the average (panel–panel) QAM is finally obtained.

In Fig. 2, we compare the QAM of our subsets with that of the OPC premeeting panels. The latter was derived for the OPC process for Nr = 3 subpanels (table 3 of ref. 11). In both cases the first-quartile agreement is about 40% (Cohen’s kappa κ = 0.21), while for the second and third quartiles it is ~30%. The top–bottom quartile agreement is 10% (κ = −0.60). The conclusion is that, in terms of self-consistency, the DPR review behaves in the same way as the premeeting OPC process. The two review processes are characterized by the same level of subjectivity.

### Domain knowledge inference

Another aim of the DPR experiment is to infer a referee’s domain knowledge for a given proposal using machine learning. ‘Expertise’ is, unfortunately, no objective quantity. However, it is reasonable to assume that the self-judgement of expertise (self-efficacy) is a good measure that might approximate such a quantity.

Given

• ‘self-reported’ as the self-reported domain knowledge

• ‘DeepThought’ as the DeepThought-inferred domain knowledge for our experiment we calculate the joint probability P(self-reportedDeepThought) using Bayes’s theorem:

$$\begin{array}{l}P({\mathrm{self}{\hbox{-}}{\mathrm{reported}}}| {\mathrm{DeepThought}})\\ =\displaystyle\frac{P({\mathrm{DeepThought}}| {\mathrm{self}{\hbox{-}}{\mathrm{reported}}})P({\mathrm{self}{\hbox{-}}{\mathrm{reported}}})}{P({\mathrm{DeepThought}})}\end{array}$$
(1)
$$\begin{array}{lll}P({\mathrm{self}{\hbox{-}}{\mathrm{reported}}}| {\mathrm{DeepThought}}) \\ =\displaystyle\frac{P({\mathrm{DeepThought}})\cap ({\mathrm{self}{\hbox{-}}{\mathrm{reported}}})}{P({\mathrm{DeepThought}})}\end{array}$$
(2)

Figure 3 shows the correlation between self-reported knowledge (see the detailed description of the experiment in Supplementary Information) and our predicted DeepThought-inferred knowledge (see Supplementary Information for a detailed description of the method).

We reiterate that we are not comparing with the true domain knowledge but with the self-reported knowledge. We find that DeepThought will predict the opposite of the self-reported knowledge in only ~10% of the cases (predicting expert with self-reported ‘no knowledge’ and vice versa). We emphasize the ~80% success rate of predicting ‘no knowledge’. These numbers show a high success rate in removing those whose expertise does not overlap with the proposal.

After the review process, we asked the proposers to evaluate the ‘helpfulness’ of the review comments. A total of 136 reviewers provided feedback.

The review usefulness distribution shows a steady rise, with a sudden drop-off at the ‘very helpful’ bin, as shown in all panels of Fig. 4. About 55% of the users rated the comments in the ‘helpful’ and ‘very helpful’ bins.

To check what factors might influence the ability to write helpful comments we use the statistical method given at the beginning of this section.

The reviewer’s expertise is expected to have an influence on the helpfulness of comments. Figure 4a,b shows the influence of both self-reported knowledge and DeepThought-inferred knowledge on the helpfulness of the comment. The probabilities are very similar between the self-reported and inferred knowledge. We highlight that experts seemingly very rarely give unhelpful comments and that non-experts rarely give very helpful comments.

The last test is to see how the comment’s helpfulness is being evaluated given the ranking of the proposal within the quartiles P(helpful commentproposal quartile). This shows a similar distribution to the other panels in Fig. 4. There are some small differences. Comments for proposals from the second to the top quartile often were perceived as relatively helpful. Comments on proposals in the last quartile were rarely ranked as very helpful (ref. 12 finds a similar effect).

We checked whether seniority has an influence on the ability to create helpful comments. Figure 4c shows some correlation between the seniority and the ability for the referee to give helpful comments. Most interesting is the apparent inability of graduate students to give very helpful comments. This might be a training issue and can be resolved by exposing the students to schemes such as DPR.

We have also asked about the helpfulness of the comment in our general feedback (Supplementary Information). The distribution of comment usefulness follows the distribution of helpfulness for individual comments relatively closely (see statistics in Supplementary Information).

The comments given in the DPR compare very favourably with the OPC (see details in Supplementary Information).

## Summary and conclusions

The main advantages of the DPR paradigm (coupled with the DeepThought approach) over the classic panel concept can be listed as follows.

• It allows a much larger statistical basis (each proposal can be easily reviewed by eight to ten scientists), enabling robust outlier rejection.

• It removes possible biases generated by panel member nominations.

• The larger pool of scientists allows a much better coverage in terms of proposal–expertise matching.

• The smaller number of proposals per reviewer allows for more careful work and more useful feedback.

• Coupled to the DeepThought approach for proposal–referee matching, it is suitable to be semi-automated; it also gives an objective criterion for ‘expertise’, removing biases in self-reporting.

• It removes the concept of the panel, which adds rigidity to the process.

• It addresses the problem of maximizing the proposal–referee match while maximizing the overlap in the evaluations, which is a typical issue in preallocated panels (see ref. 13, and references therein).

• The lack of a face-to-face meeting greatly simplifies the logistics and the costs, making it attractive for small, budget-limited facilities.

• The absence of the meeting prevents strong personal opinions from having a pivotal influence on the process.

• It involves a larger part of the community, increasing its democratic breadth.

• All applicants are exposed to the typical quality of the proposals. This allows them to better understand if their request is not allocated time by placing it in a much wider context, and helps improve their proposal-writing skills (comment by Arash Takshi: “The ability to see what my competitors were doing filled a blind spot for me. Now I know that if I don’t get funded, it’s because of the quality of the other proposals, not something I did wrong.”7).

• It trains the members of the community without additional effort.

• The lack of a meeting does not allow the exchange of opinions and the possibility of asking and answering questions to/from peers.

• Exposition of proposal content to a larger number of individuals (167 versus 78 in the real case of the DPR experiment) increases the risk of confidentiality issues.

The two major disadvantages can, however, be easily addressed: barring the fact that its effectiveness remains to be demonstrated and quantified (see above), the social, educational and networking aspects of the face-to-face meeting should not be undervalued. In this respect, we notice that the resources freed by the DPR approach can be used by the organizations for education and community networking (training on proposal writing, fostering collaborations and so on). Another possibility to enable interaction between the reviewers is to allow them to up-vote or down-vote the comments by other reviewers (the Science journal employs such an approach), which could be used to exclude comments and grades that were down-voted by a very large fraction of the other referees.

We conclude that the participating community has reacted extremely positively to this (see Supplementary Section B). The presented approach to infer expertise works very well (see Fig. 3). On an individual level, the behaviour of the DPR referees conforms to the statistical description of the regular OPC referees11, and there is no statistically significant evidence that junior reviewers systematically deviate from this (Comparison of the DPR with the OPC). The introduction of the possibility to rate the helpfulness of comments provides a new avenue to potentially reward helpful referees and train referees in general on giving useful feedback.

We encourage other organizations to run similar studies, to progress from a situation in which the classic peer review is adopted notwithstanding its limitations in the absence of better alternatives. As scientists, we firmly believe in experiments, even when these concern the way we select the experiments themselves.

## Methods

### Description of the DeepThought DPR experiment

For an overview of the process see Experiment overview. In the next sections, we will give a detailed description of the process of the DeepThought DPR experiment.

#### Reviewer exclusion and selection

Reviewer selection is a core part of the experiment. We separated this step into reviewer exclusion and reviewer selection. We tested two different strategies for reviewer selection: one based on a standard methodology (called ‘OPC emulate’), the other based on a machine estimation of the domain-specific knowledge (called ‘DeepThought’), which are described in detail in ‘DeepThought group’ in the Methods.

We describe the reviewer exclusion and selection using the abstract concept of a matrix. The matrix has a row for each proposer and a column for each referee. In our case, the referees and proposers are the same set, and thus we have a square matrix.

Our exclusion matrix was constructed in a way that would mark a referee ineligible to review a proposal where any of the investigators were from the same institution as the referee:

$${A_{{\rm{exclusion}}}} = \begin{array}{*{20}{l}}{}&{\begin{array}{*{20}{l}}{{\ {\rm{p}}_1}}&{{\!\!\!{\rm{p}}_2}}& \!\!\!\!\!\cdots &{{{\rm{p}}_n}}\end{array}}\\{\begin{array}{*{20}{l}}{{\rm{refere}}{{\rm{e}}_{\rm{1}}}}\\{{\rm{refere}}{{\rm{e}}_{\rm{2}}}}\\ \vdots \\{{\rm{refere}}{{\rm{e}}_{\rm{3}}}}\end{array}}&{\left( {\begin{array}{*{20}{l}}{\rm{1}}&{\rm{1}}& \cdots &{\rm{0}}{\phantom{{e}_1}}\\{\rm{0}}&{\rm{1}}& \cdots &{\rm{0}}{\phantom{{e}_2}}\\1&0& \cdots &\!\! 0{\phantom{\vdots }}\\{\rm{0}}&{\rm{0}}& \cdots &{\rm{1}}{\phantom{e_3}}\end{array}}\!\!\! \right)}\end{array}$$
(3)

where pn stands for proposern; 1 indicates conflict and 0 no conflict.

For the OPC-emulate group, we also constructed an additional matrix (which was combined with the previous exclusion matrix using the logical or) that marks a referee ineligible to review a proposal that was submitted to the same unit telescope as the referee’s submitted proposal.

We constructed a reviewing matrix that marks a reviewer–proposal combination with 1 (and the rest of the matrix with 0). We then used a round-robin selection process to iterate through the referees. For each referee, we use the exclusion matrix to determine eligible proposals and then the specific selection criterion for each of the groups to assign one of the remaining available proposals (taking into account the exclusion matrix). This process was repeated until all referees had been assigned eight proposals. If the process failed before completion, it was restarted with a different random number seed until we found a solution. A solution matrix needs to have all row sums and column sums equal to eight. For future projects, we strongly suggest researching algorithms from operations research (or combinatorial optimization) that have been optimized for the given process.

#### Reviewer selection methodology

We separated the volunteer base into two groups for our two experiments. The first group was 60 randomly chosen volunteers out of the 172. This group was assigned proposals that would closely emulate the current way ESO assigns proposals in the OPC (which is a variant of the common time allocation strategy present in the astronomy community).

The second group of the remaining 112 volunteers was assigned by predicting their expertise on a proposal based on their publication history using machine learning.

### OPC-emulate group

We aim with this selection process to emulate the OPC process. The members of the OPC assign themselves to expert groups in four categories:

• A—cosmology and intergalactic medium

• B—galaxies

• C—interstellar medium, star formation and planetary systems

• D—stellar evolution.

Multiple panels are then constructed for each subgroup (depending on the number of proposals for each subgroup).

We attempt to emulate this process by constructing four groups (A, B, C and D) of 15 referees. Each of these groups only reviews the proposals in their group. Thus each referee will only see proposals within the category in which they proposed.

We construct an exclusion matrix for each of the subgroups (see equation (3)) and then proceed with the review selection where at each selection step we simply randomly select any eligible proposal.

The total number of reviews from this process was 480.

### DeepThought group

The general idea behind this selection process is to use the published papers of each participant to predict how knowledgeable they were for each proposal. This made extensive use of the dataset and techniques presented in ref. 9. This required identifying their publications, constructing knowledge vectors for each referee, constructing proposal vectors from the submitted LaTeX document, constructing a knowledge matrix for each combination of referee and proposal and using this matrix in the selection process.

### Name disambiguation

The first part of this process was to uniquely identify participants in ADS to infer their publications. Reference 14 has shown that by using the last name and the first initials only 6.1% of authors’ identities are contaminated (due to either splitting or merging), which is sufficient for the statistical requirements of our experiment.

We then used the Python package ads (available at https://ads.readthedocs.io) to access the ADS application program interface to search for the participant’s papers (and their arXiv identifiers) without regard for position of authorship. We excluded (moved to the OPC-emulate group) participants who had fewer than 3 papers (nine participants) or more than 500 papers (four participants).

### Knowledge vectors

Reference 9 shows in Section 4 the construction of vectors from publication (using a technique called TFiDF). We used the document vectors from ref. 9 given all publications identified for each participant in the previous step. These document vectors were summed and then normalized. We call such a vector sum for each referee a ‘knowledge vector’.

### Proposal vectors

We use the machinery described in ref. 9 to process several sections of the LaTeX representation (‘Title’, ‘Abstract’, ‘ScientificRationale’, ‘ImmediateObjective’) of the submitted proposal. These were then converted to normalized document vectors, to which we refer as ‘proposal vectors’.

### Knowledge matrix

We then construct a knowledge matrix similar to the exclusion matrix and fill each of its elements with the dot-product between the proposal vector and referee knowledge vector (cosine distance; see equation (4)).

$${A_{{\rm{knowledge}}}} = \begin{array}{*{20}{l}}{}&{\begin{array}{*{20}{l}}{{{\ \ \ \rm{p}}_1}}&{{{\rm{p}}_2}}&{ \cdots \;}&{{{\rm{p}}_n}}\end{array}}\\{\begin{array}{*{20}{l}}{{\rm{refere}}{{\rm{e}}_{\rm{1}}}}\\{{\rm{refere}}{{\rm{e}}_{\rm{2}}}}\\ \vdots \\{{\rm{refere}}{{\rm{e}}_n}}\end{array}}&{\left( {\begin{array}{*{20}{l}}{{\rm{0}}.{\rm{8}}}&{{\rm{0}}.{\rm{4}}}&{ \cdots \;}&{{\rm{0}}.{\rm{1}}}\\{{\rm{0}}.{\rm{5}}}&{{\rm{0}}.{\rm{9}}}&{ \cdots \;}&{{\rm{0}}.{\rm{5}}}\\ \vdots & \vdots & \ddots & \vdots \\{{\rm{0}}.{\rm{6}}}&{{\rm{0}}.{\rm{2}}}&{ \cdots \;}&{{\rm{0}}.{\rm{7}}}\end{array}} \right)}\end{array}$$
(4)

where pn stands for proposern.

As opposed to the OPC-emulate case, we do not assign proposals randomly to the referees during the selection step in the referee selection process. The proposals are picked according to the following algorithm, with different steps for the first four, subsequent two and last two proposals picked for each referee.

1. 1.

From the available proposals choose the one with the highest cosine distance for the first four proposals assigned to each referee.

2. 2.

From the available proposals choose the proposal closest to the median of all cosine scores for that particular referee for the next two proposals.

3. 3.

From the available proposals choose the one with the lowest cosine distance for the last two proposals assigned to each referee.

The process was repeated with a different random seed if there were not eight suitable proposals available for each referee. Depending on the number of constraints and participants it took of the order of three repetitions to find a suitable solution.

#### Review process

The participants were given a login to evaluate the proposals. After signing a non-disclosure agreement (which is identical to the one signed by the OPC members), the participants could view the proposals assigned to them. They were first given the option to indicate a conflict of interest (removing them from making an eligible vote on the proposal). Then were asked for their expertise on the proposal’s topic. They were tasked to review the proposals by giving them a score (1–5; the same as in the OPC), their assessment of their knowledge of the proposal, and a comment.

We outline the steps in more detail in the following. The following options were given to indicate a conflict.

• No, I do not have a conflict.

• Yes, I have a close personal or professional relationship with the PI and/or team.

• Yes, I am a direct competitor to this proposal.

The referees were instructed to consider the following questions when evaluating a proposal.

• Is there sufficient background/context for the non-expert (that is, someone not specialized in this particular subfield)?

• Are previous results (either by proposers themselves or in the published literature) clearly presented?

• Are the proposed observations and the immediate objectives pertinent to the background description?

• Is the sample selection clearly described, or, if a single target, is its choice justified?

• Are the instrument modes and target location(s) (for example, cosmology fields) specified clearly?

• Will the proposed observations add significantly to the knowledge of this particular field?

They were then asked to assess their expertise on the proposal.

• This is my field of expertise.

• I have some general knowledge of this field.

• I have little or no knowledge of this field.

They were instructed to use the following general grading rules.

• 1.0—outstanding: breakthrough science

• 1.5—excellent: definitely above average

• 2.0—very good: no significant weaknesses

• 2.5—good: minor deficiencies do not detract from strong scientific case

• 3.0—fair: good scientific case, but with definite weaknesses

• 3.5—rather weak: limited science return prospects

• 4.0—weak: little scientific value and/or questionable scientific strategy

• 4.5—very weak: deficiencies outweigh strengths

• 5.0—rejected.

The referees then had to write a comment with a minimum of ten characters.

In the experiment design, each proposal was assigned to Nr = 8 peers, and each PI was assigned Np = 8 proposals. In practice, because of the declared conflicts, proposals were evaluated by four to eight referees, with Nr ≥ 6 in 95% of the cases. For the same reason, each referee reviewed between five and eight proposals, with Np ≥ 6 in 98% of the cases. This guarantees statistical robustness in the grade aggregation.

In the OPC process, the grades given by the distinct referees are combined using a simple average, after applying the referee calibration. This operation is described in ref. 11 (section 2.4 and appendix A), and aims to minimize the systematic differences in the grading scales used by the reviewers. In the current implementation, the calibration consists of a shift-and-stretch linear transformation, by which the grade distributions of the single referees are brought to have the same average and standard deviation (grades ≥3 are excluded from the calculations). This operation is justified by the relatively large number of proposals reviewed by each referee (>60), which makes the estimate of the central value and dispersion reasonably robust.

The case of the DPR is different in this respect, as a given person would have reviewed at most Np = 8 proposals. Especially for the dispersion, this limitation certainly weakens its statistical significance. For this reason, following the example of the Gemini Fast Turnaround channel5, and for the purposes of providing feedback to the users, the raw grades were combined without applying any referee calibration (the effects of calibration in the DPR experiment are presented and discussed in the main text).

#### Review evaluation

After the review deadline the participants were given access to a page with the peer reviews (most of the time seven or eight) of their proposal. The applicants are given the quartile rank as a letter (A–D, as calculated in Grade aggregation). For each comment, they were asked to rank its helpfulness:

We would be grateful if you could rate each review on a scale of 1 (not helpful) to 4 (very helpful) how much this comment helps improve your proposal (positive comments like “best proposal I ever read” can be ranked as not helpful as it does not improve the proposal further). These ratings will not be distributed further but help us for statistical purposes.

#### Questionnaire

Each participant in the DPR experiment was asked to fill in a questionnaire after performing the reviews and receiving the feedback on their own proposal:

Also, after reading the reviews, please take 10–15 minutes to fill the final questionnaire (see the link at the bottom of this page). Although it is optional, it is your chance to give us feedback on any aspect that you liked/disliked and to shape a future DPR process. It will also greatly assist us in understanding how the experiment went, learn about what works and not, which biases are still present etc. It will allow us to build better tools for you in the future.

Out of 167 participants, 140 returned a completed on-line questionnaire (83.8%). Most of the questions were multiple choice (Supplementary Section A), with many of the answers used in the following sections in the evaluation of the DPR experiment.

The questionnaire also included five free-format questions: (1) what suggestions do you have to improve the software, (2) what suggestions do you have to improve the assessment criteria and/or review process, (3) do you have concerns about your proposals being evaluated through distributed peer review, (4) do you have any further feedback or suggestions regarding distributed peer review and (5) would you like to give any further feedback and/or suggestions regarding earlier raised points on securing confidentiality, external expertise, and robustness versus bias?

Each of the five free-form questions was answered on average by 50 persons with a sentence or more. The answers were very helpful in specific suggestions for improvement and to obtain overall feedback, as summarized in the main text.

#### Data collection

The proposals were distributed to the participants on 8 October 2018. They were given until 25 October (17 days) to submit the reviews. At the time of the deadline 2 of 112 participants in the DeepThought group and 3 of 60 participants in the OPC-emulate group had not completed their reviews and were excluded from the further process (completion rate 97.1%). We received a total of 1,336 reviews (from 167 reviewers for 172 proposals).

On 30 October the 167 remaining participants were given access to their evaluated proposals and were given two weeks to provide feedback. Of these, 136 (81.4%) completed the questionnaire.

We aim to allow further study of this dataset by the community. We also want to ensure the privacy of our participants and thus have anonymized and redacted some of the dataset. In particular, we have given the participants randomized IDs and only give some derived products from the DeepThought machinery (we are not sharing knowledge vectors as they might allow the reconstruction of the individuals). In addition, we have removed any free text data the participants entered (such as the comments on the proposal). We have also removed all participants who did not reveal their gender. This is a very small number and might be used to deanonymize individuals.

## Data availability

The anonymized data are available at https://zenodo.org/record/2634598.

## References

1. 1.

Mervis, J. Just one proposal per year, please, NSF tells astronomers. Science 344, 1328–1328 (2014).

2. 2.

Merrifield, M. R. & Saari, D. G. Telescope time without tears: a distributed approach to peer review. Astron. Geophys. 50, 16–20 (2009).

3. 3.

Kurokawa, D., Lev, O., Morgenstern, J. & Procaccia, A. D. Impartial peer review. In IJCAI’15: Proc. 24th International Conference on Artificial Intelligence (eds Yang, Q. & Wooldridge, M.) 582–588 (AAAI Press, 2015); http://dl.acm.org/citation.cfm?id=2832249.2832330

4. 4.

Steppi, A. et al. Simulation study on a new peer review approach. Preprint at https://arxiv.org/abs/1806.08663 (2018).

5. 5.

Andersen, M. et al. The Gemini Fast Turnaround program. Am. Astron. Soc. Meet. Abstr. 233, 761 (2019).

6. 6.

Ardabili, P. N. & Liu, M. Incentives, quality, and risks: a look into the NSF proposal review pilot. Preprint at https://arxiv.org/abs/1307.6528 (2013).

7. 7.

Mervis, J. A radical change in peer review. Science 345, 248–249 (2014).

8. 8.

Strolger, L.-G. et al. The Proposal Auto-Categorizer and Manager for time allocation review at the Space Telescope Science Institute. Astron. J. 153, 181 (2017).

9. 9.

Kerzendorf, W. E. Knowledge discovery through text-based similarity searches for astronomy literature. J. Astrophys. Astron. 40, 23 (2019).

10. 10.

Ehrlinger, J. & Dunning, D. How chronic self-views influence (and potentially mislead) estimates of performance. J. Pers. Soc. Psychol. 84, 5–17 (2003).

11. 11.

Patat, F. Peer review under review—a statistical study on proposal ranking at ESO. Part I: the premeeting phase. Publ. Astron. Soc. Pac. 130, 084501 (2018).

12. 12.

Van Rooyen, S., Godlee, F., Evans, S., Black, N. & Smith, R. Effect of open peer review on quality of reviews and on reviewers’ recommendations: a randomised trial. BMJ 318, 23–27 (1999).

13. 13.

Cook, W. D., Golany, B., Kress, M., Penn, M. & Raviv, T. Optimal allocation of proposals to reviewers to facilitate effective ranking. Manag. Sci. 51, 655–661 (2005).

14. 14.

Milojević, S. Accuracy of simple, initials-based methods for author name disambiguation. J. Informetr. 7, 767–773 (2013).

## Acknowledgements

This paper is the result of independent research and is not to be considered as expressing the position of the ESO on proposal review and telescope time allocation procedures and policies.

We thank the 167 volunteers who participated in the DPR experiment for their work and enthusiasm. We also thank M. Kissler-Patig for promoting the DPR experiment following his experience at Gemini, ESO’s Director General X. Barçons and ESO’s Director for Science R. Ivison for their support and H. Schütze for several suggestions on the natural language processing. We thank J. Linnemann for help with some of the statistics tests.

W.E.K. is part of SNYU, and the SNYU group is supported by the NSF CAREER award AST-1352405 (PI Modjaz) and the NSF award AST-1413260 (PI Modjaz). W.E.K. was also supported by an ESO Fellowship and the Excellence Cluster Universe, Technische Universität München, for part of this work. W.E.K. thanks the Flatiron Institute. G.v.d.V. acknowledges funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme with grant agreement 724857 (consolidator grant ArcheoDyn).

## Author information

Authors

### Contributions

We use the CRT standard (see https://casrai.org/credit/) for reporting the author contributions. Conceptualization: W.E.K., F.P., G.v.d.V. Data curation: W.E.K., F.P. Formal analysis: W.E.K., F.P. Investigation: W.E.K., F.P. Methodology: W.E.K., G.v.d.V., T.A.P. Software: W.E.K., D.B. Supervision: W.E.K., F.P. Validation: W.E.K., F.P., G.v.d.V., T.A.P. Visualization: W.E.K., F.P., T.A.P. Writing—original draft: W.E.K., F.P., T.A.P. Writing—review & editing: W.E.K., F.P., G.v.d.V., T.A.P.

### Corresponding author

Correspondence to Wolfgang E. Kerzendorf.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Peer review information Nature Astronomy thanks Morten Andersen, Anna Severin and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary information

### Supplementary Information

Supplementary Sections 1–3 and appendices 1,2.

## Rights and permissions

Reprints and Permissions

Kerzendorf, W.E., Patat, F., Bordelon, D. et al. Distributed peer review enhanced with natural language processing and machine learning. Nat Astron 4, 711–717 (2020). https://doi.org/10.1038/s41550-020-1038-y

• Accepted:

• Published:

• Issue Date:

• ### Distributing the load with machine learning

• Ankita Anirban

Nature Reviews Physics (2020)

• ### Easing the burden of peer review

• Morten Andersen

Nature Astronomy (2020)