Digital holographic photo of blue transparent drawers with data and folders inside, representing big data storage.

Artificial intelligence could eventually help to award scores to the tens of thousands of papers submitted to the Research Excellence Framework by UK universities.Credit: Yuichiro Chino/Getty

Researchers tasked with examining whether artificial intelligence (AI) technology could assist in the peer review of journal articles submitted to the United Kingdom’s Research Excellence Framework (REF) say the system is not yet accurate enough to aid human assessment, and recommend further testing in a large-scale pilot scheme.

The team’s findings, published on 12 December, show that the AI system generated identical scores to human peer reviewers up to 72% of the time. When averaged out over the multiple submissions made by some institutions across a broad range of the 34 subject-based ‘units of assessment’ that make up the REF, “the correlation between the human score and the AI score was very high”, says data scientist Mike Thelwall at the University of Wolverhampton, UK, who is a co-author of the report.

In its current form, however, the tool is most useful when assessing research output from institutions that submit a lot of articles to the REF, Thelwall says. It is less useful for smaller universities that submit only a handful of articles. “If there’s a submission with say, just ten journal articles, then one or two mistakes can make a big difference to their total score.”

Thelwall says that the tool needs to reach 95% accuracy to be viable. He and his colleagues therefore recommend that the algorithms be tested on a wider scale, so that they can obtain feedback from the university sector.

They also think they can improve the accuracy of the AI system by giving it wider access to full-text versions of journal articles in machine-readable format. At the moment, the tool uses bibliometric information and article metadata to come up with its ratings. Thelwall speculates that they might be able to test the AI in the next REF by showing the algorithm’s results to peer reviewers after they’ve submitted their feedback and asking whether the tool would have affected their findings.

Training problems

One key limitation of the tool is that it is trained on a sample of articles that won’t get bigger over time. This means that the system won’t be able to continuously improve on its performance, as is usually the case for AI. That’s because the scores given by referees to research outputs submitted to the REF are subsequently deleted so that they cannot be used to challenge decisions later on, and Thelwall and his colleagues were given only temporary access.

And that limited access is not just a problem for the AI tool. “From a research-on-research perspective, it’s a tragedy that we put in all this effort and then we just delete [the data],” says James Wilsdon, a research-policy scholar and director of the Research on Research Institute in London. “The fear has always been that a university will raise a legal challenge, as there’s a lot of money at stake,” he adds.

With the current shortcomings in mind, Thelwall and his team say that the AI system shouldn’t be used to assist peer review in the next REF process, due to take place in 2027 or 2028, but could be used in a subsequent audit.

Focus-group concerns

As part of their study, Thelwall and his colleagues ran some focus groups with peer reviewers who have taken part in the REF process. According to Thelwall, some of those who attended the focus groups raised concerns that one of the 1,000 inputs used by the AI was a calculation similar to the journal impact factor, a metric that is sometimes controversially used to judge researchers and their work. “It creates a perverse incentive if universities know that their outputs will potentially be scored using information that would include the journal impact,” Thelwall says. Such an incentive might lead to researchers being pressured to publish in journals with a high impact factor, for instance.

Other inputs into the AI system include the productivity of the team generating the articles, how big the team is, how diverse it is in terms of the number of institutions and countries represented, and key words in article abstracts and titles.

Some argue that the REF process needs to give more weight to the research environment, so that institutions that have a better research culture are rewarded with more funding. In another 12 December report , Wilsdon and his colleagues suggest that audits such as those in the REF need to shift the focus away from “excellence” and towards “qualities” that cover more bases of research quality, impact, processes, culture and behaviour.

The report, a follow-up to a 2015 analysis Wilsdon co-authored of the role of metrics in assessment of UK research, also argues that the REF should avoid using an all-metrics approach in place of peer review. Furthermore, it says that the UK House of Commons Science and Technology Committee should launch an inquiry into the effects of university league tables on research culture.

That’s necessary, the report says, because “many league table providers continue to promote and intensify harmful incentives in research culture from outside the academic community, while resisting moves towards responsible metrics”.