Thirty years ago, when arXiv was launched, many felt optimistic about the potential of the internet to foster a better-informed citizenry and to level the playing field between the information haves and have-nots. With new platforms like arXiv, academia led the way. But now, those original ideals seem elusive, with political polarization so exacerbated by information echo chambers that there is no longer even agreement about what constitutes objective evidence. With stakes so high, perhaps we in academia can retake the lead we held 30 years ago and restore some of those expectations, by modelling how information can be responsibly and productively shared.

The emergence of a more minimalist quality control

In its early years, arXiv had implemented both behind-the-scenes hygienic and content-related forms of quality control, the latter of which became increasingly important as arXiv’s visibility to the wider public increased (see Box 1 for more about the history of arXiv). ‘Hygienic’ in this context refers to superficial aspects — text should be extractable; references, authors and abstract should be included; there should be no distracting line numbers or watermarks, and so on — checks that can straightforwardly be automated. For content, arXiv early on implemented a form of minimal quality control by employing a group of active scientists to glance at incoming submissions — usually based just on title and abstract — and quickly judge only whether it was of plausible interest to the target research community. This oversight was to protect readers from off-topic content, and to maintain consistency with minimal academic standards. It also anticipated the ever-present risk that nefarious elements might not necessarily act in the best interests of society, a risk that in later years was perhaps not taken seriously enough by social media companies — witness the higher-stakes societal damage facilitated by freely flowing misinformation.

But arXiv operates on an unforgiving daily turnaround, so in recent years the human moderation has been supplemented by an automated machine learning framework I created to flag and hold potentially problematic submissions for additional human scrutiny1. Automated processes do not take vacation, get sick or distracted or too busy, and can comprehensively assess full-text content, including checking each new incoming submission against the entire back database for duplication or excessive text overlaps, in milliseconds. Much of the internal human effort is now directed to mediating and adjudicating the various human and robotic oversights at scale.

From health hazards to lifesavers

Despite early doubts that preprint distribution would be relevant outside of high-energy physics, its history has been one of continuous growth into new fields, catalysed by occasional spikes. For example, focused interest in magnesium diboride superconductors in 2001, and later iron pnictide superconductors starting in 2008, led the associated experimental communities to use arXiv to report breaking results and stake precedence claims. More recently, the machine learning community adopted arXiv en masse around 2015. These researchers remain dedicated users; so far, no community that has adopted arXiv for rapid dissemination has since abandoned it.

But perhaps the spike in preprint use most relevant for questions about information sharing in wider society is the growth in bioRxiv and medRxiv triggered by the COVID-19 pandemic. These preprint servers hosted more than 10,000 articles in the pandemic’s first year2 (data for bioRxiv; data for medRxiv), and this growth may well emerge as a tipping point for other research domains. It is informative to look back at a 1995 editorial in the New England Journal of Medicine about preprints, expressing legitimate public health concerns given that “much information about health issues on the Internet, such as the risks of medications and the effects of various foods on health, is of uncertain parentage”3. Although recent experience might seem to reinforce those concerns, I would argue that evidence thus far suggests that open preprint distribution is not a source of current problems and in many cases can help mitigate them.

The COVID-19-related submissions to bioRxiv and medRxiv have not resulted in major public health hazards (although to be sure those resources are subject to more stringent review4 than arXiv). To the contrary, the worst offenders were instead published in conventional refereed venues. These include an article extolling the virtues of hydroxychloroquine (whose publisher posted a letter of concern, but not a retraction5), and other studies based on fabricated data that were quickly retracted by the Lancet and the New England Journal of Medicine6. Perhaps those and other journal editors would have benefited from seeing more expert open commentary prior to publication: to date, more than 120 peer-reviewed COVID-19 articles have been retracted or withdrawn. By contrast, a COVID-19 study posted in preprint form7, overestimating prior infection rates and quickly picked up by the press, had its statistical flaws quickly picked apart by experts. A preprint reporting results of a rigorous clinical study on the drug dexamethasone led to its deployment in the half-year prior to the study’s appearance as a journal publication, potentially saving many lives8. And it was a preprint9 that pushed back against an actual health hazard, by correcting misconceptions behind the long-assumed 5 μm boundary between (falling) droplets and (airborne) aerosols, and signalling the need for more effective revised health precautions against COVID-19 spread.

Peering ahead

I do not claim that preprint distribution is a universal panacea for delays and biases of peer-reviewed journal publication but, rather, I would suggest that with proper context the benefits can far outweigh the risks. Journalists frequently qualify mention of articles on preprint servers with a ‘not-yet-reviewed’ caveat, and ordinarily consult experts for reality check to avoid misleading the public. Although necessary qualifications to COVID-19 preprints are not provided by all digital media outlets10, it is certainly possible to standardize the application of some formulation of ‘under review’ to convey uncertainty. If we are indeed inexorably headed to increased public dissemination of preprints in more fields, it is worthwhile for all participants — researchers, peer-reviewed journals and mass media — to embrace the trend and engineer ways to keep research professionals better informed and the general public less misinformed.