Well before the Internet or the first mention of ‘big data’, physicists had been generating, analysing and preserving large data sets. One could trace back the use of exploratory data science to Kepler’s use of Tycho Brache’s astronomical observations to derive the laws of planetary motion. But whereas 17th century data science could be done by hand, the truly large data sets in physics came with the Manhattan project. From there, the physics community has been pioneering data analysis, curation and preservation.

the truly large data sets in physics came with the Manhattan project. From there, the physics community has been pioneering data analysis, curation and preservation

In a Comment published earlier this year, Boris Pritychenko tells the story of how curated atomic and nuclear data tables evolved from the first compilation of neutron cross sections in 1947, to the efforts led by Katherine Way to organize and preserve the vast quantities of results from the Manhattan project. Way’s work led to the publication in 1965 of the first nuclear data journals, which have subsequently evolved into a central resource for the community.

Around the same time, particle physicists started to compile experimental data. In 1957 Murray Gell-Mann and Art Rosenfeld published a Particle Properties Table in the Annual Review of Nuclear Science. The table was then printed as a wallet card for quick reference. But by 1964 an update in Reviews of Modern Physics, had already grown to 27 pages plus three wallet cards. 60 years later, what is now known as the Review of Particle Physics is two volumes of about 1,000 pages each and is put together by some 240 scientists in 24 countries: the Particle Data Group (PDG) Collaboration.

In another Comment Martin Green to describes how he first compiled, and has maintained ever since, the Solar Cell Efficiency Tables that have been providing 6-monthly updates of record solar cell performance since 1993, published in Progress in Photovoltaics. In 2007, Alan Kostelecky, who has been involved in organising a regular conference on charge, parity, time (CPT) and Lorentz symmetries since the late 1990s, together with co-chair Neil Russell decided to display posters during the conference coffee breaks that summarized tests of these symmetries across the various subfields. The posters were so popular that they compiled them into Data Tables that they posted on the arXiv the following year and have been updating yearly ever since. The 2011 edition was published in Reviews of Modern Physics.

These examples have a few things in common. Curated data tables emerged from a specific need of the research community. This need was identified by a champion, who first collected the data, and subsequently updated it. Data curation involves the scanning of the literature and direct alerts from experimentalists, and is therefore time-consuming and can only be done by a human specialist with a broad network of connections in the field or a team of experts. Due to the ever expanding amount of literature, regular updates are needed, their frequency dependent on the growth rate and amount of effort involved. But “the volume of new information makes it unclear how long our current procedures will remain practicable”, as Kostelecky notes. Even the PDG, which has a complex infrastructure in place to update the Review of Particle Physics, is planning to explore automation using machine learning tools for the discovery and classification of papers containing results of interest.

An unexpected trend is that despite the increasing importance of the online versions of the data tables (hosted on either arXiv, a dedicated site, app or the publisher’s digital version), there is still a demand for print, as noted by Pritychenko and Juerg Beringer, Head of PDG. The printed PDG Booklet remains “important for many especially in the area of teaching and outreach, even though the contents of the Booklet are available online and in the form of an app” says Beringer. The journal version of record also remains important, because for many the formal publication in a journal constitutes the universally-accepted final step for reporting scientific results.

To fulfil this trust, journals have a responsibility not only to the ensure the quality, discoverability and preservation of this version of record, but also to innovate and respond to the evolving needs of the communities they serve. Since 2007, there have been regular meetings of the information providers in astronomy, astrophysics and high energy physics (AAHEP) such as INSPIRE, ADS, arXiv and journal publishers who discuss these needs and look for solutions. Nature Reviews Physics is committed to take an active part in these conversations and is currently exploring how we can best answer the growing need for high-quality curated data across physics. We welcome your ideas.