Physics keeps changing and so do classification and subject indexing. Arthur Smith recalls the final updates to the Physics and Astronomy Classification Scheme (PACS) and the development of the Physics Subject Headings (PhySH), and ponders future directions.
The rise and fall of PACS
The Physics and Astronomy Classification Scheme (PACS) was not the first tool for organizing subject indexes in physics. A rough timeline from 120 years ago, near the start of Physical Review, is outlined in Fig. 1. By 1960, as a single journal, Physical Review was getting too large and scientists could no longer browse through it to find the things they were interested in. Improved subject indexing and grouping became necessary, and an ad hoc hierarchical index was added. Since the problem of classification in physics was not unique to the American Physical Society (APS) journals, in the early 1970s, several organizations got together to develop a more uniform system: PACS1.
But technology changed. In 2008 I joined the committee overseeing PACS and from my perspective there were a number of serious concerns:
Subject-based indexes had become almost irrelevant as title, abstract and full-text search on web browsers replaced the old system of printed indexes
The indexing database, Inspec, had gone its own way for subject indexing and was no longer using PACS
Frequent changes to PACS broke backwards-compatibility. Some codes had even been completely repurposed: 29.30.Dn was ‘Heavy charged-particle spectroscopy’ in 1977, then was not present in 1990 and was ‘Electron spectroscopy’ in 2010. PACS codes could not reliably find related articles across the years, and PACS-based searches were only a tiny fraction of online journal searches
Alphanumeric codes and long labels made PACS difficult for online use. The committee considered adding a ‘short descriptor’ label as the main user-visible form for PACS
The PACS hierarchy was too rigid. Polyhierarchy — having multiple parents — was impossible, so cross-cutting research areas could only be under one top-level code. Fixed string size precluded any more than four (later five) levels, which had been the cause of some restructurings to limit depth.
From the very beginning, PACS was owned by the American Institute of Physics (AIP), which had a dedicated group of several full-time staff to classify journal articles and to manage the updates to PACS with input from journal editors and other physicists. I believe there was a background concern about the expense of this classification effort. AIP decided to bring in a consultant to consider restructuring or replacing PACS with a ‘faceted taxonomy’ or ‘thesaurus’ — a hierarchical controlled vocabulary with relatively short labels. But AIP and the APS Physical Review journals no longer had the same purposes in mind. AIP developed their new proprietary physics thesaurus with the primary goal of automated classification — a rules-based system to look at abstracts or full text of articles and pick out terms or concepts that were likely to apply. APS was reliant on PACS for a lot more than just providing final classification terms to readers and people searching on the web. At the APS journals, the editors expected authors to supply suitable PACS codes with their articles and then were vigilant about checking that they were the right codes – particularly because the primary selected code often determined which editor was assigned, helped in selecting referees and also determined where an accepted article would be placed in the published journal.
The rise of PhySH
In 2011 I helped organize a session at the Materials Research Society meeting in Boston2 to look at best practices in classification in science, with particular emphasis on fields close to physics and on the vocabularies developed at the National Library of Medicine. Two basic models seemed successful: centrally organized and comprehensive classification systems, such as what the National Library of Medicine did in the life sciences; or more independent and interoperable vocabularies, often relying on the Simple Knowledge Organization System (SKOS)3 design in which each concept has a unique Uniform Resource Identifier (URI). The workshop also highlighted that the individual terms are more important than the hierarchy they are placed in. The purpose of the hierarchy is to group related terms together, and there are always many different ways to do that. Terms should come first and the relations and hierarchy should be a secondary component, perhaps adjustable depending on context or application.
At APS, we seriously considered using AIP’s new thesaurus, and also looked into the Inspec terms. But we wanted to be able to publish the full vocabulary for anybody to use, in particular, to allow authors to select classifications immediately upon submitting their articles. A second concern was scale and maintainability — we felt PACS (and AIP’s new thesaurus derived from it) had become too detailed to manage with a realistic allocation of resources. In January 2013, an internal proposal for what later became PhySH stated that the goal was a scheme flexible enough to evolve quickly as new topics become important in physics research, while ensuring that older resources do not become lost over time, and was not intended to be a comprehensive vocabulary of the field.
In late 2013, we started working with an outside consultant to develop PhySH and find or create the software needed to integrate it with our peer-review systems. We quickly decided on a faceted structure (such as ‘Physical Systems’ or ‘Techniques’) grouped by the main research areas of physics (such as ‘Nuclear Physics’ or ‘Condensed Matter’), although it took some time to settle on the detailed names and organization. Between our consultant and groups of journal editors, we worked through the various proposed disciplinary groupings during 2014 and the first half of 2015, at which point serious internal testing began, and after some more feedback, APS started using the new system in late 2015 (with initially only the editors tagging papers). There was then an ‘unveiling’ in January 2016 (ref.4) and the Physical Review journals started requiring authors to supply PhySH terms in the first half of that year. Shortly after that point, PhySH had completely replaced PACS in our handling of submitted manuscripts.
As a keyword-based system, PhySH should be easier for regular physicists to understand and remember than the alphanumeric codes used by PACS. The disciplinary groupings and facets are intended to make it easier for users to find what they are looking for while browsing among the terms, but most often people find terms just by typing in a concept they want to use. I believe the hierarchy captures important features of the organization of physics research but it isn’t central to most uses. In particular, PhySH concept labels are intended to be clear, unambiguous and independent of their parent or sibling terms, which was not true of PACS.
Under the hood, PhySH uses the SKOS model3 for controlled vocabularies — in particular, each discipline is a ‘concept scheme’, and the facets are their ‘top concepts’. Each concept is identified by a unique URI (registered as a digital object identifier, DOI) so that the label may change but the underlying identifier remains the same. Concepts may have alternate labels and related concept links to improve the findability of terms, and the SKOS model allows arbitrary depth of hierarchy and multiple parents for any concept, something that wasn’t possible in PACS. We use a few customizations to align the facets among the disciplines and to handle deprecation, but the underlying use of SKOS makes PhySH interoperable with other SKOS-based vocabularies and with software designed to work with them.
The current plan is for quarterly updates to PhySH, or a significantly shorter turnaround time for introducing new concepts than was true for PACS. APS also recently made PhySH publicly available for others to use5, and outside recommendations for changes will be considered alongside those from APS journal editors and other users.
PACS and PhySH embody different approaches to classification, and it has taken our users some time to become comfortable with the change. PACS was very much a pigeon-hole system, in which the goal was to place each article into the best ‘box’ for it along all the various dimensions of interest. The thermoelectric effect classifications shown in Fig. 1 are a typical case in which the PACS code attempted to define both the physical system and the effect being studied. The PhySH approach allows for more precision on what effect is being studied, what techniques are being used or what systems are being considered — but this comes at the cost of requiring multiple terms to characterize the research.
The hierarchical difference was also something our users took time to become accustomed to. With PACS, one almost always assigned a code from the lowest applicable level, but with PhySH any concept anywhere in the hierarchy (other than the facets and disciplines themselves) is allowed. For example, instead of having a PACS code ‘97.90. + j Other topics on stars’, PhySH just uses ‘Stars’, which encompasses all the various topics about stars. Once one is accustomed to it, the PhySH approach is simpler, but it requires a slightly different way of thinking about the classification system.
Aside from the conceptual difficulty, there were certainly technical hurdles in setting up our systems to replace PACS codes with PhySH identifiers. Most notably, we had to make some significant changes in the approach to automatic editor assignment that some of the journals used, taking into account the journal table-of-contents section that authors thought their article belonged in. After a few years’ experience, the new system seems to work but there is definitely still room for improvement. There has also been a continued hope that we could make PhySH interoperable with other vocabularies that could be used to enhance the tagging of our content (for example, by selecting elements or isotopes, or more detailed terms from astronomy or geophysics), but technically we are still not sure which is the best way to approach this. That’s not something PACS did either, but it was one of the stated goals for APS when we started working on PhySH.
The structure and other aspects of the Medical Subject Headings (MesH)6 provided some of the inspiration for the design of PhySH, but we also considered features in disciplinary vocabularies somewhat closer to physics. In particular, the open nature and SKOS design of the Unified Astronomy Thesaurus7 — which also had some background heritage from PACS and AIP’s newer thesaurus — gave us examples to consider how we wanted to interoperate. The Astrophysics Data System8 has a faceted search system with different controlled vocabularies operating along different conceptual dimensions. Geophysics maintains standard vocabularies similarly along different dimensions (such as time periods, locations or mineral types).
The closest field to physics in many respects is chemistry. Chemists have established a number of different taxonomies of molecules and compounds and other chemical systems, including the International Union of Pure and Applied Chemistry International Chemical Identifier (InCHI) and the Chemical Entities of Biological Interest (ChEBI) — I think this approach fits well with the interoperability promise of SKOS, although whether InCHI or ChEBI or some other scheme (or several of them) is the best one to apply to the APS journals is not something we have settled on yet. The general idea would be that we would supplement PhySH concepts, which identify the physical research areas or techniques, with identifiers for the specific elements, isotopes, molecules or other systems from one of these other authorities. The same approach should work for articles in fields such as astrophysics, geophysics or biophysics, pulling in vocabularies that work well in those fields.
In the life sciences there has been a proliferation of persistent identifiers for concepts, and they seem to have had some success in using them together — one central collection of these is the Identifiers.org database9 from the European Molecular Biology Laboratory. This allows an author to identify the particular protein, gene or other item of interest in their article through a simple ‘prefix: identifier’ notation, readily allowing indexing and searching. There are still some hurdles here in terms of correlating common concepts, but it seems promising and already useful.
Do controlled vocabularies and human classification efforts like this still make sense in the age of Google, Watson, big data and deep learning? The recent progress in artificial intelligence is promising, but so far these systems generally rely on learning from large collections of data previously curated by human beings rather than starting from scratch. As recently pointed out10, significant further progress in artificial intelligence probably requires combining these approaches with symbolic modelling, and there is still a long way to go. A classification system like PhySH provides both a large collection of human-curated data and also elements for a symbolic model of at least the part of the world relevant to physics research. If we hope to have machines read and ‘understand’ scientific articles then PhySH and friends should be able to contribute to that effort.
What we’re doing with classification is mapping content to a condensed representation of meaning — identifiers and symbols that represent the larger context of research work — and through that mapping then allowing those pieces of content to be related to one another, for other systems to see those relationships and present them to researchers to do useful things with. If additional automated intermediaries appear in that process, I don’t think that fundamentally changes much. To do that first part fully automatically though — to condense content into meaning — requires real understanding: a genuine general artificial intelligence. So I believe it’s going to be a lot harder and take a lot longer than some of the recent hype would have you think.
PACS worked for the APS journals for almost 40 years. I’m comfortable that we’ll see at least a decade’s use for PhySH. If we’re able to integrate it with other domain-specific vocabularies or perhaps with more general symbolic representations of the world, maybe it will still be relevant four decades from now. Whatever the future brings, classification is central to human approaches to organising the world around us, and I am sure some form of classification for physics research will continue to be practiced far into the future.