Nature | Toolbox

The trouble with reference rot

Computer scientists are trying to shore up broken links in the scholarly literature.

Article tools

Illustration by the Project Twins

The scholarly literature is meant to be a permanent record of science. So it is an embarrassing state of affairs that many of the web references in research papers are broken: click on them, and there's a fair chance they will point nowhere or to a site that may have altered since the paper referred to it.

Herbert Van de Sompel, an information scientist at the Los Alamos National Laboratory Research Library in New Mexico, quantified the alarming extent of this 'link rot' and 'content drift' (together, 'reference rot') in a paper published last December (M. Klein et al. PLoS ONE 9, e115253; 2014). With a group of researchers under the auspices of the Hiberlink project (http://hiberlink.org), he analysed more than 1 million 'web-at-large' links (defined as those beginning with 'http://' that point to sites other than research articles) in some 3.5 million articles published between 1997 and 2012. The Hiberlink team found that in articles from 2012, 13% of hyperlinks in arXiv papers and 22% of hyperlinks in papers from Elsevier journals were rotten (the proportion rises in older articles), and overall some 75% of links were not cached on any Internet archiving site within two weeks of the article's publication date, meaning their content might no longer reflect the citing author's original intent — although the reader may not know this.

Hyperlinks to web-at-large content were present in only one-quarter of the 2012 scholarly articles, but some four-fifths of those papers that did contain a link suffered from reference rot, the team found — that is, at least one reference to web-at-large content was either dead or not archived. Van de Sompel terms the situation “rather dramatic”. Because the content of servers can change, or they can 'go dark' or change hands, researchers following up links to online data sets, software or other resources might have nowhere to turn. “You've lost a trace to the evidence that was used in the research,” he says.

Snapshots of the web

Fortunately, online archiving services, such as the Internet Archive, make it possible for researchers to store permanent copies of a web page as they see it when preparing their manuscripts — a practice Van de Sompel recommends. He urges researchers to include their cached link and its creation date in their manuscripts (or for publishers to take a snapshot of referenced material when articles are submitted). The Harvard Law School Library in Cambridge, Massachusetts, has developed a web-archiving service called Perma.cc (https://perma.cc): enter a hyperlink here and the site spits back a new hyperlink for a page that contains links to both the original web source and an archived version.

Van de Sompel and others have in the past few weeks rolled out a complementary approach. It relies on a service that Van de Sompel has co-developed called Memento, which he dubs “time travel for the web”. The Memento infrastructure provides a single interface for myriad online archives, allowing users access to all of the saved versions of a given web page. This infrastructure could potentially allow access to web-at-large links in any scholarly article, even if the linked sites go down. Publishers would have to incorporate a small piece of extra computer code in their articles, and the standard single weblinks would have to be replaced with three pieces of information — the live link, a cached link and its creation date — all wrapped in Van de Sompel's proposed machine-readable tags.

Storage block

Van de Sompel says that he is “unbelievably enthusiastic” about the team's approach. But the solution depends on the cooperation of authors and publishers — who may be disinclined to help. Another issue is that web-page owners who hold copyright over content can demand that archives remove copies of it. They can also disallow archiving of their sites by including a file or line of code that prevents computer programs from 'crawling' over or capturing content — and many do. If Perma.cc, for instance, encounters such an exclusion code, it preserves the content in a 'dark archive'; to access a web page in a dark archive, the reader must contact a library participating in the Perma.cc project and request to see the site.

Scholarly articles that are behind a paywall routinely exclude such crawling, too — although publishers have introduced the DOI system to ensure that scientists can confidently cite a persistent hyperlink to the right version of an online research article, even if the publisher changes its local web addresses. (In January, however, the system that redirects DOI links went down, showing that it is not immune to failure.) Publishing companies also guard against link rot by automatically preserving articles in archives; the articles can be released if the company folds.

But not all companies are archiving, says David Rosenthal, a staff member at the library of Stanford University in California; analysis of data from a monitoring service called The Keepers Registry shows that “at most 50% of articles are preserved”, Rosenthal writes on his blog (go.nature.com/jrwqo4). So for both web-at-large hyperlinks and scholarly articles, the Memento team's mission to solve reference rot may be “excessively optimistic”, he says.

Journal name:
Nature
Volume:
521,
Pages:
111–112
Date published:
()
DOI:
doi:10.1038/521111a

Author information

Affiliations

  1. Jeffrey M. Perkel is a writer based in Pocatello, Idaho.

Author details

For the best commenting experience, please login or register as a user and agree to our Community Guidelines. You will be re-directed back to this page where you will see comments updating in real-time and have the ability to recommend comments to other users.

Comments for this thread are now closed.

Comments

1 comment Subscribe to comments

  1. Avatar for Carlos Polanco
    Carlos Polanco

    To the editor:

    Considerations about the computer scientists and the scholarly literature
    The Perkel's editorial [The trouble with reference rot, Nature] (1), clearly states the increasing problem of the missing “virtual address” of the manuscripts referred to in scientific material. Although there are key elements that make possible to track a reference i.e., author, journal, number, volume, pages (nomenclature used particularly in Nature’s journals), this represents an inconvenience to the reader, that will have to look in a search engine, to locate the reference text. It is important to note that in most cases the information is always available in the internet. There are sites such as web-archiving service called Perma.cc (https://perma.cc) (2), aiming to virtually protect the references. In my opinion, this is an important tool that will minimize the inconvenience to the reader, and therefore, it should be encouraged by the search engines i.e., Google, Yahoo, Bing, Boodigo, DuckDuckGo. This way, it will release the additional work of those who reference the work of other peers in their publications.
    Sincerely yours, Carlos Polanco, Ph.D. Universidad Nacional Autonoma de Mexico, Mexico City, Mexico.

    Carlos Polanco is an Associate Professor at the Department of Mathematics in the Universidad Nacional Autónoma de México, México city, México. (polanco@unam.mx)

    References 1. Perkel, JM Nature 521, 111-112 (2015). 2. Harvard Law Library, Langdell 263, 1545 Massachusetts, MA 02138 https://perma.cc/ accessed May 6, 2015.

sign up to Nature briefing

What matters in science — and why — free in your inbox every weekday.

Sign up

Listen

new-pod-red

Nature Podcast

Our award-winning show features highlights from the week's edition of Nature, interviews with the people behind the science, and in-depth commentary and analysis from journalists around the world.