London

The Digital Libraries Initiative is a major US research project into electronic archiving. But when you click to its homepage from the website of one of its sponsors, the US National Library of Medicine, you get a message that's increasingly familiar to researchers: “URL not found”.

Now a study by Jonathan Wren, a bioinformatician at the University of Oklahoma in Norman, has revealed the scale of the problem (J. D. Wren Bioinformatics 20, 668–672; 2004). He found that nearly one-fifth of the websites mentioned over the past decade in abstracts on Medline, the clearing-house of papers used by biomedical researchers, have disappeared.

“The web is becoming a more prevalent source of support for research,” says Wren. “If that support is lost, it can make the work impossible to replicate.”

Wren was prompted to study the problem after noticing a misspelt web address in the Medline abstract for one of his papers. In addition to the missing websites, Wren found that a fifth of the URLs in abstracts published between 1994 and April 2003 are available only intermittently. He also checked 33 FTP sites, of which only 12 still work. Wren hasn't quantified how important the dead links are, but he feels that they are probably significant as they were mentioned in the abstract.

And Robert Dellavalle, a dermatologist at the University of Colorado in Denver who is campaigning for better electronic archives, says that publishers' responses to the problem are inadequate. “Journals aren't doing anything to address the potential for electronic resources to disappear,” he says. “It's amazing what doesn't exist — one of my own articles on digital preservation isn't there any more!”

Last year, Dellavalle's own survey found that about 12% of the Internet addresses cited in The New England Journal of Medicine, The Journal of the American Medical Association and Science were extinct two years after publication (R. P. Dellavalle et al. Science 302, 787–788; 2003).

Dead or alive? The availability of websites listed in Medline abstracts over the past decade. Credit: J. D. Wren Bioinformatics 20, 668–672; 2004

Electronic resources in the physical sciences can be just as unstable, says Paul Ginsparg, a physicist at Cornell University in New York state, who runs the arXiv preprint service. “We've always insisted that arXiv submissions be complete and self-contained, and that external links be only to non-essential materials,” Ginsparg says.

Dellavalle believes that journals should compel authors to archive the electronic resources they cite. As a minimum, he says, they should be required to print them out, and submit and keep copies. Additionally, he thinks, journals should require authors to submit online references to the Internet Archive (http://www.archive.org), a non-profit digital library project that has links to the world's largest library, the US Library of Congress.

One journal, PLoS Biology, is asking authors to use the Internet Archive. Others are unsure how archives will be kept in the long term. “I don't think publishers have the resources to put everything somewhere it's going to last for ever,” says Tony Delamothe, web editor of the BMJ. “If links die or get lost over time, that's just tough.”

Maxine Clarke, publishing executive editor of Nature, says: “We do what we can to ensure that formal citations will stand the test of time, and we wouldn't let a significant part of a paper depend on a website.”

Many commercial publishers, working through a collaboration called CrossRef, want to give electronic documents permanent code numbers that will prevent them from getting lost, even when URLs change.