Return to Nature Debates: e-access

The future of the electronic scientific literature

Opinion, Nature

The Internet's transformation of scientific communication has only begun, but already much of its promise is within reach. The vision below may change in its detail, but experimentation and lack of dogmatism are undoubtedly the way forward.

"The Internet is easier to invent than to predict" is a maxim that time has proven to be a truism. Much the same might be said of scientific publishing on the Internet, the history of which is littered with failed predictions. Technological advance itself will, of course, bring dramatic changes — and it is a safe bet that bright software minds will punctually overturn any vision. But it is becoming clear that developing common standards will be critical in determining both the speed and extent of progress towards a scientific web.

'Standards' for managing electronic content are hardly a riveting topic for researchers. But they are key to a host of issues that affect scientists, such as searching, data mining, functionality and the creation of stable, long-term archives of research results (see Richard Rowe, Digital archives: how we can provide access to ‘old’ biomedical information). Moreover, just as the Internet and web owe their success to agreed network protocols on which others were able to build, common standards in science will provide a foundation for a diversity of publishing models and experiments and be a better alternative to 'one-size-fits-all' solutions (see Michael Keller, Innovation and service in scientific publishing requires more, not less, competition).

This explains why the Open Archives Initiative (OAI), one of many alternatives now being offered to scientists to disseminate their work, has now broadened its focus from e-prints (see Stevan Harnad, The self-archiving initiative) to promoting common web standards for digital content.

The reason is that some of the most promising emerging technologies will only realize their full promise if they are adopted in a consensual fashion by entire communities. At the level of the online scientific 'paper', one major change, for example, is a shift in format to make papers more computer-readable (see Jon Bosak, Text markup and the cost of access). Searches will become much more powerful; tables and figures will cease to be flat, lifeless objects, and instead will be able to be queried and manipulated by users, using suites of online visualization and data-analysis tools.

This is being made possible by Extensible Mark-up Language (XML), which allows a document to be tagged with machine-readable 'metadata', in effect converting it into a sort of mini-database. Most web pages today are coded in HTML. But this contains information only about a page's appearance. Whereas HTML specifies title and author information, for example as simple headings, such as:
<H1>The future of the electronic scientific literature</H1>
<H3>by John Smith</H3>
XML specifies these in a way that computers can understand:
<articletitle>The future of the electronic scientific literature</articletitle> <author><firstname>John</firstname> <lastname>Smith</lastname></author>.

The possibilities for tagging are endless. But a major need now is for stakeholders to agree on common metadata standards for the basic structure of scientific papers. This would allow more specific queries to be made across large swathes of the literature. Indeed, what is above all hampering the usefulness of today's online journals, e-print archives and scientific digital libraries is the lack of means to federate these resources through unified interfaces.

The PrePRINT Network search engine, operated by the Office of Scientific and Technical Information (OSTI) of the US Department of Energy, already searches across over 4,000 preprint sites and intends to include the remaining 3,000 by the end of the year. This engine gives access to almost 400,000 preprints. But the sophistication of searches is limited because most e-print servers lack agreed metadata.

The OAI has now agreed metadata standards to facilitate improved searching across participating archives, which can therefore be queried by users as if they were one seamless site. The OAI is attractive compared with centralized archives in that it allows any group to create an archive while, by agreeing common standards, they become part of a greater whole. The idea is catching on: it is supported by the Digital Library Federation (DLF), a consortium of US libraries and agencies, including the Online Computer Library Center. CrossRef, a collaboration of 78 learned society and commercial publishers, in which Nature's publishers are taking a leading role, is also actively developing common metadata standards that would allow better cross-searching of the 3 million articles they hold.

Minimal options
As metadata are expensive to create — it is estimated that tagging papers with even minimal metadata can add as much as 40% to costs — OAI is developing its core metadata as a lowest common denominator to avoid putting an excessive burden on those who wish to take part. But even these skimpy metadata already allow one to improve retrieval. This strategy is sensible as it acknowledges the fact that the value and nature of scientific information are heterogeneous.

Minimal metadata will suffice for much of the literature. But there will increasingly be sophisticated and novel forms of publications built around highly organized communities working off large, shared data sets. These hubs will stand out by their large investment in rich metadata and sophisticated databases. The future electronic landscape should see such high added-value hubs evolving as overlays to vast but largely automated literature archives and databases.

In such an early stage of development, it is essential to avoid dogmatic solutions (see Robert Campbell, Information access: what is to be done?). Not all papers will warrant the costs of marking up with metadata, nor will much of the grey literature, such as conference proceedings or the large internal documentation of government agencies (see Walter Warnick, Tailoring access to the source: preprints, grey literature and journal articles). Many high-cost, low-circulation print journals could be replaced by digital libraries. Overheads would be kept low, and the economics argues that the cheapest means of handling the bulk of the literature may be automated digital libraries. Tags automatically generated from machine analysis of the text, for example, might minimize the quantity of manual metadata needed.

Or take ResearchIndex, software produced by the computer company NEC, which builds digital libraries with little human intervention (see Steve Lawrence, Free online availability substantially increases a paper's impact). It gathers scientific papers from around the web and, using simple rules based on document formatting, can extract the title, abstract, author and references. It interprets the latter, and can conduct automatic citation analyses for all the papers indexed. Such digital libraries will also provide new tools, for example to generate new metrics based on user behaviour, which will complement and even surpass citation rankings and impact factors (see Rick Luce, Evolution and scientific literature: towards a decentralized adaptive web).

At the other end of the spectrum, specialized communities organized around shared data sets will produce highly sophisticated electronic 'publications', making it much more arduous for authors to submit information because of the amount and detail they will be required to enter in machine-readable form. Take the Alliance for Cellular Signaling (AfCS), a 10-year, multimillion-dollar, multidisciplinary project run by a consortium of 20 US institutions. It is taking a systems view of proteins involved in signalling, and integrating large amounts of data into models that will piece together how cellular signalling functions as a whole in the cell. Here, authors would be required to input information, for example, on the protocols, tissues, cell types, specific concentration factors used and the experimental outcomes. Inputs would be chosen from menus of strictly defined terms and ranges, corresponding to predefined knowledge representations and vocabularies for cell signalling.

The idea is that, rather than simply producing their own data, communities instead create a vast, shared pool of well-structured information, and benefit by being able to make much more powerful queries, simulations and data mining. A series of 'molecule pages' would also pull together virtually all published data and literature about individual molecules in relation to signalling.

Indeed, the high-throughput nature of much of modern research means that, increasingly, important results can be fully expressed only in electronic rather than print format. Systems biology in particular is driving research that seeks to describe the function of whole pathways and networks of genes and proteins, and to cover scales ranging from atoms and molecules to organisms. Increasingly, the literature and biological databases will converge to create new forms of publications. Other disciplines stand to benefit, too.

Helping machines make sense of science on the web
Many communities, including the AfCS, are building ontologies to underpin such schemes. Ontologies mean different things to different people, but they are in effect representations that attempt to hard-code human knowledge about a topic and the intrinsic relationships in ways that computers can use (see Tim Berners-Lee & James Hendler, Scientific publishing on the 'semantic web'). The microarray community has been very active in this area. The Microarray Gene Expression Database group has coordinated global standards; as a result, users will be able to query vast shared data sets to find all experiments that use a specified type of biological material, test the effects of a specified treatment or measure the expression of a specified gene, and much more.

One major problem is that genes and proteins often have different names in different organisms, and these often say little about what they do. To get round this problem, the Gene Ontology (GO) Consortium is creating tree-like ontologies of the 'molecular function', 'biological process' and 'cellular component' of gene products. All genes involved in 'DNA repair', for example, would be mapped to the corresponding GO term, irrespective of their name or source organism. A microarray gene-expression analysis that previously yielded only names of expressed genes would in addition carry mapped GO terms that might reveal, say, that half the genes are involved in 'protein folding'. GO terms can also help to federate disparate databases.

Ontologies can also be used to tag literature automatically, and will be particularly useful for grey literature and archival material for which manual tagging was not justified. Papers tagged automatically with concepts can be matched, grouped into topic maps and mined. By breaking down terminological barriers between disciplines, this should also enhance interdisciplinary understanding and even serendipity. Nature is actively investigating such possibilities.

The GO ontologies are still very incomplete, however, and the internal relationships need to be enriched. Moreover, caution is required against prematurely pigeon-holing gene functions, given the uncertainty of most annotations. Ontologies are also the focus of intensive research in computing science, and biology is not yet up to speed on this. Efforts such as GO and the Bio-Ontologies Consortium deserve support. Indeed, given the shortcomings of existing ontologies and controlled vocabularies, there may be a case for creating a more organized international effort to ensure economy of effort, interoperability and sharing of expertise.

The advent of structured papers that are increasingly held in literature databases blurs further the distinction between the scientific paper and entries in biological databases. Already, entries in the biological databases are often hyperlinked to relevant articles in the literature and vice versa, and CrossRef is developing standards for such linking. As text becomes more structured, it will be possible to increase the sophistication of both linking, data manipulation and retrieval.

Biological databases and journals have evolved relatively independently of one another. Database annotations lack the prestige of published papers; indeed, their value is largely ignored by citation metrics, and their upkeep is often regarded as a thankless task. Database curation has consequently lacked the quality control typical of good journals. The convergence between databases and the literature means that database annotators and curators will increasingly perform the functions of journal editors and reviewers, while publishers will develop sophisticated database platforms and tools.

New ways in
Database- and metadata-driven systems will drive interfaces to publications from simple keyword search models to ones that reflect the structure of biological information. Visualization tools of chromosomal location, biochemical pathways and structural interactions may become the obvious portals to the wider literature, given that there are far fewer protein structures or gene sequences than there are articles about them (see Mark Gerstein & Jochen Junker, Blurring the boundaries between the scientific 'papers' and biological databases). As Gerstein, a bioinformaticist at Yale University, points out: "One might 'fly through' a large three-dimensional molecular structure, such as the ribosome, where various surface patches would be linked to publications describing associated chemical binding studies."

Future electronic literature will therefore be much more heterogeneous than the current journal system, and dogmatic solutions should therefore be resisted (see Ann Okerson, What price 'free'?). It is significant and sensible that both CrossRef and OAI have made key strategic choices favouring openness and adaptability. They seek to federate distributed actors rather than to create centralized structures. They also make their work independent of the type of content, which makes it flexible enough to incorporate and link seamlessly not just papers but news, books and other media (see Ed Pentz, Evolution and revolution: pragmatism versus dogmatism).

Future electronic journals will also increasingly be dynamic, accompanied by annotations, updates, and links to more recently published papers. This suggests that a dynamic continually changing distributed archive, managed by those closest to them, may also be the best solution for the long-term preservation of archives. Technical and organizational best-practices for long-term repositories are currently the focus of a plethora of discussions worldwide. But the very concept of a fixed repository may be an obsolete concept, a hangover from the days of print.

Crucially, both OAI and CrossRef have also decided to build systems independent of the economic mechanisms surrounding that content. Many publishers, in particular some learned societies, may be willing to make their content free, perhaps after a certain delay (see Michael Eisen & Pat Brown, Should the scientific literature be privately owned and controlled?). Others are exploring business models where authors or sponsors pay, which would allow free access to articles on publication (see Thomas Walker, Authors willing to pay for instant web access). The open technological frameworks also mean that particular communities, such as scientists with specific metadata needs for their discipline, are free to build in more complex data structures; the higher overheads incurred may require charging for added-value services.

The OAI and CrossRef strategies therefore differ fundamentally from more centralized systems proposed by PubMed Central (PMC) (see Jo McEntyre & David Lipman, GenBank - a model community resource?), operated by the US National Library of Medicine, and E-Biosci, being developed by the European Molecular Biology Organization (see Frank Gannon, Boycott! & Les Grivell, E-Biosci: a European approach to handling biological information).

The question of whether to merge datasets or content in a single centralized repository - 'data warehousing' - or in a federation of distributed resources is a common one facing scientific databases and international research collaborations. The former works well for large amounts of data that are in standard formats, and originate from a handful of large centres. This is the case for gene sequence data, which explains the success of Genbank. In many other areas, such as structural and functional genomics, data are more heterogeneous, and their sources more dispersed, and these communities tend to put the emphasis on federated databases curated by those closest to the data.

But PMC and E-Biosci highlight the urgent need to index the full text of papers and their metadata and not just abstracts, as is the practice of PubMed and other aggregators. Services that require publishers to deposit full text only for indexing and improving search are useful, and they can simplify some of the technical issues associated with federating distributed resources (see Matt Cockerill, Distributed and centralized technologies: complementary tools to build a permanent digital archive).

Unfortunately, PMC, unlike E-Biosci, confounds this primarily technological issue with an economic one, by requiring that all text be made available free after, at most, one year. It is regrettable that PMC has not in the first instance sought full-text indexing itself as a goal, as this in itself would be an immediate boon to researchers. It would also probably have been more successful in attracting publishers.

The reality is that all of those involved in scientific publishing are in a period of intense experimentation, the outcome of which is difficult to predict (see Tim O'Reilly, Information wants to be valuable). Getting there will require novel forms of collaboration between publishers, databases, digital libraries and other stakeholders. It would be unwise to put all of one's eggs in the basket of any one economic or technological 'solution'. Diversity is the best bet.

This article is a slightly expanded version of that published in the print edition of Nature, and also includes links to some of the contributions to this forum which inspired it. Feedback is welcome, and we also encourage you to let us know of any specific services, functionality, or needs which you consider would most benefit your own work, or that of your research community; use this forum to let us know, or e-mail comments directly to Declan Butler, Nature's European correspondent (