Sir

Your Editorial “The database revolution” (Nature 445, 229–230; doi:10.1038/445229b 2007) highlighted the difficulty in maintaining a stable information architecture for biology — in terms of both funding it consistently and evolving a common format.

In addition to the suggestions you made, we urge journals to take the lead in making articles suitable for digital parsing and text mining by providing a structured digital abstract (M. R. Seringhaus & M. B. Gerstein BMC Bioinformatics 8, 17; 2007).

The distinction between journals and databases is blurring. The results published in journal articles of new structures, genome sequences and microarray experiments are automatically deposited to large databases, while the articles themselves in these disciplines are largely accessed in electronic form via PubMed queries. In the future, the text of articles will be systematically mined by computer programs, allowing interrelation of journal text with the vast repository of knowledge stored in databases. But making these interconnections now is challenging. With few exceptions, the facts published in journals are not in a format easily parsed by computer: in particular, text mining has difficulties linking names to database objects, and identifying key findings from the language of a paper.

The structured abstract would act as a gateway for text-mining engines to access an article, much as the traditional abstract now does for readers. The structured abstract consists of three main elements. First is a translation table or 'cast of characters', which lists all named genes, proteins, metabolites or other objects in the article, and relates their human-readable names to precise database identifiers. Second is a list of the main results described in simple ontologies using controlled vocabulary — for example, interactions ('protein A binds to protein B'), phenotypes ('mutation C suppresses deletion D'), and protein modifications ('protein E is phosphorylated at residue F by protein kinase G'). Third is standard evidence codes for how the results were obtained — for example, 'affinity purification' or 'mass spectrometry'. Thus the structured abstract is not only a synopsis of the results but is readily computer-readable.

Such digital summaries could be produced by authors and editors as part of the editorial process, subject to peer-review and copy editing. They could be published on journals' websites, using semantic web standards such as XML and OWL, and indexed by central repositories for fast look-up.

Adoption of the structured abstract would require action by scientists and editors to establish formats and vocabularies, as was done for Gene Ontology (Nature Genet. 25, 25–29; 2000). Early incorporation by a few journals or a single community — for example, yeast researchers — could provide a prototype before it enters widespread use.