Box 1. Data warehouses

From the following article

Linking publication, gene and protein data

Paul Kersey & Rolf Apweiler

Nature Cell Biology 8, 1183 - 1189 (2006)

doi:10.1038/ncb1495

BACK TO ARTICLE

A data warehouse is a database constructed to support efficient querying of the data it contains (in contrast to normalized databases designed to support data integrity, which are widely used to maintain primary resources). A feature of many data warehouses used in bioinformatics is that they provide generic query interfaces (for example, computer languages and graphical user interfaces) applicable to all the data they contain, thus enabling the addition of new data without the need for interface redesign. A single warehouse may be built from several different resources, but to allow the construction of queries that filter and/or extract information derived from more than one of these, the data must be fitted into a single model that captures the relationships between the different sources. This is often done by exploiting the cross-references that many of these sources contain. The centralization of data in a data warehouse is an alternative to the use of technologies designed to support distributed queries. Distributed approaches can avoid certain problems associated with data warehousing, such as the synchronization of updates, but queries accessing large quantities of data spread over many locations can be difficult to optimise.