To say that DNA arrays have become standard tools of molecular geneticists would be an exaggeration. They are well on their way, however, and it is now time to tackle issues such as standardization of array data analysis and presentation as well as reproducibility and validation. At a recent workshop in Tartu, Estonia*, participants not only exchanged results on array technology and its applications, but also engaged in a round-table discussion on the need for standardization. This coincides with the commitment by the European Bioinformatics Institute (EBI) to establish a public repository for array-based gene expression data. While much thought has gone into array data warehousing and analysis1, realizing such a database will not be an easy task. But being able to exchange and compare array results sooner rather than later will not only facilitate scientific discoveries, but also be crucial for advancing the technology. In addition, more array-based expression data in the public domain should catalyse the development of bioinformatics tools for their analysis, thereby widening the most restrictive bottleneck of comprehensive expression analysis (as also discussed by Bittner et al. on page 213).

The analogy between array data and sequence data (and their respective databases) is an obvious one, and many of the lessons learned from DNA sequence repositories will also be relevant to array data. The interpretation of array-based expression data is, however, tied more closely to the details of the experiment; small differences in probe sequences or target preparation can cause large differences in superficially very similar experiments ('probe' refers to the immobilized nucleic acid tethered on the surface and 'target', to the free nucleic acid that is being interrogated). Differential splicing can, for example, cause probes from one exon to behave quite differently to sequences taken from other parts of the same gene. Similarly, targets prepared by different methods (such as 3´ end labelling or complete labelling), or even from mRNA batches of slightly different quality, are likely to give different results. Compared with sequencing errors, it will be considerably more difficult to detect artefacts in array expression experiments, as there are no obvious 'reference' patterns (in contrast to a 'reference' sequence).

In exploring the possibility of a centralized array expression database, Alan Robinson and colleagues from EBI have been in discussion with developers and users of the technology around the world. Robinson was pleasantly surprised by the level of interest and support, as well as general agreement on at least some of the basic parameters for such a database. The central part of an entry is likely to consist of an XML file describing the data, that is, the intensity levels of every spot on the array. (XML stands for 'extensible markup language', a flexible way to create information formats and to share both format and data over the World-Wide Web.) Whether there would also be an attached graphic file with the 'raw' data (the visual pattern on the array) is still unresolved. Given the variations in the technology and the current lack of standard image analysis software, providing the array images in addition to the descriptive XML files seems important. Having a standard way to display the images (such as 16-bit TIFF images) would be desirable.

Whatever the agreed format, the way the data are displayed and described must permit evaluation, comparison and re-analysis by any user of the database. To this end, a considerable amount of additional information needs to be made available in the form of annotation. Such annotation would include an exact definition of the immobilized probes (the sequence of an oligonucleotide or PCR product if it is known, and clone identities and primer sequences for unknown sequences), a thorough description of the source and preparation of the target, and details of the hybridization reaction. Whether annotation should be in the form of 'free text' or make use of a limited amount of standardized keywords is currently under discussion.

Other crucial issues are data quality and reproducibility. Most groups making and using arrays have developed sets of internal standards to validate their results and established guidelines on the need for replication of individual experiments. As pointed out in Tartu by Thomas Gingeras (Affymetrix), "in the field of microarray technology, natural selection has not yet had a chance to act". Thus, while life would be easier if everyone used standard materials provided by commercial suppliers or resource centres, it seems likely that—at least in the short-term—scientists will apply different array technologies to similar biological questions. Under these circumstances, it is imperative that suppliers and users of the different technologies come up with ways of normalization that will allow cross-referencing and -validation.

Hans Lehrach and colleagues (Max-Planck-Institute for Molecular Genetics, Berlin) use two Arabidopsis clones in dilution series for every experiment interrogating expression in mouse and man. Both controls are deposited on the array with each individual spotting device, to take into account variation between individual devices. During target preparation, one of the clones is spiked into the RNA labelling mix and the second is labelled separately and mixed into the hybridization solution. The dilution series allows normalization of varying intensities arising from differences in the labelling and/or hybridization reactions. In order to assess reproducibility, four independent hybridization reactions per target sample are performed; two each on arrays with different spotting patterns, to control for differences between individual hybridization reactions and for 'overshining' artefacts caused by spot distribution. "If those two clones were present on every array manufactured worldwide, variation between different platforms and different laboratories could be measured—something that would be very useful but is not possible at the moment," Lehrach argues. He adds that "it should be relatively easy to provide these controls to everybody in the community". To address uniformity across analyses, more basic standards from a central source, such as a pair of DNA samples (one to be immobilized on the array, and a second pre-labelled one to be spiked into the hybridization mix) would also be useful.

Establishing a common format to exchange array data will facilitate peer-review of such data. Once a suitable database is established, this journal will make it a condition of publication that data are deposited at the time of publication. In the meantime, we encourage discussion of the issues outlined above, especially the question of common standards and a list of minimal annotations that will ensure the utility of an entry. The topic will also be on the agenda at Nature Genetics' upcoming Microarray Meeting**. We welcome comments and suggestions prior to the meeting; these should be addressed to natgen@natureny.com.

*HUGO/EU Workshop on DNA arrays—Methods and Applications, May 23–26, 1999, Tartu, Estonia (http://www.hugochip.ebc.ee).

**The Microarray Meeting—Technology, Applications & Analysis, September 22–25, 1999, Scottsdale, Arizona ( http://genetics.nature.com/Micro2.html).