Data, data everywhere Credit: JEREMY HASSEMAN

The massive amount of microarray data collected so far has been generated on multiple platforms and is stored in a host of different formats, levels of detail and locations. This makes it difficult for any group to re-analyse or verify the data, or compare the results with their own. “It's apples to oranges,” says Steven Gullans of the department of medicine at Brigham and Women's Hospital/Harvard Institutes of Medicine in Boston, Massachusetts.

Moreover, there are no uniform standards for reporting microarray data in journal articles, and there is no requirement for authors to deposit their data — and any supporting information — in the public domain. “I think the journals have to force it,” says Gullans, “just like they forced us to put sequence data in the public databases, and they are a little at a loss how to do that.”

Although most researchers agree that public databases for microarray data are a good idea, many are hesitant about depositing their own data in the public repositories now being developed. These include the Gene Expression Omnibus (GEO), operated by the US National Center for Biotechnology Information (NCBI); ArrayExpress, run by the European Bioinformatics Institute (EBI) in the UK; and CIBEX, the gene-expression database being developed by the DNA Data Bank of Japan.

Quackenbush: supports data standards Credit: JEREMY HASSEMAN

“I think everyone realizes that the value of [microarray] data is not in looking at them in isolation but really trying to look at them in a broader context,” says John Quackenbush, head of the whole-genome functional analysis group at The Institute for Genomic Research in Rockville, Maryland.

The problem is that expression data are much richer than sequence data, and many factors can affect how genes are expressed. You need to capture more information, says Quackenbush, including details of the experimental design, array design, samples, controls and experimental conditions, and the data manipulation and analysis methods used.

The Microarray Gene Expression Data (MGED) group was established in 1999 to develop a framework for describing information about a DNA microarray experiment, as well as a standard format for data exchange. The first version of its MIAME (minimum information about a microarray experiment) was proposed last year (see Nature Genet. 29, 365–371; 2001 and Nature 415, 946; 2002). The MAGE-ML (Microarray Gene Expression Markup Language) data-exchange format, which the MGED is developing along with the Life Sciences Research Task Force of the Object Management Group (OMG), a software standards organization, moved a step closer to implementation after a recent vote within the OMG.

“It all boils down to whether we want to continue in the life sciences with a tradition that the supporting data should be available, or not,” says Alvis Brazma, team leader for microarray informatics at the EBI. Brazma is responsible for spearheading efforts to adopt minimum standards for microarray data and a standard data-exchange format.

The MGED has sought the input of the microarray community, including software and hardware companies. Rosetta Inpharmatics, for example, was working on its own standard, but has since joined forces with the MGED. “Our goal was to have a standard that everyone would use and that was at risk if we had a lot of smart folks working on two different applications,” says Doug Bassett, vice president and general manager of Rosetta Biosoftware, the recently formed software arm of the company. Bassett expects the company's software products, which include the Rosetta Resolver gene-expression data analysis system, to be among the first to offer full support for MAGE-ML.

EBI's ArrayExpress currently houses only three data sets, but it now accepts data in the MAGE-ML format. The EBI is beta-testing the web-based data submission capabilities for ArrayExpress, and Brazma expects this phase to last another 2–3 months.

The GEO, launched by the NCBI last July, has been operational for longer, contains more data, and both accepts data submissions and supports data queries. But some researchers find it difficult to work with. “GEO has the disadvantage that all of the data are stored basically as a big tab-delimited file inside the database. That makes it very difficult to query,” says Quackenbush. The NCBI is developing a set of tools on top of the GEO to try to extract the information and make it more accessible. Yoshio Tateno, of the Center for Information Biology, part of the National Institute of Genetics in Mishima, Japan, expects CIBEX to be publicly accessible and support MAGE-ML some time this summer.

Some private databases are also working towards supporting MAGE-ML and being MIAME-compliant. Gavin Sherlock, director of Microarray Informatics at the Stanford Microarray Database, hopes the database will be MIAME-compliant by the end of this year. “One of the things that makes it hard for us is the quantity of data we already have,” he says, which amounts to information from some 22,000 arrays.

The MGED is also about to come up with a checklist for authors, editors and reviewers of what information should be given in microarray-based papers and what supporting information should be revealed electronically — details of which will be posted on its website. Brazma hopes it will serve as a useful guide that “will put everything on a more level playing field”.