The use of microarrays to monitor the transcription of thousands of genes under multiple conditions or in multiple cell lines is generating a massive and growing amount of valuable data. But there is a pressing need for more and better analysis tools. Two recent papers report new approaches and show how different methods of data mining can yield new information. Both papers use gene expression data related to cancer biology.

One way of analysing microarray data is to look for groups of genes whose expression patterns are similar across many experiments. The co-regulated genes within such clusters are often found to have related functions. Getz et al. started with the idea that some gene clusters might be masked by transcriptional 'noise' from genes outside the cluster, or if the genes are co-regulated in only a subset of the experiments. So the authors developed an algorithm called coupled two-way clustering that breaks down the total dataset into subsets of genes and samples that can reveal significant clusters.

Two previously published datasets were used by Getz et al. The first comprised 72 samples of two types of acute leukaemia — acute lymphoblastic leukaemia (ALL) and acute myeloid leukaemia (AML). After applying their analysis, they identified 84 clusters. One of the clusters (comprising 60 genes) separated the samples into AML and ALL. Another cluster (of 28 genes) split the AML patients into those who had received treatment and those who had not. The second dataset used by Getz et al. comprised 40 colon cancer samples and 22 controls. Their analysis was able to split the group into the normal and diseased samples using one of the clusters of genes, and another cluster partitioned the samples according to a difference in the methodology used for RNA preparation. Overall, the method does generate meaningful clusters that are not detected when the whole dataset is analysed. The task is now to examine the unexplained clusters to look for biological significance.

Butte et al. used an entirely different approach to mine data from two different datasets. The data concerned 60 cell lines established by the National Cancer Institute and used since 1989 to screen anticancer agents. The first dataset comprised the transcript levels for several thousand genes in each cell line. The second dataset comprised the GI50 (the level of anticancer agent required to achieve 50% growth inhibition) for several thousand agents on each cell line. The aim was to look for significant correlation between every possible pair of agents and genes. Correlations were summarized diagrammatically in networks and 202 such networks were found. Many expected associations were found between structurally related anticancer agents, and networks were also identified that linked genes of related function. Only one association was found between an agent and a gene — the GI50 for a thiazolidine carboxylic acid derivative increased with the expression of the gene LCP1 , which encodes an actin-binding protein. Once again, the networks need to be analysed further to uncover the biological meaning. Among the advantages of this method are that individual genes or agents can be linked more than once, and that negative correlations can be found just as easily as positive correlations.

These two papers expand the range of tools for analysis of transcript profile data, and expose further seams for would be data-miners.