Microarrays are a powerful tool for making pairwise comparisons between tumour types, allowing us to categorize tumours that cannot be distinguished histologically. But imagine being able to categorize any tumour by plugging its gene-expression data into a universal database. This is the vision of Todd Golub and colleagues, and a paper in Proceedings of the National Academy of Sciences makes the first steps towards realizing that vision.

Using commercially available oligonucleotide arrays, the authors set about classifying tumours from 14 different classes using two approaches. The first method — unsupervised learning or clustering — organizes samples on the basis of similar gene-expression patterns without any knowledge of the tumour type. This could distinguish haematopoietic or central nervous system tumours, but couldn't tell epithelial tumours apart. The second approach — supervised learning — 'trains' an algorithm to distinguish between different tumour types so that it can recognize blinded samples. This method is excellent at making pairwise distinctions, but how would it cope with 14 possibilities at once? The trick was to break the problem down into numerous pseudo-pairwise comparisons by running the data for each sample through 14 classifier algorithms that compare a specific tumour type — for example, breast cancer — with all of the other types. Each classifier can then either accept or reject the sample, depending on whether the tumour's expression pattern resembles that of the classifier. The classifier also generates a 'confidence value' that quantifies how similar the data set from the sample is to its trained breast cancer signature.

Training this classifier using 144 primary tumour samples of known class allowed it to classify 78% of the samples correctly, and for half of the mistakes, the second or third most confident predication was correct. Increasing the number of tumours in the training set might improve this score. The classifier was then let loose on 54 test samples, with similar results. Interestingly, six of eight metastatic samples were correctly classified, indicating that metastatic tumours retain a gene-expression pattern similar to that of the tumour of origin. But the classifier fared less well on poorly differentiated (high-grade) carcinomas, indicating that their gene-expression patterns are fundamentally different from those of well-differentiated tumours from the same tissue. Might this reflect a different cellular origin for these tumours? Perhaps it's time to refine our tumour classification systems to take these differences into account.