To the editor:

Mootha et al.1 propose a statistical method (Gene Set Enrichment Analysis; GSEA) to discern changes in expression levels of sets of genes selected a priori in transcriptional profiling experiments. Although consideration of groups of genes is an interesting strategy, the proposed test statistic may not necessarily determine “...if the members of a given gene set are enriched among the most differentially expressed genes between two classes”1.

Situations will probably arise when using GSEA in which genes with the highest values of the difference metric will be ignored solely due to the size of the selected gene sets, unrelated to any biological context of the genes comprising the set. By way of illustration, consider the following hypothetical example. Assume that a given data set consists of three potentially interesting sets of genes S1, S2 and S3, of respective sizes n, 5n and 4n genes, where n is any integer. Assume also that all of the genes in S1 are ranked higher (i.e., they have greater differences in expression) than the genes in S2, which in turn are ranked higher than the genes in S3. The GSEA procedure yields enrichment scores (ES)1 of 3n, 4n and 0 for S1, S2 and S3, respectively. The maximum ES1 is 4n and is attributed to S2. S2 will therefore be singled out as the candidate for further investigation over S1, even though S1 comprises the highest ranked genes. This does not seem reasonable, because S2 has been chosen only by virtue of containing a larger number of genes. In other words, GSEA can be at odds with the picture suggested by the gene ranking.

A second observation, using the same illustrative example as above, gives another counterintuitive result. In the absence of a defined third gene set (S3), the ES for S2 = 0 and the ES for S1 remains positive. Therefore, S1, and not S2, is chosen by GSEA, a result opposite to that of the previous scenario. An unusual situation has arisen in which a choice or preference between sets of high ranking is affected simply by the presence or absence of a lower ranking set.

The behavior of GSEA can not be dismissed as one of the usual power issues encountered due to noise in data, small sample size or lack of robustness to model assumptions. The simple example outlined here indicates that the power of the test statistic is sensitive to the a priori definition of the hypotheses of interest. These limitations should be clearly understood in applying and interpreting the results of the approach.