In reply:

Kim et al. provide an interesting, albeit somewhat indirect, method to estimate the rate of alternative splicing per gene in different organisms based on expressed-sequence tag (EST) and mRNA data. Their results disagree with our earlier conclusions1 and fit better with the general expectation that the rate of alternative splicing increases with organismal complexity. Furthermore, the authors argue that their method is superior to direct EST matching approaches as it is independent of the number of ESTs used. The problem with both methods, as with many other studies based on 'omics' data, is that hidden biases and flaws in data and methods can heavily affect the outcome of an estimation. The results of Kim et al. agree with our results before normalizing for EST redundancy1 (i.e., more ESTs record more splice variants; e.g. refs. 2,3 and references therein) and correlate with the length distribution of EST contigs in the species analyzed (Fig. 1a). This prompted us to study the method of Kim et al. in detail and attempt to reproduce their results (see Supplementary Note online for a discussion of the difficulties of reproducing the method of Kim et al.). To test Kim et al.'s claim that their method is independent of EST coverage, we included the rat in our analysis, as it has an EST coverage similar to that of invertebrates. Notably, the estimated rate of alternative splicing in the rat was closer to that of invertebrates than to that of the mouse (Fig. 1b), suggesting that EST coverage does bias the results of Kim et al. This idea is further supported by the correlation between EST and RefSeq sequence coverage per organism and the rate of alternative splicing estimated by Kim et al. (Fig. 1b).

Figure 1: Dependence of the method of Kim et al. on EST coverage.
figure 1

(a) Cumulative length distribution of TIGR EST contigs for each organism used in this analysis. The similarity between this graph and the graph by Kim et al. showing their estimate of G implies that the method indirectly measures the length of contigs (Supplementary Note online). (b) Relationship between EST coverage and the estimated rates of alternative splicing (AS) per gene in each organism. To adjust EST coverage to the context of this analysis, it is given as ratio of the total number of ESTs to the number of RefSeqs available for that organism. Limited reproducibility and hidden parameter choices are inherent in large-scale analyses, and we had difficulty reproducing the method of Kim et al. We therefore used simulations to reproduce their figures approximately (simulated Kim et al.; Supplementary Note online).

Why do subsets of the EST contigs show the same rate of alternative splicing as the full data set? We believe that the use of The Institute for Genomic Research (TIGR) EST contigs4 rather than ESTs themselves leads to misleading statistics. TIGR contigs are derived data that merge redundant ESTs, and, in theory, each splice form of a gene should be represented by a single contig, including the forms represented by RefSeq sequences. The rate of alternative splicing per gene measured by Kim et al. ultimately depends on the ratio of the total number of contigs that support alternative splicing forms to the number of contigs that support RefSeq forms (for details see Supplementary Note online). When Kim et al. used subsets of TIGR contigs to test the dependency of their method on EST coverage, they affected the number of RefSeq and alternative splicing forms equally. Therefore, the ratio (and the rate) remains the same with any subset of the data. Thus, this test is not valid to prove independence from EST redundancy, which we argue is biasing the results of Kim et al.

Independent of the hidden biases that affect the method of Kim et al., the use of EST data might not be optimal for comparative analysis: they might not reflect global gene expression patterns adequately5 and they have various other flaws6. The limitations of EST coverage is best illustrated by the Drosophila melanogaster gene Dscam, which is only represented by 20 ESTs, even though it seems to express >38,000 alternative splice variants7. The estimates of the rate of alternative splicing also strongly depend on the treatment of the data; just one parameter choice, the number of base pairs at the end of the ESTs to be ignored to account for sequencing errors, can substantially influence the estimation of the rate of alternative splicing (Supplementary Note online). Finally, the use of EST data to compare distant organisms is not trivial, as the protocols and sources used to produce ESTs vary greatly between organisms. In mammals, ESTs are heavily sampled in a subset of tissues, which can lead to biases (e.g., human brain ESTs are over-represented, and brain is the tissue with the highest rate of alternative splicing8), whereas ESTs from invertebrates are often taken from whole organisms. It is a conceptual and practical challenge to normalize for these and other heterogeneities when comparing the extent of alternative splicing between different species.

The study by Kim et al. is important as it indicates that no estimate should be considered absolute when extrapolating from data with many hidden biases. Nevertheless, we doubt that the method in general and the implementation in particular is superior to direct EST matching (Supplementary Note online). The opportunity now exists to consider and carefully evaluate a variety of data and methods to approach an understanding of the impact of alternative splicing in each organism. The art is to avoid hidden biases in derived data and their processing as much as possible.

Note: Supplementary information is available on the Nature Genetics website.