To the editor:

The sequencing of the human genome showed that humans have 30,000 genes. This finding raised the possibility that alternative splicing, rather than an increased number of expressed genomic loci, was responsible for the functional complexity of vertebrates relative to invertebrates1. It has been estimated that 40–60% of all human genes1,2,3,4 and 74% of multiexon human genes5 are alternatively spliced. These estimates do not take into account how many different alternative splice forms exist for a given gene. Brett et al. examined alternative splicing in seven species, including human, using large-scale expressed-sequence tag (EST) analysis6. They concluded that vertebrates and invertebrates had similar rates of alternative splicing, not only with respect to the proportion of the genes affected but also with respect to the number of alternative splicing forms per gene. The method they used depends on the extent of EST coverage in the underlying data sets.

To avoid this shortcoming and to provide an alternative estimate for the number of splice variants per gene, we modified the method that Ewing and Green used to estimate the total human gene count7. This method requires two independent sets of incomplete gene sequence data from the organism, the first of which should be unbiased. For counting the genes in the human genome, Ewing and Green used mRNA sequences for the first data set and EST contig sequences for the second data set7. They required at least 100 aligned bases between a pair of sequences between the two data sets to agree in order to consider them a match. They used this low number because some of the mRNAs were incomplete or represented different alternatively spliced forms of the same gene. To count alternative splice forms using this method, we considered that a set of alternative splice forms from the same gene could be counted as different genes by increasing the minimum number of aligned bases for sequence comparison. As the minimum number of aligned bases increases, the gene count G will asymptotically approach the total number of transcriptional products. Using this approach, we estimated the extent of alternative splicing in Caenorhabditis elegans, Drosophila melanogaster, Mus musculus and Homo sapiens. We constructed the first set of data (n1) for each genome by selecting UniGene clusters8 that are represented by a member of the National Center for Biotechnology Information Reference Sequences (RefSeq) database9. We derived the set of EST contigs (n2) from the appropriate gene indices from The Institute for Genomic Research10. Further details are available in Supplementary Methods online.

From these comparisons, we predicted the value for G as the minimum number of aligned bases varied. As expected, G increased asymptotically as the minimum number of aligned bases increased (Fig. 1a). The rate of alternative splicing per gene can be estimated as the ratio of the values of G at the two extreme ends of the graph (Fig. 1b). These data indicate that mice and humans have a higher rate of alternative splicing than do fruit flies and nematodes.

Figure 1: Extent of alternative splicing in four organisms.
figure 1

(a) Estimate of the gene count G with varying minimum number of aligned bases for H. sapiens (diamonds), M. musculus (squares), D. melanogaster (triangles) and C. elegans (circles). (b) Average number of alternative splice forms per gene for each organism. This was calculated by dividing the estimate for G with a minimum of 3,000 aligned bases by the estimate for G at a minimum of 100 aligned bases. Error bars show the 95% confidence intervals, assuming there are no errors in the underlying data, calculated as described in Supplementary Methods online.

To confirm that these observations were not the result of differing amounts of EST data, we tested the four species with same number of reference sequences (n1). Four different subsets of EST contigs (all, one-half, one-quarter and one-eighth subsets of the data) showed nearly indistinguishable results (Supplementary Fig. 1 online). Therefore, our predictions are independent of the extent of EST contig coverage. To test whether our results could be due to the presence of pseudogenes, we analyzed a subset of the EST contigs that hit only one region of the human genome. The rate of alternative splicing in the subset that excluded transcriptional variants from pseudogenes was similar to that observed using the full EST contig set (Supplementary Fig. 2 online). This indicates that the confounding affect of the pseudogenes is not substantial. Finally, we determined whether the higher proportion of tumor libraries in human EST data (relative to the other species) confounded our results. Notably, tumor-specific transcripts seem to have a higher rate of alternative splicing in humans (Supplementary Fig. 3 online). But the rate of alternative splicing estimated from the non-tumor-specific transcripts was similar to that estimated using a random sample of the data set including all EST contigs. (Supplementary Fig. 3 online). This is probably because only 10% of the contigs in the Gene Indices data set from The Institute for Genomic Research are considered tumor-specific by our criteria. Further details are available in Supplementary Methods online.

Our results disagree with those of Brett et al.6. We believe that our method provides a more accurate answer, as it does not depend on the extent of EST coverage. Our results indicate that, in accordance with expectations given that there are only 30,000 genes in humans, there is a greater amount of alternative splicing in mammals than in invertebrates. Enumerating these different forms and understanding their roles in contributing to biological complexity is a vast area for future research.

Note: Supplementary information is available on the Nature Genetics website.