Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Computational identification of promoters and first exons in the human genome

A Corrigendum to this article was published on 01 November 2002


The identification of promoters and first exons has been one of the most difficult problems in gene-finding. We present a set of discriminant functions that can recognize structural and compositional features such as CpG islands, promoter regions and first splice-donor sites. We explain the implementation of the discriminant functions into a decision tree that constitutes a new program called FirstEF. By using different models to predict CpG-related and non-CpG-related first exons, we showed by cross-validation that the program could predict 86% of the first exons with 17% false positives. We also demonstrated the prediction accuracy of FirstEF at the genome level by applying it to the finished sequences of human chromosomes 21 and 22 as well as by comparing the predictions with the locations of the experimentally verified first exons. Finally, we present the analysis of the predicted first exons for all of the 24 chromosomes of the human genome.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Histogram of CpG scores for all of the 2,139 first exons of our first-exon database.
Figure 2: Histogram of the relative distance between the 5′ end of the CpG window and the splice-donor site for CpG-related first exons.
Figure 3: Histograms of the relative distance between the 5′ end of the predicted first exon (TSS) and the annotated translation start codon (ATG) for human chromosome 21 (Fig. 3a) and human chromosome 22 (Fig. 3b).

Similar content being viewed by others

Accession codes




  1. Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

    Article  CAS  Google Scholar 

  2. Venter, J.C. et al. The sequence of the human genome. Science 291,1304–1351 (2001).

    Article  CAS  Google Scholar 

  3. Lander, E.S. The new genomics: global views of biology. Science 274, 536–539 (1996).

    Article  CAS  Google Scholar 

  4. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).

    Article  CAS  Google Scholar 

  5. Solovyev, V.V., Salamov, A.A. & Lawrence, C.B. Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res. 22, 5156–5163 (1994).

    Article  CAS  Google Scholar 

  6. Zhang, M.Q. Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc. Natl. Acad. Sci. USA 94, 565–568 (1997).

    Article  CAS  Google Scholar 

  7. Cleverie, J.M. Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet. 6, 1735–1744 (1997).

    Article  Google Scholar 

  8. Galas, D.J. Sequence interpretation: making sense of sequence. Science 291, 1257–1260 (2001).

    Article  CAS  Google Scholar 

  9. Stormo, G.D. Gene-finding approaches for eukaryotes. Genome Res. 10, 394–397 (2000).

    Article  CAS  Google Scholar 

  10. Maroni, G. The organization of eukaryotic genes. Evol. Biol. 29, 1–19 (1996).

    CAS  Google Scholar 

  11. Scherf, M., Klingenhoff, A. & Werner, T. Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. J. Mol. Biol. 297, 599–606 (2000).

    Article  CAS  Google Scholar 

  12. Davuluri, R.V., Suzuki, Y., Sugano, S. & Zhang, M.Q. CART classification of human 5′ UTR sequences. Genome Res. 10, 1807–1816 (2000).

    Article  CAS  Google Scholar 

  13. Gardiner-Garden, M. & Frommer, M. CpG islands in vertebrate genomes. J. Mol. Biol. 196, 261–282 (1987).

    Article  CAS  Google Scholar 

  14. Ioshikhes, I.P. & Zhang, M.Q. Large-scale human promoter mapping using CpG islands. Nature Genet. 26, 61–63 (2000).

    Article  CAS  Google Scholar 

  15. Hattori, M. et al. The DNA sequence of human chromosome 21. Nature 405, 311–319 (2000).

    Article  CAS  Google Scholar 

  16. Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489–495 (1999).

    Article  CAS  Google Scholar 

  17. Lemon, B. & Tjian, R. Orchestrated response: a symphony of transcription factors for gene control. Genes Dev. 14, 2551–2569 (2000).

    Article  CAS  Google Scholar 

  18. Claverie, J.M. From bioinformatics to computational biology. Genome Res. 10, 1277–1279 (2000).

    Article  CAS  Google Scholar 

  19. Perier, R.C., Praz, V., Junier, T., Bonnard, C. & Bucher P. The eukaryotic promoter database (EPD). Nucleic Acids Res. 28, 302–303 (2000).

    Article  CAS  Google Scholar 

  20. Hong, S.J. & Weiss, S.M. Advances in predictive models for data mining. Pattern Recognition Let. 22, 55–61 (2001).

    Article  Google Scholar 

  21. Cross, S.H. & Bird, A.P. CpG islands and genes. Curr. Opin. Genet. Dev. 5, 309–314 (1995).

    Article  CAS  Google Scholar 

  22. Cross, S., Kovarik, P., Schmidtke, J. & Bird, A. Non-methylated islands in fish genomes are GC-poor. Nucleic Acids Res. 19, 1469–1474 (1991).

    Article  CAS  Google Scholar 

  23. Zhang, M.Q. Identification of human gene core promoters in silico. Genome Res. 8, 319–326 (1998).

    Article  CAS  Google Scholar 

  24. Venables, W.N. & Ripley, B.D. Modern Applied Statistics with S-Plus (Springer, New York, 1994).

Download references


This work was supported by grants to M.Q.Z. from the National Institutes of Health, and I.G. is also supported by a CSHL Association fellowship. We thank G. Chen for setting up the web interface to FirstEF, as well as N. Banerjee, K. Hermann, H. Herzel, M. Hoffman, D. Holste, W. Li, F. Lillo, M. Ronemus, R. Sachidanandam, K. Rateitschak, A. Schmitt and Z. Xuan for valuable discussions and comments on the manuscript.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Michael Q. Zhang.

Supplementary information

Rights and permissions

Reprints and permissions

About this article

Cite this article

Davuluri, R., Grosse, I. & Zhang, M. Computational identification of promoters and first exons in the human genome. Nat Genet 29, 412–417 (2001).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing