Abstract
The identification of promoters and first exons has been one of the most difficult problems in gene-finding. We present a set of discriminant functions that can recognize structural and compositional features such as CpG islands, promoter regions and first splice-donor sites. We explain the implementation of the discriminant functions into a decision tree that constitutes a new program called FirstEF. By using different models to predict CpG-related and non-CpG-related first exons, we showed by cross-validation that the program could predict 86% of the first exons with 17% false positives. We also demonstrated the prediction accuracy of FirstEF at the genome level by applying it to the finished sequences of human chromosomes 21 and 22 as well as by comparing the predictions with the locations of the experimentally verified first exons. Finally, we present the analysis of the predicted first exons for all of the 24 chromosomes of the human genome.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Accession codes
References
Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Venter, J.C. et al. The sequence of the human genome. Science 291,1304–1351 (2001).
Lander, E.S. The new genomics: global views of biology. Science 274, 536–539 (1996).
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
Solovyev, V.V., Salamov, A.A. & Lawrence, C.B. Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res. 22, 5156–5163 (1994).
Zhang, M.Q. Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc. Natl. Acad. Sci. USA 94, 565–568 (1997).
Cleverie, J.M. Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet. 6, 1735–1744 (1997).
Galas, D.J. Sequence interpretation: making sense of sequence. Science 291, 1257–1260 (2001).
Stormo, G.D. Gene-finding approaches for eukaryotes. Genome Res. 10, 394–397 (2000).
Maroni, G. The organization of eukaryotic genes. Evol. Biol. 29, 1–19 (1996).
Scherf, M., Klingenhoff, A. & Werner, T. Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. J. Mol. Biol. 297, 599–606 (2000).
Davuluri, R.V., Suzuki, Y., Sugano, S. & Zhang, M.Q. CART classification of human 5′ UTR sequences. Genome Res. 10, 1807–1816 (2000).
Gardiner-Garden, M. & Frommer, M. CpG islands in vertebrate genomes. J. Mol. Biol. 196, 261–282 (1987).
Ioshikhes, I.P. & Zhang, M.Q. Large-scale human promoter mapping using CpG islands. Nature Genet. 26, 61–63 (2000).
Hattori, M. et al. The DNA sequence of human chromosome 21. Nature 405, 311–319 (2000).
Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489–495 (1999).
Lemon, B. & Tjian, R. Orchestrated response: a symphony of transcription factors for gene control. Genes Dev. 14, 2551–2569 (2000).
Claverie, J.M. From bioinformatics to computational biology. Genome Res. 10, 1277–1279 (2000).
Perier, R.C., Praz, V., Junier, T., Bonnard, C. & Bucher P. The eukaryotic promoter database (EPD). Nucleic Acids Res. 28, 302–303 (2000).
Hong, S.J. & Weiss, S.M. Advances in predictive models for data mining. Pattern Recognition Let. 22, 55–61 (2001).
Cross, S.H. & Bird, A.P. CpG islands and genes. Curr. Opin. Genet. Dev. 5, 309–314 (1995).
Cross, S., Kovarik, P., Schmidtke, J. & Bird, A. Non-methylated islands in fish genomes are GC-poor. Nucleic Acids Res. 19, 1469–1474 (1991).
Zhang, M.Q. Identification of human gene core promoters in silico. Genome Res. 8, 319–326 (1998).
Venables, W.N. & Ripley, B.D. Modern Applied Statistics with S-Plus (Springer, New York, 1994).
Acknowledgements
This work was supported by grants to M.Q.Z. from the National Institutes of Health, and I.G. is also supported by a CSHL Association fellowship. We thank G. Chen for setting up the web interface to FirstEF, as well as N. Banerjee, K. Hermann, H. Herzel, M. Hoffman, D. Holste, W. Li, F. Lillo, M. Ronemus, R. Sachidanandam, K. Rateitschak, A. Schmitt and Z. Xuan for valuable discussions and comments on the manuscript.
Author information
Authors and Affiliations
Corresponding author
Supplementary information
Rights and permissions
About this article
Cite this article
Davuluri, R., Grosse, I. & Zhang, M. Computational identification of promoters and first exons in the human genome. Nat Genet 29, 412–417 (2001). https://doi.org/10.1038/ng780
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ng780
This article is cited by
-
iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features
BMC Genomics (2022)
-
Diversification of CpG-Island Promoters Revealed by Comparative Analysis Between Human and Rhesus Monkey Genomes
Mammalian Genome (2020)
-
Sensitivity to differential NRF1 gene signatures contributes to breast cancer disparities
Journal of Cancer Research and Clinical Oncology (2020)
-
3D-QSAR and Pharmacophore modeling of 3,5-disubstituted indole derivatives as Pim kinase inhibitors
Structural Chemistry (2020)
-
Opening up the DNA methylome of dementia
Molecular Psychiatry (2017)