Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Identification of protein coding regions by database similarity search

Abstract

Sequence similarity between a translated nucleotide sequence and a known biological protein can provide strong evidence for the presence of a homologous coding region, even between distantly related genes. The computer program BLASTX performed conceptual translation of a nucleotide query sequence followed by a protein database search in one programmatic step. We characterized the sensitivity of BLASTX recognition to the presence of substitution, insertion and deletion errors in the query sequence and to sequence divergence. Reading frames were reliably identified in the presence of 1% query errors, a rate that is typical for primary sequence data. BLASTX is appropriate for use in moderate and large scale sequencing projects at the earliest opportunity, when the data are most prone to containing errors.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Similar content being viewed by others

References

  1. Staden, R. & McLachlan, A.D. Codon preference and its use in identifying protein coding regions in long DNA sequence. Nucl. Acids Res. 10, 141–156 (1982).

    Article  CAS  Google Scholar 

  2. Fickett, J.W. Recognition of protein coding regions in DNA sequences. Nucl. Acids Res. 10, 5303–5318 (1982).

    Article  CAS  Google Scholar 

  3. Staden, R. Finding protein coding regions in genomic sequences. Methods Enzymol. 183, 163–180 (1990).

    Article  CAS  Google Scholar 

  4. Uberbacher, E.C., & Mural, R.J. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc. natn. Acad. Sci. U.S.A. 88, 11261–11265 (1991).

    Article  CAS  Google Scholar 

  5. Fields, C.A. & Soderlund, C.A. gm: a practical tool for automating DNA sequence analysis. Comput. Appl. Biosci. 6, 263–270 (1990).

    CAS  PubMed  Google Scholar 

  6. States, D.J., Gish, W. & Altschul, S.F. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods: A Compan. Meth. Enzymol. 3, 66–70 (1991).

    Article  CAS  Google Scholar 

  7. Riordan, J.R. et al. Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science 245, 1066–1073 (1989).

    Article  CAS  Google Scholar 

  8. Chen, C.-J. et al. Genomic organization of the human multidrug resistance (MDR1) gene and origin of P-glycoproteins. J. biol. Chem. 265, 506–514 (1990).

    CAS  PubMed  Google Scholar 

  9. Adams, M.D. et al. Sequence identification of 2,375 human brain genes. Nature 355, 632–634 (1992).

    Article  CAS  Google Scholar 

  10. Sulston, J. et al. The C. elegans genome sequencing project: a beginning. Nature 356, 37–41 (1992).

    Article  Google Scholar 

  11. Karlin, S. & Altschul, S.F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. natn. Acad. Sci. U.S.A. 87, 2264–2268 (1990).

    Article  CAS  Google Scholar 

  12. Altschul, S.F. Amino acid substitution matrices from an information theoretic perspective. J. molec. Biol. 219, 555–565 (1991).

    Article  CAS  Google Scholar 

  13. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. molec. Biol. 215, 403–410 (1990).

    Article  CAS  Google Scholar 

  14. Pearson, W.R. & Lipman, D.J. Improved tools for biological sequence comparison. Proc. natn. Acad. Sci. U.S.A. 85, 2444–2448 (1988).

    Article  CAS  Google Scholar 

  15. Henikoff, S., Wallace, J.C. & Brown, J.P. Finding protein similarities with nucleotide sequence databases. Methods Enzymol. 183, 111–132 (1990).

    Article  CAS  Google Scholar 

  16. Wallace, J.C. & Henikoff, S. PATMAT: a searching and extraction program for sequence, pattern and block queries and databases. Comp. appl. bio. Sci. 8, 249–254 (1992).

    CAS  Google Scholar 

  17. Shannon, K.W. & Rabinowitz, J.C. Isolation and characterization of the Saccharomyces cerevisiae MIS1 gene encoding mitochondrial C1-tetrahydrofolate synthase. J. biol. Chem. 263, 7717–7725 (1988).

    CAS  PubMed  Google Scholar 

  18. Barker, W.C., George, D.G. & Hunt, L.T. Protein sequence database. Meth. Enzymol. 183, 31–49 (1990).

    Article  CAS  Google Scholar 

  19. Laloux, I., Dubois, E., Dewerchin, M. & Jacobs, E. TEC1, a gene involved in the activation of Ty1 and Ty1-mediated gene expression in Saccharomyces cerevisiae: cloning and molecular analysis. Molec. cell. Biol. 10, 3541–3550 (1991).

    Article  Google Scholar 

  20. Chan, Y.L. et al. The primary structure of rat ribosomal protein L7. The presence near the amino terminus of L7 of five tandem repeats of a sequence of 12 amino acids. J. biol. Chem. 262, 1111–1115 (1987).

    CAS  PubMed  Google Scholar 

  21. Otaka, E., Higo, K.I. & Itoh, T. Isolation of seventeen proteins and amino-terminal amino acid sequences of eight proteins from cytoplasmic ribosomes of yeast. Molec. gen. Genet. 191, 519–524 (1983).

    Article  CAS  Google Scholar 

  22. Lee, M.G. & Nurse, P. Complementation used to clone a human homologue of the fission yeast cell cycle control gene cdc2. Nature 327, 31–35 (1987).

    Article  CAS  Google Scholar 

  23. Burks, C. et al. GenBank: current status and future directions. Methods Enzymol. 183, 3–22 (1990).

    Article  CAS  Google Scholar 

  24. Krawetz, S.A. Sequence errors described in GenBank: a means to determine the accuracy of DNA sequence interpretation. Nucl. Acids Res. 17, 3951–3957 (1989).

    Article  CAS  Google Scholar 

  25. States, D.J. & Botstein, D. Molecular sequence accuracy and the analysis of protein coding regions. Proc. natn. Acad. Sci. U.S.A. 88, 5518–5522 (1991).

    Article  CAS  Google Scholar 

  26. Gonnet, G.H., Cohen, M.A. & Benner, S.A. Exhaustive matching of the entire protein sequence database. Science 256, 1443–1445 (1992).

    Article  CAS  Google Scholar 

  27. Boguski, M.S. dbEST, a database of expressed sequence tagged sites. (National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894-0001, Internet electronic mail: boguski@ncbi.nlm.nih.gov, 1992).

  28. Update on expressed sequence tag database. NCBI News 1 (3), 6 (1992).

  29. Bairoch, A. & Boeckmann, B. The SWISS-PROT protein sequence data bank. Nuc. Acids Res. 20, 2019–2022 (1992).

    Article  CAS  Google Scholar 

  30. Entrez CD ROM Pre-release 6 (National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD, 1992).

  31. Rubin, C.M., Houck, C.M., Deininger, P.L., Friedmann, T. & Schmid, C.W. Partial nucleotide sequence of the 300-nucleotide interspersed repeated human DNA sequences. Nature 284, 372–374 (1980).

    Article  CAS  Google Scholar 

  32. Claverie, J.-M. Identifying coding exons by similarity search: Alu-derived and other potentially misleading protein sequences. Genomics 12, 838–841 (1992).

    Article  CAS  Google Scholar 

  33. Wootton, J.C. & Federhen, S. Statistics of local complexity in amino acid sequences and sequence databases. Computers Chem. (in the press).

  34. Claverie, J.-M. & States, D.J. Information enhancement methods for large scale sequence analysis. Computers Chem. (in the press).

  35. Smith, T.F. & Waterman, M.S. Identification of common molecular subsequences. J. molec. Biol. 147, 195–197 (1981).

    Article  CAS  Google Scholar 

  36. Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. in Atlas of Protein Sequence and Structure (ed. Dayhoff, M. O.) 5 (3), 345–352 (Natn. Biomed. Res. Found., Washington D.C., 1978).

    Google Scholar 

  37. Hopcroft, J.E. & Ullman, J.D. Introduction to automata theory, languages, and computation, 42–43 (Addison-Wesley Publishing, Reading, MA, 1979).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gish, W., States, D. Identification of protein coding regions by database similarity search. Nat Genet 3, 266–272 (1993). https://doi.org/10.1038/ng0393-266

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng0393-266

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing