Abstract
Sequence similarity between a translated nucleotide sequence and a known biological protein can provide strong evidence for the presence of a homologous coding region, even between distantly related genes. The computer program BLASTX performed conceptual translation of a nucleotide query sequence followed by a protein database search in one programmatic step. We characterized the sensitivity of BLASTX recognition to the presence of substitution, insertion and deletion errors in the query sequence and to sequence divergence. Reading frames were reliably identified in the presence of 1% query errors, a rate that is typical for primary sequence data. BLASTX is appropriate for use in moderate and large scale sequencing projects at the earliest opportunity, when the data are most prone to containing errors.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Staden, R. & McLachlan, A.D. Codon preference and its use in identifying protein coding regions in long DNA sequence. Nucl. Acids Res. 10, 141–156 (1982).
Fickett, J.W. Recognition of protein coding regions in DNA sequences. Nucl. Acids Res. 10, 5303–5318 (1982).
Staden, R. Finding protein coding regions in genomic sequences. Methods Enzymol. 183, 163–180 (1990).
Uberbacher, E.C., & Mural, R.J. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc. natn. Acad. Sci. U.S.A. 88, 11261–11265 (1991).
Fields, C.A. & Soderlund, C.A. gm: a practical tool for automating DNA sequence analysis. Comput. Appl. Biosci. 6, 263–270 (1990).
States, D.J., Gish, W. & Altschul, S.F. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods: A Compan. Meth. Enzymol. 3, 66–70 (1991).
Riordan, J.R. et al. Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science 245, 1066–1073 (1989).
Chen, C.-J. et al. Genomic organization of the human multidrug resistance (MDR1) gene and origin of P-glycoproteins. J. biol. Chem. 265, 506–514 (1990).
Adams, M.D. et al. Sequence identification of 2,375 human brain genes. Nature 355, 632–634 (1992).
Sulston, J. et al. The C. elegans genome sequencing project: a beginning. Nature 356, 37–41 (1992).
Karlin, S. & Altschul, S.F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. natn. Acad. Sci. U.S.A. 87, 2264–2268 (1990).
Altschul, S.F. Amino acid substitution matrices from an information theoretic perspective. J. molec. Biol. 219, 555–565 (1991).
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. molec. Biol. 215, 403–410 (1990).
Pearson, W.R. & Lipman, D.J. Improved tools for biological sequence comparison. Proc. natn. Acad. Sci. U.S.A. 85, 2444–2448 (1988).
Henikoff, S., Wallace, J.C. & Brown, J.P. Finding protein similarities with nucleotide sequence databases. Methods Enzymol. 183, 111–132 (1990).
Wallace, J.C. & Henikoff, S. PATMAT: a searching and extraction program for sequence, pattern and block queries and databases. Comp. appl. bio. Sci. 8, 249–254 (1992).
Shannon, K.W. & Rabinowitz, J.C. Isolation and characterization of the Saccharomyces cerevisiae MIS1 gene encoding mitochondrial C1-tetrahydrofolate synthase. J. biol. Chem. 263, 7717–7725 (1988).
Barker, W.C., George, D.G. & Hunt, L.T. Protein sequence database. Meth. Enzymol. 183, 31–49 (1990).
Laloux, I., Dubois, E., Dewerchin, M. & Jacobs, E. TEC1, a gene involved in the activation of Ty1 and Ty1-mediated gene expression in Saccharomyces cerevisiae: cloning and molecular analysis. Molec. cell. Biol. 10, 3541–3550 (1991).
Chan, Y.L. et al. The primary structure of rat ribosomal protein L7. The presence near the amino terminus of L7 of five tandem repeats of a sequence of 12 amino acids. J. biol. Chem. 262, 1111–1115 (1987).
Otaka, E., Higo, K.I. & Itoh, T. Isolation of seventeen proteins and amino-terminal amino acid sequences of eight proteins from cytoplasmic ribosomes of yeast. Molec. gen. Genet. 191, 519–524 (1983).
Lee, M.G. & Nurse, P. Complementation used to clone a human homologue of the fission yeast cell cycle control gene cdc2. Nature 327, 31–35 (1987).
Burks, C. et al. GenBank: current status and future directions. Methods Enzymol. 183, 3–22 (1990).
Krawetz, S.A. Sequence errors described in GenBank: a means to determine the accuracy of DNA sequence interpretation. Nucl. Acids Res. 17, 3951–3957 (1989).
States, D.J. & Botstein, D. Molecular sequence accuracy and the analysis of protein coding regions. Proc. natn. Acad. Sci. U.S.A. 88, 5518–5522 (1991).
Gonnet, G.H., Cohen, M.A. & Benner, S.A. Exhaustive matching of the entire protein sequence database. Science 256, 1443–1445 (1992).
Boguski, M.S. dbEST, a database of expressed sequence tagged sites. (National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894-0001, Internet electronic mail: boguski@ncbi.nlm.nih.gov, 1992).
Update on expressed sequence tag database. NCBI News 1 (3), 6 (1992).
Bairoch, A. & Boeckmann, B. The SWISS-PROT protein sequence data bank. Nuc. Acids Res. 20, 2019–2022 (1992).
Entrez CD ROM Pre-release 6 (National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD, 1992).
Rubin, C.M., Houck, C.M., Deininger, P.L., Friedmann, T. & Schmid, C.W. Partial nucleotide sequence of the 300-nucleotide interspersed repeated human DNA sequences. Nature 284, 372–374 (1980).
Claverie, J.-M. Identifying coding exons by similarity search: Alu-derived and other potentially misleading protein sequences. Genomics 12, 838–841 (1992).
Wootton, J.C. & Federhen, S. Statistics of local complexity in amino acid sequences and sequence databases. Computers Chem. (in the press).
Claverie, J.-M. & States, D.J. Information enhancement methods for large scale sequence analysis. Computers Chem. (in the press).
Smith, T.F. & Waterman, M.S. Identification of common molecular subsequences. J. molec. Biol. 147, 195–197 (1981).
Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. in Atlas of Protein Sequence and Structure (ed. Dayhoff, M. O.) 5 (3), 345–352 (Natn. Biomed. Res. Found., Washington D.C., 1978).
Hopcroft, J.E. & Ullman, J.D. Introduction to automata theory, languages, and computation, 42–43 (Addison-Wesley Publishing, Reading, MA, 1979).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Gish, W., States, D. Identification of protein coding regions by database similarity search. Nat Genet 3, 266–272 (1993). https://doi.org/10.1038/ng0393-266
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1038/ng0393-266