Abstract
Sequence similarity search programs are versatile tools for the molecular biologist, frequently able to identify possible DNA coding regions and to provide clues to gene and protein structure and function. While much attention had been paid to the precise algorithms these programs employ and to their relative speeds, there is a constellation of associated issues that are equally important to realize the full potential of these methods. Here, we consider a number of these issues, including the choice of scoring systems, the statistical significance of alignments, the masking of uninformative or potentially confounding sequence regions, the nature and extent of sequence redundancy in the databases and network access to similarity search services.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Genomic signatures and evolutionary history of the endangered blue-crowned laughingthrush and other Garrulax species
BMC Biology Open Access 24 August 2022
-
Genetic differentiation and intrinsic genomic features explain variation in recombination hotspots among cocoa tree populations
BMC Genomics Open Access 29 April 2020
-
Functional identification of lncRNAs in sweet cherry (Prunus avium) pollen tubes via transcriptome analysis using single-molecule long-read sequencing
Horticulture Research Open Access 01 December 2019
Access options
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout
References
Altschul, S.F. Amino acid substitution matrices from an information theoretic perspective. J. molec. Biol. 219, 556–565 (1991).
Altschul, S.F. A protein alignment scoring system sensitive at all evolutionary distances. J. molec. Evol. 36, 290–300 (1993).
States, D.J., Gish, W. & Altschul, S.F. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods 3, 66–70 (1991).
Gish, W. & States, D.J. Identification of protein coding regions by database similarity search. Nature Genet. 3, 266–272 (1993).
Claverie, J.-M. Detecting frameshifts by amino acid sequence comparison. J. molec. Biol. 234, 1140–1157 (1993).
Karlin, S. & Altschul, S.F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. natn. Acad. Sci. U.S.A. 87, 2264–2268 (1990).
Karlin, S., Dembo, A. & Kawabata, T. Statistical composition of high-scoring segments from molecular sequences. Ann. Stat. 18, 571–581 (1990).
Dembo, A. & Karlin, S. Strong limit theorems of empirical functionals for large exceedances of partial sums of i.i.d. variables. Ann. Prob. 19, 1737–1755 (1991).
Karlin, S. & Altschul, S.F. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. natn. Acad. Sci. U.S.A. 90, 5873–5877 (1993).
Smith, T.F., Waterman, M.S. & Burks, C. The statistical distribution of nucleic acid similarities. Nucl. Acids Res. 13, 645–656 (1985).
Altschul, S.F. & Erickson, B.W. A nonlinear measure of subalignment similarity and its significance levels. Bull. math. Biol. 48, 617–632 (1986).
Collins, J.F., Coulson, A.F.W. & Lyall, A. The significance of protein sequence similarities. CABIOS 4, 67–71 (1988).
Mott, R. Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bull. math. Biol. 54, 59–75 (1992).
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. molec. Biol. 215, 403–410 (1990).
Needleman, S.B. & Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. molec. Biol. 48, 443–453 (1970).
Sellers, P.H. On the theory and computation of evolutionary distances. SIAM J. appl. Math. 26, 787–793 (1974).
Sankoff, D. & Kruskal, J.B. Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison (Addison-Wesley, Reading, M.A, 1983).
Smith, T.F. & Waterman, M.S. Identification of common molecular subsequences. J. molec. Biol. 147, 195–197 (1981).
Goad, W.B. & Kanehisa, M.I. Pattern recognition in nucleic acid sequences. I.A general method for finding local homologies and symmetries. Nucl. Acids Res. 10, 247–263 (1982).
Sellers, P.H. Pattern recognition in genetic sequences by mismatch density. Bull. math. Biol. 46, 501–514 (1984).
Waterman, M.S. & Eggert, M. A new algorithm for best subsequence alignments with applications to tRNA-rRNA comparisons. J. molec. Biol. 197, 723–728 (1987).
Coulson, A.F.W., Collins, J.F. & Lyall, A. Protein and nucleic acid database searching: a suitable case for parallel processing. Comp. J. 30, 420–424 (1987).
Chow, E.T., Hunkapiller, T., Peterson, J.C., Zimmerman, B.A. & Waterman, M.S. in Proc. 1991 Int. Conf. on Supercomputing, 216–223 (ACMPress, New York, 1991).
Jones, R. Sequence pattern matching on a massively parallel computer. CABIOS 8, 377–383 (1992).
Brutlag, D.L. et al. BLAZE: an implementation of the Smith-Waterman sequence comparison algorithm on a massively parallel computer. Comput. Chem. 17, 203–207 (1993).
Sturrock, S.S. & Collins, J.F. MPsrch version 1.3. (Biocomputing Research Unit, University of Edinburgh, 1993).
Lipman, D.J. & Pearson, W.R. Rapid and sensitive protein similarity searches. Science 227, 1435–1441 (1985).
Pearson, W.R. & Lipman, D.J. Improved tools for biological sequence comparison. Proc. natn. Acad. Sci. U.S.A. 85, 2444–2448 (1988).
White, C.T. et al. in Proc. 1991 IEEE Int. Conf. Comp. Design: VLSI in Computers and Processors, 504–509 (IEEE Comp. Soc. Press, Los Alamitos, CA, 1991).
Pearson, W.R. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics 11, 635–650 (1991).
Altschul, S.F. & Lipman, D.J. Protein database searches for multiple alignments. Proc. natn. Acad. Sci. U.S.A. 87, 5509–5513 (1990).
Argos, P. A sensitive procedure to compare amino acid sequences. J. molec. Biol. 193, 385–396 (1987).
Vogt, G. & Argos, P. Searching for distantly related protein sequences in large databases by parallel processing on a transputer machine. CABIOS 8, 49–55 (1992).
McLachlan, A.D. Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome C551 . J. molec. Biol. 61, 409–424 (1971).
Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. . in Atlas of Protein Sequence and Structure vol. 5, suppl. 3 (ed. M.O. Dayhoff) 345–352 (Natn. Biomed. Res. Found., Washington, 1978).
Schwartz, R.M. & Dayhoff, M.O. . in Atlas of Protein Sequence and Structure vol. 5, suppl. 3 (ed. M. O. Dayhoff) 353–358 (Natn. Biomed. Res. Found., Washington, 1978).
Feng, D.F., Johnson, M.S. & Doolittle, R.F. Aligning amino acid sequences: comparison of commonly used methods. J. molec. Evol. 21, 112–125 (1985).
Rao, J.K.M. New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters. Int. J. peptide protein Res. 29, 276–281 (1987).
Risler, J.L., Delorme, M.O., Delacroix, H. & Henaut, A. Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. J. molec. Biol. 204, 1019–1029 (1988).
Gonnet, G.H., Cohen, M.A. & Benner, S.A. Exhaustive matching of the entire protein sequence database. Science 256, 1443–1445 (1992).
Henikoff, S. & Henikoff, J.G. Amino acid substitution matrices from protein blocks. Proc. natn. Acad. Sci. U.S.A 89, 10915–10919 (1992).
Jones, D.T., Taylor, W.R. & Thornton, J.M. The rapid generation of mutation data matrices from protein sequences. CABIOS 8, 275–282 (1992).
Overington, J., Donnelly, D., Johnson, M.S., Sali, A. & Blundell, T.L. Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds. Prot. Sci. 1, 216–226 (1992).
Wilbur, W.J. On the PAM matrix model of protein evolution. Molec. Biol. Evol. 2, 434–447 (1985).
Henikoff, S. & Henikoff, J.G. Performance evaluation of amino acid substitution matrices. Proteins 17, 49–61 (1993).
Waterman, M.S., Gordon, L. & Arratia, R. Phase transitions in sequence matches and nucleic acid structure. Proc. natn. Acad. Sci. U.S.A. 84, 1239–1243 (1987).
Fitch, W.M. & Smith, T.F. Optimal sequence alignments. Proc. natn. Acad. Sci. U.S.A. 80, 1382–1386 (1983).
Gotoh, O. An improved algorithm for matching biological sequences. J. molec. Biol. 162, 705–708 (1982).
Altschul, S.F. & Erickson, B.W. Optimal sequence alignment using affine gap costs. Bull. math. Biol. 48, 603–616 (1986).
Myers, E.W. & Miller, W. Optimal alignments in linear space. CABIOS 4, 11–17 (1988).
Miller, W. & Myers, E.W. Sequence comparison with concave weighting functions. Bull. math. Biol. 50, 97–120 (1988).
Pascarella, S. & Argos, P. Analysis of insertions/deletions in protein structures. J. molec. Biol. 224, 461–471 (1992).
Benner, S.A., Cohen, M.A. & Gonnet, G.H. Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J. molec. Biol. 229, 1065–1082 (1993).
Benson, D., Lipman, D.J. & Ostell, J. GenBank. Nucl. Acids Res. 21, 2963–2965 (1993).
Rice, C.M., Fuchs, R., Higgins, D.G., Stoehr, P.J. & Cameron, G.N. The EMBL data library. Nucl. Acids Res. 21, 2967–2971 (1993).
Barker, W.C., George, D.G., Mewes, H.-W., Pfeiffer, F. & Tsugita, A. The PIR-International databases. Nucl. Acids Res. 21, 3089–3092 (1993).
Adams, M.D. et al. Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252, 1651–1656 (1991).
Sikela, J.M. & Auffray, C. Finding new genes faster than ever. Nature Genet. 3, 189–191 (1993).
Davies, K. The EST express gathers steam. Nature 364, 554 (1993).
Boguski, M.S., Lowe, T.M.J. & Tolstoshev, C.M. dbEST — database for “expressed sequence tags”. Nature Genet. 4, 332–333 (1993).
Bleasby, A.J. & Wootton, J.C. Construction of validated, non-redundant composite sequence databases. Protein Eng. 3, 153–159 (1990).
Benson, D., Boguski, M., Lipman, D.J. & Ostell, J. The national center for biotechnology information. Genomics 6, 389–391 (1990).
Bairoch, A. & Boeckmann, B. The SWISS-PROT protein sequence data bank, recent developments. Nucl. Acids Res. 21, 3093–3096 (1993).
Henikoff, S. Sequence analysis by electronic mail server. Trends biochem. Sci. 18, 267–268 (1993).
Krol, E. The Whole Internet User's Guide & Cataolog. (O'Reilly & Assoc., Inc., Sebastopol, CA, 1992).
Network Entrez. NCBI News 2(2), 1 (National Library of Medicine, Bethesda, MD, 1993).
Wootton, J.C. & Federhen, S. Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17, 149–163 (1993).
Green, P., Lipman, D., Hillier, L., Waterston, R., States, D.J. & Claverie, J.-M. Ancient conserved regions in new gene sequences. Science 259, 1711–1716 (1993).
Riggins, G.J. et al. Human genes containing polymorphic trinucleotide repeats. Nature Genet. 2, 186–191 (1992).
Harding, R.M., Boyce, A.J. & Clegg, J.B. The evolution of tandemly repetitive DNA: recombination rules. Genetics 132, 847–859 (1992).
Karlin, S. & Brendel, V. Charge configurations in viral proteins. Proc. natn. Acad. Sci. U.S.A. 85, 9396–9400 (1988).
Karlin, S. & Brendel, V. Charge and statistical significance in protein and DNA sequence analysis. Science 257, 39–49 (1992).
Brendel, V., Bucher, P., Nourbakhsh, I.R., Blaisdell, B.E. & Karlin, S. Methods and algorithms for statistical analysis of protein sequences. Proc. natn. Acad. Sci. U.S.A. 89, 2002–2006 (1992).
Claverie, J.-M. & States, D.J. Information enchancement methods for large scale sequence analysis. Comput. Chem. 17, 191–201 (1993).
Jurka, J., Walichiewicz, J. & Milosavljevic, A. Prototypic sequences for human repetitive DNA. J. molec. Evol. 35, 286–291 (1992).
Hanks, S.K. & Quinn, A.M. Protein kinase catalytic domain sequence database: identification of conserved features of primary structure and classification of family members. Meth. Enzymol. 200, 38–62 (1991).
Collins, F. & Galas, D. A new five-year plan for the U.S. human genome project. Science 262, 43–46 (1993).
Gumbel, E.J. Statistics of extremes. (Columbia Univ. Press, New York, 1958).
Arratia, R., Gordon, L. & Waterman, M.S. An extreme value theory for sequence matching. Ann. Stat. 14, 971–993 (1986).
Arratia, R., Morris, P. & Waterman, M.S. Stochastic scrabble: large deviations for sequences with scores. J. appl. Prob. 25, 106–119 (1988).
Arratia, R. & Waterman, M.S. The Erdos-Renyi strong law for pattern matching with a given proportion of mismatches. Ann. Prob. 17, 1152–1169 (1989).
Salamon, P. & Konopka, A.K. A maximum entropy principle for distribution of local complexity in naturally occurring nucleotide sequences. Comput. Chem. 16, 117–124 (1992).
Salamon, P., Wootton, J.C., Konopka, A.K. & Hansen, L. On the robustness of maximum entropy relationships for complexity distributions of nucleotide sequences. Comput. Chem. 17, 135–148 (1993).
Miyoshi, H. et al. The t(8:21) translocation in acute myeloid leukemia results in production of an AML1-MTG8 fusion transcript. EMBO J. 12, 2715–2721 (1993).
Kokubo, T., Gong, D.-W., Roeder, R.G., Horikoshi, M. & Nakatani, Y. The Drosophlla 110-kDa TFIID subunit directly interacts with the N-terminal region of the 230-kDa subunit. Proc. natn. Acad. Sci. U.S.A. 90, 5896–5900 (1993).
Hoey, T. et al. Molecular cloning and functional analysis of Drosophila TAF110 reveal properties expected of coactivators. Cell 72, 247–260 (1993).
Owens, G.P., Hahn, W.E. & Cohen, J.J. Identification of mRNAs associated with programmed cell death in immature thymocytes. Mol. cell. Biol. 11, 4177–4188 (1991).
Schwabe, J.W., Neuhaus, D. & Rhodes, D. Solution structure of the DNA-binding domain of the oestrogen receptor. Nature 348, 458–461 (1990).
Feig, L.A. The many roads that lead to Ras. Science 260, 767–768 (1993).
McCormick, F. How receptors turn Ras on. Nature 363, 15–16 (1993).
Boguski, M.S. & McCormick, F. Proteins regulating Ras and its relatives. Nature 366, 643–654 (1993).
Rozakis-Adcock, M., Femley, R., Wade, J., Pawson, T. & Bowtell, D. The SH2 and SH3 domains of mammalian Grb2 couple the EGF receptor to the Ras activator mSos1. Nature 363, 83–85 (1993).
Musacchio, A., Gibson, T., Rice, P., Thompson, J. & Saraste, M. The PH domain is a common piece in the structural patchwork of signalling (and other) proteins. Trends biochem. Sci. 18, 343–348 (1993).
Arents, G., Burlingame, R.W., Wang, B.C., Love, W.E. & Moudrianakis, E.N. The nucleosomal core histone octamer at 3.1 A resolution: a tripartite protein assembly and a left-handed superhelix. Proc. natn. Acad. Sci. U.S.A. 88, 10148–10152 (1991).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Altschul, S., Boguski, M., Gish, W. et al. Issues in searching molecular sequence databases. Nat Genet 6, 119–129 (1994). https://doi.org/10.1038/ng0294-119
Issue Date:
DOI: https://doi.org/10.1038/ng0294-119
This article is cited by
-
Genomic signatures and evolutionary history of the endangered blue-crowned laughingthrush and other Garrulax species
BMC Biology (2022)
-
Homology modeling in combination of phylogenetic assortment, a new approach to resolve the phylogeny of selected heterocystous cyanobacteria based on phycocyanin encoding cpcBA-IGS locus
Vegetos (2021)
-
Genetic differentiation and intrinsic genomic features explain variation in recombination hotspots among cocoa tree populations
BMC Genomics (2020)
-
Investigating molecular evolutionary forces and phylogenetic relationships among melatonin precursor-encoding genes of different plant species
Molecular Biology Reports (2020)
-
Functional identification of lncRNAs in sweet cherry (Prunus avium) pollen tubes via transcriptome analysis using single-molecule long-read sequencing
Horticulture Research (2019)