Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Issues in searching molecular sequence databases

Abstract

Sequence similarity search programs are versatile tools for the molecular biologist, frequently able to identify possible DNA coding regions and to provide clues to gene and protein structure and function. While much attention had been paid to the precise algorithms these programs employ and to their relative speeds, there is a constellation of associated issues that are equally important to realize the full potential of these methods. Here, we consider a number of these issues, including the choice of scoring systems, the statistical significance of alignments, the masking of uninformative or potentially confounding sequence regions, the nature and extent of sequence redundancy in the databases and network access to similarity search services.

This is a preview of subscription content

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

References

  1. Altschul, S.F. Amino acid substitution matrices from an information theoretic perspective. J. molec. Biol. 219, 556–565 (1991).

    Google Scholar 

  2. Altschul, S.F. A protein alignment scoring system sensitive at all evolutionary distances. J. molec. Evol. 36, 290–300 (1993).

    CAS  PubMed  Google Scholar 

  3. States, D.J., Gish, W. & Altschul, S.F. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods 3, 66–70 (1991).

    CAS  Google Scholar 

  4. Gish, W. & States, D.J. Identification of protein coding regions by database similarity search. Nature Genet. 3, 266–272 (1993).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Claverie, J.-M. Detecting frameshifts by amino acid sequence comparison. J. molec. Biol. 234, 1140–1157 (1993).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. Karlin, S. & Altschul, S.F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. natn. Acad. Sci. U.S.A. 87, 2264–2268 (1990).

    CAS  Google Scholar 

  7. Karlin, S., Dembo, A. & Kawabata, T. Statistical composition of high-scoring segments from molecular sequences. Ann. Stat. 18, 571–581 (1990).

    Google Scholar 

  8. Dembo, A. & Karlin, S. Strong limit theorems of empirical functionals for large exceedances of partial sums of i.i.d. variables. Ann. Prob. 19, 1737–1755 (1991).

    Google Scholar 

  9. Karlin, S. & Altschul, S.F. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. natn. Acad. Sci. U.S.A. 90, 5873–5877 (1993).

    CAS  Google Scholar 

  10. Smith, T.F., Waterman, M.S. & Burks, C. The statistical distribution of nucleic acid similarities. Nucl. Acids Res. 13, 645–656 (1985).

    CAS  PubMed  Google Scholar 

  11. Altschul, S.F. & Erickson, B.W. A nonlinear measure of subalignment similarity and its significance levels. Bull. math. Biol. 48, 617–632 (1986).

    CAS  PubMed  Google Scholar 

  12. Collins, J.F., Coulson, A.F.W. & Lyall, A. The significance of protein sequence similarities. CABIOS 4, 67–71 (1988).

    CAS  PubMed  Google Scholar 

  13. Mott, R. Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bull. math. Biol. 54, 59–75 (1992).

    CAS  PubMed  Google Scholar 

  14. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. molec. Biol. 215, 403–410 (1990).

    CAS  Google Scholar 

  15. Needleman, S.B. & Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. molec. Biol. 48, 443–453 (1970).

    CAS  Google Scholar 

  16. Sellers, P.H. On the theory and computation of evolutionary distances. SIAM J. appl. Math. 26, 787–793 (1974).

    Google Scholar 

  17. Sankoff, D. & Kruskal, J.B. Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison (Addison-Wesley, Reading, M.A, 1983).

    Google Scholar 

  18. Smith, T.F. & Waterman, M.S. Identification of common molecular subsequences. J. molec. Biol. 147, 195–197 (1981).

    CAS  Google Scholar 

  19. Goad, W.B. & Kanehisa, M.I. Pattern recognition in nucleic acid sequences. I.A general method for finding local homologies and symmetries. Nucl. Acids Res. 10, 247–263 (1982).

    CAS  PubMed  Google Scholar 

  20. Sellers, P.H. Pattern recognition in genetic sequences by mismatch density. Bull. math. Biol. 46, 501–514 (1984).

    CAS  Google Scholar 

  21. Waterman, M.S. & Eggert, M. A new algorithm for best subsequence alignments with applications to tRNA-rRNA comparisons. J. molec. Biol. 197, 723–728 (1987).

    CAS  PubMed  Google Scholar 

  22. Coulson, A.F.W., Collins, J.F. & Lyall, A. Protein and nucleic acid database searching: a suitable case for parallel processing. Comp. J. 30, 420–424 (1987).

    Google Scholar 

  23. Chow, E.T., Hunkapiller, T., Peterson, J.C., Zimmerman, B.A. & Waterman, M.S. in Proc. 1991 Int. Conf. on Supercomputing, 216–223 (ACMPress, New York, 1991).

    Google Scholar 

  24. Jones, R. Sequence pattern matching on a massively parallel computer. CABIOS 8, 377–383 (1992).

    CAS  PubMed  Google Scholar 

  25. Brutlag, D.L. et al. BLAZE: an implementation of the Smith-Waterman sequence comparison algorithm on a massively parallel computer. Comput. Chem. 17, 203–207 (1993).

    CAS  Google Scholar 

  26. Sturrock, S.S. & Collins, J.F. MPsrch version 1.3. (Biocomputing Research Unit, University of Edinburgh, 1993).

    Google Scholar 

  27. Lipman, D.J. & Pearson, W.R. Rapid and sensitive protein similarity searches. Science 227, 1435–1441 (1985).

    CAS  Google Scholar 

  28. Pearson, W.R. & Lipman, D.J. Improved tools for biological sequence comparison. Proc. natn. Acad. Sci. U.S.A. 85, 2444–2448 (1988).

    CAS  Google Scholar 

  29. White, C.T. et al. in Proc. 1991 IEEE Int. Conf. Comp. Design: VLSI in Computers and Processors, 504–509 (IEEE Comp. Soc. Press, Los Alamitos, CA, 1991).

    Google Scholar 

  30. Pearson, W.R. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics 11, 635–650 (1991).

    CAS  PubMed  Google Scholar 

  31. Altschul, S.F. & Lipman, D.J. Protein database searches for multiple alignments. Proc. natn. Acad. Sci. U.S.A. 87, 5509–5513 (1990).

    CAS  Google Scholar 

  32. Argos, P. A sensitive procedure to compare amino acid sequences. J. molec. Biol. 193, 385–396 (1987).

    CAS  PubMed  Google Scholar 

  33. Vogt, G. & Argos, P. Searching for distantly related protein sequences in large databases by parallel processing on a transputer machine. CABIOS 8, 49–55 (1992).

    CAS  PubMed  Google Scholar 

  34. McLachlan, A.D. Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome C551 . J. molec. Biol. 61, 409–424 (1971).

    CAS  Google Scholar 

  35. Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. . in Atlas of Protein Sequence and Structure vol. 5, suppl. 3 (ed. M.O. Dayhoff) 345–352 (Natn. Biomed. Res. Found., Washington, 1978).

    Google Scholar 

  36. Schwartz, R.M. & Dayhoff, M.O. . in Atlas of Protein Sequence and Structure vol. 5, suppl. 3 (ed. M. O. Dayhoff) 353–358 (Natn. Biomed. Res. Found., Washington, 1978).

    Google Scholar 

  37. Feng, D.F., Johnson, M.S. & Doolittle, R.F. Aligning amino acid sequences: comparison of commonly used methods. J. molec. Evol. 21, 112–125 (1985).

    CAS  Google Scholar 

  38. Rao, J.K.M. New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters. Int. J. peptide protein Res. 29, 276–281 (1987).

    CAS  Google Scholar 

  39. Risler, J.L., Delorme, M.O., Delacroix, H. & Henaut, A. Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. J. molec. Biol. 204, 1019–1029 (1988).

    CAS  PubMed  Google Scholar 

  40. Gonnet, G.H., Cohen, M.A. & Benner, S.A. Exhaustive matching of the entire protein sequence database. Science 256, 1443–1445 (1992).

    CAS  Google Scholar 

  41. Henikoff, S. & Henikoff, J.G. Amino acid substitution matrices from protein blocks. Proc. natn. Acad. Sci. U.S.A 89, 10915–10919 (1992).

    CAS  Google Scholar 

  42. Jones, D.T., Taylor, W.R. & Thornton, J.M. The rapid generation of mutation data matrices from protein sequences. CABIOS 8, 275–282 (1992).

    CAS  PubMed  Google Scholar 

  43. Overington, J., Donnelly, D., Johnson, M.S., Sali, A. & Blundell, T.L. Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds. Prot. Sci. 1, 216–226 (1992).

    CAS  Google Scholar 

  44. Wilbur, W.J. On the PAM matrix model of protein evolution. Molec. Biol. Evol. 2, 434–447 (1985).

    CAS  PubMed  Google Scholar 

  45. Henikoff, S. & Henikoff, J.G. Performance evaluation of amino acid substitution matrices. Proteins 17, 49–61 (1993).

    CAS  PubMed  Google Scholar 

  46. Waterman, M.S., Gordon, L. & Arratia, R. Phase transitions in sequence matches and nucleic acid structure. Proc. natn. Acad. Sci. U.S.A. 84, 1239–1243 (1987).

    CAS  Google Scholar 

  47. Fitch, W.M. & Smith, T.F. Optimal sequence alignments. Proc. natn. Acad. Sci. U.S.A. 80, 1382–1386 (1983).

    CAS  Google Scholar 

  48. Gotoh, O. An improved algorithm for matching biological sequences. J. molec. Biol. 162, 705–708 (1982).

    CAS  Google Scholar 

  49. Altschul, S.F. & Erickson, B.W. Optimal sequence alignment using affine gap costs. Bull. math. Biol. 48, 603–616 (1986).

    CAS  Google Scholar 

  50. Myers, E.W. & Miller, W. Optimal alignments in linear space. CABIOS 4, 11–17 (1988).

    CAS  PubMed  Google Scholar 

  51. Miller, W. & Myers, E.W. Sequence comparison with concave weighting functions. Bull. math. Biol. 50, 97–120 (1988).

    CAS  PubMed  Google Scholar 

  52. Pascarella, S. & Argos, P. Analysis of insertions/deletions in protein structures. J. molec. Biol. 224, 461–471 (1992).

    CAS  PubMed  Google Scholar 

  53. Benner, S.A., Cohen, M.A. & Gonnet, G.H. Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J. molec. Biol. 229, 1065–1082 (1993).

    CAS  PubMed  Google Scholar 

  54. Benson, D., Lipman, D.J. & Ostell, J. GenBank. Nucl. Acids Res. 21, 2963–2965 (1993).

    CAS  PubMed  Google Scholar 

  55. Rice, C.M., Fuchs, R., Higgins, D.G., Stoehr, P.J. & Cameron, G.N. The EMBL data library. Nucl. Acids Res. 21, 2967–2971 (1993).

    CAS  PubMed  Google Scholar 

  56. Barker, W.C., George, D.G., Mewes, H.-W., Pfeiffer, F. & Tsugita, A. The PIR-International databases. Nucl. Acids Res. 21, 3089–3092 (1993).

    CAS  PubMed  Google Scholar 

  57. Adams, M.D. et al. Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252, 1651–1656 (1991).

    CAS  Google Scholar 

  58. Sikela, J.M. & Auffray, C. Finding new genes faster than ever. Nature Genet. 3, 189–191 (1993).

    CAS  Google Scholar 

  59. Davies, K. The EST express gathers steam. Nature 364, 554 (1993).

    Google Scholar 

  60. Boguski, M.S., Lowe, T.M.J. & Tolstoshev, C.M. dbEST — database for “expressed sequence tags”. Nature Genet. 4, 332–333 (1993).

    CAS  PubMed  PubMed Central  Google Scholar 

  61. Bleasby, A.J. & Wootton, J.C. Construction of validated, non-redundant composite sequence databases. Protein Eng. 3, 153–159 (1990).

    CAS  PubMed  Google Scholar 

  62. Benson, D., Boguski, M., Lipman, D.J. & Ostell, J. The national center for biotechnology information. Genomics 6, 389–391 (1990).

    CAS  PubMed  Google Scholar 

  63. Bairoch, A. & Boeckmann, B. The SWISS-PROT protein sequence data bank, recent developments. Nucl. Acids Res. 21, 3093–3096 (1993).

    CAS  PubMed  Google Scholar 

  64. Henikoff, S. Sequence analysis by electronic mail server. Trends biochem. Sci. 18, 267–268 (1993).

    CAS  PubMed  Google Scholar 

  65. Krol, E. The Whole Internet User's Guide & Cataolog. (O'Reilly & Assoc., Inc., Sebastopol, CA, 1992).

    Google Scholar 

  66. Network Entrez. NCBI News 2(2), 1 (National Library of Medicine, Bethesda, MD, 1993).

  67. Wootton, J.C. & Federhen, S. Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17, 149–163 (1993).

    CAS  Google Scholar 

  68. Green, P., Lipman, D., Hillier, L., Waterston, R., States, D.J. & Claverie, J.-M. Ancient conserved regions in new gene sequences. Science 259, 1711–1716 (1993).

    CAS  PubMed  PubMed Central  Google Scholar 

  69. Riggins, G.J. et al. Human genes containing polymorphic trinucleotide repeats. Nature Genet. 2, 186–191 (1992).

    CAS  PubMed  PubMed Central  Google Scholar 

  70. Harding, R.M., Boyce, A.J. & Clegg, J.B. The evolution of tandemly repetitive DNA: recombination rules. Genetics 132, 847–859 (1992).

    CAS  PubMed  PubMed Central  Google Scholar 

  71. Karlin, S. & Brendel, V. Charge configurations in viral proteins. Proc. natn. Acad. Sci. U.S.A. 85, 9396–9400 (1988).

    CAS  Google Scholar 

  72. Karlin, S. & Brendel, V. Charge and statistical significance in protein and DNA sequence analysis. Science 257, 39–49 (1992).

    CAS  PubMed  Google Scholar 

  73. Brendel, V., Bucher, P., Nourbakhsh, I.R., Blaisdell, B.E. & Karlin, S. Methods and algorithms for statistical analysis of protein sequences. Proc. natn. Acad. Sci. U.S.A. 89, 2002–2006 (1992).

    CAS  Google Scholar 

  74. Claverie, J.-M. & States, D.J. Information enchancement methods for large scale sequence analysis. Comput. Chem. 17, 191–201 (1993).

    CAS  Google Scholar 

  75. Jurka, J., Walichiewicz, J. & Milosavljevic, A. Prototypic sequences for human repetitive DNA. J. molec. Evol. 35, 286–291 (1992).

    CAS  PubMed  Google Scholar 

  76. Hanks, S.K. & Quinn, A.M. Protein kinase catalytic domain sequence database: identification of conserved features of primary structure and classification of family members. Meth. Enzymol. 200, 38–62 (1991).

    CAS  PubMed  Google Scholar 

  77. Collins, F. & Galas, D. A new five-year plan for the U.S. human genome project. Science 262, 43–46 (1993).

    CAS  PubMed  Google Scholar 

  78. Gumbel, E.J. Statistics of extremes. (Columbia Univ. Press, New York, 1958).

    Google Scholar 

  79. Arratia, R., Gordon, L. & Waterman, M.S. An extreme value theory for sequence matching. Ann. Stat. 14, 971–993 (1986).

    Google Scholar 

  80. Arratia, R., Morris, P. & Waterman, M.S. Stochastic scrabble: large deviations for sequences with scores. J. appl. Prob. 25, 106–119 (1988).

    Google Scholar 

  81. Arratia, R. & Waterman, M.S. The Erdos-Renyi strong law for pattern matching with a given proportion of mismatches. Ann. Prob. 17, 1152–1169 (1989).

    Google Scholar 

  82. Salamon, P. & Konopka, A.K. A maximum entropy principle for distribution of local complexity in naturally occurring nucleotide sequences. Comput. Chem. 16, 117–124 (1992).

    CAS  Google Scholar 

  83. Salamon, P., Wootton, J.C., Konopka, A.K. & Hansen, L. On the robustness of maximum entropy relationships for complexity distributions of nucleotide sequences. Comput. Chem. 17, 135–148 (1993).

    CAS  Google Scholar 

  84. Miyoshi, H. et al. The t(8:21) translocation in acute myeloid leukemia results in production of an AML1-MTG8 fusion transcript. EMBO J. 12, 2715–2721 (1993).

    CAS  PubMed  PubMed Central  Google Scholar 

  85. Kokubo, T., Gong, D.-W., Roeder, R.G., Horikoshi, M. & Nakatani, Y. The Drosophlla 110-kDa TFIID subunit directly interacts with the N-terminal region of the 230-kDa subunit. Proc. natn. Acad. Sci. U.S.A. 90, 5896–5900 (1993).

    CAS  Google Scholar 

  86. Hoey, T. et al. Molecular cloning and functional analysis of Drosophila TAF110 reveal properties expected of coactivators. Cell 72, 247–260 (1993).

    CAS  Google Scholar 

  87. Owens, G.P., Hahn, W.E. & Cohen, J.J. Identification of mRNAs associated with programmed cell death in immature thymocytes. Mol. cell. Biol. 11, 4177–4188 (1991).

    CAS  PubMed  PubMed Central  Google Scholar 

  88. Schwabe, J.W., Neuhaus, D. & Rhodes, D. Solution structure of the DNA-binding domain of the oestrogen receptor. Nature 348, 458–461 (1990).

    CAS  PubMed  Google Scholar 

  89. Feig, L.A. The many roads that lead to Ras. Science 260, 767–768 (1993).

    CAS  Google Scholar 

  90. McCormick, F. How receptors turn Ras on. Nature 363, 15–16 (1993).

    CAS  Google Scholar 

  91. Boguski, M.S. & McCormick, F. Proteins regulating Ras and its relatives. Nature 366, 643–654 (1993).

    CAS  Google Scholar 

  92. Rozakis-Adcock, M., Femley, R., Wade, J., Pawson, T. & Bowtell, D. The SH2 and SH3 domains of mammalian Grb2 couple the EGF receptor to the Ras activator mSos1. Nature 363, 83–85 (1993).

    CAS  PubMed  PubMed Central  Google Scholar 

  93. Musacchio, A., Gibson, T., Rice, P., Thompson, J. & Saraste, M. The PH domain is a common piece in the structural patchwork of signalling (and other) proteins. Trends biochem. Sci. 18, 343–348 (1993).

    CAS  Google Scholar 

  94. Arents, G., Burlingame, R.W., Wang, B.C., Love, W.E. & Moudrianakis, E.N. The nucleosomal core histone octamer at 3.1 A resolution: a tripartite protein assembly and a left-handed superhelix. Proc. natn. Acad. Sci. U.S.A. 88, 10148–10152 (1991).

    CAS  Google Scholar 

Download references

Author information

Affiliations

Authors

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Altschul, S., Boguski, M., Gish, W. et al. Issues in searching molecular sequence databases. Nat Genet 6, 119–129 (1994). https://doi.org/10.1038/ng0294-119

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng0294-119

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing