Bioinformatics in the post-sequence era

Article metrics

Abstract

In the past decade, bioinformatics has become an integral part of research and development in the biomedical sciences. Bioinformatics now has an essential role both in deciphering genomic, transcriptomic and proteomic data generated by high-throughput experimental technologies and in organizing information gathered from traditional biology. Sequence-based methods of analyzing individual genes or proteins have been elaborated and expanded, and methods have been developed for analyzing large numbers of genes or proteins simultaneously, such as in the identification of clusters of related genes and networks of interacting proteins. With the complete genome sequences for an increasing number of organisms at hand, bioinformatics is beginning to provide both conceptual bases and practical methods for detecting systemic functional behaviors of the cell and the organism.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1
Figure 2: Bioinformatics developments of the past decade.
Figure 3: Bioinformatics now and in the future.

References

  1. 1

    Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

  2. 2

    Lipman, D.J. & Pearson, W.R. Rapid and sensitive protein similarity searches. Science 227, 1435–1441 (1985).

  3. 3

    Smith, T.F. & Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).

  4. 4

    Olson, M., Hood, L., Cantor, C. & Botstein D. A common language for physical mapping of the human genome. Science 245, 1435–1435 (1989).

  5. 5

    Adams, M.D. et al. Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252, 1651–1656 (1991).

  6. 6

    Fleischmann, R.D. et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496–512 (1995).

  7. 7

    Goffeau, A. et al. Life with 6000 genes. Science 274, 546–567 (1996).

  8. 8

    The C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012–2018 (1998).

  9. 9

    Adams, M.D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).

  10. 10

    Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

  11. 11

    Venter, J.C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

  12. 12

    Bork, P. & Koonin, E.V. Predicting functions from protein sequences—where are the bottlenecks? Nat. Genet. 18, 313–318 (1998).

  13. 13

    Park, J. et al. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol. 284, 1201–1210 (1998).

  14. 14

    Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

  15. 15

    Krogh, A., Brown, M., Mian, I.S., Sjolander, K. & Haussler, D. Hidden Markov models in computational biology. Applications to protein modeling. J. Mol. Biol. 235, 1501–1531 (1994).

  16. 16

    Thompson, J.D., Higgins, D.G. & Gibson, T.J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994).

  17. 17

    Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. & Higgins, D.G. The CLUSTAL X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 25, 4876–4882 (1997).

  18. 18

    Rost, B. & Sander, C. Improved prediction of protein secondary structure by use of sequence profiles and neural networks. Proc. Natl. Acad. Sci. USA 90, 7558–7562 (1993).

  19. 19

    Nakai, K. & Kanehisa, M. A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14, 897–911 (1992).

  20. 20

    Bork, P. Powers and pitfalls in sequence analysis: the 70% hurdle. Genome Res. 10, 398–400 (2000).

  21. 21

    Falquet, L. et al. The PROSITE database, its status in 2002. Nucleic Acids Res. 30, 235–238 (2002).

  22. 22

    Henikoff, J.G., Greene, E.A., Pietrokovski, S. & Henikoff, S. Increased coverage of protein families with the blocks database servers. Nucleic Acids Res. 28, 228–230 (2000).

  23. 23

    Attwood, T.K. et al. PRINTS and PRINTS-S shed light on protein ancestry. Nucleic Acids Res. 30, 239–241 (2002).

  24. 24

    Corpet, F., Servant, F., Gouzy, J. & Kahn, D. ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res. 28, 267–269 (2000).

  25. 25

    Sonnhammer, E.L., Eddy, S.R., and Durbin, R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28, 405–420 (1997).

  26. 26

    Schultz, J., Milpetz, F., Bork, P. & Ponting, C.P. SMART, a simple modular architecture research tool: identification of signaling domains. Proc. Natl. Acad. Sci. USA 95, 5857–5864 (1998).

  27. 27

    Haft, D.H. et al. TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Res. 29, 41–43 (2001).

  28. 28

    Huynen, M., Snel, B., Lathe, W. III & Bork, P. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res. 10, 1204–1210 (2000).

  29. 29

    Marcotte, E.M., Pellegrini, M., Thompson, M.J., Yeates, T.O. & Eisenberg, D. A combined algorithm for genome-wide prediction of protein function. Nature 402, 83–86 (1999).

  30. 30

    Tatusov, R.L., Koonin, E.V. & Lipman, D.J. A genomic perspective on protein families. Science 278, 631–637 (1997).

  31. 31

    Pease, A.C. et al. Light-generated oligonucleotide arrays for rapid DNA sequence analysis. Proc. Natl. Acad. Sci. USA 91, 5022–5026 (1994).

  32. 32

    DeRisi, J.L., Iyer, V.R. & Brown, P.O. Exploring the metablic and genetic control of gene expression on a genomic scale. Science 278, 680–686 (1997).

  33. 33

    Tamayo, P. et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA 96, 2907–2912 (1999).

  34. 34

    Brown, M.P. et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA 97, 262–267 (2000).

  35. 35

    Uetz, P. et al. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403, 623–627 (2000).

  36. 36

    Ito, T. et al. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. USA 98, 4569–4574 (2001).

  37. 37

    Gavin, A.C. et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 147 (2002).

  38. 38

    Ho, Y. et al. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415, 180–183 (2002).

  39. 39

    von Mering, C. et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature 417, 399–403 (2002).

  40. 40

    Edwards, A.M. et al. Bridging structural biology and genomics: assessing protein interaction data with known complexes. Trends Genet. 18, 529–536 (2002).

  41. 41

    Ashburner, M. et al. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000).

  42. 42

    Kanehisa, M. A database for post-genome analysis. Trends Genet. 13, 375–376 (1997).

  43. 43

    Karp, P.D., Riley, M., Paley, S.M. & Pelligrini-Toole, A. EcoCyc: an encyclopedia of Escherichia coli genes and metabolism. Nucleic Acids Res. 24, 32–39 (1996).

  44. 44

    Ogata, H., Fujibuchi, W., Goto, S. & Kanehisa, M. A heuristic graph comparison algorithm and its application to detect functionally related enzyme clusters. Nucleic Acids Res. 28, 4021–4028 (2000).

  45. 45

    Barabasi, A.L. & Albert, R. Emergence of scaling in random networks. Science 286, 509–512 (1999).

  46. 46

    Watts, D.J. & Strogatz, S.H. Collective dynamics of 'small-world' networks. Nature 393, 440–442 (1998).

  47. 47

    Milo, R. et al. Network motifs: simple building blocks of complex networks. Science 298, 824–827 (2002).

  48. 48

    Ideker, T. et al. Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 292, 929–934 (2001).

  49. 49

    Kumar, A. et al. Subcellular localization of the yeast proteome. Genes Dev. 16, 707–719 (2002).

  50. 50

    Kanehisa, M. Post-Genome Informatics (Oxford Univ. Press, Oxford, 2000).

  51. 51

    Baxevanis, A.D. The molecular biology database collection: 2002 update. Nucleic Acids Res. 30, 1–12 (2002).

  52. 52

    Murzin, A.G., Brenner, S.E., Hubbard, T. & Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995).

  53. 53

    Orengo, C.A. et al. CATH—a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997).

  54. 54

    Wingender, E. et al. TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res. 28, 316–319 (2000).

  55. 55

    Bader, G.D. et al. BIND—the biomolecular interaction network database. Nucleic Acids Res. 29, 242–245 (2001).

  56. 56

    Xenarios, I. et al. DIP, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 30, 303–305 (2002).

Download references

Author information

Correspondence to Minoru Kanehisa.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Kanehisa, M., Bork, P. Bioinformatics in the post-sequence era. Nat Genet 33, 305–310 (2003) doi:10.1038/ng1109

Download citation

Further reading