Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

The language of genes

Abstract

Linguistic metaphors have been woven into the fabric of molecular biology since its inception. The determination of the human genome sequence has brought these metaphors to the forefront of the popular imagination, with the natural extension of the notion of DNA as language to that of the genome as the 'book of life'. But do these analogies go deeper and, if so, can the methods developed for analysing languages be applied to molecular biology? In fact, many techniques used in bioinformatics, even if developed independently, may be seen to be grounded in linguistics. Further interweaving of these fields will be instrumental in extending our understanding of the language of life.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Grammar-style derivations of idealized versions of RNA structures.
Figure 2: Protein domain arrangements and the Chomsky hierarchy.
Figure 3: Distributions of the number of occurrences of Pfam protein domains (blue squares) in the genome of the yeast Saccharomyces cerevisiae, and of words (red diamonds) in Shakespeare's Romeo and Juliet, in both cases sorted in rank order from left to right.

Similar content being viewed by others

References

  1. Aitchison, J. Linguistics (NTC/Contemporary Publishing, Chicago, 1999).

    Google Scholar 

  2. Chomsky, N. Syntactic Structures (Mouton, The Hague, 1957).

    MATH  Google Scholar 

  3. Jurafsky, D. & Martin, J. H. Speech and Language Processing (Prentice Hall, Upper Saddle River, NJ, 2000).

    Google Scholar 

  4. Brendel, V. & Busse, H. G. Genome structure described by formal languages. Nucleic Acids Res. 12, 2561–2568 (1984).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Head, T. Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviors. Bull. Math. Biol. 49, 737–759 (1987).

    MathSciNet  CAS  PubMed  MATH  Google Scholar 

  6. Searls, D. B. in Proc. 7th Natl Conf. Artif. Intell. 386–391 (AAAI Press, Menlo Park, CA, 1988).

    Google Scholar 

  7. Searls, D. B. The linguistics of DNA. Am. Sci. 80, 579–591 (1992).

    ADS  Google Scholar 

  8. Searls, D. B. in Logic Programming: Proc. North Am. Conf. (eds Lusk, E. & Overbeek, R.) 189–208 (MIT Press, Cambridge, MA, 1989).

    Google Scholar 

  9. Searls, D. B. in Artificial Intelligence and Molecular Biology Ch. 2 (ed. Hunter, L.) 47–120 (AAAI Press, Menlo Park, CA, 1993).

    Google Scholar 

  10. Searls, D. B. in Mathematical Support for Molecular Biology (eds Farach-Colton, M., Roberts, F. S., Vingron, M. & Waterman, M.) 117–140 (American Mathematical Society, Providence, RI, 1999).

    Google Scholar 

  11. Durbin, R., Krogh, A., Mitchison, G. & Eddy, S. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, Cambridge, 1998).

    MATH  Google Scholar 

  12. Baldi, P. & Brunak, S. Bioinformatics: The Machine Learning Approach (MIT Press, Cambridge, MA, 2001).

    MATH  Google Scholar 

  13. Lyngso, R. B. & Pedersen, C. N. RNA pseudoknot prediction in energy-based models. J. Comput. Biol. 7, 409–427 (2000).

    CAS  PubMed  Google Scholar 

  14. Joshi, A. in Natural Language Processing: Psycholinguistic, Computational and Theoretical Perspectives (eds Dowty, D., Karttunen, L. & Zwicky, A.) 206–250 (Chicago Univ. Press, New York, 1985).

    Google Scholar 

  15. Uemura, Y., Hasegawa, A., Kobayashi, S. & Yokomori, T. Tree-adjoining grammars for RNA structure prediction. Theor. Comput. Sci. 10, 277–303 (1999).

    MathSciNet  MATH  Google Scholar 

  16. Searls, D. B. String Variable Grammar: a logic grammar formalism for DNA sequences. J. Logic Program. 24, 73–102 (1995).

    MathSciNet  MATH  Google Scholar 

  17. Rivas, E. & Eddy, S. R. The language of RNA: a formal grammar that includes pseudoknots. Bioinformatics 16, 334–340 (2000).

    CAS  PubMed  Google Scholar 

  18. Shieber, S. Evidence against the context-freeness of natural language. Linguist. Phil. 8, 333–343 (1985).

    Google Scholar 

  19. Schultz, J., Milpetz, F., Bork, P. & Ponting, C. P. SMART, a simple modular architecture research tool: identification of signalling domains. Proc. Natl Acad. Sci. USA 95, 5857–5864 (1998).

    ADS  CAS  PubMed  Google Scholar 

  20. Westhead, D. R., Slidel, T. W., Flores, T. P. & Thornton, J. M. Protein structural topology: automated analysis and diagrammatic representation. Protein Sci. 8, 897–904 (1999).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. Abe, N. & Mamitsuka, H. Predicting protein secondary structure using stochastic tree grammars. Machine Learn. 29, 275–301 (1997).

    MATH  Google Scholar 

  22. Przytycka, T., Srinivasan, R., & Rose, G. D. Recursive domains in proteins. Protein Sci. 11, 409–417 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. Jung, J. & Lee, B. Circularly permuted proteins in the protein structure database. Protein Sci. 10, 1881–1886 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. Hopcroft, J. E. & Ullman, J. D. Introduction to Automata Theory, Languages, and Computation (Addison-Wesley, Reading, MA, 1979).

    MATH  Google Scholar 

  25. Searls, D. B. Reading the book of life. Bioinformatics 17, 579–580 (2001).

    CAS  PubMed  Google Scholar 

  26. Dong, S. & Searls, D. B. Gene structure prediction by linguistic methods. Genomics 23, 540–551 (1994).

    CAS  PubMed  Google Scholar 

  27. Searls, D. B. Linguistic approaches to biological sequences. Comput. Appl. Biosci. 13, 333–344 (1997).

    CAS  PubMed  Google Scholar 

  28. Collado-Vides, J. A transformational-grammar approach to the study of the regulation of gene expression. J. Theor. Biol. 136, 403–425 (1989).

    CAS  PubMed  Google Scholar 

  29. Rosenblueth, D. A. et al. Syntactic recognition of regulatory regions in Escherichia coli. Comput. Appl. Biosci. 12, 15–22 (1996).

    Google Scholar 

  30. Leung, S., Mellish, C. & Robertson, D. Basic Gene Grammars and DNA-ChartParser for language processing of Escherichia coli promoter DNA sequences. Bioinformatics 17, 226–236 (2001).

    CAS  PubMed  Google Scholar 

  31. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).

    CAS  PubMed  Google Scholar 

  32. Reese, M. G., Kulp, D., Tammana, H. & Haussler, D. Genie—gene finding in Drosophila melanogaster. Genome Res. 10, 529–538 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  33. Yandell, M. D. & Majoros, W. H. Genomics and natural language processing. Nature Rev. Genet. 3, 601–610 (2002).

    CAS  PubMed  Google Scholar 

  34. Sakakibara, Y. et al. Stochastic context-free grammars for tRNA modeling. Nucleic Acids Res. 22, 5112–5120 (1994).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. Rivas, E. & Eddy, S. R. Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2, 8 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. Knudsen, B. & Hein, J. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 15, 446–454 (1999).

    CAS  PubMed  Google Scholar 

  38. Brown, M. P. Small subunit ribosomal RNA modeling using stochastic context-free grammars. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 57–66 (2000).

    CAS  PubMed  Google Scholar 

  39. Holmes, I. & Rubin, G. M. Pairwise RNA structure comparison with stochastic context-free grammars. Pac. Symp. Biocomput. 163–174 (2002).

  40. Brown M. & Wilson C. RNA pseudoknot modeling using intersections of stochastic context free grammars with applications to database search. Pac. Symp. Biocomput. 109–125 (1996).

  41. Campbell, L. Historical Linguistics: An Introduction (MIT Press, Cambridge, MA, 1999).

    Google Scholar 

  42. Darwin, C. The Descent of Man (John Murray, London, 1871).

    Google Scholar 

  43. Dawkins, R. The Selfish Gene (Oxford Univ. Press, Oxford, 1976).

    Google Scholar 

  44. Nowak, M. A., Komarova, N. L. & Niyogi, P. Computational and evolutionary aspects of language. Nature 417, 611–617 (2002).

    ADS  CAS  PubMed  Google Scholar 

  45. Pennock, R. T. Tower of Babel: The Evidence against the New Creationism (Bradford/MIT Press, Cambridge, MA, 1999).

    Google Scholar 

  46. Cavalli-Sforza, L. L. Genes, Peoples, and Languages (North Point Press, New York, 2000).

    Google Scholar 

  47. Warnow T. Mathematical approaches to comparative linguistics. Proc. Natl Acad. Sci. USA 94, 6585–6590 (1997).

    ADS  MathSciNet  CAS  PubMed  MATH  Google Scholar 

  48. Swadesh, M. Lexicostatistical dating of prehistoric ethnic contacts: with special reference to North American Indians and Eskimos. Proc. Am. Phil. Soc. 96, 452–463 (1952).

    Google Scholar 

  49. Kruskal, J. B., Dyen, I. & Black, P. in Lexicostatistics in Genetic Linguistics (ed. Dyen, I.) 30–55 (Mouton, The Hague, 1973).

    Google Scholar 

  50. Mushegian, A. The minimal genome concept. Curr. Opin. Genet. Dev. 9, 709–714 (1999).

    CAS  PubMed  Google Scholar 

  51. Tatusov, R. L., Galperin, M. Y., Natale, D. A. & Koonin, E. V. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28, 33–36 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. Snel, B., Bork P, & Huynen, M. A. Genome phylogeny based on gene content. Nature Genet. 21, 108–110 (1999).

    CAS  PubMed  Google Scholar 

  53. Tekaia, F., Lazcano, A., & Dujon, B. The genomic tree as revealed from whole proteome comparisons. Genome Res. 9, 550–557 (1999).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. Lin, J. & Gerstein, M. Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. Genome Res. 10, 808–818 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  55. Pellegrini, M. et al. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA 96, 4285–4288 (1999).

    ADS  CAS  PubMed  Google Scholar 

  56. McWhorter, J. H. The Power of Babel: A Natural History of Language 128–129 (Freeman, New York, 2001).

    Google Scholar 

  57. Searls, D. B. From Jabberwocky to genome: Lewis Carroll and computational biology. J. Comp. Biol. 8, 339–348 (2001).

    CAS  Google Scholar 

  58. Lupas, A. N., Ponting, C. P. & Russell, R. B. On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? J. Struct. Biol. 134, 191–203 (2001).

    CAS  PubMed  Google Scholar 

  59. McKeown, K. R. & Radev, D. R. in A Handbook of Natural Language Processing (eds Dale, R., Moisl, H. & Somers, H.) 507–523 (Dekker, New York, 2000).

    Google Scholar 

  60. Marcotte, E. M. et al. Detecting protein function and protein-protein interactions from genome sequences. Science 285, 751–753 (1999).

    CAS  PubMed  Google Scholar 

  61. Smadja, F. Retrieving collocations from text: XTRACT. Comput. Linguist. 19, 143–177 (1993).

    Google Scholar 

  62. Rudman, J. The state of authorship attribution studies: some problems and solutions. Comput. Humanities 31, 351–365 (1998).

    Google Scholar 

  63. Barnbrook, G. Language and Computers (Edinburgh Univ. Press, Edinburgh, 1996).

    Google Scholar 

  64. Zipf, G. K. Human Behavior and the Principle of Least Effort (Addison-Wesley, Boston, MA, 1949).

    Google Scholar 

  65. Mandelbrot, B. The Fractal Geometry of Nature (Freeman, San Francisco, 1983).

    Google Scholar 

  66. Mantegna, R. N. et al. Linguistic features of noncoding DNA sequences. Phys. Rev. Lett. 73, 3169–3172 (1994).

    ADS  CAS  PubMed  Google Scholar 

  67. Huynen, M. A. & van Nimwegen, E. The frequency distribution of gene family sizes in complete genomes. Mol. Biol. Evol. 15, 583–589 (1998).

    CAS  PubMed  Google Scholar 

  68. Harrison, P. M. & Gerstein, M. Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. J. Mol. Biol. 318, 1155–1174 (2002).

    CAS  PubMed  Google Scholar 

  69. Qian, J., Luscombe, N. M. & Gerstein, M. Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. J. Mol. Biol. 313, 673–681 (2001).

    CAS  PubMed  Google Scholar 

  70. Schuster, P., Fontana, W., Stadler, P. F. & Hofacker, I. L. From sequences to shapes and back: a case study in RNA secondary structures. Proc. R. Soc. Lond. B 255, 279–284 (1994).

    ADS  CAS  Google Scholar 

  71. Hoyle, D. C., Rattray, M., Jupp, R. & Brass, A. Making sense of microarray data distributions. Bioinformatics 18, 576–584 (2002).

    CAS  PubMed  Google Scholar 

  72. Rzhetsky, A. & Gomez, S. M. Birth of scale-free molecular networks and the number of distinct DNA and protein domains per genome. Bioinformatics 17, 988–996 (2001).

    CAS  PubMed  Google Scholar 

  73. Jeong, H. et al. The large-scale organization of metabolic networks. Nature 407, 651–654 (2000).

    ADS  CAS  Google Scholar 

  74. Park, J., Lappe, M. & Teichmann, S. A. Mapping protein family interactions: intramolecular and intermolecular protein family interaction repertoires in the PDB and yeast. J. Mol. Biol. 307, 929–938 (2001).

    CAS  PubMed  Google Scholar 

  75. Garcia-Vallve, S., Romeu, A. & Palau, J. Horizontal gene transfer in bacterial and archaeal complete genomes. Genome Res. 10, 1719–1725 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  76. White, O. et al. A quality control algorithm for DNA sequencing projects. Nucleic Acids Res. 21, 3829–3838 (1993).

    CAS  PubMed  PubMed Central  Google Scholar 

  77. Hoover, D. I. Statistical stylistics and authorship attribution: an empirical investigation. Lit. Linguist. Comput. 16, 421–444 (2001).

    Google Scholar 

  78. Binongo, J. N. G. & Smith, M. W. A. The application of principal component analysis to stylometry. Lit. Linguist. Comput. 14, 445–466 (1999).

    Google Scholar 

  79. Hoorn, J. F., Frank, S. L., Kowalczyk, W. & van der Ham, F. Neural network identification of poets using letter sequences. Lit. Linguist. Comput. 14, 311–338 (1999).

    Google Scholar 

  80. Leopold, E. & Kindermann, J. Text categorization with support vector machines. How to represent texts in input space? Machine Learn. 46, 423–444 (2002).

    MATH  Google Scholar 

  81. Holmes, D. I. & Forsyth, R. S. The Federalist revisited: new directions in authorship attribution. Lit. Linguist. Comput. 10, 111–127 (1995).

    Google Scholar 

  82. Altman, R. B. & Raychaudhuri, S. Whole-genome expression analysis: challenges beyond clustering. Curr. Opin. Struct. Biol. 11, 340–347 (2001).

    CAS  PubMed  Google Scholar 

  83. Searls D. B. Mining the bibliome. Pharmacogenomics J. 1, 88–89 (2001).

    CAS  PubMed  Google Scholar 

  84. Popov, O., Segal, D. M. & Trifonov, E. N. Linguistic complexity of protein sequences as compared to texts of human languages. Biosystems 38, 65–74 (1996).

    CAS  PubMed  Google Scholar 

  85. Trifonov, E. N. Interfering contexts of regulatory sequence elements. Comput. Appl. Biosci. 12, 423–429 (1996).

    CAS  PubMed  Google Scholar 

  86. Spenser, M. & Howe, C. Estimating distances between manuscripts based on copying errors. Lit. Linguist. Comput. 16, 467–484 (2001).

    Google Scholar 

  87. Barbrook, A. C., Howe, C. J., Blake, N. & Robinson, P. The phylogeny of the Canterbury Tales. Nature 394, 839 (1998).

    ADS  CAS  Google Scholar 

  88. Platnick, N. I. & Cameron, H. D. Cladistic methods in textual, linguistic, and phylogenetic analysis. Syst. Zool. 26, 380–385 (1977).

    Google Scholar 

  89. Tanselle, G. T. Literature and Artifacts (Bibliographical Society of the University of Virginia, Charlottesville, VA, 1998).

    Google Scholar 

  90. Ferrer, D. Hypertextual representation of literary working papers. Lit. Linguist. Comput. 10, 143–145 (1995).

    Google Scholar 

Download references

Acknowledgements

I thank P. Agarwal, A. Lupas, N. Odendahl and K. Rice for helpful comments on the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David B. Searls.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Searls, D. The language of genes. Nature 420, 211–217 (2002). https://doi.org/10.1038/nature01255

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/nature01255

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing