Abstract
Linguistic metaphors have been woven into the fabric of molecular biology since its inception. The determination of the human genome sequence has brought these metaphors to the forefront of the popular imagination, with the natural extension of the notion of DNA as language to that of the genome as the 'book of life'. But do these analogies go deeper and, if so, can the methods developed for analysing languages be applied to molecular biology? In fact, many techniques used in bioinformatics, even if developed independently, may be seen to be grounded in linguistics. Further interweaving of these fields will be instrumental in extending our understanding of the language of life.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Protein intrinsically disordered region prediction by combining neural architecture search and multi-objective genetic algorithm
BMC Biology Open Access 07 September 2023
Access options
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout



References
Aitchison, J. Linguistics (NTC/Contemporary Publishing, Chicago, 1999).
Chomsky, N. Syntactic Structures (Mouton, The Hague, 1957).
Jurafsky, D. & Martin, J. H. Speech and Language Processing (Prentice Hall, Upper Saddle River, NJ, 2000).
Brendel, V. & Busse, H. G. Genome structure described by formal languages. Nucleic Acids Res. 12, 2561–2568 (1984).
Head, T. Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviors. Bull. Math. Biol. 49, 737–759 (1987).
Searls, D. B. in Proc. 7th Natl Conf. Artif. Intell. 386–391 (AAAI Press, Menlo Park, CA, 1988).
Searls, D. B. The linguistics of DNA. Am. Sci. 80, 579–591 (1992).
Searls, D. B. in Logic Programming: Proc. North Am. Conf. (eds Lusk, E. & Overbeek, R.) 189–208 (MIT Press, Cambridge, MA, 1989).
Searls, D. B. in Artificial Intelligence and Molecular Biology Ch. 2 (ed. Hunter, L.) 47–120 (AAAI Press, Menlo Park, CA, 1993).
Searls, D. B. in Mathematical Support for Molecular Biology (eds Farach-Colton, M., Roberts, F. S., Vingron, M. & Waterman, M.) 117–140 (American Mathematical Society, Providence, RI, 1999).
Durbin, R., Krogh, A., Mitchison, G. & Eddy, S. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, Cambridge, 1998).
Baldi, P. & Brunak, S. Bioinformatics: The Machine Learning Approach (MIT Press, Cambridge, MA, 2001).
Lyngso, R. B. & Pedersen, C. N. RNA pseudoknot prediction in energy-based models. J. Comput. Biol. 7, 409–427 (2000).
Joshi, A. in Natural Language Processing: Psycholinguistic, Computational and Theoretical Perspectives (eds Dowty, D., Karttunen, L. & Zwicky, A.) 206–250 (Chicago Univ. Press, New York, 1985).
Uemura, Y., Hasegawa, A., Kobayashi, S. & Yokomori, T. Tree-adjoining grammars for RNA structure prediction. Theor. Comput. Sci. 10, 277–303 (1999).
Searls, D. B. String Variable Grammar: a logic grammar formalism for DNA sequences. J. Logic Program. 24, 73–102 (1995).
Rivas, E. & Eddy, S. R. The language of RNA: a formal grammar that includes pseudoknots. Bioinformatics 16, 334–340 (2000).
Shieber, S. Evidence against the context-freeness of natural language. Linguist. Phil. 8, 333–343 (1985).
Schultz, J., Milpetz, F., Bork, P. & Ponting, C. P. SMART, a simple modular architecture research tool: identification of signalling domains. Proc. Natl Acad. Sci. USA 95, 5857–5864 (1998).
Westhead, D. R., Slidel, T. W., Flores, T. P. & Thornton, J. M. Protein structural topology: automated analysis and diagrammatic representation. Protein Sci. 8, 897–904 (1999).
Abe, N. & Mamitsuka, H. Predicting protein secondary structure using stochastic tree grammars. Machine Learn. 29, 275–301 (1997).
Przytycka, T., Srinivasan, R., & Rose, G. D. Recursive domains in proteins. Protein Sci. 11, 409–417 (2002).
Jung, J. & Lee, B. Circularly permuted proteins in the protein structure database. Protein Sci. 10, 1881–1886 (2001).
Hopcroft, J. E. & Ullman, J. D. Introduction to Automata Theory, Languages, and Computation (Addison-Wesley, Reading, MA, 1979).
Searls, D. B. Reading the book of life. Bioinformatics 17, 579–580 (2001).
Dong, S. & Searls, D. B. Gene structure prediction by linguistic methods. Genomics 23, 540–551 (1994).
Searls, D. B. Linguistic approaches to biological sequences. Comput. Appl. Biosci. 13, 333–344 (1997).
Collado-Vides, J. A transformational-grammar approach to the study of the regulation of gene expression. J. Theor. Biol. 136, 403–425 (1989).
Rosenblueth, D. A. et al. Syntactic recognition of regulatory regions in Escherichia coli. Comput. Appl. Biosci. 12, 15–22 (1996).
Leung, S., Mellish, C. & Robertson, D. Basic Gene Grammars and DNA-ChartParser for language processing of Escherichia coli promoter DNA sequences. Bioinformatics 17, 226–236 (2001).
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
Reese, M. G., Kulp, D., Tammana, H. & Haussler, D. Genie—gene finding in Drosophila melanogaster. Genome Res. 10, 529–538 (2000).
Yandell, M. D. & Majoros, W. H. Genomics and natural language processing. Nature Rev. Genet. 3, 601–610 (2002).
Sakakibara, Y. et al. Stochastic context-free grammars for tRNA modeling. Nucleic Acids Res. 22, 5112–5120 (1994).
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997).
Rivas, E. & Eddy, S. R. Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2, 8 (2001).
Knudsen, B. & Hein, J. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 15, 446–454 (1999).
Brown, M. P. Small subunit ribosomal RNA modeling using stochastic context-free grammars. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 57–66 (2000).
Holmes, I. & Rubin, G. M. Pairwise RNA structure comparison with stochastic context-free grammars. Pac. Symp. Biocomput. 163–174 (2002).
Brown M. & Wilson C. RNA pseudoknot modeling using intersections of stochastic context free grammars with applications to database search. Pac. Symp. Biocomput. 109–125 (1996).
Campbell, L. Historical Linguistics: An Introduction (MIT Press, Cambridge, MA, 1999).
Darwin, C. The Descent of Man (John Murray, London, 1871).
Dawkins, R. The Selfish Gene (Oxford Univ. Press, Oxford, 1976).
Nowak, M. A., Komarova, N. L. & Niyogi, P. Computational and evolutionary aspects of language. Nature 417, 611–617 (2002).
Pennock, R. T. Tower of Babel: The Evidence against the New Creationism (Bradford/MIT Press, Cambridge, MA, 1999).
Cavalli-Sforza, L. L. Genes, Peoples, and Languages (North Point Press, New York, 2000).
Warnow T. Mathematical approaches to comparative linguistics. Proc. Natl Acad. Sci. USA 94, 6585–6590 (1997).
Swadesh, M. Lexicostatistical dating of prehistoric ethnic contacts: with special reference to North American Indians and Eskimos. Proc. Am. Phil. Soc. 96, 452–463 (1952).
Kruskal, J. B., Dyen, I. & Black, P. in Lexicostatistics in Genetic Linguistics (ed. Dyen, I.) 30–55 (Mouton, The Hague, 1973).
Mushegian, A. The minimal genome concept. Curr. Opin. Genet. Dev. 9, 709–714 (1999).
Tatusov, R. L., Galperin, M. Y., Natale, D. A. & Koonin, E. V. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28, 33–36 (2000).
Snel, B., Bork P, & Huynen, M. A. Genome phylogeny based on gene content. Nature Genet. 21, 108–110 (1999).
Tekaia, F., Lazcano, A., & Dujon, B. The genomic tree as revealed from whole proteome comparisons. Genome Res. 9, 550–557 (1999).
Lin, J. & Gerstein, M. Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. Genome Res. 10, 808–818 (2000).
Pellegrini, M. et al. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA 96, 4285–4288 (1999).
McWhorter, J. H. The Power of Babel: A Natural History of Language 128–129 (Freeman, New York, 2001).
Searls, D. B. From Jabberwocky to genome: Lewis Carroll and computational biology. J. Comp. Biol. 8, 339–348 (2001).
Lupas, A. N., Ponting, C. P. & Russell, R. B. On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? J. Struct. Biol. 134, 191–203 (2001).
McKeown, K. R. & Radev, D. R. in A Handbook of Natural Language Processing (eds Dale, R., Moisl, H. & Somers, H.) 507–523 (Dekker, New York, 2000).
Marcotte, E. M. et al. Detecting protein function and protein-protein interactions from genome sequences. Science 285, 751–753 (1999).
Smadja, F. Retrieving collocations from text: XTRACT. Comput. Linguist. 19, 143–177 (1993).
Rudman, J. The state of authorship attribution studies: some problems and solutions. Comput. Humanities 31, 351–365 (1998).
Barnbrook, G. Language and Computers (Edinburgh Univ. Press, Edinburgh, 1996).
Zipf, G. K. Human Behavior and the Principle of Least Effort (Addison-Wesley, Boston, MA, 1949).
Mandelbrot, B. The Fractal Geometry of Nature (Freeman, San Francisco, 1983).
Mantegna, R. N. et al. Linguistic features of noncoding DNA sequences. Phys. Rev. Lett. 73, 3169–3172 (1994).
Huynen, M. A. & van Nimwegen, E. The frequency distribution of gene family sizes in complete genomes. Mol. Biol. Evol. 15, 583–589 (1998).
Harrison, P. M. & Gerstein, M. Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. J. Mol. Biol. 318, 1155–1174 (2002).
Qian, J., Luscombe, N. M. & Gerstein, M. Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. J. Mol. Biol. 313, 673–681 (2001).
Schuster, P., Fontana, W., Stadler, P. F. & Hofacker, I. L. From sequences to shapes and back: a case study in RNA secondary structures. Proc. R. Soc. Lond. B 255, 279–284 (1994).
Hoyle, D. C., Rattray, M., Jupp, R. & Brass, A. Making sense of microarray data distributions. Bioinformatics 18, 576–584 (2002).
Rzhetsky, A. & Gomez, S. M. Birth of scale-free molecular networks and the number of distinct DNA and protein domains per genome. Bioinformatics 17, 988–996 (2001).
Jeong, H. et al. The large-scale organization of metabolic networks. Nature 407, 651–654 (2000).
Park, J., Lappe, M. & Teichmann, S. A. Mapping protein family interactions: intramolecular and intermolecular protein family interaction repertoires in the PDB and yeast. J. Mol. Biol. 307, 929–938 (2001).
Garcia-Vallve, S., Romeu, A. & Palau, J. Horizontal gene transfer in bacterial and archaeal complete genomes. Genome Res. 10, 1719–1725 (2000).
White, O. et al. A quality control algorithm for DNA sequencing projects. Nucleic Acids Res. 21, 3829–3838 (1993).
Hoover, D. I. Statistical stylistics and authorship attribution: an empirical investigation. Lit. Linguist. Comput. 16, 421–444 (2001).
Binongo, J. N. G. & Smith, M. W. A. The application of principal component analysis to stylometry. Lit. Linguist. Comput. 14, 445–466 (1999).
Hoorn, J. F., Frank, S. L., Kowalczyk, W. & van der Ham, F. Neural network identification of poets using letter sequences. Lit. Linguist. Comput. 14, 311–338 (1999).
Leopold, E. & Kindermann, J. Text categorization with support vector machines. How to represent texts in input space? Machine Learn. 46, 423–444 (2002).
Holmes, D. I. & Forsyth, R. S. The Federalist revisited: new directions in authorship attribution. Lit. Linguist. Comput. 10, 111–127 (1995).
Altman, R. B. & Raychaudhuri, S. Whole-genome expression analysis: challenges beyond clustering. Curr. Opin. Struct. Biol. 11, 340–347 (2001).
Searls D. B. Mining the bibliome. Pharmacogenomics J. 1, 88–89 (2001).
Popov, O., Segal, D. M. & Trifonov, E. N. Linguistic complexity of protein sequences as compared to texts of human languages. Biosystems 38, 65–74 (1996).
Trifonov, E. N. Interfering contexts of regulatory sequence elements. Comput. Appl. Biosci. 12, 423–429 (1996).
Spenser, M. & Howe, C. Estimating distances between manuscripts based on copying errors. Lit. Linguist. Comput. 16, 467–484 (2001).
Barbrook, A. C., Howe, C. J., Blake, N. & Robinson, P. The phylogeny of the Canterbury Tales. Nature 394, 839 (1998).
Platnick, N. I. & Cameron, H. D. Cladistic methods in textual, linguistic, and phylogenetic analysis. Syst. Zool. 26, 380–385 (1977).
Tanselle, G. T. Literature and Artifacts (Bibliographical Society of the University of Virginia, Charlottesville, VA, 1998).
Ferrer, D. Hypertextual representation of literary working papers. Lit. Linguist. Comput. 10, 143–145 (1995).
Acknowledgements
I thank P. Agarwal, A. Lupas, N. Odendahl and K. Rice for helpful comments on the manuscript.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Searls, D. The language of genes. Nature 420, 211–217 (2002). https://doi.org/10.1038/nature01255
Issue Date:
DOI: https://doi.org/10.1038/nature01255
This article is cited by
-
Protein intrinsically disordered region prediction by combining neural architecture search and multi-objective genetic algorithm
BMC Biology (2023)
-
Grammar-aware sentence classification on quantum computers
Quantum Machine Intelligence (2023)
-
Knowledge transfer, templates, and the spillovers
European Journal for Philosophy of Science (2022)
-
Promiscuous Domains in Eukaryotes and HAT Proteins in FUNGI Have Followed Different Evolutionary Paths
Journal of Molecular Evolution (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.