Replying to T. Yonezawa & M. Hasegawa Nature 468, 10.1038/nature09482 (2010)
Yonezawa and Hasegawa1 provide an example from two apparently unrelated families of nucleic acid coding sequences for which an Akaike information criterion (AIC) model selection test, similar to mine2, chooses a common origin hypothesis. Although this may seem surprising, the coding sequences in this example were aligned in the same reading frame. The constraints of the genetic code are expected to induce correlations between these sequences (and among all coding sequences) that are not due to common ancestry. For instance, owing to codon bias and the structure of the genetic code, in these sequences the second codon position is biased towards T (about twofold over average), whereas the third position is usually an A (∼50%) and rarely a G (∼4%).
One can account for these correlations explicitly by using codon models (as implemented in PAML3, codonFreq = 2 or 3) or standard amino acid models (as in PhyML4). With these more realistic models, independent ancestry is the strongly preferred hypothesis. Furthermore, the raw likelihoods and AIC scores increase significantly (by hundreds to thousands of logs), indicating that codon and amino acid models are greatly superior to the naive nucleotide models.
Yonezawa and Hasegawa1 point out that I2 did not explicitly test models in which selection or biophysical constraints generate sequence correlations among proteins with independent origins. Formal phylogenetic models accounting for such factors are currently unavailable; their development would be a welcome advance. Although these are important considerations for proteins with low sequence similarity, neither selection nor physical constraints alone can plausibly generate the high levels of sequence similarity (>55% average sequence identity) observed in the universal protein data set that I used2,5. The amount of adaptive convergence necessary to produce thousands of identical amino acids among 23 different proteins from completely independent beginnings is not comparable to the limited molecular convergence seen with, for example, homologous digestive lysozymes6, in which already highly similar proteins (in function, structure and sequence) later acquired a handful of identical substitutions in parallel.
How could selection or biophysical constraints induce correlations among unrelated sequences? If certain similar amino acid sequences are necessary for performing specific functions (or for adopting a specific tertiary conformation that is necessary for function), then selection for function may ‘lead’ proteins with independent origins to neighbouring regions of sequence space. However, no particular protein sequence or fold is necessary for any given function. There are abundant examples of proteins with undetectable sequence similarity and different folds that perform the same biochemical and cellular functions7. For example, the proteases subtilisin, trypsin and carboxypeptidase have the same active site and mechanism, whereas papain, renin and thermolysin have different active sites and different mechanisms. All six proteases have radically different folds and sequences. Because different folds in general have different sequence requirements, proteins with the same function need not have similar sequences.
Even assuming that a certain protein fold is necessary for a given function, current molecular evidence indicates that sequence requirements for a fold are extremely low—nearly indistinguishable from random. This data comes from many independent sources from throughout biology.
Many large classes of proteins with identical folds have no detectable sequence similarity (for example, families of TIM barrels, carbonic anhydrases, OB-folds, SH3 domains, Rossmann folds and immunoglobulin domains). These proteins provide prima facie evidence that sequence requirements for any particular fold and function are nearly indistinguishable from random. Protein domains in the SCOP database8 from different superfamilies yet with the same fold share ∼9% sequence identity9.
Identical folds with known independent origins have nearly random sequence similarity9,10. For example, unrelated proteins with the same fold from the MALISAM database share 8.5 ± 0.4% sequence identity9,10. This data can be used to estimate the correlations among independently evolved and created proteins with the same fold, and the correlations are nearly random. In the universal protein data set that I used2, the average sequence correlation induced by common ancestry is roughly one log-likelihood per site for the most divergent proteins. In contrast, the correlations among independent proteins with the same fold are ∼100 times weaker. From this we can estimate that model selection scores for common ancestry hypotheses will be many thousands of logs greater than competing selection hypotheses.
Even the most conserved proteins have not yet reached the limits of sequence space, which has been estimated to be near the random expectation for any given fold and function11.
These arguments are largely circumstantial and informal. I have not tested all possible competing hypotheses, and my analysis will not be the “last word on common ancestry”12. I emphasize that I have in no sense provided an absolute ‘proof’ of universal common ancestry. One of the great advantages of the model selection framework that I presented is that if a novel model is proposed with a well-defined likelihood function, then we can easily compare it to the common ancestry models and see how it fares.
Yonezawa, T. & Hasegawa, M. Was the universal common ancestry proved? Nature 468, 10.1038/nautre09482 (2010)
Theobald, D. L. A formal test of the theory of universal common ancestry. Nature 465, 219–222 (2010)
Yang, Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13, 555–556 (1997)
Guindon, S. & Gascuel, O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52, 696–704 (2003)
Brown, J. R., Douady, C. J., Italia, M. J., Marshall, W. E. & Stanhope, M. J. Universal trees based on large combined protein sequence data sets. Nature Genet. 28, 281–285 (2001)
Stewart, C. B., Schilling, J. W. & Wilson, A. C. Adaptive evolution in the stomach lysozymes of foregut fermenters. Nature 330, 401–404 (1987)
Omelchenko, M. V., Galperin, M. Y., Wolf, Y. I. & Koonin, E. V. Non-homologous isofunctional enzymes: a systematic analysis of alternative solutions in enzyme evolution. Biol. Direct 5, 31 (2010)
Andreeva, A. et al. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 36, D419–D425 (2008)
Cheng, H., Kim, B. H. & Grishin, N. V. Discrimination between distant homologs and structural analogs: lessons from manually constructed, reliable data sets. J. Mol. Biol. 377, 1265–1278 (2008)
Cheng, H., Kim, B. H. & Grishin, N. V. MALISAM: a database of structurally analogous motifs in proteins. Nucleic Acids Res. 36, D211–D217 (2008)
Povolotskaya, I. S. & Kondrashov, F. A. Sequence space and the ongoing expansion of the protein universe. Nature 465, 922–926 (2010)
Steel, M. & Penny, D. Origins of life: Common ancestry put to the test. Nature 465, 168–169 (2010)
About this article
Cite this article
Theobald, D. Theobald reply. Nature 468, E10 (2010). https://doi.org/10.1038/nature09483