To understand why molecular evolution turned out as it did, we must characterize not only the path that evolution followed across the space of possible molecular sequences but also the many alternative trajectories that could have been taken but were not. A large-scale comparison of real and possible histories would establish whether the outcome of evolution represents an optimal state driven by natural selection or the contingent product of historical chance events1; it would also reveal how the underlying distribution of functions across sequence space shaped historical evolution2,3. Here we combine ancestral protein reconstruction4 with deep mutational scanning5,6,7,8,9,10 to characterize alternative histories in the sequence space around an ancient transcription factor, which evolved a novel biological function through well-characterized mechanisms11,12. We find hundreds of alternative protein sequences that use diverse biochemical mechanisms to perform the derived function at least as well as the historical outcome. These alternatives all require prior permissive substitutions that do not enhance the derived function, but not all require the same permissive changes that occurred during history. We find that if evolution had begun from a different starting point within the network of sequences encoding the ancestral function, outcomes with different genetic and biochemical forms would probably have resulted; this contingency arises from the distribution of functional variants in sequence space and epistasis between residues. Our results illuminate the topology of the vast space of possibilities from which history sampled one path, highlighting how the outcome of evolution depends on a serial chain of compounding chance events.
Subscribe to Journal
Get full journal access for 1 year
only $3.90 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Monod, J. Chance and Necessity: An Essay on the Natural Philosophy of Biology (Vintage Books, 1972)
Maynard Smith, J. Natural selection and the concept of a protein space. Nature 225, 563–564 (1970)
Wagner, A. Neutralism and selectionism: a network-based reconciliation. Nat. Rev. Genet. 9, 965–974 (2008)
Hochberg, G. K. A. & Thornton, J. W. Reconstructing ancient proteins to understand the causes of structure and function. Annu. Rev. Biophys. 46, 247–269 (2017)
Fowler, D. M. et al. High-resolution mapping of protein sequence-function relationships. Nat. Methods 7, 741–746 (2010)
Hietpas, R. T., Jensen, J. D. & Bolon, D. N. A. Experimental illumination of a fitness landscape. Proc. Natl Acad. Sci. USA 108, 7896–7901 (2011)
Podgornaia, A. I. & Laub, M. T. Pervasive degeneracy and epistasis in a protein-protein interface. Science 347, 673–677 (2015)
Wu, N. C., Dai, L., Olson, C. A., Lloyd-Smith, J. O. & Sun, R. Adaptation in protein fitness landscapes is facilitated by indirect paths. eLife 5, e16965 (2016)
Aakre, C. D. et al. Evolving new protein-protein interaction specificity through promiscuous intermediates. Cell 163, 594–606 (2015)
Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016)
McKeown, A. N. et al. Evolution of DNA specificity in a transcription factor family produced a new gene regulatory module. Cell 159, 58–68 (2014)
Anderson, D. W., McKeown, A. N. & Thornton, J. W. Intermolecular epistasis shaped the function and evolution of an ancient transcription factor and its DNA binding sites. eLife 4, e07864 (2015)
Carroll, J. S. et al. Genome-wide analysis of estrogen receptor binding sites. Nat. Genet. 38, 1289–1297 (2006)
Watson, L. C. et al. The glucocorticoid receptor dimer interface allosterically transmits sequence-specific DNA signals. Nat. Struct. Mol. Biol. 20, 876–883 (2013)
Luisi, B. F. et al. Crystallographic analysis of the interaction of the glucocorticoid receptor with DNA. Nature 352, 497–505 (1991)
Schwabe, J. W., Chapman, L., Finch, J. T. & Rhodes, D. The crystal structure of the estrogen receptor DNA-binding domain bound to DNA: how receptors discriminate between their response elements. Cell 75, 567–578 (1993)
Zilliacus, J., Carlstedt-Duke, J., Gustafsson, J. A. & Wright, A. P. Evolution of distinct DNA-binding specificities within the nuclear receptor family of transcription factors. Proc. Natl Acad. Sci. USA 91, 4175–4179 (1994)
Bain, D. L. et al. Glucocorticoid receptor-DNA interactions: binding energetics are the primary determinant of sequence-specific transcriptional activity. J. Mol. Biol. 422, 18–32 (2012)
Eick, G. N., Bridgham, J. T., Anderson, D. P., Harms, M. J. & Thornton, J. W. Robustness of reconstructed ancestral protein functions to statistical uncertainty. Mol. Biol. Evol. 34, 247–261 (2017)
Bloom, J. D., Gong, L. I. & Baltimore, D. Permissive secondary mutations enable the evolution of influenza oseltamivir resistance. Science 328, 1272–1275 (2010)
Gong, L. I., Suchard, M. A. & Bloom, J. D. Stability-mediated epistasis constrains the evolution of an influenza protein. eLife 2, e00631 (2013)
Harms, M. J. & Thornton, J. W. Evolutionary biochemistry: revealing the historical and physical causes of protein properties. Nat. Rev. Genet. 14, 559–571 (2013)
Starr, T. N. & Thornton, J. W. Epistasis in protein evolution. Protein Sci. 25, 1204–1218 (2016)
Dickinson, B. C., Leconte, A. M., Allen, B., Esvelt, K. M. & Liu, D. R. Experimental interrogation of the path dependence and stochasticity of protein evolution using phage-assisted continuous evolution. Proc. Natl Acad. Sci. USA 110, 9007–9012 (2013)
Ortlund, E. A., Bridgham, J. T., Redinbo, M. R. & Thornton, J. W. Crystal structure of an ancient protein: evolution by conformational epistasis. Science 317, 1544–1548 (2007)
Harms, M. J. & Thornton, J. W. Historical contingency and its biophysical basis in glucocorticoid receptor evolution. Nature 512, 203–207 (2014)
Natarajan, C. et al. Predictable convergence in hemoglobin function has unpredictable molecular underpinnings. Science 354, 336–339 (2016)
Shah, P., McCandlish, D. M. & Plotkin, J. B. Contingency and entrenchment in protein evolution under purifying selection. Proc. Natl Acad. Sci. USA 112, E3226–E3235 (2015)
Bridgham, J. T., Ortlund, E. A. & Thornton, J. W. An epistatic ratchet constrains the direction of glucocorticoid receptor evolution. Nature 461, 515–519 (2009)
Lynch, M. & Hagner, K. Evolutionary meandering of intermolecular interactions along the drift barrier. Proc. Natl Acad. Sci. USA 112, E30–E38 (2015)
Fox, J. E., Bridgham, J. T., Bovee, T. F. H. & Thornton, J. W. An evolvable oestrogen receptor activity sensor: development of a modular system for integrating multiple genes into the yeast genome. Yeast 24, 379–390 (2007)
Mumberg, D., Müller, R. & Funk, M. Yeast vectors for the controlled expression of heterologous proteins in different genetic backgrounds. Gene 156, 119–122 (1995)
Gietz, R. D. & Woods, R. A. Transformation of yeast by lithium acetate/single-stranded carrier DNA/polyethylene glycol method. Methods Enzymol. 350, 87–96 (2002)
R Core Team. R: A language and environment for statistical computing (R Foundation for Statistical Computing, 2016)
Muggeo, V. M. R. segmented: an R package to fit regression models with broken-line relationships. R News 8, 20–25 (2008)
Sluder, A. E., Mathews, S. W., Hough, D., Yin, V. P. & Maina, C. V. The nuclear receptor superfamily has undergone extensive proliferation and diversification in nematodes. Genome Res. 9, 103–120 (1999)
Benatuil, L., Perez, J. M., Belk, J. & Hsieh, C. M. An improved yeast transformation method for the generation of very large human antibody libraries. Protein Eng. Des. Sel. 23, 155–159 (2010)
Scanlon, T. C., Gray, E. C. & Griswold, K. E. Quantifying and resolving multiple vector transformants in S. cerevisiae plasmid libraries. BMC Biotechnol. 9, 95 (2009)
Fowler, D. M., Stephany, J. J. & Fields, S. Measuring the activity of protein variants on a large scale using deep mutational scanning. Nat. Protocols 9, 2267–2284 (2014)
Mir, K., Neuhaus, K., Bossert, M. & Schober, S. Short barcodes for next generation sequencing. PLoS ONE 8, e82933 (2013)
Peterman, N. & Levine, E. Sort-seq under the hood: implications of design choices on large-scale characterization of sequence-function relations. BMC Genomics 17, 206 (2016)
Delignette-Muller, M. L. & Dutang, C. fitdistrplus: an R package for fitting distributions. J. Stat. Softw. 64, http://dx.doi.org/10.18637/jss.v064.i04 (2015)
Archer, K. J. & Williams, A. A. A. L1 penalized continuation ratio models for ordinal response prediction using high-dimensional datasets. Stat. Med. 31, 1464–1474 (2012)
Vega Yon, J., Fábrega Lacoa, J. & Kunst, J. B. rgexf: build, import and export GEXF graph files. R package version 0.15.3. https://CRAN.R-project.org/package=rgexf (2015)
Bastian, M ., Heymann, S. & Jacomy, M. Gephi: an open source software for exploring and manipulating networks. In Int. AAAI Conference on Weblogs and Social Media, vol. 8, 361–362 (Association for the Advancement of Artificial Intelligence, 2009)
Csardi, G. & Nepusz, T. The igraph software package for complex network research. InterJ. Complex Syst. 1695, 1–9 (2006)
Sailer, Z. R. & Harms, M. J. Detecting high-order epistasis in nonlinear genotype-phenotype maps. Genetics 205, 1079–1088 (2017)
Knol, M. J., Pestman, W. R. & Grobbee, D. E. The (mis)use of overlap of confidence intervals to assess effect modification. Eur. J. Epidemiol. 26, 253–254 (2011)
Schymkowitz, J. et al. The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005)
Luscombe, N. M., Laskowski, R. A. & Thornton, J. M. NUCPLOT: a program to generate schematic diagrams of protein-nucleic acid interactions. Nucleic Acids Res. 25, 4940–4945 (1997)
Schymkowitz, J. W. H. et al. Prediction of water and metal binding sites and their affinities by using the Fold-X force field. Proc. Natl Acad. Sci. USA 102, 10147–10152 (2005)
Crooks, G. E., Hon, G., Chandonia, J.-M. & Brenner, S. E. WebLogo: a sequence logo generator. Genome Res. 14, 1188–1190 (2004)
Abriata, L. A., Palzkill, T. & Dal Peraro, M. How structural and physicochemical determinants shape sequence constraints in a functional enzyme. PLoS ONE 10, e0118684 (2015)
Paternoster, R., Brame, R., Mazerolle, P. & Piquero, A. Using the correct statistical test for the equality of regression coefficients. Criminology 36, 859–866 (1998)
We thank J. Bridgham and B. Metzger for technical advice, members of the Thornton laboratory past and present for comments, the University of Chicago’s Flow Cytometry and Genomics Cores, and E. Thomas for poetic inspiration. This work was supported by National Institutes of Health R01GM104397 and R01GM121931 (J.W.T.), T32-GM007183 (T.N.S.), UL1-TR000430, and a National Science Foundation Graduate Research Fellowship (T.N.S.).
The authors declare no competing financial interests.
Reviewer Information Nature thanks J. Bloom, A. de Visser, and D. Weinreich for their contribution to the peer review of this work.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Extended Data Figure 1 Design and validation of a yeast FACS-seq assay for steroid receptor DNA-binding function.
a, GFP activation in ERE (purple) and SRE (green) yeast reporters correlates with previously measured protein–DNA binding affinity11,12. Asterisk, stop-codon-containing variant. Dashed line, best fit segmented-linear relationship between GFP activation and log10(Ka,mac) b, Histogram of the per-cell green fluorescence for AncSR1 on ERE measured via flow cytometry, fitted to a logistic distribution (dashed line). c, Distributions providing the best fit to flow cytometry data for isogenic cultures of 101 DBD variants, using Akaike information criterion. d, Comparisons of mean fluorescence estimates between FACS-seq replicates of each protein/response element combination. Black points, coding RH variants; light grey, stop-codon-containing variants. R2pos, squared Pearson correlation coefficient for variants with mean fluorescence significantly higher than stop-codon-containing variants in either or both replicates. e, Comparisons between mean fluorescence as determined in FACS-seq and via flow cytometry analysis of isogenic cultures for a random selection of clones from each library. Dashed line, best-fit linear regression. f, Robustness of classification to sampling depth. Variants were binned according to the minimum number of cells with which they were sampled in either replicate. Below 15 cells sampled (dashed line), the probability that a variant called active in one replicate was also called active in the other is dependent on sampling depth; to minimize errors due to sampling depth, we eliminated as ‘undetermined’ all variants with fewer than 15 cells sampled after pooling replicates. g, Standard error of mean fluorescence estimates (s.e.m.) in each library as a function of sampling depth. Top: for each background, the relationship between s.e.m. and sampling depth for ERE (purple) and SRE (green) libraries, as estimated from the sampling distribution of stop-codon-containing variants (dotted lines) or variability in mean fluorescence estimates between replicates (solid lines). Bottom: the cumulative fraction of coding variants in each library having a certain number of cells sampled in the pooled data.
a, A scatterplot of side-angle scattering (SSC-A) and forward-angle scattering (FSC-A) selects for a homogenous cell population (P1). b, A scatterplot of the height of the per-cell forward scatter peak (FSC-H) and the integrated area of this peak (FSC-A) excludes events where multiple cells pass through the detector simultaneously (P2). c, Final sort bins (P3–P6) are drawn on the distribution of green fluorescence (FITC-A). d, Table showing the hierarchical parentage of sort gates and the percentage of events that fall in each bin.
For each protein/response element combination, a continuation ratio ordinal logistic regression model was constructed to predict the functional class of a variant as a function of its four RH amino-acid states, including possible first-order main effects and second-order pairwise epistatic effects. Tenfold cross-validation was used to select the penalization parameter λ and evaluate performance. a, b, True positive rate (left, TPR, the proportion of experimental positives that are predicted positive) and positive predictive value (right, PPV, the proportion of predicted positives that are experimentally positive) are shown as a function of λ for AncSR1+11P on ERE. Classifications were evaluated for (a) all active (weak and strong) versus inactive variants and (b) strong active versus weak active and inactive variants. Grey dotted lines, cross-validation replicates; solid line, mean. Dashed line shows the chosen value of λ = 10−5; as λ continues to decrease beyond λ = 10−5, the true positive rate plateaus but positive predictive value continues to decline. c, The number of non-zero parameters included in each model as a function of λ. Dashed line, λ = 10−5. d, Summary of performance metrics from tenfold cross-validation for each model with λ = 10−5. Accuracy is the proportion of predicted classifications (strong, weak, and inactive) that match their experimentally determined classes.
a, b, Diverse mechanisms for recognition of SRE (a) or ERE (b) by the historical RH genotypes (GSKV and EGKA) and alternative SRE-specific variants. Contacts from FoldX-generated structural models are shown between RH residues (circles) and DNA bases (letters), backbone phosphates (small circles) and sugars (pentagons, numbered by position in the DNA motif; dashed numbers refer to the complementary strand). Hydrogen bonds are shown as dashed arrows from donor to acceptor; dotted lines, non-bonded contacts. Red squares, bases that form hydrogen bonds in the EGKA-ERE structure that are unsatisfied in complex with an SRE-specific RH; red circles, side chains with polar groups that are not satisfied in complex with ERE. Only DNA contacts that vary among the analysed structures are shown. c, Large side chains at position 29 correlate with the loss of a conserved R33 hydrogen bond to ERE. For ERE-bound structural models, the distance of the Arg33 guanidinium hydrogen to the ERE T4 carbonyl oxygen was measured and compared with the atomic volume of the residue at position 29 in that variant.
Extended Data Figure 5 The ancestral RH (EGKA) and derived RH (GSKV) can access many SRE-specific outcomes by short paths in AncSR1+11P.
a, Concentric rings contain RH genotypes of minimum path length one, two, or three steps from AncSR1+11P:EGKA (centre). The historical outcome (GSKV, boxed, bottom) is accessible through a three-step path (EGKA–GGKA–GGKV–GSKV). Alternative SRE-specific outcomes accessible in three or fewer steps are in green. Lines connect genotypes separated by a single non-synonymous nucleotide mutation; lines among genotypes in the outer ring are not shown for clarity. Orange arrows indicate paths of significantly increasing SRE mean fluorescence. b, For trajectories indicated by orange arrows in a, SRE mean fluorescence is shown versus mutational distance from AncSR1+11P:EGKA (with x-axis jitter to avoid overplotting). Grey lines connect variants separated by single-nucleotide mutations. Error bars, 90% confidence intervals. Green dashed line, activity of AncSR1+11P:GSKV on SRE. c, For the SRE-specific outcomes accessed in orange paths in a, the probability of each outcome under models where the probability of taking a step depends on the relative increase in SRE mean fluorescence (correlated fixation model), or where any SRE-enhancing step is equally likely (equal fixation model)8. d, The historical outcome (GSKV) has SRE-specific single-mutant neighbours. Concentric rings contain SRE-specific RH genotypes of path length one or two steps from AncSR1+11P:GSKV (centre). Lines connect genotypes separated by a single non-synonymous nucleotide mutation; lines among genotypes in the outer ring are not shown for clarity. e, The distribution of SRE mean fluorescence of SRE-specific neighbours of AncSR1+11P:GSKV illustrated in d. Error bars, 90% confidence intervals.
a, Alternative ERE-specific starting points reach SRE-specific outcomes with very different amino-acid states. For each starting point accessing at least 15 outcomes (the median of all starting points), the frequency profile of amino-acid states at each RH site was determined for the set of SRE-specific outcomes reached in three or fewer steps; for each pair of starting points, the Jensen–Shannon (J–S) distance between profiles was calculated. Blue curve, distribution of pairs of starting points by J–S distances of the outcomes they reach; grey, distribution of J–S distances between profiles for randomly sampled sets of SRE-specific variants. In each modal peak, the amino-acid frequency profiles for outcomes reached by a representative pair of ERE-specific starting points are shown. b–d, Contingency in the accessibility of individual SRE-specific outcomes remains when path lengths longer than the historical trajectory are considered. Plots are equivalent to Fig. 2b–d but for trajectories of increasing length.
Extended Data Figure 7 The historical starting point cannot access the derived function without permissive mutations.
a, AncSR1 RH functional network layout as in Fig. 3c, with the shortest paths from AncSR1:EGKA to SRE specificity highlighted. The ancestral RH (EGKA) can access SRE specificity. However, all trajectories are at least five steps long, require permissive RH changes that confer no SRE activity (for example, K28R and G26A), and proceed through promiscuous intermediates. b, For paths highlighted in a, SRE mean fluorescence is shown versus mutational distance from AncSR1:EGKA; grey lines connect variants separated by single-nucleotide mutations. Error bars, 90% confidence intervals. Green dashed line, activity of AncSR1+11P:GSKV on SRE. AncSR1:EGKA was represented by only seven cells in the SRE library, so its FACS-seq SRE mean fluorescence estimate is unreliable (and its classification was thus inferred by the predictive model). In isolated flow cytometry experiments, its SRE mean fluorescence was indistinguishable from null alleles; the decrease in SRE mean fluorescence from step 0 to step 1 suggested by this figure is therefore more probably a flat line (no change in SRE activity). c, Stochasticity and contingency in trajectories of functional change. Diagrams illustrate paths from a purple starting point (left) to possible green outcomes (right). In a deterministic trajectory (i), a particular genotype encoding the green function will evolve deterministically if selection favours acquisition of the green function and only that genotype is accessible. The outcome of evolution is stochastic (ii) if multiple outcomes are accessible, so which one occurs is random. An outcome is contingent (iii) if its accessibility depends on the prior occurrence of some step that cannot be driven by selection for that outcome. Contingency and stochasticity can occur independently (ii and iii), or they can co-occur in serial (iv).
Extended Data Figure 8 The effect of historical permissive substitutions is mediated by non-specific increases in affinity.
a–d, The 11P substitutions non-specifically increase transcriptional activity as measured by FACS-seq, consistent with FoldX predictions of effects on binding affinity. a, Classification of SRE-specific variants as 11P-dependent (orange) and 11P-independent (yellow) on the basis of their functions in AncSR1 and AncSR1+11P backgrounds. Icons for individual variants specifically assessed in b and c are shown. b, FACS-seq mean fluorescence estimates for 11P-dependent (orange) and 11P-independent (yellow) RH variants in the AncSR1 (left) and AncSR1+11P (right) backgrounds, shown as box-and-whisker plots as in Fig. 4a. Icons represent variants validated in c. P values, Wilcoxon rank-sum test with continuity correction. The mean fluorescence of 11P-independent genotypes is significantly higher in the AncSR1 background but not in AncSR+11P. c, Validation of apparently restrictive effect of 11P on some genotypes. For three variants non-functional in AncSR1+11P but SRE-specific in AncSR1 FACS-seq assays (×), we measured mean fluorescence of isogenic cultures by flow cytometry. We also assayed variants SRE-specific in AncSR1+11P and SRE-specific (square) or non-functional (open circle) in AncSR1, as validation controls. Isogenic mean fluorescence is represented as mean ± s.e.m. from three replicate transformations and inductions analysed via flow cytometry. All FACS-seq classifications were validated except for the three apparently restricted variants in AncSR1+11P (highlighted in red), which are in fact strong SRE-activators in this background. Each of these variants was predicted to be a strong SRE-binder on the basis of its genotype, but had an artificially low FACS-seq mean fluorescence estimate, perhaps because of a strong growth defect in inducing conditions. d, After removing the three genotypes with inaccurate FACS-seq fluorescence measurements (×), 11P-independent genotypes have significantly higher mean fluorescence than 11P-dependent genotypes in the AncSR1+11P background, consistent with a non-specific permissive mechanism via affinity. P values, Wilcoxon rank-sum test with continuity correction. e, The 11P substitutions do not alter the genetic determinants of SRE specificity. Each plot shows, for a variable site in the library, the frequency of every amino-acid state in two functionally defined sets of variants. Spearman’s ρ for each correlation is shown. The top row shows that the determinants of SRE specificity are similar in AncSR1 and AncSR1+11P libraries; the bottom row shows a much weaker relationship between the determinants of SRE and ERE specificity within the AncSR1+11P library. f, Biochemical determinants of ERE and SRE specificity in the AncSR1 (top) and AncSR1+11P (bottom) backgrounds. A multiple logistic regression model predicts the probability that a variant is response-element-specific from the biochemical properties of its amino-acid state at each of the four variable RH sites. The coefficients of this model represent the change in log-odds of being ERE-specific or SRE-specific per unit change in each property. Asterisks indicate site-specific determinants that differ significantly between ERE and SRE specificity in each background (Z test, P < 0.05).
This file contains a list of oligonucleotide sequences used in this study. (XLSX 23 kb)
This file contains a list of all RH sequences and their specificity classifications in each protein background as used in this study. (XLSX 9310 kb)
About this article
Cite this article
Starr, T., Picton, L. & Thornton, J. Alternative evolutionary histories in the sequence space of an ancient protein. Nature 549, 409–413 (2017). https://doi.org/10.1038/nature23902
Nature Communications (2020)
Computational and Structural Biotechnology Journal (2020)
Protein Science (2020)
Sharing DNA-binding information across structurally similar proteins enables accurate specificity determination
Nucleic Acids Research (2020)