High-resolution mapping of protein sequence-function relationships

Journal name:
Nature Methods
Year published:
Published online


We present a large-scale approach to investigate the functional consequences of sequence variation in a protein. The approach entails the display of hundreds of thousands of protein variants, moderate selection for activity and high-throughput DNA sequencing to quantify the performance of each variant. Using this strategy, we tracked the performance of >600,000 variants of a human WW domain after three and six rounds of selection by phage display for binding to its peptide ligand. Binding properties of these variants defined a high-resolution map of mutational preference across the WW domain; each position had unique features that could not be captured by a few representative mutations. Our approach could be applied to many in vitro or in vivo protein assays, providing a general means for understanding how protein function relates to sequence.

At a glance


  1. A highly parallel assay for exploring protein sequence-function relationships.
    Figure 1: A highly parallel assay for exploring protein sequence-function relationships.

    (a) PyMOL software–generated cartoon of the NMR spectroscopy structure of the hYAP65 WW domain (Protein Data Bank identifier 1jmq) in complex with its peptide ligand16. The portion of the mutagenized region of the hYAP65 WW domain whose encoding DNA was sequenced is shown in blue. Arrows indicate locations of sequencing primers. (b) A library of sequences encoding variant WW domains was generated using chemical DNA synthesis with doped nucleotide pools, PCR-amplified and displayed as a fusion to T7 capsid protein. The input phage library was subjected to successive rounds of selection. Each round consisted of phage binding to peptide ligand immobilized on beads, washing to remove unbound phage, and elution and amplification of bound phage. Sequencing libraries were created using PCR from input phage and from phage after three and six rounds of selection, and were sequenced using overlapping paired-end reads on the Illumina platform. An example of four unique variants of differing affinity are shown in different colors. Green arrows indicate locations of sequencing primers.

  2. Comparison of mutational tolerance and evolutionary conservation in the WW domain.
    Figure 2: Comparison of mutational tolerance and evolutionary conservation in the WW domain.

    (a) The mutational diversity of the input, round-three and round-six WW domain libraries is shown. Mutations are enumerated by position in the domain and by amino acid substitution. (b) Ratio of mutational frequencies in the round-six/input libraries observed at each position in the WW domain. Data for residues intolerant to mutation are shown as blue lines; data for beneficial mutations are shown as red lines. Residues making substantial contact with the peptide are underlined. (c) A comparison of the mutational preference at each position of the hYAP65 WW domain to a consensus WW domain sequence. Residues in the hYAP65 sequence that are identical to the consensus sequence amino acids and that were mutationally intolerant in our assay are highlighted in blue. Using mutational frequencies for enriched variants, we generated a logo plot that indicates mutational preference at each position (the plot shows only mutations that are advantageous and uses the standard coloring scheme of weblogo). Residues for which mutations to the consensus sequence were beneficial are boxed in green in the consensus sequence. The gray arginine residue is present in the hYAP65 WW domain but not in the consensus (dash). A plot of conservation is shown as a percentage of sequences in the alignment that are identical to the consensus at each position. (d) The mutational frequency ratio data from b were projected (log2-transformed) onto the space-filling model of the hYAP65 WW domain NMR spectroscopy structure (Protein Data Bank identifier 1jmq) using the PyMOL software16. Positions at which the frequency of mutations increased are shown in red, and positions at which the frequency of mutations decreased are shown in blue.

  3. Comprehensive sequence-function map of the WW domain.
    Figure 3: Comprehensive sequence-function map of the WW domain.

    We calculated enrichment ratios (round-six/input libraries) for each amino acid at each position in the WW domain. Each plot corresponds to a different amino acid substitution profile as indicated. The x axes indicate position along the WW domain and the y axes indicate log2(enrichment ratio). Blue dots indicate a measured enrichment ratio and red dots indicate the wild type sequence, which enriched 1.7-fold. Gray dots indicate mutations not observed and were arbitrarily placed at zero. The alanine plot corresponds to a traditional alanine scan of the WW domain.

  4. Prediction of WW domain folding energies and double-mutant enrichment ratios.
    Figure 4: Prediction of WW domain folding energies and double-mutant enrichment ratios.

    (a) The Rosetta framework was used to calculate folding energies for 16,363 full-length hYAP65 WW domain variants. Predicted folding energies relative to the wild-type WW domain energy are plotted against the observed fitness (variant/wild type) for the variants containing one, two or more mutations. (b) Using a basis set of single mutant enrichment ratios, we predicted double-mutant enrichment ratios using a product model.

Accession codes

Referenced accessions



  1. Sidhu, S.S. & Koide, S. Phage display for engineering and analyzing protein interaction interfaces. Curr. Opin. Struct. Biol. 17, 481487 (2007).
  2. Matouschek, A., Kellis, J.T. Jr., Serrano, L. & Fersht, A.R. Mapping the transition state and pathway of protein folding by protein engineering. Nature 340, 122126 (1989).
  3. Cunningham, B.C. & Wells, J.A. High-resolution epitope mapping of hGH-receptor interactions by alanine-scanning mutagenesis. Science 244, 10811085 (1989).
  4. Levin, A.M. & Weiss, G.A. Optimizing the affinity and specificity of proteins with molecular display. Mol. Biosyst. 2, 4957 (2006).
  5. Pal, G., Kouadio, J.L., Artis, D.R., Kossiakoff, A.A. & Sidhu, S.S. Comprehensive and quantitative mapping of energy landscapes for protein-protein interactions by rapid combinatorial scanning. J. Biol. Chem. 281, 2237822385 (2006).
  6. Dias-Neto, E. et al. Next-generation phage display: integrating and comparing available molecular tools to enable cost-effective high-throughput analysis. PLoS ONE 4, e8338 (2009).
  7. Ge, X., Mazor, Y., Hunicke-Smith, S.P., Ellington, A.D. & Georgiou, G. Rapid construction and characterization of synthetic antibody libraries without DNA amplification. Biotechnol. Bioeng. 106, 347357 (2010).
  8. Di Niro, R. et al. Rapid interactome profiling by massive sequencing. Nucleic Acids Res. 38, e110 (2010).
  9. Macias, M.J., Wiesner, S. & Sudol, M. WW and SH3 domains, two different scaffolds to recognize proline-rich ligands. FEBS Lett. 513, 3037 (2002).
  10. Espanel, X. et al. Probing WW domains to uncover and refine determinants of specificity in ligand recognition. Cytotechnology 43, 105111 (2003).
  11. Jager, M., Nguyen, H., Crane, J.C., Kelly, J.W. & Gruebele, M. The folding mechanism of a beta-sheet: the WW domain. J. Mol. Biol. 311, 373393 (2001).
  12. Jiang, X., Kowalski, J. & Kelly, J.W. Increasing protein stability using a rational approach combining sequence homology and structural alignment: stabilizing the WW domain. Protein Sci. 10, 14541465 (2001).
  13. Kasanov, J., Pirozzi, G., Uveges, A.J. &, Kay, B.K. Characterizing class I WW domains defines key specificity determinants and generates mutant domains with novel specificities. Chem. Biol. 8, 231241 (2001).
  14. Koepf, E.K. et al. Characterization of the structure and function of W right arrow F WW domain variants: identification of a natively unfolded protein that folds upon ligand binding. Biochemistry 38, 1433814351 (1999).
  15. Nguyen, H., Jager, M., Moretto, A., Gruebele, M. & Kelly, J.W. Tuning the free-energy landscape of a WW domain by temperature, mutation, and truncation. Proc. Natl. Acad. Sci. USA 100, 39483953 (2003).
  16. Pires, J.R. et al. Solution structures of the YAP65 WW domain and the variant L30 K in complex with the peptides GTPPPPYTVG, N-(n-octyl)-GPPPY and PLPPY and the application of peptide libraries reveal a minimal binding epitope. J. Mol. Biol. 314, 11471156 (2001).
  17. Toepert, F., Pires, J.R., Landgraf, C., Oschkinat, H. & Schneider-Mergener, J. Synthesis of an array comprising 837 variants of the hYAP WW protein domain. Angew. Chem. Int. Edn Engl. 40, 897900 (2001).
  18. Yanagida, H., Matsuura, T. & Yomo, T. Compensatory evolution of a WW domain variant lacking the strictly conserved Trp residue. J. Mol. Evol. 66, 6171 (2008).
  19. Dalby, P.A., Hoess, R.H. & DeGrado, W.F. Evolution of binding affinity in a WW domain probed by phage display. Protein Sci. 9, 23662376 (2000).
  20. Dai, M. et al. Using T7 phage display to select GFP-based binders. Protein Eng. Des. Sel. 21, 413424 (2008).
  21. Quail, M.A. et al. A large genome center's improvements to the Illumina sequencing system. Nat. Methods 5, 10051010 (2008).
  22. Knight, R. & Yarus, M. Analyzing partially randomized nucleic acid pools: straight dope on doping. Nucleic Acids Res. 31, e30 (2003).
  23. Weiss, G.A., Watanabe, C.K., Zhong, A., Goddard, A. & Sidhu, S.S. Rapid mapping of protein functional epitopes by combinatorial alanine scanning. Proc. Natl. Acad. Sci. USA 97, 89508954 (2000).
  24. Guo, H.H., Choe, J. & Loeb, L.A. Protein tolerance to random amino acid change. Proc. Natl. Acad. Sci. USA 101, 92059210 (2004).
  25. Das, R. & Baker, D. Macromolecular modeling with Rosetta. Annu. Rev. Biochem. 77, 363382 (2008).
  26. Kortemme, T. & Baker, D. A simple physical model for binding energy hot spots in protein-protein complexes. Proc. Natl. Acad. Sci. USA 99, 1411614121 (2002).
  27. Bershtein, S., Segal, M., Bekerman, R., Tokuriki, N. & Tawfik, D.S. Robustness-epistasis link shapes the fitness landscape of a randomly drifting protein. Nature 444, 929932 (2006).
  28. Weinreich, D.M., Delaney, N.F., Depristo, M.A. & Hartl, D.L. Darwinian evolution can follow only very few mutational paths to fitter proteins. Science 312, 111114 (2006).
  29. Ge, B. et al. Survey of allelic expression using EST mining. Genome Res. 15, 15841591 (2005).
  30. Storey, J.D. & Tibshirani, R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100, 94409445 (2003).

Download references

Author information


  1. Department of Genome Sciences, University of Washington, Seattle, Washington, USA.

    • Douglas M Fowler,
    • Carlos L Araya,
    • Jason J Stephany &
    • Stanley Fields
  2. Department of Biochemistry, University of Washington, Seattle, Washington, USA.

    • Sarel J Fleishman,
    • Elizabeth H Kellogg &
    • David Baker
  3. Howard Hughes Medical Institute, Seattle, Washington, USA.

    • Jason J Stephany,
    • David Baker &
    • Stanley Fields
  4. Department of Medicine, University of Washington, Seattle, Washington, USA.

    • Stanley Fields


D.M.F. conceived of the method, carried out the experiments, analyzed the data and wrote the paper; C.L.A. conceived of the method, analyzed the data and wrote the paper; J.J.S. carried out the experiments; E.H.K., S.J.F. and D.B. carried out the protein folding and binding energy calculations; and S.F. conceived of the method and wrote the paper.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (9M)

    Supplementary Figures 1–11, Supplementary Tables 1–2, Supplementary Note 1

Additional data