Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Letter
  • Published:

De novo protein design by citizen scientists

Abstract

Online citizen science projects such as GalaxyZoo1, Eyewire2 and Phylo3 have proven very successful for data collection, annotation and processing, but for the most part have harnessed human pattern-recognition skills rather than human creativity. An exception is the game EteRNA4, in which game players learn to build new RNA structures by exploring the discrete two-dimensional space of Watson–Crick base pairing possibilities. Building new proteins, however, is a more challenging task to present in a game, as both the representation and evaluation of a protein structure are intrinsically three-dimensional. We posed the challenge of de novo protein design in the online protein-folding game Foldit5. Players were presented with a fully extended peptide chain and challenged to craft a folded protein structure and an amino acid sequence encoding that structure. After many iterations of player design, analysis of the top-scoring solutions and subsequent game improvement, Foldit players can now—starting from an extended polypeptide chain—generate a diversity of protein structures and sequences that encode them in silico. One hundred forty-six Foldit player designs with sequences unrelated to naturally occurring proteins were encoded in synthetic genes; 56 were found to be expressed and soluble in Escherichia coli, and to adopt stable monomeric folded structures in solution. The diversity of these structures is unprecedented in de novo protein design, representing 20 different folds—including a new fold not observed in natural proteins. High-resolution structures were determined for four of the designs, and are nearly identical to the player models. This work makes explicit the considerable implicit knowledge that contributes to success in de novo protein design, and shows that citizen scientists can discover creative new solutions to outstanding scientific challenges such as the protein design problem.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The Foldit user interface.
Fig. 2: Comparison of Foldit player and automated design-sampling strategies.
Fig. 3: Structural characterization of Foldit player-designed proteins.
Fig. 4: High-resolution structures of Foldit player-designed proteins.

Similar content being viewed by others

Data availability

The atomic coordinates of Foldit1, Peak6 and Ferredog-Diesel crystal structures and the Foldit3 NMR structure have been deposited in the RCSB Protein Data Bank (PDB) with accession numbers 6MRR, 6MRS, 6NUK and 6MSP, respectively. Chemical shift and NOESY peak list data for Foldit3 were deposited in the Biological Magnetic Resonance Data Bank with accession number 30527.

Code availability

Because Foldit crowdsourcing relies on regulated, fair competition between participants, the source code of the Foldit user interface is not open. The underlying Rosetta macromolecular modelling suite (https://www.rosettacommons.org) is freely available to academic and non-commercial users, and commercial licenses are available via the University of Washington CoMotion Express License Program. Analysis scripts used in this paper are available in the Supplementary Information.

References

  1. Lintott, C. J. et al. Galaxy Zoo: morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey. Mon. Not. R. Astron. Soc. 389, 1179–1189 (2008).

    Article  ADS  Google Scholar 

  2. Kim, J. S. et al. Space-time wiring specificity supports direction selectivity in the retina. Nature 509, 331–336 (2014).

    Article  CAS  Google Scholar 

  3. Kawrykow, A. et al. Phylo: a citizen science approach for improving multiple sequence alignment. PLoS ONE 7, e31362 (2012).

    Article  ADS  CAS  Google Scholar 

  4. Lee, J. et al. RNA design rules from a massive open laboratory. Proc. Natl Acad. Sci. USA 111, 2122–2127 (2014).

    Article  ADS  Google Scholar 

  5. Cooper, S. et al. Predicting protein structures with a multiplayer online game. Nature 466, 756–760 (2010).

    Article  ADS  CAS  Google Scholar 

  6. Epstein, C. J., Goldberger, R. F. & Anfinsen, C. B. The genetic control of tertiary protein structure: studies with model systems. Cold Spring Harb. Symp. Quant. Biol. 28, 439–449 (1963).

    Article  CAS  Google Scholar 

  7. Lin, Y.-R. et al. Control over overall shape and size in de novo designed proteins. Proc. Natl Acad. Sci. USA 112, E5478–E5485 (2015).

    Article  CAS  Google Scholar 

  8. Huang, P.-S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320–327 (2016).

    Article  ADS  CAS  Google Scholar 

  9. Marcos, E. et al. Principles for designing proteins with cavities formed by curved β sheets. Science 355, 201–206 (2017).

    Article  ADS  CAS  Google Scholar 

  10. Dou, J. et al. De novo design of a fluorescence-activating β-barrel. Nature 561, 485–491 (2018).

    Article  ADS  CAS  Google Scholar 

  11. Alford, R. F. et al. The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017).

    Article  CAS  Google Scholar 

  12. Khatib, F. et al. Crystal structure of a monomeric retroviral protease solved by protein folding game players. Nat. Struct. Mol. Biol. 18, 1175–1177 (2011).

    Article  CAS  Google Scholar 

  13. Eiben, C. B. et al. Increased Diels–Alderase activity through backbone remodeling guided by Foldit players. Nat. Biotechnol. 30, 190–192 (2012).

    Article  CAS  Google Scholar 

  14. Blout, E. R. & Idelson, M. Compositional effects on the configuration of water-soluble polypeptide copolymers of l-glutamic acid and l–lysine. J. Am. Chem. Soc. 80, 4909–4913 (1958).

    Article  CAS  Google Scholar 

  15. Doty, P., Imahori, K. & Klemperer, E. The solution properties and configurations of a polyampholytic polypeptide: copoly-l-lysine-l-glutamic acid. Proc. Natl Acad. Sci. USA 44, 424–431 (1958).

    Article  ADS  CAS  Google Scholar 

  16. Ghosh, K. & Dill, K. A. Theory for protein folding cooperativity: helix bundles. J. Am. Chem. Soc. 131, 2306–2312 (2009).

    Article  CAS  Google Scholar 

  17. Rohl, C. A., Strauss, C. E. M., Misura, K. M. S. & Baker, D. Protein structure prediction using Rosetta. Methods Enzymol. 383, 66–93 (2004).

    Article  CAS  Google Scholar 

  18. Koga, N. et al. Principles for designing ideal protein structures. Nature 491, 222–227 (2012).

    Article  ADS  CAS  Google Scholar 

  19. Regan, L. & DeGrado, W. F. Characterization of a helical protein designed from first principles. Science 241, 976–978 (1988).

    Article  ADS  CAS  Google Scholar 

  20. Harbury, P. B., Plecs, J. J., Tidor, B., Alber, T. & Kim, P. S. High-resolution protein design with backbone freedom. Science 282, 1462–1467 (1998).

    Article  CAS  Google Scholar 

  21. Thomson, A. R. et al. Computational design of water-soluble α-helical barrels. Science 346, 485–488 (2014).

    Article  ADS  CAS  Google Scholar 

  22. Jacobs, T. M. et al. Design of structurally distinct proteins using strategies inspired by evolution. Science 352, 687–690 (2016).

    Article  ADS  CAS  Google Scholar 

  23. Ramachandran, G. N. & Sasisekharan, V. Conformation of polypeptides and proteins. Adv. Protein Chem. 23, 283–438 (1968).

    Article  CAS  Google Scholar 

  24. Chen, V. B. et al. MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr. D 66, 12–21 (2010).

    Article  CAS  Google Scholar 

  25. Montelione, G. T. et al. Recommendations of the wwPDB NMR Validation Task Force. Structure 21, 1563–1570 (2013).

    Article  CAS  Google Scholar 

  26. Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).

    Article  CAS  Google Scholar 

  27. Santoro, M. M. & Bolen, D. W. Unfolding free energy changes determined by the linear extrapolation method. 1. Unfolding of phenylmethanesulfonyl α-chymotrypsin using different denaturants. Biochemistry 27, 8063–8068 (1988).

    Article  CAS  Google Scholar 

  28. Otwinowski, Z. & Minor, W. Processing of X-ray diffraction data collected in oscillation mode. Methods Enzymol. 276, 307–326 (1997).

    Article  CAS  Google Scholar 

  29. McCoy, A. J. et al. Phaser crystallographic software. J. Appl. Crystallogr. 40, 658–674 (2007).

    Article  CAS  Google Scholar 

  30. Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. Features and development of Coot. Acta Crystallogr. D 66, 486–501 (2010).

    Article  CAS  Google Scholar 

  31. Afonine, P. V. et al. Towards automated crystallographic structure refinement with phenix.refine. Acta Crystallogr. D 68, 352–367 (2012).

    Article  CAS  Google Scholar 

  32. Jansson, M. et al. High-level production of uniformly 15N- and 13C-enriched fusion proteins in Escherichia coli. J. Biomol. NMR 7, 131–141 (1996).

    Article  CAS  Google Scholar 

  33. Delaglio, F. et al. NMRPipe: a multidimensional spectral processing system based on UNIX pipes. J. Biomol. NMR 6, 277–293 (1995).

    Article  CAS  Google Scholar 

  34. Bartels, C., Xia, T. H., Billeter, M., Güntert, P. & Wüthrich, K. The program XEASY for computer-supported NMR spectral analysis of biological macromolecules. J. Biomol. NMR 6, 1–10 (1995).

    Article  CAS  Google Scholar 

  35. Liu, G. et al. NMR data collection and analysis protocol for high-throughput protein structure determination. Proc. Natl Acad. Sci. USA 102, 10487–10492 (2005).

    Article  ADS  CAS  Google Scholar 

  36. Shen, Y., Delaglio, F., Cornilescu, G. & Bax, A. TALOS+: a hybrid method for predicting protein backbone torsion angles from NMR chemical shifts. J. Biomol. NMR 44, 213–223 (2009).

    Article  CAS  Google Scholar 

  37. Huang, Y. J., Tejero, R., Powers, R. & Montelione, G. T. A topology-constrained distance network algorithm for protein structure determination from NOESY data. Proteins 62, 587–603 (2006).

    Article  CAS  Google Scholar 

  38. Güntert, P., Mumenthaler, C. & Wüthrich, K. Torsion angle dynamics for NMR structure calculation with the new program DYANA. J. Mol. Biol. 273, 283–298 (1997).

    Article  Google Scholar 

  39. Herrmann, T., Güntert, P. & Wüthrich, K. Protein NMR structure determination with automated NOE assignment using the new software CANDID and the torsion angle dynamics algorithm DYANA. J. Mol. Biol. 319, 209–227 (2002).

    Article  CAS  Google Scholar 

  40. Huang, Y. J., Powers, R. & Montelione, G. T. Protein NMR recall, precision, and F-measure scores (RPF scores): structure quality assessment measures based on information retrieval statistics. J. Am. Chem. Soc. 127, 1665–1674 (2005).

    Article  CAS  Google Scholar 

  41. Linge, J. P., Williams, M. A., Spronk, C. A., Bonvin, A. M. & Nilges, M. Refinement of protein structures in explicit solvent. Proteins 50, 496–506 (2003).

    Article  CAS  Google Scholar 

  42. Brünger, A. T. et al. Crystallography & NMR system: a new software suite for macromolecular structure determination. Acta Crystallogr. D 54, 905–921 (1998).

    Article  Google Scholar 

  43. Lüthy, R., Bowie, J. U. & Eisenberg, D. Assessment of protein models with three-dimensional profiles. Nature 356, 83–85 (1992).

    Article  ADS  Google Scholar 

  44. Sippl, M. J. Recognition of errors in three-dimensional structures of proteins. Proteins 17, 355–362 (1993).

    Article  CAS  Google Scholar 

  45. Laskowski, R. A., Macarthur, M. W., Moss, D. S. & Thornton, J. M. Procheck—a program to check the stereochemical quality of protein structures. J. Appl. Crystallogr. 26, 283–291 (1993).

    Article  CAS  Google Scholar 

  46. Word, J. M., Bateman, R. C., Jr, Presley, B. K., Lovell, S. C. & Richardson, D. C. Exploring steric constraints on protein mutations using MAGE/PROBE. Protein Sci. 9, 2251–2259 (2000).

    Article  CAS  Google Scholar 

  47. Bhattacharya, A., Tejero, R. & Montelione, G. T. Evaluating protein structures determined by structural genomics consortia. Proteins 66, 778–795 (2007).

    Article  CAS  Google Scholar 

  48. Tejero, R., Snyder, D., Mao, B., Aramini, J. M. & Montelione, G. T. PDBStat: a universal restraint converter and restraint analysis software package for protein NMR. J. Biomol. NMR 56, 337–351 (2013).

    Article  CAS  Google Scholar 

  49. Trifonov, E. N. in Structure and Methods, Vol. 1: The Proceedings of the Sixth Conversation held at The University–SUNY (Adenine, 1990).

  50. Holm, L. & Laakso, L. M. Dali server update. Nucleic Acids Res. 44 (W1), W351–W355 (2016).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We thank all Foldit players for their gameplay contributions, and for feedback offered on the Foldit website (https://fold.it). We thank A. Kang, S. A. Rettie, C. Chow and L. Carter for help with experiments; D. Alonso, L. Goldschmidt, P. Vecchiato and D. Kim for computer support; and Rosetta@home (https://boinc.bakerlab.org) volunteers for computing resources. We thank G. Rocklin, V. Mulligan and other members of the Baker laboratory for discussions. This material is based on work supported by the National Science Foundation (NSF) Graduate Research Fellowship under grant no. DGE-1256082, NSF grant no. 1629879, National Institutes of Health (NIH) grant 1UH2CA203780, and NIH grants 1S10 OD018207 and 5R01 GM120574 (to G.T.M.) and HHMI (D.B.). The ALS-ENABLE beamlines are supported in part by the NIH, National Institute of General Medical Sciences, grant P30 GM124169-01. The Advanced Light Source is a DOE User Facility under contract no. DE-AC02-05CH11231. Foldit3 was a nominated target of the CASP COMMONS Community Outreach program.

Reviewer information

Nature thanks Jérôme Waldispuhl and the other anonymous reviewer(s) for their contribution to the peer review of this work.

Author information

Authors and Affiliations

Authors

Contributions

B.K., Z.P., F.K., S.C. and D.B. designed the study. B.K., J.F., T.H., A.F., D.-A.S. and S.C. developed Foldit software tools.  A. Boykov, R.D.E., S.K., T.N.-S. and L.W., along with the other Foldit players, designed all proteins. B.K., F.K., A.F. and A. Bauer analysed Foldit player designs. B.K. performed biophysical characterization. B.K., M.J.B. and F.D. determined crystal structures. G.L., Y.I. and G.T.M. determined the NMR structure. B.K. and D.B. wrote the manuscript with input from all authors. Foldit players contributed extensively through their feedback and gameplay, which generated the data for this paper.

Corresponding author

Correspondence to David Baker.

Ethics declarations

Competing interests

G.T.M. is a co-founder of Nexomics Biosciences.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Initial top-ranking Foldit player designs.

When challenged to design a protein with only the talaris2013 score function (and no additional rules), Foldit players discovered low-energy models that are unlikely to fold as designed. a, An extended α-helix, composed entirely of lysine and glutamate, has very favourable energies for hydrogen-bonding, electrostatic and backbone torsions, but is unlikely to fold cooperatively into a single stable structure. This type of design is discouraged with the ‘core exists’ rule. b, Owing to their greater surface area, large aromatic sidechains can make more interactions than smaller aliphatic sidechains, even when underpacked or solvent-exposed. This type of design is discouraged with the ‘residue interaction energy’ rule. c, A design with an alanine- and glycine-saturated core can make favourable van der Waals interactions between closely packed backbone atoms; however, the burial of these small sidechains is associated with a weaker hydrophobic effect, and the lack of interdigitation allows exchange between multiple conformations with similar core packing energies (that is, molten globule behaviour). These designs are discouraged with the ‘secondary structure design’ rule.

Extended Data Fig. 2 Rosetta energy of top Foldit player designs.

Rosetta energy of top-ranking designs was calculated with the talaris2013 score function and normalized by residue count. a, Energy of top-ten-ranked designs from: initial Foldit puzzles (round 0; n = 30 designs), round 1 puzzles (n = 170), round 2 puzzles (n = 510) and round 3 puzzles (n = 250). The introduction of supplementary rules in round 1 and round 2 resulted in higher-energy designs (P < 10−6 and P < 0.01, respectively; Wilcoxon rank-sum test). The backbone modelling improvements in round 3 resulted in lower-energy designs (P < 10−15; Wilcoxon rank-sum test). b, Energy of top-ten-ranked designs from round three all-α puzzles (n = 30) or α/β-puzzles using the ‘secondary structure’ rule (n = 220). All-α designs tend to have lower energy than α/β-designs (P < 10−10; Wilcoxon rank-sum test). Box plots show: centre line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range; points, outliers.

Extended Data Fig. 3 New backbone-modelling tools in Foldit.

a, The ‘remix’ tool allows players to select a region of the model and search a library of backbone fragments for a conformation that can be substituted. b, An interactive Ramachandran map allows players to easily identify residues with outlier backbone conformations. Players can also click and drag points on the Ramachandran map to set the backbone torsions of individual residues. c, A ‘blueprint’ panel shows the primary sequence and secondary structure content of the model. Residues are coloured according to the ABEGO quadrants of the Ramachandran plot7. d, Players can drag-and-drop modular building blocks onto the blueprint panel to insert common turn conformations into their model.

Extended Data Fig. 4 Improvement of backbone quality in round 3 Foldit designs.

MolProbity24 was used to calculate the proportion of residues with unfavored or outlier backbone torsions in: high-resolution crystal structures of native proteins (n = 6,342), de novo design models from a previous study7 (n = 72), and top-ranking Foldit player designs from before (n = 680) and after (n = 250) improvements to Foldit backbone-modelling tools. Initial Foldit player designs contained significantly more unfavoured torsions than native proteins or other de novo designs from a previous study7 (P < 10−15, two-tailed t-test). Improvements to Foldit’s backbone-modelling tools led Foldit players to produce designs with fewer unfavoured torsions (P < 10−15, two-tailed t-test). Box plots show: centre line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range; points, outliers.

Extended Data Fig. 5 Protein folds represented by successful Foldit player designs.

Each fold has a unique arrangement and connectivity of secondary structure elements, depicted in cartoon diagrams. Diagrams are labelled with Roman numerals as in Fig. 3. Fold XX is a new fold, previously unobserved in natural proteins; TM-align26 and DALI50 alignments of design 2003594_S028 against the entire PDB found no structural homologues with this fold.

Extended Data Fig. 6 Foldit player demographics.

All players who participated in Foldit protein design puzzles and who had not opted out of Foldit-related email were solicited for survey questions. Data are shown for n = 324 responding Foldit players.

Extended Data Fig. 7 Category rankings of Foldit players.

Foldit player rankings are strongly correlated in the design and prediction categories (Spearman’s rank correlation coefficient of 0.84). This suggests that skills developed playing Foldit structure prediction puzzles carry over to design puzzles and vice versa.

Extended Data Table 1 Success rates of Foldit player-designed proteins
Extended Data Table 2 X-ray crystallography data and refinement statistics
Extended Data Table 3 NMR and refinement statistics for protein structures

Supplementary information

Supplementary Information

This file contains Supplementary Methods, Supplementary Table 1, biophysical characterization of all 56 successful protein designs, testimonials from Foldit players describing their protein design strategies, and the names of participating Foldit players.

Reporting Summary

Supplementary Data

This file contains the design models, protein and DNA sequences for all tested protein designs. Also included is the Foldit configuration files that were used to set up design puzzles for Foldit players and custom code used to analyse circular dichroism data.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Koepnick, B., Flatten, J., Husain, T. et al. De novo protein design by citizen scientists. Nature 570, 390–394 (2019). https://doi.org/10.1038/s41586-019-1274-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41586-019-1274-4

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing