Letter | Published:

De novo protein design by citizen scientists

Abstract

Online citizen science projects such as GalaxyZoo1, Eyewire2 and Phylo3 have proven very successful for data collection, annotation and processing, but for the most part have harnessed human pattern-recognition skills rather than human creativity. An exception is the game EteRNA4, in which game players learn to build new RNA structures by exploring the discrete two-dimensional space of Watson–Crick base pairing possibilities. Building new proteins, however, is a more challenging task to present in a game, as both the representation and evaluation of a protein structure are intrinsically three-dimensional. We posed the challenge of de novo protein design in the online protein-folding game Foldit5. Players were presented with a fully extended peptide chain and challenged to craft a folded protein structure and an amino acid sequence encoding that structure. After many iterations of player design, analysis of the top-scoring solutions and subsequent game improvement, Foldit players can now—starting from an extended polypeptide chain—generate a diversity of protein structures and sequences that encode them in silico. One hundred forty-six Foldit player designs with sequences unrelated to naturally occurring proteins were encoded in synthetic genes; 56 were found to be expressed and soluble in Escherichia coli, and to adopt stable monomeric folded structures in solution. The diversity of these structures is unprecedented in de novo protein design, representing 20 different folds—including a new fold not observed in natural proteins. High-resolution structures were determined for four of the designs, and are nearly identical to the player models. This work makes explicit the considerable implicit knowledge that contributes to success in de novo protein design, and shows that citizen scientists can discover creative new solutions to outstanding scientific challenges such as the protein design problem.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Data availability

The atomic coordinates of Foldit1, Peak6 and Ferredog-Diesel crystal structures and the Foldit3 NMR structure have been deposited in the RCSB Protein Data Bank (PDB) with accession numbers 6MRR, 6MRS, 6NUK and 6MSP, respectively. Chemical shift and NOESY peak list data for Foldit3 were deposited in the Biological Magnetic Resonance Data Bank with accession number 30527.

Code availability

Because Foldit crowdsourcing relies on regulated, fair competition between participants, the source code of the Foldit user interface is not open. The underlying Rosetta macromolecular modelling suite (https://www.rosettacommons.org) is freely available to academic and non-commercial users, and commercial licenses are available via the University of Washington CoMotion Express License Program. Analysis scripts used in this paper are available in the Supplementary Information.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Lintott, C. J. et al. Galaxy Zoo: morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey. Mon. Not. R. Astron. Soc. 389, 1179–1189 (2008).

  2. 2.

    Kim, J. S. et al. Space-time wiring specificity supports direction selectivity in the retina. Nature 509, 331–336 (2014).

  3. 3.

    Kawrykow, A. et al. Phylo: a citizen science approach for improving multiple sequence alignment. PLoS ONE 7, e31362 (2012).

  4. 4.

    Lee, J. et al. RNA design rules from a massive open laboratory. Proc. Natl Acad. Sci. USA 111, 2122–2127 (2014).

  5. 5.

    Cooper, S. et al. Predicting protein structures with a multiplayer online game. Nature 466, 756–760 (2010).

  6. 6.

    Epstein, C. J., Goldberger, R. F. & Anfinsen, C. B. The genetic control of tertiary protein structure: studies with model systems. Cold Spring Harb. Symp. Quant. Biol. 28, 439–449 (1963).

  7. 7.

    Lin, Y.-R. et al. Control over overall shape and size in de novo designed proteins. Proc. Natl Acad. Sci. USA 112, E5478–E5485 (2015).

  8. 8.

    Huang, P.-S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320–327 (2016).

  9. 9.

    Marcos, E. et al. Principles for designing proteins with cavities formed by curved β sheets. Science 355, 201–206 (2017).

  10. 10.

    Dou, J. et al. De novo design of a fluorescence-activating β-barrel. Nature 561, 485–491 (2018).

  11. 11.

    Alford, R. F. et al. The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017).

  12. 12.

    Khatib, F. et al. Crystal structure of a monomeric retroviral protease solved by protein folding game players. Nat. Struct. Mol. Biol. 18, 1175–1177 (2011).

  13. 13.

    Eiben, C. B. et al. Increased Diels–Alderase activity through backbone remodeling guided by Foldit players. Nat. Biotechnol. 30, 190–192 (2012).

  14. 14.

    Blout, E. R. & Idelson, M. Compositional effects on the configuration of water-soluble polypeptide copolymers of l-glutamic acid and l–lysine. J. Am. Chem. Soc. 80, 4909–4913 (1958).

  15. 15.

    Doty, P., Imahori, K. & Klemperer, E. The solution properties and configurations of a polyampholytic polypeptide: copoly-l-lysine-l-glutamic acid. Proc. Natl Acad. Sci. USA 44, 424–431 (1958).

  16. 16.

    Ghosh, K. & Dill, K. A. Theory for protein folding cooperativity: helix bundles. J. Am. Chem. Soc. 131, 2306–2312 (2009).

  17. 17.

    Rohl, C. A., Strauss, C. E. M., Misura, K. M. S. & Baker, D. Protein structure prediction using Rosetta. Methods Enzymol. 383, 66–93 (2004).

  18. 18.

    Koga, N. et al. Principles for designing ideal protein structures. Nature 491, 222–227 (2012).

  19. 19.

    Regan, L. & DeGrado, W. F. Characterization of a helical protein designed from first principles. Science 241, 976–978 (1988).

  20. 20.

    Harbury, P. B., Plecs, J. J., Tidor, B., Alber, T. & Kim, P. S. High-resolution protein design with backbone freedom. Science 282, 1462–1467 (1998).

  21. 21.

    Thomson, A. R. et al. Computational design of water-soluble α-helical barrels. Science 346, 485–488 (2014).

  22. 22.

    Jacobs, T. M. et al. Design of structurally distinct proteins using strategies inspired by evolution. Science 352, 687–690 (2016).

  23. 23.

    Ramachandran, G. N. & Sasisekharan, V. Conformation of polypeptides and proteins. Adv. Protein Chem. 23, 283–438 (1968).

  24. 24.

    Chen, V. B. et al. MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr. D 66, 12–21 (2010).

  25. 25.

    Montelione, G. T. et al. Recommendations of the wwPDB NMR Validation Task Force. Structure 21, 1563–1570 (2013).

  26. 26.

    Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).

  27. 27.

    Santoro, M. M. & Bolen, D. W. Unfolding free energy changes determined by the linear extrapolation method. 1. Unfolding of phenylmethanesulfonyl α-chymotrypsin using different denaturants. Biochemistry 27, 8063–8068 (1988).

  28. 28.

    Otwinowski, Z. & Minor, W. Processing of X-ray diffraction data collected in oscillation mode. Methods Enzymol. 276, 307–326 (1997).

  29. 29.

    McCoy, A. J. et al. Phaser crystallographic software. J. Appl. Crystallogr. 40, 658–674 (2007).

  30. 30.

    Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. Features and development of Coot. Acta Crystallogr. D 66, 486–501 (2010).

  31. 31.

    Afonine, P. V. et al. Towards automated crystallographic structure refinement with phenix.refine. Acta Crystallogr. D 68, 352–367 (2012).

  32. 32.

    Jansson, M. et al. High-level production of uniformly 15N- and 13C-enriched fusion proteins in Escherichia coli. J. Biomol. NMR 7, 131–141 (1996).

  33. 33.

    Delaglio, F. et al. NMRPipe: a multidimensional spectral processing system based on UNIX pipes. J. Biomol. NMR 6, 277–293 (1995).

  34. 34.

    Bartels, C., Xia, T. H., Billeter, M., Güntert, P. & Wüthrich, K. The program XEASY for computer-supported NMR spectral analysis of biological macromolecules. J. Biomol. NMR 6, 1–10 (1995).

  35. 35.

    Liu, G. et al. NMR data collection and analysis protocol for high-throughput protein structure determination. Proc. Natl Acad. Sci. USA 102, 10487–10492 (2005).

  36. 36.

    Shen, Y., Delaglio, F., Cornilescu, G. & Bax, A. TALOS+: a hybrid method for predicting protein backbone torsion angles from NMR chemical shifts. J. Biomol. NMR 44, 213–223 (2009).

  37. 37.

    Huang, Y. J., Tejero, R., Powers, R. & Montelione, G. T. A topology-constrained distance network algorithm for protein structure determination from NOESY data. Proteins 62, 587–603 (2006).

  38. 38.

    Güntert, P., Mumenthaler, C. & Wüthrich, K. Torsion angle dynamics for NMR structure calculation with the new program DYANA. J. Mol. Biol. 273, 283–298 (1997).

  39. 39.

    Herrmann, T., Güntert, P. & Wüthrich, K. Protein NMR structure determination with automated NOE assignment using the new software CANDID and the torsion angle dynamics algorithm DYANA. J. Mol. Biol. 319, 209–227 (2002).

  40. 40.

    Huang, Y. J., Powers, R. & Montelione, G. T. Protein NMR recall, precision, and F-measure scores (RPF scores): structure quality assessment measures based on information retrieval statistics. J. Am. Chem. Soc. 127, 1665–1674 (2005).

  41. 41.

    Linge, J. P., Williams, M. A., Spronk, C. A., Bonvin, A. M. & Nilges, M. Refinement of protein structures in explicit solvent. Proteins 50, 496–506 (2003).

  42. 42.

    Brünger, A. T. et al. Crystallography & NMR system: a new software suite for macromolecular structure determination. Acta Crystallogr. D 54, 905–921 (1998).

  43. 43.

    Lüthy, R., Bowie, J. U. & Eisenberg, D. Assessment of protein models with three-dimensional profiles. Nature 356, 83–85 (1992).

  44. 44.

    Sippl, M. J. Recognition of errors in three-dimensional structures of proteins. Proteins 17, 355–362 (1993).

  45. 45.

    Laskowski, R. A., Macarthur, M. W., Moss, D. S. & Thornton, J. M. Procheck—a program to check the stereochemical quality of protein structures. J. Appl. Crystallogr. 26, 283–291 (1993).

  46. 46.

    Word, J. M., Bateman, R. C., Jr, Presley, B. K., Lovell, S. C. & Richardson, D. C. Exploring steric constraints on protein mutations using MAGE/PROBE. Protein Sci. 9, 2251–2259 (2000).

  47. 47.

    Bhattacharya, A., Tejero, R. & Montelione, G. T. Evaluating protein structures determined by structural genomics consortia. Proteins 66, 778–795 (2007).

  48. 48.

    Tejero, R., Snyder, D., Mao, B., Aramini, J. M. & Montelione, G. T. PDBStat: a universal restraint converter and restraint analysis software package for protein NMR. J. Biomol. NMR 56, 337–351 (2013).

  49. 49.

    Trifonov, E. N. in Structure and Methods, Vol. 1: The Proceedings of the Sixth Conversation held at The University–SUNY (Adenine, 1990).

  50. 50.

    Holm, L. & Laakso, L. M. Dali server update. Nucleic Acids Res. 44 (W1), W351–W355 (2016).

Download references

Acknowledgements

We thank all Foldit players for their gameplay contributions, and for feedback offered on the Foldit website (https://fold.it). We thank A. Kang, S. A. Rettie, C. Chow and L. Carter for help with experiments; D. Alonso, L. Goldschmidt, P. Vecchiato and D. Kim for computer support; and Rosetta@home (https://boinc.bakerlab.org) volunteers for computing resources. We thank G. Rocklin, V. Mulligan and other members of the Baker laboratory for discussions. This material is based on work supported by the National Science Foundation (NSF) Graduate Research Fellowship under grant no. DGE-1256082, NSF grant no. 1629879, National Institutes of Health (NIH) grant 1UH2CA203780, and NIH grants 1S10 OD018207 and 5R01 GM120574 (to G.T.M.) and HHMI (D.B.). The ALS-ENABLE beamlines are supported in part by the NIH, National Institute of General Medical Sciences, grant P30 GM124169-01. The Advanced Light Source is a DOE User Facility under contract no. DE-AC02-05CH11231. Foldit3 was a nominated target of the CASP COMMONS Community Outreach program.

Reviewer information

Nature thanks Jérôme Waldispuhl and the other anonymous reviewer(s) for their contribution to the peer review of this work.

Author information

B.K., Z.P., F.K., S.C. and D.B. designed the study. B.K., J.F., T.H., A.F., D.-A.S. and S.C. developed Foldit software tools.  A. Boykov, R.D.E., S.K., T.N.-S. and L.W., along with the other Foldit players, designed all proteins. B.K., F.K., A.F. and A. Bauer analysed Foldit player designs. B.K. performed biophysical characterization. B.K., M.J.B. and F.D. determined crystal structures. G.L., Y.I. and G.T.M. determined the NMR structure. B.K. and D.B. wrote the manuscript with input from all authors. Foldit players contributed extensively through their feedback and gameplay, which generated the data for this paper.

Competing interests

G.T.M. is a co-founder of Nexomics Biosciences.

Correspondence to David Baker.

Extended data figures and tables

  1. Extended Data Fig. 1 Initial top-ranking Foldit player designs.

    When challenged to design a protein with only the talaris2013 score function (and no additional rules), Foldit players discovered low-energy models that are unlikely to fold as designed. a, An extended α-helix, composed entirely of lysine and glutamate, has very favourable energies for hydrogen-bonding, electrostatic and backbone torsions, but is unlikely to fold cooperatively into a single stable structure. This type of design is discouraged with the ‘core exists’ rule. b, Owing to their greater surface area, large aromatic sidechains can make more interactions than smaller aliphatic sidechains, even when underpacked or solvent-exposed. This type of design is discouraged with the ‘residue interaction energy’ rule. c, A design with an alanine- and glycine-saturated core can make favourable van der Waals interactions between closely packed backbone atoms; however, the burial of these small sidechains is associated with a weaker hydrophobic effect, and the lack of interdigitation allows exchange between multiple conformations with similar core packing energies (that is, molten globule behaviour). These designs are discouraged with the ‘secondary structure design’ rule.

  2. Extended Data Fig. 2 Rosetta energy of top Foldit player designs.

    Rosetta energy of top-ranking designs was calculated with the talaris2013 score function and normalized by residue count. a, Energy of top-ten-ranked designs from: initial Foldit puzzles (round 0; n = 30 designs), round 1 puzzles (n = 170), round 2 puzzles (n = 510) and round 3 puzzles (n = 250). The introduction of supplementary rules in round 1 and round 2 resulted in higher-energy designs (P < 10−6 and P < 0.01, respectively; Wilcoxon rank-sum test). The backbone modelling improvements in round 3 resulted in lower-energy designs (P < 10−15; Wilcoxon rank-sum test). b, Energy of top-ten-ranked designs from round three all-α puzzles (n = 30) or α/β-puzzles using the ‘secondary structure’ rule (n = 220). All-α designs tend to have lower energy than α/β-designs (P < 10−10; Wilcoxon rank-sum test). Box plots show: centre line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range; points, outliers.

  3. Extended Data Fig. 3 New backbone-modelling tools in Foldit.

    a, The ‘remix’ tool allows players to select a region of the model and search a library of backbone fragments for a conformation that can be substituted. b, An interactive Ramachandran map allows players to easily identify residues with outlier backbone conformations. Players can also click and drag points on the Ramachandran map to set the backbone torsions of individual residues. c, A ‘blueprint’ panel shows the primary sequence and secondary structure content of the model. Residues are coloured according to the ABEGO quadrants of the Ramachandran plot7. d, Players can drag-and-drop modular building blocks onto the blueprint panel to insert common turn conformations into their model.

  4. Extended Data Fig. 4 Improvement of backbone quality in round 3 Foldit designs.

    MolProbity24 was used to calculate the proportion of residues with unfavored or outlier backbone torsions in: high-resolution crystal structures of native proteins (n = 6,342), de novo design models from a previous study7 (n = 72), and top-ranking Foldit player designs from before (n = 680) and after (n = 250) improvements to Foldit backbone-modelling tools. Initial Foldit player designs contained significantly more unfavoured torsions than native proteins or other de novo designs from a previous study7 (P < 10−15, two-tailed t-test). Improvements to Foldit’s backbone-modelling tools led Foldit players to produce designs with fewer unfavoured torsions (P < 10−15, two-tailed t-test). Box plots show: centre line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range; points, outliers.

  5. Extended Data Fig. 5 Protein folds represented by successful Foldit player designs.

    Each fold has a unique arrangement and connectivity of secondary structure elements, depicted in cartoon diagrams. Diagrams are labelled with Roman numerals as in Fig. 3. Fold XX is a new fold, previously unobserved in natural proteins; TM-align26 and DALI50 alignments of design 2003594_S028 against the entire PDB found no structural homologues with this fold.

  6. Extended Data Fig. 6 Foldit player demographics.

    All players who participated in Foldit protein design puzzles and who had not opted out of Foldit-related email were solicited for survey questions. Data are shown for n = 324 responding Foldit players.

  7. Extended Data Fig. 7 Category rankings of Foldit players.

    Foldit player rankings are strongly correlated in the design and prediction categories (Spearman’s rank correlation coefficient of 0.84). This suggests that skills developed playing Foldit structure prediction puzzles carry over to design puzzles and vice versa.

  8. Extended Data Table 1 Success rates of Foldit player-designed proteins
  9. Extended Data Table 2 X-ray crystallography data and refinement statistics
  10. Extended Data Table 3 NMR and refinement statistics for protein structures

Supplementary information

  1. Supplementary Information

    This file contains Supplementary Methods, Supplementary Table 1, biophysical characterization of all 56 successful protein designs, testimonials from Foldit players describing their protein design strategies, and the names of participating Foldit players.

  2. Reporting Summary

  3. Supplementary Data

    This file contains the design models, protein and DNA sequences for all tested protein designs. Also included is the Foldit configuration files that were used to set up design puzzles for Foldit players and custom code used to analyse circular dichroism data.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark
Fig. 1: The Foldit user interface.
Fig. 2: Comparison of Foldit player and automated design-sampling strategies.
Fig. 3: Structural characterization of Foldit player-designed proteins.
Fig. 4: High-resolution structures of Foldit player-designed proteins.
Extended Data Fig. 1: Initial top-ranking Foldit player designs.
Extended Data Fig. 2: Rosetta energy of top Foldit player designs.
Extended Data Fig. 3: New backbone-modelling tools in Foldit.
Extended Data Fig. 4: Improvement of backbone quality in round 3 Foldit designs.
Extended Data Fig. 5: Protein folds represented by successful Foldit player designs.
Extended Data Fig. 6: Foldit player demographics.
Extended Data Fig. 7: Category rankings of Foldit players.

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.