Peptide materials have a wide array of functions, from tissue engineering and surface coatings to catalysis and sensing. Tuning the sequence of amino acids that comprise the peptide modulates peptide functionality, but a small increase in sequence length leads to a dramatic increase in the number of peptide candidates. Traditionally, peptide design is guided by human expertise and intuition and typically yields fewer than ten peptides per study, but these approaches are not easily scalable and are susceptible to human bias. Here we introduce a machine learning workflow—AI-expert—that combines Monte Carlo tree search and random forest with molecular dynamics simulations to develop a fully autonomous computational search engine to discover peptide sequences with high potential for self-assembly. We demonstrate the efficacy of the AI-expert to efficiently search large spaces of tripeptides and pentapeptides. The predictability of AI-expert performs on par or better than our human experts and suggests several non-intuitive sequences with high self-assembly propensity, outlining its potential to overcome human bias and accelerate peptide discovery.
This is a preview of subscription content, access via your institution
Subscribe to Nature+
Get immediate online access to Nature and 55 other Nature journal
Subscribe to Journal
Get full journal access for 1 year
only $9.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
The data that support the findings of this study are available in the Extended Data figures (for synthesized pentapeptides), the Supplementary Information (for AI-expert-proposed pentapeptides) and the accompanying code repository at https://doi.org/10.5281/zenodo.6564202 (for tripeptides). Source data are provided with this paper.
The codes underlying the AI-expert framework are freely available for general use under a Creative Commons Attribution 4.0 International license and are deposited at https://doi.org/10.5281/zenodo.6564202.
Zhu, S. et al. Self-assembly of collagen-based biomaterials: preparation, characterizations and biomedical applications. J. Mater. Chem. B 6, 2650–2676 (2018).
Sorushanova, A. et al. The collagen suprafamily: from biosynthesis to advanced biomaterial development. Adv. Mater. 31, 1801651 (2019).
Lewis, R. V. Spider silk: ancient ideas for new biomaterials. Chem. Rev. 106, 3762–3774 (2006).
Scholes, G. D., Fleming, G. R., Olaya-Castro, A. & Van Grondelle, R. Lessons from nature about solar light harvesting. Nat. Chem. 3, 763–774 (2011).
Luo, Q., Hou, C., Bai, Y., Wang, R. & Liu, J. Protein assembly: versatile approaches to construct highly ordered nanostructures. Chem. Rev. 116, 13571–13632 (2016).
Wei, G. et al. Self-assembling peptide and protein amyloids: from structure to tailored function in nanotechnology. Chem. Soc. Rev. 46, 4661–4708 (2017).
Ulijn, R. V. & Smith, A. M. Designing peptide based nanomaterials. Chem. Soc. Rev. 37, 664–675 (2008).
Adler-Abramovich, L. & Gazit, E. The physical properties of supramolecular peptide assemblies: from building block association to technological applications. Chem. Soc. Rev. 43, 6881–6893 (2014).
Wang, M. et al. Nanoribbons self-assembled from short peptides demonstrate the formation of polar zippers between β-sheets. Nat. Commun. 9, 5118 (2018).
Lakshmanan, A. et al. Aliphatic peptides show similar self-assembly to amyloid core sequences, challenging the importance of aromatic interactions in amyloidosis. Proc. Natl Acad. Sci. USA 110, 519–524 (2013).
Brahmachari, S., Arnon, Z. A., Frydman-Marom, A., Gazit, E. & Adler-Abramovich, L. Diphenylalanine as a reductionist model for the mechanistic characterization of β-amyloid modulators. ACS Nano 11, 5960–5969 (2017).
Yemini, M., Reches, M., Rishpon, J. & Gazit, E. Novel electrochemical biosensing platform using self-assembled peptide nanotubes. Nano Lett. 5, 183–186 (2005).
Zohrabi, T., Habibi, N., Zarrabi, A., Fanaei, M. & Lee, L. Y. Diphenylalanine peptide nanotubes self-assembled on functionalized metal surfaces for potential application in drug-eluting stent. J. Bio. Mater. Res. A 104, 2280–2290 (2016).
Tao, K., Makam, P., Aizen, R. & Gazit, E. Self-assembling peptide semiconductors. Science 358, eaam9756 (2017).
Yan, X., Zhu, P. & Li, J. Self-assembly and application of diphenylalanine-based nanostructures. Chem. Soc. Rev. 39, 1877–1890 (2010).
Kholkin, A., Amdursky, N., Bdikin, I., Gazit, E. & Rosenman, G. Strong piezoelectricity in bioinspired peptide nanotubes. ACS Nano 4, 610–614 (2010).
Yan, X. et al. Transition of cationic dipeptide nanotubes into vesicles and oligonucleotide delivery. Angew. Chem. Int. Ed. 119, 2483–2486 (2007).
Zhao, X. et al. Molecular self-assembly and applications of designer peptide amphiphiles. Chem. Soc. Rev. 39, 3480–3498 (2010).
Zelzer, M. & Ulijn, R. V. Next-generation peptide nanomaterials: molecular networks, interfaces and supramolecular functionality. Chem. Soc. Rev. 39, 3351–3357 (2010).
Cui, H., Webber, M. J. & Stupp, S. I. Self-assembly of peptide amphiphiles: from molecules to nanostructures to biomaterials. Peptide Sci. Original Res. Biomol. 94, 1–18 (2010).
Rufo, C. M. et al. Short peptides self-assemble to produce catalytic amyloids. Nat. Chem. 6, 303–309 (2014).
Gelain, F., Luo, Z. & Zhang, S. Self-assembling peptide EAK16 and RADA16 nanofiber scaffold hydrogel. Chem. Rev. 120, 13434–13460 (2020).
Solomon, L. A. et al. Tailorable exciton transport in doped peptide-amphiphile assemblies. ACS Nano 11, 9112–9118 (2017).
Palmer, L. C. & Stupp, S. I. Molecular self-assembly into one-dimensional nanostructures. Acc. Chem. Res. 41, 1674–1684 (2008).
Zhang, S. Discovery and design of self-assembling peptides. Interface Focus 7, 20170028 (2017).
White, S. H. & Wimley, W. C. Hydrophobic interactions of peptides with membrane interfaces. Biochim. Biophys. Acta Biomembr. 1376, 339–352 (1998).
Wimley, W. C., Creamer, T. P. & White, S. H. Solvation energies of amino acid side chains and backbone in a family of host-guest pentapeptides. Biochemistry 35, 5109–5124 (1996).
Chou, P. Y. & Fasman, G. D. Prediction of protein conformation. Biochemistry 13, 222–245 (1974).
Frederix, P. W. et al. Exploring the sequence space for (tri-) peptide self-assembly to design and discover new hydrogels. Nat. Chem. 7, 30–37 (2015).
Batra, R., Song, L. & Ramprasad, R. Emerging materials intelligence ecosystems propelled by machine learning. Nat. Rev. Mater 6, 655–678 (2021).
Balachandran, P. V., Kowalski, B., Sehirlioglu, A. & Lookman, T. Experimental search for high-temperature ferroelectric perovskites guided by two-step machine learning. Nat. Commun. 9, 1668 (2018).
Lookman, T., Balachandran, P. V., Xue, D., Hogden, J. & Theiler, J. Statistical inference and adaptive design for materials discovery. Curr. Opin. Solid State Mater. Sci. 21, 121–128 (2017).
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 2011).
Browne, C. B. et al. A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4, 1–43 (2012).
Frederix, P. W., Ulijn, R. V., Hunt, N. T. & Tuttle, T. Virtual screening for dipeptide aggregation: toward predictive tools for peptide self-assembly. J. Phys. Chem. Lett. 2, 2380–2384 (2011).
Bekker, H. et al. in Physics Computing Vol. 92, 252–256 RA DeGroot, J Nadrchal (World Scientific Singapore, 1993).
Abraham, M. J. et al. GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1, 19–25 (2015).
Coulom, R. Efficient selectivity and backup operators in Monte-Carlo tree search. In Proc. 5th International Conference on Computers and Games 72–83 (Springer, 2006).
Kocsis, L. & Szepesvári, C. Bandit based Monte-Carlo planning. In Proc. 15th European Conference on Machine Learning 282–293 (Springer, 2006).
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
Dieb, T. M., Ju, S., Shiomi, J. & Tsuda, K. Monte Carlo tree search for materials design and discovery. MRS Commun. 9, 532–536 (2019).
Srinivasan, S. et al. Artificial intelligence-guided De novo molecular design targeting COVID-19. ACS Omega. 6, 12557–12566 (2021).
Liu, Y.-C. & Tsuruoka, Y. Modification of improved upper confidence bounds for regulating exploration in Monte-Carlo tree search. Theor. Comput. Sci. 644, 92–105 (2016).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Monticelli, L. et al. The Martini coarse-grained force field: extension to proteins. J. Chem. Theory Comput. 4, 819–834 (2008).
Singh, G. & Tieleman, D. P. Using the Wimley-White hydrophobicity scale as a direct quantitative test of force fields: the Martini coarse-grained model. J. Chem. Theory Comput. 7, 2316–2324 (2011).
de Jong, D. H., Periole, X. & Marrink, S. J. Dimerization of amino acid side chains: lessons from the comparison of different force fields. J. Chem. Theory Comput. 8, 1003–1014 (2012).
Tang, J. D., Mura, C. & Lampe, K. J. Stimuli-responsive, pentapeptide, nanofiber hydrogel for tissue engineering. J. Am. Chem. Soc. 141, 4886–4899 (2019).
Clarke, D. E., Parmenter, C. D. & Scherman, O. A. Tunable pentapeptide self-assembled β-sheet hydrogels. Angew. Chem. Int. Ed. 57, 7709–7713 (2018).
Reches, M., Porat, Y. & Gazit, E. Amyloid fibril formation by pentapeptide and tetrapeptide fragments of human calcitonin. J. Bio. Chem. 277, 35475–35480 (2002).
Guterman, T. et al. Real-time in-situ monitoring of a tunable pentapeptide gel-crystal transition. Angew. Chem. 131, 16016–16022 (2019).
Tsiolaki, P. L., Hamodrakas, S. J. & Iconomidou, V. A. The pentapeptide LQVVR plays a pivotal role in human cystatin C fibrillization. FEBS Lett. 589, 159–164 (2015).
Krysmann, M. J. et al. Self-assembly and hydrogelation of an amyloid peptide fragment. Biochemistry 47, 4597–4605 (2008).
Kong, J. & Yu, S. Fourier transform infrared spectroscopic analysis of protein secondary structures. Acta Biochim. Biophys. Sin. 39, 549–559 (2007).
Fujiwara, K., Toda, H. & Ikeguchi, M. Dependence of α-helical and β-sheet amino acid propensities on the overall protein fold type. BMC Struct. Biol. 12, 18 (2012).
RDKit open source toolkit for cheminformatics; http://www.rdkit.org/
Gobbi, A. & Poppinger, D. Genetic optimization of combinatorial libraries. Biotechnol. Bioeng. 61, 47–54 (1998).
Humphrey, W., Dalke, A. & Schulten, K. VMD: visual molecular dynamics. J. Mol. Graph. 14, 33–38 (1996).
Berendsen, H. J., Postma, J. V., van Gunsteren, W. F., DiNola, A. & Haak, J. R. Molecular dynamics with coupling to an external bath. J. Chem. Phys. 81, 3684–3690 (1984).
Hess, B. P-LINCS: a parallel linear constraint solver for molecular simulation. J. Chem. Theory Comput. 4, 116–122 (2008).
Marrink, S. J., Risselada, H. J., Yefimov, S., Tieleman, D. P. & De Vries, A. H. The Martini force field: coarse grained model for biomolecular simulations. J. Phys. Chem. B 111, 7812–7824 (2007).
Marrink, S. J., De Vries, A. H. & Mark, A. E. Coarse grained model for semiquantitative lipid simulations. J. Phys. Chem. B 108, 750–760 (2004).
Yesylevskyy, S. O., Schäfer, L. V., Sengupta, D. & Marrink, S. J. Polarizable water model for the coarse-grained Martini force field. PLoS Comput. Biol. 6, e1000810 (2010).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Batra, R. et al. Screening of therapeutic agents for COVID-19 using machine learning and ensemble docking studies. J. Phys. Chem. Lett. 11, 7058–7065 (2020).
Kim, C., Chandrasekaran, A., Huan, T. D., Das, D. & Ramprasad, R. Polymer genome: a data-powered polymer informatics platform for property predictions. J. Phys. Chem. C 122, 17575–17585 (2018).
Work performed at the Center for Nanoscale Materials, a US Department of Energy (DOE) Office of Science User Facility, was supported by the US DOE, Office of Basic Energy Sciences, under contract no. DE-AC02-06CH11357, and additionally supported by the University of Chicago and the DOE under DOE contract no. DE-AC02-06CH11357 awarded to UChicago Argonne, LLC, operator of the Argonne National Laboratory. This material is based on work supported by the DOE, Office of Science, BES Data, Artificial Intelligence and Machine Learning at DOE Scientific User Facilities programme (Digital Twins). We gratefully acknowledge the computing resources provided on Bebop, the high-performance computing clusters operated by the Laboratory Computing Resource Center (LCRC) at Argonne National Laboratory. S.K.R.S.S. acknowledges support from the UIC faculty start-up fund. We acknowledge T. Tuttle for sharing computational data on tripeptides.
The authors declare no competing interests.
Peer review information
Nature Chemistry thanks Shuguang Zhang, Jin Kim Montclare, Fabien Plisson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Top ranked tripeptides identified using the brute-force computational search on 8000 candidates. The score is based on the reward function rtri. Abbreviations: AP, aggregation propensity; logP; hydrophobicity.
Computational (AP, logP) and experimental (LC(RT), OD800nm) measurements, along with the associated reward scores (rpenta, rtri) and experimental score (ExpScore) are provided. β-sheet scale corrected rpenta and rtri scores, respectively titled rpentawB and rtriwB, are also included. Cases where aggregation (Agg.) was observed are marked 1 with a bold font.
Frequency of occurrence of (left panel) amino acids in the 29 human expert proposed sequences and (right panel) the overall charge distribution of those sequences. It is evident that human experts preferred to include V, K and F amino acids and overall charge neutral pentapeptides sequences. The complete list of the pentapeptides proposed by the human experts and the rationale for choosing/rejecting a sequences for synthesis is provided in Supplementary Information Table S2.
Effect of the exploration constant c in Eq. 1 on the search efficiency of AI-expert for the case of tripeptides with (a) just the MCTS scheme and (b) with the MCTS+RF scheme. The boxplots showcase the number of runs needed to find the topmost scoring tripeptide. The minima and maxima bounds of box represent the 25th and 75th percentile, the middle line the median, the upper whiskers extended to last datum less than 75th percentile + 1.5(IQR), lower whiskers extended to first datum greater than 25th percentile - 1.5(IQR), and data beyond the whiskers are plotted as individual points. Here, IQR signify interquartile range given by 75th - 25th percentile. The results are based on n=10 statistically independent runs. Number of trials needed using a brute-force or random search (on average) are also shown using dotted lines. The MCTS+RF scheme performs the best—not only is the MCTS+RF scheme less sensitive to the choice of c parameter, it also finds the topmost scoring tripeptide more efficiently. The MCTS+RF scheme with c = 10 was found to be most efficient and thus was selected for the pentapeptide search.
Performance of the random forest (RF) model to predict the computed aggregation propensity (AP) in a) tripeptides and b) pentapeptides. In both cases improvement in the RF model performance with increasing size of training data (left panel) is shown, along with an example parity plot of the test data when it constitutes 20 % of the total dataset. In a) n=10 statistically independent runs with a random split of test-train data (from 8000 total cases) were performed. Here, data are presented as mean values +1.5/-1.5 SD. In b) the test-train split (from ~ 6600 total cases using rpenta) was performed in a special manner to capture the progressive improvement of the RF model during the MCTS run. Since within the MCTS+RF scheme the training data was generated in an online fashion, the RF model training set consists of AP values evaluated in the early stages of the MCTS run while the test set contains AP values evaluated in the later stage of the run. Abbreviation: MAE, mean absolute error; SD, standard deviation.
Source data for AI-proposed ALL pentapeptides, top AI, top human, and synthesized pentapeptides.
Source data for pentapeptide characterization.
Source data for pentapeptide characterization with beta-sheet factor.
Source data for top-scoring tripeptides.
Source data for overall results for the synthesized pentapeptides.
Source data for diversity analysis of human expert proposed candidates.
Source data for RF surrogate models of aggregation propensity.
About this article
Cite this article
Batra, R., Loeffler, T.D., Chan, H. et al. Machine learning overcomes human bias in the discovery of self-assembling peptides. Nat. Chem. 14, 1427–1435 (2022). https://doi.org/10.1038/s41557-022-01055-3