Introduction

Hydrogel, an important class of soft materials, is formed from a self-assembled matrix that immobilizes water. Hydrogels have attracted increasing attention in various research fields because they mimic properties in natural systems such as the bodies of jellyfish, the cornea in the eye, and even the condensed chromatins in the cell nucleus1,2. Inspired by natural self-assembled functional materials (high-order assemblies of proteins), considerable attention has been focused on hydrogels formed by peptides because of their high biocompatibility3,4,5,6,7, low immunogenicity8,9,10, and similarity to the extracellular matrix11,12,13,14. To date, peptidic hydrogels have been widely used in materials science15,16,17,18, biomedicine19,20,21,22, and semiconductors23,24,25. However, the current design capability fails to meet the growing demand for neoteric peptidic hydrogels since the existing inefficient methods still rely on amino acid sequences that derive from natural proteins, professional experience in the peptide field, or laboratory discoveries by serendipity26,27,28. Therefore, accurate prediction of hydrogel formation and de novo design of peptidic hydrogels emerge as of great significance to broaden the available hydrogel-forming peptide library.

To better understand the self-assembly behaviors of peptides in forming hydrogels and the resulting morphologies, coarse-grained molecular dynamics (CGMD) has been employed to model peptide self-assembly29,30,31,32. Ulijn and Tuttle’s groups recently developed a useful approach to provide valuable design rules for overcoming the limitation of serendipity in discovering aggregation or self-assembly in dipeptide and tripeptide systems33,34. However, molecular dynamics (MD) simulations of selected peptides could only give the information (e.g., aggregation propensity, acronymized as AP) for predicting new peptides that derive from the original ones. Importantly, due to the enormous sequence quantities of peptides, brute-force MD is becoming increasingly intractable for investigating the hydrogel formation ability of longer-chain peptides33,35,36. To the best of our knowledge, systematic studies on peptidic hydrogel prediction and de novo design are less explored and remain challenging26,37.

This work provides an integrated computational, experimental, and machine learning (ML) approach to build a score function for discovering tetrapeptides for hydrogelation with an improved hit rate. Tetrapeptides have sufficient structural and sequence diversity for developing a peptide hydrogel library with ample candidates, while requiring a moderate workload of simulation for generating training data. This approach proceeds as follows, firstly, the computation adopts CGMD and ML-trained regression model to provide an estimation of AP (Fig. 1a). Based on the original score function APH33, 55 peptides are selected and chemically synthesized (Fig. 1b) for verification of gelation. With the resulting gelation feasibility (i.e., yes or no), a classification model is trained to produce the gelation corrector Cg fed to the original score function. An updated score function is then devised as APHC (Fig. 1b). The process above is looped three times with mutual information exchange between ML and experimental results (Fig. 1b) to enhance the performance of Cg from experimental results of 165 peptides, 100 of which could form hydrogels after gelation tests. Finally, tetrapeptide hydrogels obtained by de novo design from our computational model are selected as immune adjuvants to boost humoral immune recognition towards the receptor binding domain (RBD) of SARS-CoV-2 virus (Fig. 1c). The results show that the selected tetrapeptide hydrogel boosts the immune response of a model protein RBD from the spike protein of coronavirus. Overall, an 8,000-peptide library for gelation is built based on APHC with a gelation rate reaching 87.1% (Supplementary Data 2), providing great potential for further innovations in peptide-based soft materials.

Fig. 1: Workflow of coupled experimental and machine learning approach for discovering tetrapeptide hydrogels and their potential biological applications.
figure 1

a 104 uniformly distributed tetrapeptide sequences are obtained by hypercubic sampling first. CGMD simulations are then performed to generate the training data of aggregation propensity (AP), and regression models are trained to predict the AP of the entire sequence space of tetrapeptides (204). b Based on the available score function APH = AP2 × logP0.5, 55 peptides are selected and chemically synthesized to verify the gelation ability (gel marked with 1 and non-gel marked with 0). The sequence (feature) and the 1/0 (label) data are then passed to the ML algorithm to train a classification model, producing a gelation corrector Cg for each tetrapeptide. The APH score is then updated to APHC,1 = AP2 × logP0.5 × Cg,1, and another batch containing 55 peptides is selected based on APHC and is synthesized and verified. Then, the whole 110 sequences (feature) and 1/0 (label) data are employed to update the classification model to generate Cg,2, and the APHC,1 score are updated to APHC,2 = AP2× logP0.5 × Cg,2. Based on APHC,2, the third batch of 55 peptides are selected and chemically synthesized, and Cg,2 and APHC,2 are updated to Cg,3 and APHC,3. c The de novo designed peptide hydrogel is applied to serve as an efficient adjuvant for enhancing antibody production of RBD protein.

Results

Performance of corrected score function APHC

We employed cost-effective ML prediction instead of performing brute-force CGMD for generating the AP values of the entire space of tetrapeptides containing 160,000 sequences. Therefore, accurate prediction of AP values relying on ML regression models should be a prerequisite for obtaining potential hydrogels. We tested various training conditions, including training algorithms, feature representation approaches, and the size of training datasets to obtain an optimal AI model (Supplementary Figs. 24, Supplementary Tables 24, and Supplementary Data 1). Using the algorithm of support vector machine (SVM)38 with 10,000 training data represented by 80-bit one-hot approach with amino acid sequence (Supplementary Table 3), we obtained a reliable SVM model with training/testing performance of 0.095/0.092 in mean absolute difference (MAEtr/MAEte) and 0.928/0.933 in coefficient of determination (R2tr/R2te)39 (Fig. 2a, b). Further analysis of the prediction performance of SVM model revealed that the error between the predicted AP (APprd) and simulated AP (APsim) was less than 2.5% as APsim was greater than 1.5 (Fig. 2c), proving the reliability and capability of the selected model on predicting peptide aggregates and further formation of hydrogels.

Fig. 2: “Human-in-the-loop” for obtaining corrected score APHC.
figure 2

a Performance of different algorithms (i.e., LR: linear regression; NN: nearest neighbor; RF: random forest; SVM: support vector machine) with different numbers of training datasets (i.e., 1,000, 5,000, and 10,000). b Training and testing performance of ML model trained with SVM and 10,000 data using one-hot representation. The color scale indicates the density of the data points. c Error distribution with respect to simulated AP value (APsim). d First batch: ranking of experimentally selected peptides with respect to APprd and APH, and accuracy of resulted Cg,1. The Chi-square statistic test (single-sided test) has been performed, with the null hypothesis that the proportion of hydrogel-forming peptides in the population of top 8000 APH score is larger or equal to 61.5%, with a degree of freedom of 1 and significance level of 0.05. e Second batch: ranking of experimentally selected peptides with respect to APH and APHC,1, and accuracy of results in Cg,2. The Chi-square statistic test (single-sided test) has been performed with a degree of freedom of 1 and a significance level of 0.05. f Third batch: ranking of experimentally selected peptides with respect to APH and APHC,2, and accuracy of resulted Cg,3. The Chi-square statistic test (single-sided test) has been performed with a degree of freedom of 1 and a significance level of 0.05. g Gelation hit rate of experimentally synthesized tetrapeptides within the top 8000 ranking with respect to APH (first batch), APHC,1 (second batch), APHC,2 (third batch), and APHC,3 (final). h The comparison between final gelation hit rates evaluated by APprd, APH, and APHC. i (The Chi-square statistic test (single-sided test) has been performed with a degree of freedom of 1 and significance level of 0.05.) and j Distribution of APHC and APH with respect to logP’ of experimentally synthesized 165 tetrapeptides (blue indicates gelation while red indicates solution) and the complete sequence space of tetrapeptides (gray). k Comparison between the ranking of APHC (r-APHC) and APH (r-APH), where color indicates the absolute difference between r-APHC and r-APH of a single tetrapeptide. l TEM images of WPYY, WWCP, WVII, and IMVV (Inserts: optical images of the corresponding peptide). Source data are provided as a Source Data file.

Distinctive from all available score functions focusing on the prediction of peptide self-assembly33, we constructed a corrected score function APHC within three loops (Fig. 2d–f) for improving the gelation hit rate. Since the final goal was to develop a hydrogel-forming peptide library with the minimum candidate numbers and the highest gelation possibility, we constrained our gelation hit rate assessment within the top 8000 assessing scores (APH and APHC). We calculated APH (Fig. 1b) in the first loop and randomly selected 55 peptides (26 peptides that were among the top 8000 in the APH ranking), which were possibly to form hydrogel according to human expertise. It was found that 16 among the 26 peptides (within the top 8000) could form a hydrogel, and a corresponding gelation hit rate of 61.5% could be achieved with the APH score, while a similar hit rate of 63% can be achieved with the APprd score alone (Fig. 2d, left panel). With the total 55 gelation results, we trained a classification model to generate the gelation corrector Cg,1 with an averaged accuracy of 0.735 (averaged over ten parallel ML experiments, Fig. 2d, right panel). During the second loop, we calculated APHC,1 (APHC,1 = APprd2 × logP0.5 × Cg,1) and selected another 55 peptides, 30 of which were in the top 8000 APHC,1, and 23 peptides (of the 30 peptides) formed hydrogels, resulting in a hit rate of 76.7% within the top 8000 APHC,1 pool, while the APH score yielded only a gelation hit rate of 64.7% (Fig. 2e, left panel). Augmenting the gelation results from the second batch to the first batch (total 110 data), we retrained a classification model to update the gelation corrector from Cg,1 to Cg,2 with an average accuracy of 0.746 (Fig. 2e, right panel). Proceeding to the third loop, we updated the APHC,1 to APHC,2 (APHC2 = APprd2 × logP0.5 × Cg,2). Similar to the previous two loops, we selected 55 peptides, and a gelation hit rate of 81.6% (31 out of 38) was generated within the top 8000 APHC,2 and a rate of 75.0% was achieved with APHC,1 alone (Fig. 2f, left panel). With a total of 165 experimental gelation results, a final classification model was trained and produced gelation corrector Cg,3 with an averaged accuracy of 0.767 (Fig. 2f, right panel) and score APHC,3 (APHC,3 = APHC = APprd2 × logP0.5 × Cg,3) for each peptide, and a gelation hit rate of 87.1% was finally achieved with the top 8,000 APHC,3 (Fig. 2g), while the APprd and APH could only produce a gelation hit rate around 66% (Fig. 2h) based on the 165 gelation results. We listed the top 8,000 Cg and APHC (Supplementary Data 2 and 3) peptides for the convenience of selection and comparison.

To further differentiate between APHC and APH in predicting peptide hydrogels, we next compared the relationship between APHC and logP’ (Fig. 2i) as well as APH and logP’ (Fig. 2j) of experimentally synthesized 165 peptides that were marked with blue (gelation: yes) or red (gelation: no) dots, and those of total tetrapeptides (gray dots). Here, logP’ indicated normalized hydrophilicity between 0 and 1. In addition to the relationship between APHC and logP’, the relationship between APHC-AP and APHC-Cg was also investigated (Supplementary Fig. 6a, b). No linear correlation for APHC and logP’ (also APHC and AP) can be observed, demonstrating that hydrophobicity and aggregation propensity were not the only two contributors to gelation, for instance, lower isoelectric points (i.e., 4.5 ~ 6 on pH scale) could improve the gelation performance (Supplementary Fig. 6c) due to the Columbic interaction and hydrogen bonds, inducing the formation of water-containing networks between deprotonated peptides and water solvent. These results indicated the significance of cooperating experimental input (i.e., Cg) into a prediction of hydrogel-forming sequences. Furthermore, it was conducive for hydrogelation when logP’ was in the range of 0.05 to 0.4, as evidenced that the logP’ of all gelating peptides were in this range (Fig. 2i). Peptides with too weak hydrophilicity (<0.05) possibly form precipitates while ones with too strong hydrophilicity (>0.4) maintain in solution. The APH also assigned high scores to peptides with logP’ in the range of 0.05 to 0.4. However, APH cannot efficiently pinpoint peptides with high gelation potential and low APprd. As a result, more gelation peptides fall out of top 8000 compared to APHC (Fig. 2j). We have also compared the ranks of APHC and APH of the complete sequence space of tetrapeptides (Fig. 2k). APHC can significantly increase the rank of peptides which could potentially form hydrogels (maximum absolute difference in rank between APHC and APH is 8.5 × 104), such as WVII (by 14311) and IMVV (by 57146), while decreasing the rank of peptides that hardly form the hydrogel, such as WPYY (by 33033) and WWCP (by 50677). These four peptides were synthesized, validating that WVII and IMVV can form hydrogel while WPYY and WWCP cannot (Fig. 2l, Supplementary Data 6 and 7).

Discovery and characterization of peptide hydrogels

After validating the efficiency of APHC in predicting tetrapeptide hydrogels, we detailed the phase state of 165 synthesized tetrapeptides and the observed assembly behavior in an aqueous solution. Having demonstrated the identity of each synthetic tetrapeptides by mass spectrometry (MS, Supplementary Data 4) and nuclear magnetic resonance spectroscopy (NMR, Supplementary Data 5), we defined the hydrogel as the formation of a self-supporting, non-flowing mixture of water and hydrogelator through the vial-inverting method. Figure 3a (Insert optical images) showed the 6 representative tetrapeptides (FVIY, WEFF, WKFF, WTIF, WVFY, and IFYT) hydrogels in the glass vial, probably due to the π-π interaction of more than two aromatic amino acids in the tetrapeptide. Transmission electron microscope (TEM) studies (Fig. 3a and Supplementary Data 6) showed that the hydrogel formed by FVIY, WEFF, WKFF, or WTIF contained entangled nanofibers, while the hydrogel formed by WVFY or IFYT contained interlaced nanosheets. MD simulations (1250 ns) confirmed the observation of TEM results, and the front ranking of APHC demonstrated the formation of these hydrogels. Mechanical properties of tetrapeptide hydrogels (Fig. 3b and Supplementary Fig. 7) indicated that both the elasticity (G’) and the viscosity (G”) exhibited weak frequency dependence between 0.01 and 100 Hz. The G’ values were higher than G” values, suggesting the formation of a hydrogel. Fourier transforms infrared (FTIR) spectroscopy (Fig. 3c) in the amide I region (1620–1648 cm−1, C = O stretching vibration) revealed the presence of β-sheet conformation in all these six hydrogels, further indicating the presence of highly ordered peptide nanostructures.

Fig. 3: Experimental investigations on the self-assembly behavior of 165 synthetic tetrapeptides.
figure 3

a TEM images of 6 representative hydrogels of synthetic tetrapeptides, respectively. Inserts: optical images of the corresponding hydrogel (pH between 7.0 to 7.5). MD simulation results (1250 ns) and APHC ranking were shown in the right column. b Dynamic frequency sweep of tetrapeptide hydrogels at the strain value of 0.5%. c FTIR spectra in the amide I region of tetrapeptide hydrogels. d TEM images of 6 representative non-hydrogels of tetrapeptide Insets: optical images of corresponding solution/suspension (pH = 7.5). MD simulation results (1250 ns) and APHC ranking were shown in the right column. e Statistics and classification of morphologies obtained by TEM for hydrogel-forming tetrapeptides (100 peptides). f Statistics and classification of morphologies obtained by TEM for non-hydrogel-forming tetrapeptides (65 peptides). Source data are provided as a Source Data file.

We also paid attention to those non-hydrogel-forming tetrapeptides to obtain rules of sequences of non-gelating peptides. Figure 3d (Insert optical images) showed six representative tetrapeptides (TRFS, PGWW, FGWW, RRRR, LRFH, and RWVF) with low APHC ranking, some of which were highly soluble while others were barely soluble in water. TEM images (Fig. 3d and Supplementary Data 6) showed that these six peptides formed aggregates with different sizes in an aqueous solution, which were qualitatively consistent with the morphologies obtained in MD simulations except for RRRR (Fig. 3d, right column), showing different levels of aggregation. Taking the TEM result of RRRR together, we attributed this to the thermodynamic factor of concentration. Finally, we presented a summary of the assembled morphologies of all synthesized tetrapeptides (Fig. 3e, f and Supplementary Data 6), indicating that hydrogel-forming tetrapeptides tended to form fibers, sheets, or hybrid morphology (70%) in an aqueous solution. Non-hydrogel-forming tetrapeptides self-assembled into aggregates, spheres, or particulate supramolecular structures (86%). The results above confirmed the self-assembled nanostructures and hydrogelation results of these synthetic tetrapeptides, as predicted by the corrected score function APHC.

Hydrogelation laws from experiment and simulation results

One hundred and sixty-five synthesized peptides were presented with different colors indicating the capabilities of hydrogel formation (Fig. 4a). The average rank of APHC (Fig. 4b) for peptides gelation at four certain concentrations was 2664, 2801, 3646, and 4899, respectively, which was consistent with the experimental results (Supplementary Data 7) of the gelation capability, demonstrating the reliability of APHC in screening tetrapeptides for the hydrogel formation.

Fig. 4: Hydrogelation laws from experiments and simulations.
figure 4

a Sequences of 165 synthesized tetrapeptides. The number in each fill represents the rank of APHC. Different color represents the hydrogelation capability at different concentrations. b Averaged rank of APHC of hydrogel-forming peptides at 30, 60, 90, and 120 mM, as well as non-gelating peptides at 120 mM (after the dashed line), n = 46 for 30 mM, n = 30 for 60 mM, n = 24 for 90 and 120 mM (with gelation), and n = 65 for 120 mM (without gelation, n represent the number of synthetic tetrapeptides in each category. c Contribution of each amino acid at different positions to hydrogel formation, compared between 100 experimental data of hydrogel-forming peptides and the top 8,000 APHC simulation data. d Distribution of 8,000 APHC with amino acid F fixed at the C terminus (P4). The x-axis is P1 (N-terminus), the y-axis is P2, and the third position is illustrated in the rectangular box. Source data are provided as a Source Data file.

Hydrogelation laws (i.e., the effect of position and type of amino acids on gelation) deduced from the experimentally synthesized peptides gelators (100 data) and computationally selected candidates (top 8000 data based on APHC) exhibited reasonable consistency (Fig. 4c). Aromatic amino acids (F and Y) had the largest contribution to gelation, especially when located at positions 3 and 4 near the C-terminus. The W had a much lower contribution due to the strong hydrophobicity, which may lead to suspension with precipitation instead of forming a hydrogel. The H amino acid with a five-membered ring structure was favored in position 3 in gelation peptides. Second to F, W, and Y, the amino acids I, L, V, and M can also contribute to gelation due to hydrophobicity carried by side chains. The simulation results only slightly increased the percentage of I, L, V, and M at positions 2 and 3, while the experimental results raised the percentage of I, L at positions 1 and 3 and V at position 2. The contribution of the polar amino acids N, Q, S, T, and C to gelation was identical in both the experiment and simulation. N and Q with strong polarity were rarely found in gelation peptides with scarce occupancy at position 1 or 2. S and T with moderate polarity were beneficial for gelation when S was located at positions 1, 2, and 4 and T at 1, 2. Apolar amino acid C contains the -SH group, which may induce the formation of disulfide bonds and stable nanostructures, especially when located at position 1. Amino acid P contributed to the hydrogel formation when located at position 1 because of the potential formation of the “kink” structure33,40, promoting self-assembly. Meanwhile, G without functional side chains cannot significantly contribute to gelation. Charged amino acids D, E, R, and K had a minimal contribution to gelation. However, peptides with K near the N-terminus were found to form hydrogel due to the attraction of opposite charges driving self-assembly.

To gain an overview of the effects of position-type on gelation, we analyzed the APHC scores of the complete space sequence of tetrapeptides with fixed position 4 (fixed F, Fig. 4d; fixed remaining 19 amino acids, Supplementary Figs. 826). Different from Fig. 4c, we can discern the effect of doublets and triplets of amino acids on gelation, other than a single amino acid. It can be confirmed that aromatic-aromatic (F, W, Y – F, W, Y) and aromatic-hydrophobic (F, W, Y – I, L, M, V) doublets had positive effects on gelation synergistically. In addition, aromatic amino acids bonded with P and K exhibited similar positive performance. These rules can also be applied to the triplets. In addition, we analyzed the position-type percentage with respect to adjacent amino acids, based on the 100 hydrogel-forming peptides in the experiment and 8000 peptides with the highest APHC score in the simulation (Supplementary Fig. 27). It can also be deduced that aromatic-aromatic and aromatic-hydrophobic doublets have the most significant contribution to hydrogelation, and position-specific rules regarding other amino acids are also congruent with those deduced from Fig. 4c, d. For example, amino acid A is barely found in the fourth position, except when F or Y is located in the third position. In summary, we have presented a complete picture of the relationship between the gelation ability and position & type of 20 natural amino acids, providing schematic guidance for experimentalists to design tetrapeptide hydrogels and possibly functional applications associated.

Boosting antibody production of RBD vaccine

The advantage of self-assembling peptide materials is their remarkable multivalency, which contributes to improved immunogenicity. It is generally known that multivalency can repeatedly display ligands or epitopes to increase affinity for specific receptors while enhancing antibody responses8,9,41,42,43. The RBD of the spike protein covering the surface of SARS-CoV-2 attracted our interest as a promising target antigen for COVID-19 vaccines44,45,46,47. We hypothesized that tetrapeptide hydrogel could provide a biodegradable platform to encapsulate RBD protein and enhance humoral immune responses against RBD protein. Since YAWF has a APHC rank within the top 8000 (1661, Fig. 4a) and can form hydrogels containing nanofibrous network (Supplementary Fig. 28), we selected this tetrapeptide as a vaccine adjuvant candidate. We quantified the production of antigen-specific antibodies in C57BL/6 mice, which was a crucial indicator for evaluating the performance of the SARS-CoV-2 vaccine (Fig. 5a). Compared with the RBD group, the results (Fig. 5b) showed that the FDA (U.S. Food and Drug Administration) approved adjuvant aluminum could enhance the generation of IgG by 20.7-fold. The hydrogel formed by YAWF remarkably increased the generation of IgG by 41.6-fold (the endpoint titres of RBD, aluminum, and YAWF were shown in Supplementary Fig. 29), suggesting that the tetrapeptide hydrogel could boost the immune response in vivo. The results also indicated that the hydrogel group significantly enhanced the production of IgG1, which was similar to the aluminum group. The RBD-specific IgG2b response in the hydrogel group increased around 9.7-fold, compared with the commercial aluminum adjuvant group. As for IgG2c, the hydrogel formed by YAWF maintained high IgG2c titers, surpassing the ones in the aluminum group or control group (Fig. 5b and Supplementary Fig. 29).

Fig. 5: Response to tetrapeptide-based hydrogel nano vaccination.
figure 5

a 6–8 weeks C57BL/6 mice were immunized thrice at day 0, 7, and 14 with 15 μg RBD (RBD group), 12.5 μL aluminum adjuvant, and 15 μg RBD (Alum + RBD group), 60 mM tetrapeptide hydrogel and 15 μg RBD (YAWF + RBD group). Serum and splenocytes were collected on day 21. b Enzyme-linked immunosorbent assay (ELISA) responses to serum samples (RBD-specific) at different dilutions. SARS-CoV-2 RBD-specific IgG antibodies (IgG, IgG1, IgG2b, and IgG2c) were analyzed by endpoint dilution ELISA and measured as absorbance at 450 nm. The data were shown as the mean ± SEM (n = 6 biologically independent mice), and the p values were calculated by comparing RBD with YAWF + RBD using a one-way ANOVA test. c 7 days after the last immunization, splenocytes were collected and re-stimulated with RBD protein. The bars shown were mean ± SD (n = 6 biologically independent samples), and differences between RBD and other treatments were determined using a one-way ANOVA test. The secretion of IL-5 and IFN-γ in the splenocytes supernatants was detected using ELISA. d Optical images of bone marrow-derived dendritic cells (BMDCs) treated with RBD-loaded tetrapeptide hydrogel (scale bar = 100 μm). e Flow cytometry analysis of BMDCs expressing CD83, CD80, and CD86. f The level of IL-6 and TNF-α in BMDCs culture supernatants were analyzed using ELISA. The data were shown as the mean ± SD (n = 3 biologically independent samples), and differences between RBD and other treatments were determined using one-way ANOVA test. Source data are provided as a Source Data file.

During the infection, the regular pathway to produce IgG antibody is highly related to the proliferation of SARS-CoV-2-specific CD8+ or CD4+ T cells, which is reflected by the elevated secretion of several typical cytokines, including interleukin-5 (IL-5) and interferon-γ (IFN-γ). Compared with the aluminum adjuvant group, the mice that received YAWF based vaccine showed a higher IL-5 level in their splenocytes culture, and IFN-γ secretion was also obviously evoked (Fig. 5c). Thus, the YAWF stimulated an obvious cell-dependent adaptive immune response. To further confirm the capability of the tetrapeptide vaccine to regulate related cell immunity, the upstream dendritic cells (DCs) activation enhanced by tetrapeptide hydrogel was evaluated. The DCs treated with YAWF vaccine showed promising activation as the percentage of CD83, CD80, and CD86 expressing cells augmented to 72.0%, 71.1%, and 50.5% (Fig. 5e and Supplementary Fig. 30). Such intense activation could also be proved by the clustering of DCs (Fig. 5d) producing raised levels of Th-1 cytokines (Fig. 5f). To sum up, de novo designed tetrapeptide hydrogels as immune adjuvant successfully enhanced the immune response to RBD protein in vivo, providing great inspiration for us to explore natural tetrapeptide hydrogel library for biomedical applications.

Discussion

This work demonstrated an efficient “human-in-the-loop” framework that integrated coarse-grained molecular dynamics, machine learning, and experiments for the prediction and discovery of peptide hydrogels. The framework evolved into an updated score function APHC to evaluate the hydrogelation feasibility of 160,000 natural tetrapeptides, and a gelation hit rate of 87.1% with the top 8,000 APHC rank was achieved. The simulation and experiment revealed similar hydrogelation laws for short peptide design. Subsequently, a de novo-designed tetrapeptide hydrogel based on our hydrogelation laws was successfully applied in SARS-CoV-2 vaccine adjuvant, proving the potentials of the peptide libraries within the top 8,000 APHC rank for developing versatile biological and medical applications. Moving forward, the “human-in-the-loop” framework can be further automated by employing a robotic platform for synthesizing new peptides and performing machine learning for training classification models. The framework described here can also be extended to the efficient design of other functional materials/devices, including the terminal-covered peptide hydrogels, peptide batteries, peptide fluorescence probes, and peptide semiconductors, contributing to modern organic nanotechnology employing short peptide building blocks as key structural and functional elements.

Methods

Ethical approval

All mice were handled in accordance with institutional guidelines, and all animal experiments were approved by the Institutional Animal Care and Use Committee (IACUC) of Westlake University (IACUC Protocol #21-046-WHM).

Material sources

Fmoc-amino acids were obtained from GL Biochem (Shanghai, China). 2-Cl-trityl chloride resin was obtained from Nankai Resin Co. Ltd. (Tianjin, China). Commercially available reagents were used without further purification unless noted otherwise. Deionized water was used for all experiments. All other chemicals were reagent grade or better. Horseradish peroxidase-conjugated goat anti-mouse IgG, IgG1, IgG2b, and IgG2c (1030-05, 1071-05, 1091-05, and 1078-05, 1:5000 dilution) were obtained from Southern Biotech (USA). Mouse IFN-γ, IL-5, TNF-α and IL-6 ELISA kits (430807, 431204, 430907, and 431307) were obtained from Biolegend (USA). Recombinant murine GM-CSF and IL-4 (315-03 and 214-14) were purchased from Peprotech (USA). FITC anti-mouse CD83 Antibody, FITC anti-mouse CD80 Antibody, and PE anti-mouse CD86 Antibody (121505, 104705, and 159204, 0.5 µg per million cells in 100 µL volume for usage) were purchased from Biolegend (USA). Imject Alum Adjuvant was obtained from Thermofisher (USA). RBD-Fc (Z03513, SARS-CoV-2 Spike protein, RBD, mFc Tag, CHO-expressed) and RBD-H (Z03479, SARS-CoV-2 Spike protein, RBD, His Tag,) were purchased from Genscript Biotech (Nanjing, China). This research followed institutional guidelines, and all animal experiments were approved by the Institutional Animal Care and Use Committee of Westlake University.

Coarse-grained molecular dynamics (CGMD)

To speed up the simulation/screening, coarse-grained molecular dynamics (CGMD) simulations were adopted to generate machine learning (ML) training data, which were performed with the open-source GROMACS package48,49 and Martini2 force field50,51,52. The all-atom tetrapeptide structures (prepared based on CHARMM3653) were coarse-grained using the python script martinize.py51. In simulations for screening purposes, total of 300 coarse-grained tetrapeptides (as zwitterions) were solvated randomly in a 13 nm × 13 nm × 13 nm box with water whose density was set as approximately 1 g/cm3 (~18700 water beads). The charge of the tetrapeptide/water system was maintained neutral by adding the proper amount of Na+ or Cl-, and the system was also maintained at neutral pH. The whole system was then energy-minimized using the steepest descent algorithm54, until the maximum force on each atom was less than 20 kJ mol-1 nm-1. Subsequently, the system was passed to an equilibration run for 5 × 106 steps with a time step of 25 fs, resulting in a total simulation time of 125 ns. The temperature and pressure during the equilibration were controlled through Berendsen algorithm at 300 K and 1 bar, respectively. A total of 15,000 such simulations were performed, and the selection of the initial 15,000 tetrapeptides was based on Latin hypercubic sampling55. To obtain more accurate and stable morphology of self-assembled structure, a 1,250 ns duration was employed, and the final morphology results were averaged over 8 identical simulations.

To quantitatively characterize the degree of self-assembly, we adopted the aggregation propensity (AP) value33, which was calculated by:

$${{{{{\rm{AP}}}}}}=\frac{{{{{{{\rm{SASA}}}}}}}_{{{{{{\rm{initial}}}}}}}}{{{{{{{\rm{SASA}}}}}}}_{{{{{{\rm{final}}}}}}}}$$
(1)

Where the SASAinitial and SASAfinal are the solvent (i.e., tetrapeptides) accessible surface area at the beginning and end of a CGMD equilibration run.

Self-assembled peptides cannot guarantee the formation of hydrogels, which was also affected by the hydrophobicity of peptides. Therefore, a hydrogel formation score function APHC considering hydrophobicity was utilized to screen out the peptides with the highest possibility of hydrogel formation under current experimental/computational conditions, as shown below:

$${{{{{{\rm{A}}}}}}{{{{{\rm{P}}}}}}}_{{{{{{\rm{HC}}}}}}}={{{{{\rm{AP}}}}}}{{\hbox{'}}}{{\times }}{{\log }}{{{{{\rm{P}}}}}}{{\hbox{'}}}{{\times }}{C}_{g}$$
(2)
$${{\log }}{{{{{\rm{P}}}}}}=\mathop{\sum }\limits_{i=1}^{4}\triangle {G}_{{{{{{\rm{wat}}}}}}-{{{{{\rm{oct}}}}}},i}$$
(3)

Where the AP’ and the logP’ are normalized AP and logP value (normalized to 1), and α (=2) and β (=0.5) are two coefficients determining the significance of AP’ and the logP’. ∆Gwat-oct, i (kcal mol-1) is the Wimley–White whole-residue hydrophobicities for each amino acid56. Cg is the gelation corrector output by the ML classification model trained with experimental gelation results.

Machine learning

Four different ML algorithms were deployed: Random Forest (RF)57, Linear Regression (LR)58, Nearest Neighbor (NN)59, and Support Vector Machine (SVM)60. Mean absolute error (MAE) and coefficient of determination (R2)39 were calculated to assess the performance of each ML model. Different numbers of training data sets prepared by Excel 2019 (i.e., 1000, 5000, and 10,000) were used to train the ML models. In each training, 80% of the training data was used for training, while the remaining 20% was used for validation (Fig. 1b). After obtaining each model, another 5000 data were employed for independent testing.

Before training the ML model, we converted the amino acid sequence into numerical data with 4-integer and 80 bit one-hot representation approaches (shown in Supplementary Table 3, taking Glu-His-Asn-Thr, i.e., EHNT, as an example), aiming for enhanced model performance with optimal data presentation approach. Moreover, a tetrapeptide can be considered as a “tripeptide” with each “position” represented by one of the 400 possible dipeptide sequences, namely, a tetrapeptide can have 1200 possible bits with 3 of them to be 1. Therefore, we also trained models with a 1200-bit one-hot representation converted from the dipeptide sequence composition.

All the model training and prediction were conducted via ASCENDS code61, which employs a Python-based open-source data analytic toolkit, scikit-learn62. The training was initially performed based on the default hyperparameter settings in ASCENDS (Supplementary Table 5). To investigate the effect of hyperparameters on training performance, we selected three kernels and different parameters and ranges for tuning (Supplementary Table 6)63. The performance of the SVM model with varying hyperparameters were illustrated in Supplementary Fig. 5. The highest training performance of MAEtr was 0.090 and R2tr was 0.934, as kernel = rbf, C = 100, and gamma = 0.001. However, it was only slightly increased compared to the generated training performance (MAEtr = 0.095, R2tr = 0.928, as shown in Fig. 2b) with the default hyperparameters (kernel = rbf, C = 1, and gamma = auto, equaling to 1/n_features = 1/80 = 0.0125). The slightly increased training performance would have minimal impact on the prediction, and we thus concluded that the default hyperparameters were good enough for achieving reliable prediction results.

Synthesis, purification, and characterization of tetrapeptides

The selected tetrapeptides were synthesized by solid-phase peptide synthesis (SPPS) using 2-chlorotrityl chloride resin, and the side chains of the corresponding N-Fmoc protected amino acids were properly protected by different chemical groups (Supplementary Fig. 1). First, the C-terminal of the first amino acid was conjugated to the resin. Anhydrous N, N’-dimethyl formamide (DMF) containing 20% piperidine was used to remove Fmoc group. To couple the next amino acid to the free amino group, HBTU (O-(Benzotriazol-1-yl)-N, N, N’, N’-tetramethyluronium hexafluorophosphate) was used as coupling reagent and the organic base N, N-diisopropylethylamine (DIPEA) was added. The growth of the peptide chain was performed according to the established Fmoc SPPS protocol. After the final coupling step, the excess reagent was rinsed with DMF, followed by five washing steps using dichloromethane (DCM) for 1 min (5 mL per gram of resin). The peptide was cleaved using cleavage reagent (trifluoroacetic acid (TFA): triisopropylsilane (TIS): H2O = 95%: 2.5%: 2.5) for 45 minutes. 20 mL per gram of resin of ice-cold diethyl ether was then added to the concentrated cleavage reagent. The resulting precipitate was centrifuged at 1500 g for 10 minutes at 4 °C. The supernatant was then decanted, and the resulting solid was dissolved in H2O/CH3CN (1:1) for HPLC separation. HPLC was conducted at Agilent 1260 Infinity II Manual Preparative Liquid Chromatography system using a C18 RP column with CH3CN (0.1% of trifluoroacetic acid) and water (0.1% of trifluoroacetic acid) as the eluents (Supplementary Table 1). The purity of each tetrapeptide was verified by HPLC, and the purified tetrapeptide was dissolved in 200 μL of 0.1 mg/mL methanol to prepare Mass spectrometry (MS) samples. MS was conducted at the Agilent InfinityLab LC/MSD system with MSD signal set as positive ion mode. NMR samples were prepared by dissolving purified tetrapeptides in 600 μL of 8 mg/mL DMSO-d6. 1H NMR spectra were obtained on a Bruker BioSpin AVANCE NEO spectrometer (500 MHz, Switzerland), using tetramethyl silane as an internal standard. The structure of tetrapeptide was verified by MS and 1H NMR.

Transmission electron microscope

We used the negative staining technique to observe the morphologies formed by peptides. A micropipette was used to load 10 μL of sample solution to a carbon-coated copper grid, and we used a piece of filter paper to remove the excess solution. After rinsing the grid with the deionized water, we used uranyl acetate to stain the sample for 1 minute and then rinsed the grid with deionized water again. The excess liquid was drained with filter paper and conducted on a Talos L120C system, operating at 120 kV.

Peptide hydrogel formation

This work defines hydrogel formation as a self-supporting, non-flowing mixture of water and hydrogelator by the vial-inverting method. All purified tetrapeptides were dissolved in ultrapure water (to form a 30 mM solution initially), followed by the stepwise addition of 1 N NaOH solution to adjust the overall aqueous pH to 6.5–8.5. Meanwhile, a short-term ultrasonication treatment was applied after each pH adjustment to facilitate peptide solubilization and speed up peptide self-assembly. These operations could be repeated several times until a viscous, translucent colloid was formed, suggesting the initial stage of the gelation process. The mixture was allowed to stand in for another 48 h for complete hydrogelation. Upon the absence of gelation phenomenon during the aforementioned loops, we increased the peptide concentration (60 mM, 90 mM, and 120 mM were selected) and re-started such gelation operation loop to explore their gelation feasibility.

Rheology

During the experiment, a rheology test was carried out on an ARES-G2 (TA instrument) system with a 25 mm parallel plate at a 500 μm gap during the experiment. In the process of dynamic frequency (strain = 0.5%) scanning, the obtained hydrogel was transferred to the test platform with a pipette, and the changes of elastic modulus (G’) and viscous modulus (G”) of the hydrogel during scanning (frequency from 100–0.01 Hz) was tested. The hydrogels were then characterized by dynamic strain sweep at a fixed frequency of 1 Hz, and the changes in elastic modulus (G’) and viscous modulus (G”) of the hydrogel were recorded (strain: 0.01–100%).

Fourier transform infrared spectroscopy

Samples of tetrapeptide hydrogel were first lyophilized to powder, then we placed the sample on a diamond single reflection attenuated total reflectance (ATR) module. Spectra were recorded on a FTIR micro spectrometer (ThermoFisher Nicolet iS50) by averaging 32 scans at a spectral resolution of 1 cm-1.

Preparation of tetrapeptide hydrogel vaccines

YAWF was dissolved in 600 μL of endotoxin-free PBS buffer (≤0.5 EU/mL, Cellcook Biotech Co. Ltd., Guangzhou, China). A homogeneous hydrogel was formed by adjusting the final pH to 7.5 by 1 N NaOH. Then RBD protein (RBD-Fc) was added to the hydrogel, followed by vortexing and standing at room temperature for one hour to obtain a tetrapeptide hydrogel protein vaccine. For in vivo immune evaluation, C57BL/6 J mice were randomly divided into three groups (n = 6): (1) 15 μg RBD protein (RBD group); (2)12.5 μL aluminum adjuvant and 15 μg RBD protein (Alum + RBD group); (3) 60 mM tetrapeptide hydrogel and 15 μg RBD protein (YAWF + RBD group). Each mouse was injected subcutaneously in the groin with 100 μL of the prepared vaccine.

Mice

C57BL/6 J mice (female, 6-8 weeks old) were obtained from the laboratory animal resources center (LARC) at Westlake University. They were housed in specific pathogen-free (SPF) conditions, and a 12-h light/12-h dark cycle was used. The housing temperature for mice is between 20–26 °C with 40–70% humidity.

Tetrapeptide hydrogel in promoting dendritic cells maturation

Bone marrow cells were isolated from the femur and tibia of C57BL/6 J mice and then cultured in 1640 medium containing GM-CSF (5 ng/mL) and IL-4 (5 ng/mL) at 37 °C for 6 days64. The collected immature DCs were plated in a 24-well plate at a density of 1 × 106 cells per well. After 24 h, 50 μL of the blank medium, vaccine, and LPS were added to each well, respectively, then the medium volume was supplemented to 1 mL. The cells were cultured for another 24 h and centrifuged to collect the cells and supernatant. The acquired cells were labeled with FITC-tagged anti-CD80, FITC-tagged anti-CD83, and PE-tagged anti-CD86 for flow cytometry (CytExpert Software for CytoFLEX 2.4.0.28). The production of IL-6 and TNF-α in the cell culture supernatants was also analyzed by ELISA kit.

ELISA for antibody titer

The production of anti-RBD IgG, IgG1, IgG2b, and IgG2c antibodies in mice serum was analyzed by ELISA. RBD proteins (RBD-H) were plated on 96-well uncoated ELISA plates (Biolegend) at 3 μg/mL in PBS buffer overnight at 4 °C. The plate was blocked with 1% BSA for 2 h at 37 °C. By washing the plate three times with PBST (Phosphate buffer saline (PBS) with 5‰ Tween 20), 100 μL diluted mice serum was added into per well and incubated at 37 °C for 2 h. The plate was washed with PBST four times. Each well of 96-well plate was added with 100 μL HRP-labeled goat anti-mouse IgG, IgG1, IgG2b, and IgG2c binding antibody (1: 5000 diluted in blocking buffer) at 37 °C for 1 h. After washing the plate four times with PBST, 50 μL 3,3′,5,5′-tetramethylbenzidine (TMB) was added into per well. The reaction was stopped with 50 μL 2 M H2SO4. The absorbance value at 450 nm and 570 nm wavelength was determined by a Microplate reader (Thermo Fisher Scientific, Varioskan LUX). Titers were analyzed with log 10 serum dilution plotted against absorbance at 450 nm minus absorbance at 570 nm. Antibody titer values were defined as the highest serum dilution that gave an optical density above 0.1.

Cytokine production

On the 7th day after the last immunization, fresh splenocytes were collected by grinding the mouse spleen65. The splenocytes (5 × 106 cells/mL) from each group of mice were plated in 24-well plates, and stimulated with soluble RBD protein (50 μg/mL) for 96 h. The production of IFN-γ and IL-5 in cell culture supernatants was analyzed by ELISA kit.

Statistics and reproducibility

Statistical significance was determined using a one-way ANOVA test. Statistical analyses were performed using GraphPad Prism 8. For all representative TEM or optical images, experiments were performed three times independently with similar results.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.