Main

Proteins participate in essentially all biological processes and play critical roles for an organism. The structures of proteins are highly correlated to their functions in biological processes. Determining the protein structures to understand their functions can make considerable contributions to life science.

In recent years, protein structure prediction technologies based on artificial intelligence have made sunstantial progress in prediction accuracy, demonstrating great prospects for the drug and vaccine industry. In particular, AlphaFold2 (ref. 1) has pushed the performance to a new frontier in the challenging 14th Critical Assessment of Protein Structure Prediction (CASP14) (ref. 2), approaching the accuracy of experimental determination methods. Mainstream protein structure prediction pipelines rely heavily on co-evolution information extracted from multiple sequence alignments (MSAs). MSAs can be simply regarded as protein chains similar to the target protein chain in sequence. An MSA is related to the co-evolution information of protein sequences, which is crucial to predicting its structure. However, over-reliance on MSAs becomes the bottleneck of various protein-related tasks. Compared with the time (usually a few seconds) required for model inference in the structure prediction pipeline, searching MSAs is time consuming, costing tens of minutes for a protein. The time-consuming search is destructive in tasks demanding high-throughput requests, such as protein design. In the design of therapeutic proteins, such as peptides and antibodies, large-scale virtual screening is typically used to sift through candidate protein datasets to identify potential drugs that can be further validated for a specific target protein. A precise and efficient protein structure prediction method could potentially accelerate the development of new drugs for treating a variety of diseases.

Consequently, designing an accurate and efficient MSA-free protein structure prediction method to is likely to benefit and accelerate the development of protein studies. We argue that a large-scale protein language model (PLM) can serve as an alternative to the MSAs to learn the co-evolution knowledge for MSA-free prediction. An MSA-based method uses the information retrieval technique to explicitly capture co-evolutionary information of a target protein from the protein sequence databases, while a PLM-based method embeds co-evolutionary information into the large-scale model parameters during training and performs an implicit retrieval through model inference, where the PLM can be regarded as a protein knowledge base3. An MSA-based method is less efficient in retrieving information and depends on the retrieval scheme designed manually. On the other hand, a PLM-based method is more efficient in information retrieval, and the quality of retrieval depends primarily on the model’s capacity or parameter size. The past few years have seen tremendous success of large-scale language models4,5,6 in natural language processing, a field that shares many characteristics with protein study. With an increasing number of model parameters, the capacity for learning language knowledge grows substantially. Using self-supervised learning on large-scale unlabelled proteins, PLMs can reveal the long-range interactions along protein sequences and improve downstream protein-related tasks. Advanced works have attempted to adopt PLMs to enhance the performance of multiple downstream tasks, such as estimating the secondary structures and the functions7,8,9,10. In particular, several studies11,12,13 have attempted to apply PLMs to protein structure prediction. Most works first predict the inter-residue two-dimensional geometry using neural networks and then reconstruct the three-dimensional (3D) structure on the basis of energy minimization, which cannot provide end-to-end 3D structure prediction. Moreover, compared with the geometric learning capability of the Evoformer and Structure modules proposed by AlphaFold, the capacities of the geometric models used by these methods, such as recursive models and residual neural networks, are also unsatisfactory in understanding the co-evolution and spatial relations between the residues in a single sequence.

Inspired by the progress of PLMs and AlphaFold2, we propose an end-to-end MSA-free protein structure prediction pipeline, HelixFold-Single. The model used in HelixFold-Single consists of two major components: a large-scale PLM as the foundation and the essential components from AlphaFold2 for folding. The PLM can encode the primary structure into single representation and pair representation to learn the domain knowledge. The Evoformer and Structure modules from AlphaFold2 are then integrated to process the representation, learn the geometric knowledge and then predict the coordinates of atoms. The two components are connected to give an end-to-end differentiable model. HelixFold-Single contains two training stages. In the first stage, the large-scale PLM is trained with thousands of millions of unlabelled single sequences by the task of masked language prediction. In the second stage, we train the whole model with protein structures composed of experimental ground truth and augmentation structures generated by AlphaFold2.

We compare HelixFold-Single with AlphaFold2 and RoseTTAFold on datasets CASP14 and CAMEO (Continuous Automated Model Evaluation). HelixFold-Single achieves accuracy competitive with that of the other methods on proteins with sufficient numbers of homologous sequences. We also analyse the performance of HelixFold-Single on targets with various numbers of homologous sequences: HelixFold-Single is capable of providing accurate structure predictions on most targets, especially targets with large homologous families. An ablation study comparing PLMs of different sizes demonstrates the importance of the size of the PLM for structure prediction. Furthermore, HelixFold-Single shows great superiority in prediction efficiency when compared with the MSA-based methods and could be applied to protein-related tasks demanding a great number of predictions. Specifically, we investigate HelixFold-Single’s precision on various types of representative protein, including peptides, antibodies and nanobodies, with the aim of assessing its potential for application in therapeutic protein design. Our results suggest that HelixFold-Single performs well in predicting flexible regions of these proteins, highlighting its strengths for such applications.

HelixFold-Single

HelixFold-Single aims to take advantage of both the PLM and the main modules used in AlphaFold2 for single-sequence-based protein structure prediction. As exhibited in Fig. 1, HelixFold-Single consists of three components: PLM Base, Adaptor and Geometric Modelling. The large-scale PLM Base is employed to encode the co-evolution information in the parameters, which is used as an alternative to MSAs. Then, in Geometric Modelling, following AlphaFold2, we use modified Evoformer (named EvoformerS) and Structure modules to sufficiently exchange the information between the single representations and pair representations to capture the geometric information and recover the 3D coordinates of the atoms. We adopt an Adaptor layer to extract the co-evolution information from PLM to effectively generate the sequence and pair representations required as inputs to Geometric Modelling. The whole differentiable pipeline is trained by both self-supervised pre-training with bulks of unlabelled single sequences and supervised learning with geometric labels.

Fig. 1: The framework of HelixFold-Single.
figure 1

It consists of a protein language model as PLM Base, the composite of the EvoformerS (revised from Evoformer) and Structure Module of AlphaFold2 as Geometric Modelling, and Adaptor to connect PLM Base and Geometric Modelling. M, million; K, thousand.

Results

Overall comparison

To compare the overall accuracy of HelixFold-Single with several baseline structure prediction pipelines, including MSA-based and MSA-free methods, we used CASP14 (refs. 1,14,15) with 87 domain targets and CAMEO16 with 371 targets collected from 4 September 2021 to 19 February 2022. AlphaFold2 (ref. 1) and RoseTTAFold17, which rely on MSAs to provide predictions, are currently the most advanced methods for protein structure prediction. We evaluated the prediction performance of AlphaFold2 and RossTTAFold with and without homologous sequences (denoted by AlphaFold2 (input: MSA), RoseTTAFold (input: MSA), AlphaFold2 (input: single) and RoseTTAFold (input: single)). We also trained an MSA-free version of AlphaFold2, denoted by AlphaFold2-Single, by only using the single sequences as input. To evaluate the accuracy of HelixFold-Single and other methods, we utilized a commonly used metric, that is, the template modelling score (TM-score)18.

Figure 2 exhibits the test results of our proposed HelixFold-Single and the compared methods on CASP14 and CAMEO. On the basis of the results, we make the following observations.

  1. (1)

    In general, HelixFold-Single significantly surpasses all the MSA-free methods on CASP14 and CAMEO and is competitive with the MSA-based methods in certain scenarios. Notably, the accuracy of HelixFold-Single on CAMEO is comparable to that of AlphaFold2 (input: MSA) and outshines another baseline, RoseTTAFold (input: MSA). HelixFold-Single demonstrates the great potential of incorporating PLM into geometric modelling for protein structure prediction.

  2. (2)

    HelixFold-Single can be on a par with the MSA-based methods on targets with large homologous families, for example, on CASP14 template-based modelling (TBM)-easy domain targets with a median of 7,000 homologous sequences (MSA depth = 7,000) and on CAMEO targets with more than 1,000 homologous sequences (MSA depth > 1,000). These results indicate that the accuracy of HelixFold-Single is correlated to the richness of homologous sequences, revealing that the large-scale PLM adopted by HelixFold-Single is capable of embedding the information, for example, co-evolution knowledge, of MSAs used by the MSA-based methods.

  3. (3)

    Comparing HelixFold-Single with other MSA-free methods, HelixFold-Single exhibits its great superiority in all the categories of CASP14 and CAMEO. Since AlphaFold2 and RoseTTAFold rely on MSAs as input during the training process, it is challenging for these methods to provide accurate predictions when taking only single sequences as input. Even for AlphaFold2-Single, which uses only single protein sequences as input for training, its precision is unsatisfactory without the assistance of the PLM.

Fig. 2: Overall comparison of HelixFold-Single and other methods on CASP14 and CAMEO.
figure 2

a,b, AlphaFold2 (input: MSA) and RoseTTAFold (input: MSA) are MSA-based methods, while the others use the primary structures as input. Data are divided into quartiles, and a box is drawn between the first and third quartiles, with an additional line drawn along the second quartile to mark the median and a cross to mark the mean. The whiskers extend from the edges of the box to represent the minimum and maximum values within a certain range, excluding outliers. This system is used for all box plots of this paper. a, CASP14 (87 targets classified into free-modelling (FM) and TBM categories on the basis of their relatedness to existing structures.) b, CAMEO (371 targets classified into four categories depending on MSA depth).

Effect of number of homologous sequences

The results on CASP14 and CAMEO indicate that the accuracy of HelixFold-Single is related to the number of homologous sequences. We further compare the performance of HelixFold-Single and other methods on the targets with variant MSA depths. We have collected a fresh test dataset, MSA-Depth-Test, comprising targets that were released between May 2020 and October 2021 from the Research Collaboratory for Structural Bioinformatics Protein Data Bank (PDB). Specifically, we selected targets that exhibit relatively sparse homologous sequences. We blended these targets with the data of CASP14 and CAMEO as a new evaluation set. Figure 3a compares the TM-scores of HelixFold-Single and the baseline methods on the evaluation set, grouped by the number of homologous sequences (MSA depths). Figure 3b shows the distribution of the proteins in different groups in this evaluation set. We can see that as the available homologous sequences grow the average TM-scores of both HelixFold-Single and the MSA-based methods increase, while the scores of the other MSA-free methods decrease. For the proteins with sparse homologous sequences, the TM-scores of all the compared methods are unsatisfactory. For the proteins with larger homologous families, especially those with more than thousands, HelixFold-Single can compete with the MSA-based methods. In general, it appears that HelixFold-Single is more sensitive to the presence of evolutionary information when compared with MSA-based methods such as AlphaFold (input: MSA) or RoseTTAFold (input: MSA). Given that 90% of the targets in PDB have more than 1,024 homologous sequences, we can reasonably extrapolate that HelixFold-Single can achieve satisfying accuracy on the most frequently investigated proteins.

Fig. 3: Analysis of the impact of homologous sequences (MSA depths), and investigation of the relations between MSA depths, TM-scores and perplexity of the PLM.
figure 3

a, Comparison between HelixFold-Single and the baseline methods on 1,251 protein targets with various numbers of homologous sequences (MSA depths). b, Distribution of proteins with different homologous sequences in PDB. c, Relations between MSA depths and TM-scores of HelixFold-Single. d, Relations between MSA depths and perplexity of PLM. e, Relation between perplexity of PLM and TM-scores of HelixFold-Single.

To further investigate the relationship between the capacity of the PLM, the accuracy of protein structure prediction and the size of the homologous family, we utilized the targets in CASP14 and CAMEO datasets to exhibit their relations, as shown in Fig. 3c–e. As we expected, from Fig. 3c, a protein’s structure accuracy (TM-score) is correlated to the size of its homologous family (MSA depth), and the results are consistent with those in Fig. 3b. Moreover, we use a probability metric, perplexity19, to indicate the capacity of the PLM. Perplexity is a metric widely used in natural language processing to quantify the level of uncertainty a language model has in predicting text (which corresponds to the protein sequences in PLM). A lower perplexity score indicates a higher degree of accuracy for the language model. The results in Fig. 3d show that the perplexity of the PLM and the MSA depths are negatively correlated. We reasonably inferred that a PLM would prioritize learning the patterns of high-frequency proteins (which typically have more homologous sequences) rather than long-tail proteins (which usually only have a few homologous sequences) from the large-scale unlabelled protein sequences. These results also explain why the PLM-based HelixFold-Single is more sensitive to MSA depth when predicting protein structures. Moreover, the perplexity of the PLM and the TM-scores of HelixFold-Single are also negatively correlated. These results indicate that if the PLM Base module can predict (model) a protein sequence well, then there is a high probability that the PLM module can learn the co-evolution information of this protein and serves as an alternative to MSAs. Thus, the Geometric Modelling module can leverage the co-evolution embedded in the PLM to provide a more accurate structure for that protein.

Effect of sizes of PLMs

To comprehensively study the ability of the PLMs of different sizes to learn the co-evolution information, we compare a pre-trained PLM of one billion parameters (denoted by PLM-1B) and another pre-trained PLM of 100 million (denoted by PLM-100M). Figure 4a exhibits the perplexity of PLM-1B and PLM-100M on the targets from datasets CASP14 and CAMEO. In general, the smaller the perplexity is, the stronger the capacity of the PLM is. Thus, PLM-1B with more model parameters performs better than PLM-100M with fewer parameters on both datasets CASP14 and CAMEO. In addition, we apply PLM-1B and PLM-100M to the task of protein residue contact prediction to compare their performance on the downstream tasks. We simply fit a logistic regression that takes the attention weights, that is, \([{{{{\boldsymbol{z}}}}}^{(1)},{{{{\boldsymbol{z}}}}}^{(2)},\ldots ,{{{{\boldsymbol{z}}}}}^{({n}_{\mathrm{PLM}})}]\), from the PLMs as input and predict the contact of residues on the targets in datasets CASP14 and CAMEO. Following refs. 7,20, we use the top L/5 long-range contact precision, denoted by P@L/5, where L is the protein length, as the evaluation metric, and the results are shown in Fig. 4b. As we can see, PLM-1B is significantly superior to PLM-100M on the contact prediction task. The results from Fig. 4a and Fig. 4b both support the hypothesis that the larger the size of the PLM, the stronger its capacity. Therefore, it can be reasonably inferred that the performance of the PLM will continue to improve as the size of the PLM increases further.

Fig. 4: Comparison of PLMs of different sizes on CAMEO (371 targets) and CASP14 (87 targets).
figure 4

a, Perplexity of PLM-1B and PLM-100M. b, Contact prediction of PLM-1B and PLM-100M.

Prediction speed comparison

Massive time consumption for searching MSAs is one of the bottlenecks of MSA-based folding, and accelerating the speed of protein structure prediction can considerably broader its applications. The MSA-free HelixFold-Single has a tremendous advantage in inference efficiency by avoiding MSA searching. Figure 5 exhibits the computation time cost of (1) MSA searching, (2) the whole inference pipeline of AlphaFold2 and (3) the inference of HelixFold-Single. All the tests are executed on a single NVIDIA A100(40G) graphics processing unit. In general, HelixFold-Single consumes much less time than AlphaFold2, while the AlphaFold2 pipeline spends most of its time in MSA searching. For proteins less than 100 amino acids in length, HelixFold-Single’s prediction time is only about one-thousandth of that of AlphaFold2. Even for the proteins with more than 800 amino acids, HelixFold-Single still has great efficiency superiority. The good efficiency of HelixFold-Single demonstrates the potential of its application in tasks with a high demand for structural prediction.

Fig. 5: Comparison of median times of MSA search, AlphaFold2 and HelixFold-Single speeds.
figure 5

We compare the median times of MSA search, AlphaFold2 and HelixFold-Single on proteins with various lengths.

Study on multiple types of representative protein

One of the strengths of HelixFold-Single is its efficiency when compared with MSA-based methods, which makes it well suited for high-throughput protein structure prediction tasks such as protein design. To investigate the performance of HelixFold-Single on therapeutic proteins, three representative types of protein were chosen: peptides, antibodies and nanobodies. Peptides are smaller protein molecules that can be used as drugs to target a variety of biological processes, while antibodies and nanobodies are used in immunotherapy to target specific cells or molecules in the body. An antibody contains two chains, a heavy chain and a light chain, and a nanobody only includes the heavy chain. We evaluate the MSA-free HelixFold-Single and MSA-based AlphaFold2 on multiple datasets—Recent-PDB, Peptide, Antibody and Nanobody—to gain insights into the applicability of these methods to different types of protein and their potential use in protein design. Recent-PDB can be seen as the control group containing recently released proteins from PDB, while the remaining datasets represent experimental groups that are more relevant to therapeutic applications. Antibody-VH and Antibody-VL respectively represent the sets of heavy chains and light chains of collected antibodies.

The results presented in Fig. 6a are intriguing, as they demonstrate that HelixFold-Single can perform as well as, or even outperform, AlphaFold2 in certain scenarios. While HelixFold-Single’s performance slightly lags behind that of AlphaFold2 on the Peptide dataset, the precision gap between the two methods is considerably narrower than that on the Recent-PDB dataset. This indicates that HelixFold-Single is better suited for predicting the structures of short and highly flexible peptides. For the antibody-related datasets, HelixFold-Single performs competitively with AlphaFold2 on datasets Antibody-VL and Nanobody, and surpasses AlphaFold2 on Antibody-VH. We surmise that HelixFold-Single is better equipped to capture the intricate patterns of the complementarity-determining regions (CDRs) from the large-scale protein sequence data, where the CDRs of antibodies are crucial for the specificity of an antibody and are known to be highly variable and difficult to predict. Therefore, we conducted a detailed analysis of HelixFold-Single’s performance on the CDRs, as illustrated in Fig. 6b,c. HelixFold-Single performs comparably to AlphaFold2 in terms of the whole chains (VH, VL and VHH) and all the CDRs, with a slight advantage in predicting the CDR-H3 (widely recognized as the most diverse and critical CDRs) of the antibodies and nanobodies. Given the high variability of short peptides and the CDRs of antibodies, it is reasonable to assume that HelixFold-Single excels in predicting highly variable regions where MSAs may not be effective. To support this hypothesis, we performed additional analyses on the secondary structures of peptides and antibodies. Our results showed that HelixFold-Single is capable of accurately predicting the regions with the more flexible secondary structures of ‘turn’ or ‘coil’. For more information, please refer to Supplementary Section 5.

Fig. 6: Comparison between AlphaFold2 and HelixFold-Single on the representative types of protein.
figure 6

ac, Recent-PDB (7,595 targets) is the control group. Peptide (197 targets), Antibody (90 targets) and Nanobody (184 targets) are the sets of representative proteins. Note that a typical antibody has six CDRs, while a nanobody has three CDRs. a, Overall comparison. b, Antibody. c, Nanobody. RMSD, root-mean-square deviation.

Related works

Protein language models

Large-scale language models4 with the self-supervised learning paradigm, such as masked language modelling5 and autoregression21, have achieved extraordinary success in natural language processing tasks. Recent progress has revealed that their capabilities are strongly related to the scale of the model parameters: the larger the scale of the parameters, the better the performance6. The community has not yet seen any sign of growth stopping on moving from billions to hundreds of billions of parameters. These language models are capable of memorizing and generalizing massive common-sense knowledge and professional expertise implicitly included in the large-scale unlabelled data. Inspired by these achievements, PLMs tried to transfer language models and self-supervised learning tasks to protein modelling. A protein can be represented by an amino-acid sequence, similar to the sequences of words or tokens in natural language processing. Previous works7,8,9,10 have shown that, by pre-training with only single sequences without much supervision, PLMs can reveal the protein classification, stability and lower-level structure information (including secondary and tertiary structures and two-dimensional contact maps). However, the accuracy of these models in structure prediction is still far from that of the mainstream folding models supervised by the ground-truth protein structure.

Protein structure prediction

Mainstream pipelines22,23,24,25 rely on extracting the co-evolution information from MSAs to predict the protein structures. Earlier works manually designed the features derived from MSAs, such as inverse covariance matrices. Then, deep neural networks—for example, convolutional networks—are utilized to model the relations between the residues. Advanced studies1,24, directly take the MSAs as input and apply deep neural networks to predict the 3D coordinates of the proteins. In particular, the appearance of AlphaFold2 (ref. 1) has markedly narrowed the accuracy gap between the experimentally determined structures and model-estimated structures, employing the Evoformer module to enhance the interaction between MSA sequences and pairwise geometric information and the Structure module to directly predict the atoms’ coordinates. However, the reliance on MSA inevitably impedes the computation efficiency and accurate prediction of orphan proteins and designed proteins, as well as downstream tasks such as protein design.

Although the structure of a protein is dependent on its primary structure, it is incredibly challenging to train an accurate model that can infer the protein structures with only the primary structures. Only a small number of samples, that is, experimentally determined structures recorded in the PDB database, are available for model training. Several works attempt to incorporate PLMs for MSA-free protein structure prediction. RGN2 (ref. 11) employs a PLM (AminoBERT) with a recurrent geometric network that utilizes Frenet–Serret frames to generate the backbone structure. Moreover, advanced studies12,13 combine pre-trained PLMs, such as ProT5 (ref. 8) and ESM-1b (ref. 26), with residual neural networks to predict two-dimensional structures (for example, a contact map of a protein), yielding superior performance in orphan proteins. Nonetheless, the overall accuracy of those works is still unsatisfactory due to the limited capacity of the model architectures used.

Conclusion and future work

On the one hand, mainstream protein structure prediction methods, such as AlphaFold2 and RoseTTAFold, rely on the MSAs to extract the homologous information. However, searching MSAs is time consuming, limiting the application of those methods to broader protein-related tasks. On the other hand, a large-scale PLM learns the protein correlations from a great number of unlabelled proteins through self-supervised learning tasks. By utilizing large-scale parameters to embed the homologous information, we prove that it can be used as an alternative to MSAs to reduce the time required by the protein structure prediction methods. HelixFold-Single attempts to take advantage of both the PLM and the geometric modelling, predicting the protein structures end to end with only the primary structures. HelixFold-Single can be on a par with the MSA-based methods on targets with large homologous families and is much more efficient than the MSA-based methods, demonstrating its application prospects for protein study.

In the future, as the experimental results indicate that a larger size of PLM can achieve superior performance, we will continue investigating PLMs with a larger size for protein structure prediction. In addition, the accuracy on the targets with only a few homologous sequences is still unsatisfactory. Thus we will try to introduce more diverse training data to alleviate this problem.

Methods

Large-scale PLM Base

Inspired by large-scale pre-trained language models, we follow previous works on pre-training a PLM. The PLM processes the primary protein sequences (that is, the amino-acid sequences) and extracts the knowledge needed for further geometric modelling. A protein of length L can be uniquely represented by a sequence of types of amino acid denoted by x = (x1, x2, …, xL). An embedding layer E(xl) maps the type identifier to dPLM-dimensional embedding vectors:

$${{{{\mathbf{x}}}}}^{(0)}=(E({x}_{1}),E({x}_{2}),\ldots ,E({x}_{L})).$$

Notice that \({{{{\mathbf{x}}}}}^{(k)}\in {{\mathbb{R}}}^{L\times {d}_{\mathrm{PLM}}}\) is the representation of the amino-acid sequence.

We then apply the widely used Transformer-style blocks4 to process the embedding vectors, denoted by

$${{{{\mathbf{x}}}}}^{(k+1)}={\mathrm{DisentangledAttentionTransformer}}\Big({{{{\mathbf{x}}}}}^{(k)}\Big).$$
(1)

Accurately predicting the contacts between the residues, especially the long-rage contacts, is critical for protein structure prediction. Taking into account that the contact between the residues is more dependent on the relative positions rather than the absolute positions (counted from the start of the sequence), we employ DisentangledAttentionTransformer from DeBerTa27 to focus on the modelling of interactions between the residue representations and the relative positions. DisentangledAttentionTransformer adopts the attention mechanism to learn the interactions between the residues as well as the interactions of the interaction–position pairs.

Moreover, we take advantage of multihead self-attention weights in DisentangledAttentionTransformer to construct the initial pair representation. The attention weights of the kth block are denoted by \({{{{\mathbf{z}}}}}^{(k)}\in {{\mathbb{R}}}^{L\times L\times {h}_{\mathrm{PLM}}}\), where hPLM is the number of heads of self-attention.

We add an additional Adaptor to map the output of PLM Base to the input of the Geometric Modelling module.

$$\begin{array}{rcl}{\tilde{{{{\mathbf{x}}}}}}^{(0)}&=&{\mathrm{Linear}}\Big({{{{\mathbf{x}}}}}^{({n}_{\mathrm{PLM}})}\Big),\\ {\tilde{{{{\mathbf{z}}}}}}^{({{0}})}&=&{\mathrm{Linear}}\Big([{{{{\mathbf{z}}}}}^{(1)},{{{{\mathbf{z}}}}}^{(2)},\ldots \,,{{{{\mathbf{z}}}}}^{({n}_{\mathrm{PLM}})}]\Big),\end{array}$$
(2)

where nPLM is the number of blocks in PLM Base, and the operator [] refers to concatenation. \({\tilde{{{{\mathbf{x}}}}}}^{(0)}\in {{\mathbb{R}}}^{L\times {d}_{\mathrm{single}}}\) and \({\tilde{{{{\mathbf{z}}}}}}^{({{{\bf{0}}}})}\in {{\mathbb{R}}}^{L\times L\times {d}_{\mathrm{pair}}}\) are the initial single representations and pair representations of the Geometric Modelling module, respectively.

Geometric modelling

We employ the Evoformer and Structure modules proposed in AlphaFold2 (ref. 1) to model the relations between the residues and then estimate the 3D coordinates of the atoms in the proteins. We slightly modify the original Evoformer to match our settings. We name the revised Evoformer EvoformerS (Evoformer with single representations). First, the original Evoformer takes the MSA representation and pair representation, encoded from the searched MSAs, as input. As an alternative, EvoformerS takes the output of Adaptor (including the single representations (\({\tilde{{{{\mathbf{x}}}}}}^{(0)}\)) and pair representations (\({\tilde{{{{\mathbf{z}}}}}}^{(0)}\))). Second, Evoformer adopts various attention mechanisms to exchange the information within the single and pair representations to learn the spatial relationships. Note that, in contrast to the original version of Evoformer proposed by AlphaFold2, we remove the column-wise gated self-attention because HelixFold-Single focuses on MSA-free protein structure prediction and there is no need to exchange the messages within the MSAs. We follow the other geometric components of AlphaFold2, including the Structure module, which takes the single representation and pair representation yielded by EvoformerS and exploits invariant point attention and other geometric transformation operators to predict end to end the 3D coordinates of the atoms. Also, following AlphaFold2, we recycle the whole Geometric Modelling module to refine the predicted structures iteratively.

Model optimization

For the sake of leveraging the domain knowledge from the protein database, we operate two-stage parameter optimization on HelixFold-Single.

In the first stage, the PLM is pre-trained to capture the co-evolution information. The PLM is trained with about 300 million single sequences recorded in a protein database. To encourage PLM to observe the diverse single sequences as soon as possible, we cluster the proteins by similarity of single sequences and sample the proteins to balance the distributions of different clusters in our training data. We apply a self-supervised technique masked language model to optimize the parameters of the PLM, by randomly masking 15% of residues in the single sequences and then reconstructing these masked residues. More concretely, the masked language model attempts to predict P(xlx1, …, xl−1, xM, xl+1, …, xL) given the residue in the lth position xl being masked by xM. A crucial proposal of this work is that the PLM can learn the dependence between the masked residue and the other residues, and thus represent the co-evolution information. Previous works7 have already verified that PLMs can reveal secondary structures of the proteins, but the relation between PLM and co-evolution has been little discussed. Co-evolution is the phenomenon that two residues in contact tend to evolve at the same time to preserve the structure and thus the function of the protein. In PLM, if a residue at another position s has a profound impact (if the residue at position s is changed, the masked residue will also change) on the masked residue, then these two residues are likely to evolve at the same time.

In the second stage, since merely relying on PLM to predict the structure is inadequate to capture the geometric information, PLM Base and Geometric Modelling modules in HelixFold-Single are jointly optimized. We utilize 100,000 experimentally determined protein structures. We also use an additional one million estimated protein structures for training in this stage (distilled from AlphaFold2). Following AlphaFold2, we train the network end to end with the main losses, including the frame aligned point error loss and other auxiliary losses. By combining the computationally efficient PLM Base module (compared with MSA search) and the Geometric Modelling module, HelixFold-Single is capable of providing efficient and precise protein structure prediction.

Datasets

We used UniRef30 (2021-03) (ref. 28), which clusters UniRef100 seed sequences from the UniProt Knowledgebase and selected UniProt Archive records29,30 at a 30% pairwise sequence identity level, to pre-train the PLM. Then, three datasets are used to train the whole network, including the proteins in PDB (refs. 31,32) released before 14 May 2020 and two datasets constructed from Uniclust30 (v.2018-08) and AlphaFold Protein Structure Database (v.2022-01) (ref. 33), for knowledge distillation.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.