HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein Language Model as an Alternative

AI-based protein structure prediction pipelines, such as AlphaFold2, have achieved near-experimental accuracy. These advanced pipelines mainly rely on Multiple Sequence Alignments (MSAs) as inputs to learn the co-evolution information from the homologous sequences. Nonetheless, searching MSAs from protein databases is time-consuming, usually taking dozens of minutes. Consequently, we attempt to explore the limits of fast protein structure prediction by using only primary sequences of proteins. HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2. Our proposed method, HelixFold-Single, first pre-trains a large-scale protein language model (PLM) with thousands of millions of primary sequences utilizing the self-supervised learning paradigm, which will be used as an alternative to MSAs for learning the co-evolution information. Then, by combining the pre-trained PLM and the essential components of AlphaFold2, we obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence. HelixFold-Single is validated in datasets CASP14 and CAMEO, achieving competitive accuracy with the MSA-based methods on the targets with large homologous families. Furthermore, HelixFold-Single consumes much less time than the mainstream pipelines for protein structure prediction, demonstrating its potential in tasks requiring many predictions. The code of HelixFold-Single is available at https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold-single, and we also provide stable web services on https://paddlehelix.baidu.com/app/drug/protein-single/forecast.


Introduction
Proteins participate in essentially all biological processes and play critical roles for an organism.The structures of proteins are highly correlated to their functions in the biological processes.Determining the protein structures to understand their functions can bring considerable contributions to life science.
In recent years, AI-based protein structure prediction technologies have made significant progress in prediction accuracy, demonstrating great prospects for the drug and vaccine industry.Particularly, AlphaFold2 [1] pushes the performance to a new frontier in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14 [2]), approaching the accuracy of experimental determination methods.Mainstream protein structure prediction pipelines heavily rely on co-evolution information extracted from Multiple Sequence Alignments (MSAs).MSAs can be simply regarded as protein chains similar to the target protein chain in sequence.MSA is related to the co-evolution information of protein sequences, which is crucial to predicting its structure.However, over-reliance on MSAs becomes the bottleneck of various protein-related tasks.First, compared with the time (usually several seconds) required for model inference in the structure prediction pipeline, searching MSAs is time-consuming, costing dozens of minutes for a protein.The time-consuming searching is devastating in the tasks demanding high-throughput requests, such as protein design.Second, the primary structures (single sequence), rather than the MSAs, drive the folding of the proteins.The MSA extracting methods are also not designed specifically for protein folding.Thus, the MSA-based pipelines only memorize the determined structures of similar proteins for prediction but do not entirely understand the mechanism of protein folding.
Consequently, designing an accurate MSA-free protein structure prediction method to address the mentioned issues is likely to benefit and accelerate the development of protein studies.We argue that a large-scale protein language model (PLM) can be served as an alternative to the MSAs to learn the co-evolution knowledge for MSA-free prediction.We speculate that a PLM with billions of parameters can effectively memorize the MSAs and infer the co-evolution information.The past few years have seen the tremendous success of large-scale language models [3,4,5] in Natural Language Processing, a field that shares a lot of characters with protein studying.With the increase of the model parameters, the capacity for learning language knowledge grows substantially.Using self-supervised learning on large-scale unlabeled proteins, PLMs can reveal the long-range relation along protein sequences and improve downstream protein-related tasks.Advanced works have attempted to adopt PLMs to enhance the performance of multiple downstream tasks, such as estimating the secondary structures and the functions [6,7,8,9].Particularly, several studies [10,11,12] attempted to apply PLMs to protein structure prediction.Most works first predict the inter-residue 2D geometry by neural networks and then reconstruct the 3D structure based on energy minimization, which can not provide end-to-end 3D structure prediction.Besides, compared with the geometric learning capability of EvoFormer and Structure Module proposed by AlphaFold, the capacities of the geometric models used by these methods, such as recursive models and ResNets, are also unsatisfactory in understanding the co-evolution and spatial relations between the residues in a single sequence.
Inspired by the progress of PLMs and AlphaFold2, we propose an end-to-end MSA-free protein structure prediction pipeline, HelixFold-Single.The model used in HelixFold-Single consists of two major components: a large-scale PLM as the foundation and the essential components from AlphaFold2 for folding.The PLM can encode the primary structure into single representation and pair representation to learn the domain knowledge.The EvoFormer and Structure Module from AlphaFold2 are then integrated to process the representation, learn the geometric knowledge, and then predict the coordinates of atoms.The two components are wired up to give an end-to-end differentiable model.HelixFold-Single contains two training stages.In the first stage, the large-scale PLM is trained with thousands of millions of unlabeled single sequences by the task of masked language prediction.In the second stage, we train the whole model with the protein structures composed of experimental ground-truth and augmentation structures generated by AlphaFold2.
We compare HelixFold-Single with AlphaFold2 and RoseTTAFold on datasets CASP14 and CAMEO.HelixFold-Single achieves competitive accuracy with those methods on proteins with sufficient homologous sequences.We also analyze the performance of HelixFold-Single on targets with various homologous sequences, and HelixFold-Single is capable of providing accurate structure predictions on most targets, especially the targets with large homologous families.The ablation study comparing the PLMs of different sizes demonstrates the importance of the size of PLM for structure prediction.Furthermore, HelixFold-Single shows great superiority in prediction efficiency compared with the MSA-based methods and could be applied to protein-related tasks demanding a great number of predictions.The code of HelixFold-Single is publicly released at GitHub https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold-single.Web service of HelixFold-Single is also available at https://paddlehelix.baidu.com/app/drug/protein-single/forecast to provide efficient protein structure predictions.

HelixFold-Single
HelixFold-Single aims to take advantage of both the protein language model (PLM) and the main modules used in AlphaFold2 for single sequence-based protein structure prediction.As exhibited in Figure 1, HelixFold-Single consists of three components: PLM Base, Adaptor, and Geometric Modeling.A large-scale PLM Base is employed to encode the co-evolution information in the parameters, which is used as an alternative to MSAs.Then, in Geometric Modeling, following AlphaFold2, we use modified EvoFormer and Structure Module to sufficiently exchange the information between the single representations and pair representations to capture the geometric information and recover the 3D coordinates of the atoms.We adopt an Adaptor layer to extract the co-evolution information from PLM to effectively generate the sequence and pair representations required as inputs to the Geometric modeling.The whole differentiable pipeline is trained by both self-supervised pre-training with bulks of unlabeled single sequences and supervised learning with geometric labels.

Large-Scale PLM Base
Inspired by large-scale pre-trained language models, we follow previous works on pre-training a protein language model (PLM).The PLM processes the primary protein sequences (i.e., the amino acid sequences) and extracts the knowledge needed for further geometric modeling.A protein of length L can be uniquely represented by a sequence of types of amino acids denoted by x = (x 1 , x 2 , ..., x L ).An embedding layer E(x l ) maps the type id to d PLM -dimension embedding vectors: Notice that x (k) ∈ R L×dPLM is the representation of the amino acid sequence.
We then apply the widely used Transformer-style blocks ( [3] to process the embedding vectors, denoted by Accurately predicting the contacts between the residues, especially the long-rage contacts, is critical for protein structure prediction.Considering the contact between the residues is more dependent on the relative positions rather than the absolute positions (counted from the start of the sequence), we employ DisentangledAttentionTransformer from DeBerTa [13] to focus on the modeling of interactions between the residue representations and the relative positions.
DisentangledAttentionTransformer adopts the attention mechanism to learn the interactions between the residues as well as the interactions of the interaction-position pairs.
Besides, we take advantage of multi-head self-attention weights in DisentangledAttentionTransformer to construct the initial pair representation.The attention weights of the k-th block is denoted by z (k) ∈ R L×L×hPLM , where h PLM is the number of heads of self-attention.
We add an additional Adaptor to map the output of PLM Base to the input of Geometric Modeling module.
x(0) = Linear(x (nPLM) ), where n PLM is the number of blocks in PLM Base, and operator [] refers to concatenation.x(0) ∈ R L×dSingle and z(0) ∈ R L×L×dPair are the initial single representations and pair representations of the Geometric Modeling module, respectively.

Geometric Modeling
We employ the EvoFormer and Structure Module proposed by AlphaFold2 [1] to model the relations between the residues and then estimate the 3D coordinates of the atoms in the proteins.We slightly modify the original EvoFormer and Structure Module's architecture to match our settings.First, the original EvoFormer takes the MSA representation and pair representation, encoded from the searched MSAs, as input.As an alternative, we take the output of the Adaptor (including the single representations (x (0) ) and pair representations (z (0) )).Second, Evoformer adopts various attention mechanisms to exchange the information within the single and pair representations to learn the spatial relationships.Note that, compared with the original version of Evoformer proposed by AlphaFold2, we remove the column-wise gated self-attention because HelixFold-Single focuses on MSA-free protein structure prediction and is no need to exchange the messages within the MSAs.We follow the other geometric components of AlphaFold2, including the Structure Module that takes the single representation and pair representation yielded by the EvoFormer, and exploits Invariant Point Attention and other geometric transformation operators to end-to-end predict the 3D coordinates of the atoms.Also, following AlpahFold2, we recycle the whole Geometric Modeling module to refine the predicted structures iteratively.

Model Optimization
For the sake of leveraging the domain knowledge from the protein database, we operate two-stage parameter optimization on HelixFold-Single.
In the first stage, the PLM is pre-trained to capture the co-evolution information.The PLM is trained with about 300 million of single sequences recorded in a protein database.To encourage PLM to observe the diverse single sequences as soon as possible, we cluster the proteins by the similarity of single sequences and sample the proteins to balance the distributions of different clusters in our training data.We apply the self-supervised technique masked language model (MLM) to optimize the parameters of the PLM, by randomly masking 15% of residues in the single sequences and then reconstructing those masked residues.More concretely, MLM attempts to predict p(x l |x 1 , ..., x l−1 , x M , x l+1 , ..., x L ) given the residue in the l-th position x l being masked by x M .A crucial proposal of this work is that PLM can learn the dependency between the masked residue and the other residues, and thus represent the co-evolution information.
Previous works [6] have already verified that PLMs can reveal secondary structures of the proteins, but little has been discussed on the relation between PLM and co-evolution.Co-evolution is the phenomenon that two residues in contact tend to evolve at the same time to preserve the structure and thus the function of the protein.In PLM, if a residue at another position s has a profound impact (the residue at position s is changed, the masked residue will also change) on the masked residue, then those two residues are likely to evolve at the same time.
In the second stage, since merely relying on PLM to predict the structure is inadequate to capture the geometric information, PLM Base and Geometric Modeling modules in HelixFold-Single are jointly optimized.We utilize 100 thousand experimentally determined protein structures.We also use additional one million estimated protein structures for training in this stage (distilled from AlphaFold2).Following AlphaFold2, we end-to-end train the network with the main losses, including Frame Aligned Point Error (FAPE) loss and other auxiliary losses.By combining the computational efficient PLM Base module (compared with MSA search) and the Geometric Modeling module, HelixFols-Single is capable of providing efficient and precise protein structure prediction.

Overall Comparison
CASP14 [1,19,20] with 87 domain targets and CAMEO [21] with 371 targets collected from 2021-09-04 to 2022-02-19 are used to compare the overall accuracy of HelixFold-Single with the several baseline structure prediction pipelines, including the MSA-based and MSA-free methods.AlphaFold2 [1] and RoseTTAFold [22] are currently the most advanced methods for protein structure prediction, relying on MSAs to provide predictions.We test the accuracy of AlphaFold2 and RossTTAFold with and without homologous sequences, respectively.A commonly used metric, i.e., TM-score [23], is exploited to evaluate the accuracy of HelixFold-Single and other methods.(1) In general, HelixFold-Single significantly surpasses all the MSA-free methods on CASP14 and CAMEO and is competitive with the MSA-based methods in some cases.Notably, the accuracy of HelixFold-Single on CAMEO is comparable to that of AlphaFold2 (input:MSA) and outshines another strong baseline, RoseTTAFold (input:MSA).
HelixFold-Single demonstrates the great potential of incorporating PLM into geometric modeling for protein structure prediction.
(2) HelixFold-Single can be par with the MSA-based methods on the targets with large homologous families, e.g., TBMeasy domain targets in CASP14 with a median of seven homologous sequences and targets with more than a thousand homologous sequences (MSA depth > 1000) in CAMEO.These results indicate that the accuracy of HelixFold-Single is correlated to the richness of homologous sequences, revealing that the large-scale PLM adopted by HelixFold-Single is capable of embedding the information, e.g., co-evolution knowledge, of MSAs used by the MSA-based methods.
(3) Compared HelidFold-Single with other MSA-free methods, HelixFold-Single exhibits its great superiority on all the categories of CASP14 and CAMEO.Since AlphaFold2 and RoseTTAFold rely on MSAs as input during the training process, it is challenging for those methods to provide accurate predicts when taking only the single sequences as input.

Effect of Number of Homologous Sequences
The results on CASP14 and CAMEO indicate that the accuracy of HelixFold-Single is related to the number of homologous sequences.We further compare the performance of HelixFold-Single and other methods on the targets with variant MSA depths.We collected the targets released between 2020-05 and 2021-10 from PDB, from which we picked the targets with relatively sparse homologous sequences.We blended those targets with the data of CASP14 and CAMEO as a new evaluation set. Figure 3a compares the TM-scores of HelixFold-Single and the baseline methods on the evaluation set, grouped by the number of homologous sequences (MSA depths).Figure 3b shows the distribution of the proteins in different groups in this evaluation set.We can see that as the available homologous sequences grow, the average TM-score of both HelixFold-Single and the MSA-based methods increases, while the scores of the other MSA-free methods decrease.For the proteins with sparse homologous sequences, the TM-scores of all the compared methods are unsatisfactory.For the proteins with larger homologous families, especially those with more than thousands, HelixFold-Single can compete with the MSA-based methods.Given that 90% of the targets in PDB have more than 1024 homologous sequences, we can reasonably extrapolate that HelixFold-Single can achieve satisfying accuracy on the most frequently investigated proteins.
In order to further investigate the relationship between the capacity of the PLM, the accuracy of protein structure prediction, and the size of the homologous family, we utilized the targets in CASP14 and CAMEO datasets to exhibit their relations, as shown in Figure 3c, Figure 3d, and Figure 3e.As we expected, from Figure 3c, a protein's structure accuracy (TM-score) is correlated to the size of its homologous family (MSA depth), and the results are consistent with those in Figure 3b.Besides, we use a probability metric, Perplexity [24], to indicate the capacity of the protein language model.If the PLM can predict or reconstruct a protein sequence well, the Perplexity is low in predicting that target.From Figure 3d and Figure 3e, we can observe that the Perplexity of the PLM and the MSA depths are negatively correlated.The Perplexity of the PLM and the TM-scores of HelixFold-Single are also negatively correlated.The results indicate that if the PLM Base module can well predict (model) a protein sequence, there is a high probability that the PLM module can learn the co-evolution information of this protein and serves as an alternative to MSAs.Thus, the   Geometric Modeling module can leverage the co-evolution embedded in the PLM to provide a more accurate structure for that protein.

Effect of the Sizes of the PLMs
To comprehensively study the ability of the PLMs of different sizes to learn the co-evolution information, we compare a pre-trained PLM of 1B parameters (denoted by PLM-1B) and another pre-trained PLM of 100M (denoted by PLM-100M).Table 4a exhibits the Perplexity of PLM-1B and PLM-100M of the targets from datasets CASP14 and CAMEO.
In general, the smaller the perplexity is, the stronger the capacity of the PLM is.Thus, PLM-1B with more model parameters performs better than PLM-100M with fewer parameters on both datasets CASP14 and CAMEO.In addition, we apply the PLM-1B and PLM-100M on the task of protein residue contact prediction to compare their performance on the downstream tasks.We simply fit a logistic regression that takes the attention weights, i.e., [z (1) , z (2) , • • • , z (nPLM) ], from the PLMs as input and predict the contact of residues on the targets in datasets CASP14 and CAMEO.Following [6,25], we use top L/5 long-range contact precision, denoted by P@L/5, as the evaluation metric, and the results are shown in Figure 4b.As we can see, PLM-1B is significantly superior to PLM-100M on the contact prediction task.The results from Figure 4a and Figure 4b both support the hypothesis that the larger the size of the PLM is, the stronger its capacity is.Therefore, it can be reasonably inferred that the performance of the PLM will continue to improve as the size of the PLM increases to a larger size.Massive time consumption for searching MSAs is one of the bottlenecks of the MSA-based folding, and accelerating the speed of protein structure prediction can considerably broader its applications.The MSA-free HelixFold-Single has a tremendous advantage for inference efficiency for exempting MSA searching.Figure 5 exhibits the computation time cost of 1. MSA searching; 2. Whole inference pipeline of AlphaFold2; 3. Inference of HelixFold-Single.All the tests are executed in a single NVIDIA A100(40G) GPU.In general, Helixfold-Single consumes much less time than the Alphafold2, while AlphaFold2 pipeline spends most of its time in MSA searching.For proteins less than 100 in length, HelixFold-Single's prediction time is only about one-thousandth of that of AlphaFold2.Even for the proteins with more than 800 amino acids, HelixFold-Single still has great efficiency superiority.The high efficiency of HelixFold-Single demonstrates the potential of its application in tasks with a great demand for structural prediction.Most proteins exert their functions by interacting with other molecules.Changes in the structure of a protein, especially those in the key interacting residues, can significantly affect its biological function.As a result, a protein's function is

Case Study
HelixFold-Single closely associated with its structure, and accurately predicting the structure would facilitate our understanding of its biological role.While AlphaFold2 achieves outstanding accuracy in most of the protein structure prediction tasks, its performance can still be poor in some situations.Here, we demonstrate that HelixFold-Single complements AlphaFold2 in several of these cases.Endolysin enzymes from bacteriophages cause bacterial lysis by degrading the peptidoglycan cell wall.The streptococcal C1 phage endolysin PlyC is the most potent endolysin and can rapidly lyse group A, C, and E streptococci.Study on PlyC structure revealed that the key residues, including R66, E36, R29, etc, are important for the binding of PlyC to its target and hence are critical to its function [26].However, AlphaFold2 failed to produce the reliable structure of the protein (Figure 6(a)).This is probably due to insufficient co-evolution information extracted from MSAs.In contrast, the structure predicted by HelixFold-Single (Figure 6(b)) more closely resembles the one measured by the experiment, likely attributed to its little dependence on the information from MSAs.A similar result is observed for another protein RoxP.This protein is produced by Cutibacterium acnes, a predominant bacterium on human skin, and was shown to alleviate radical-induced cell damage.The key residues R56, R106, R121, R123 on RoxP form a positively charged groove, which acts as the binding site for substrate and cofactors [27].HelixFold-Single accurately predicts the formation of the positively charged groove(Figure 6(d)), which is not observed in the structure predicted by AlphaFold2 (Figure 6(c)).Furthermore, the TM-score of HelixFold-Single for RoxP is much higher than that of AlphaFold2, suggesting an overall better performance of HelixFold-Single in predicting RoxP structure.Altogether, our case studies indicate that HelixFold-Single outperforms AlphaFold2 in some situations and can be used as a reliable tool to analyze the function of proteins without known X-RAY structures.
4 Related Works

Protein Language Models
Large-scale language models [3] with the self-supervised learning (SSL) paradigm, such as masked language model (MLM) [4] and auto-regression [26], have achieved extraordinary success in Natural Language Processing (NLP) tasks.
Recent progress has revealed that their capabilities are deeply related to the scale of the model parameters: the larger the scale of the parameters, the better the performance [5].The community has not yet seen a sign of stopping growth by moving from billions to hundreds of billions of parameters.Those language models are capable of memorizing and generalizing massive common-sense knowledge and professional expertise implicitly included in the large-scale unlabeled data.Inspired by those achievements, Protein Language Models (PLMs) tried to transfer language models and SSL tasks to protein modeling.A protein can be represented by an amino acid sequence, similar to the sequences of words or tokens in NLP.Previous works [6,7,8,9] have shown that by pre-training with only single sequences without much supervision, protein language models can reveal the protein classification, stability, and lower-level structure information (including secondary, tertiary structures and 2D contact maps).However, the accuracy of these models in structure prediction is still far from that of the mainstream folding models supervised by the ground-truth protein structure.

Protein Structure Prediction
Mainstream pipelines [27,28,29,30] rely on extracting the co-evolution information from Multiple Sequence Alignments (MSAs) to predict the protein structures.Earlier works manually designed the features derived from MSAs, such as inverse covariance matrices of MSAs.Then, deep neural networks (DNNs), e.g., convolutional networks, are utilized to model the relations between the residues.Advanced studies [1,29], directly take the MSAs as input and apply DNNs to predict the 3D coordinates of the proteins.Particularly, the appearance of AlphaFold2 [1] has dramatically narrowed the accuracy gap between the experimentally determined structures and model estimated structures, employing the EvoFormer module to enhance the interaction between MSA sequences and pairwise geometric information and the Structure module to directly predict the atoms' coordinates.However, the reliance on MSA inevitably impedes the computation efficiency and accurate prediction of orphan proteins and designed proteins, as well as downstream tasks such as protein design.
Although the structure of a protein is dependent on its primary structure, it is incredibly challenging to train an accurate model that can infer the protein structures with only the primary structures.Only a small number of samples, i.e., experimentally determined structures recorded in the PDB database, are available for model training.Several works attempt to incorporate the protein language models (PLMs) for MSA-free protein structure prediction.RGN2 [10] employs a protein language model (AminoBERT) with a recurrent geometric network that utilizes Frenet-Serret frames to generate the backbone structure.Besides, advanced studies [11,12] combine pre-trained PLMs, such as ProT5 [7] and ESM-1b [31], with ResNets to predict 2D structures, e.g., contact map of a protein, yielding superior performance in orphan proteins.Nonetheless, the overall accuracy of those works is still unsatisfactory due to the limited capacity of the used model architectures.

Conclusion and Future Work
On the one hand, mainstream protein structure prediction methods, such as AlphaFold2 and RoseTTAFold, rely on the MSAs to extract the homologous information.However, searching MSAs is time-consuming, limiting the application of those methods to broader protein-related tasks.On the other hand, the large-scale protein language model learns the protein correlations from a great number of unlabeled proteins through self-supervised learning tasks.By utilizing large-scale parameters to embed the homologous information, we prove it can be used as an alternative to MSAs to reduce the time consumption required by the protein structure prediction methods.HelixFold-Single attempts to take advantage of both the protein language model and the geometric modeling, end-to-end predicting the protein structures with only the primary structures.HelixFold-Single can be par with the MSA-based methods on targets with large homologous families and is much more efficient than the MSA-based methods, demonstrating its application prospect for protein study.
In the future, as the experimental results indicate that the larger size of the PLM can achieve superior performance, we will continue investigating the PLM with a larger size for protein structure prediction.In addition, the accuracy of the targets with only a few homologous sequences is still unsatisfactory.Thus we will try to introduce more diverse training data to alleviate this problem.

Code Availability
The source code, trained weights and inference code of HelixFold-Single are freely available at GitHub https:// github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold-single to ensure the reproduction of our experimental results.performance gain by adding position-to-residue is trivial.Thus, we left out the position-to-residue term in DeBerTa.As a result, we have the DisentangledAttention layer denoted by q = x in W q , k = x in W k , v = x in W v , p = e p W p A i,j = q i k T j (a) residue-to-residue + q i p T δ(i,j) (b) residue-to-position W * are trainable parameters; e p is the trainable position embedding; δ(i, j) denotes the relative distance between position i and j.

Geometric Modeling
Since HelixFold-Single takes only the single sequence as input, we slightly modify the architecture of Evoformer, removing the columnwise attention.The architecture of the revised Evoformer is shown in Figure 8.

Figure 1 :
Figure 1: The framework of HelixFold-Single with a protein language model as PLM Base, the compose of EvoFormer and Structure Module of AlphaFold2 as Geometric Modeling, and Adaptor to connect PLM Base and Geometric Modeling.

Figure 2 Figure 2 :
Figure 2 exhibits the test results of our proposed HelixFold-Single and the compared methods on CASP14 and CAMEO.From the results, we have the following observations:

Figure 3 :
Figure 3: Analysis of the impact of homologous sequences (MSA depths) and investigation of the relations between MSA depths, TM-scores, and perplexity of the PLM.

Figure 4 :
Figure 4: Comparison of PLMs of different sizes.

Figure 5 :
Figure 5: Median times of MSA search, AlphaFold2, and HelixFold-Single on proteins with various lengths.

Figure 6 :
Figure 6: HelixFold-Single predicts PlyC and RoxP structure more accurately than AlphaFold2.PlyC structures predicted by (a) AlphaFold2 and (b) HelixFold-Single is aligned with the reference structure (PDB ID: 7KWT, chain B); RoxP structure predicted by (c) AlphaFold2 and (d) HelixFold-Single is aligned with the reference structure (PDB ID: 7BCJ, chain A).A-D) Green: structure predicted by AlphaFold2.Magentas: structure predicted by HelixFold-Single.Cyan: reference crystal structure measured by X-RAY diffraction approach (resolution<1.8A).Key residues related to protein function are shown as sticks.