Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment

Protein language models have emerged as an alternative to multiple sequence alignment for enriching sequence information and improving downstream prediction tasks such as biophysical, structural, and functional properties. Here we show that a method called SPOT-1D-LM combines traditional one-hot encoding with the embeddings from two different language models (ProtTrans and ESM-1b) for the input and yields a leap in accuracy over single-sequence-based techniques in predicting protein 1D secondary and tertiary structural properties, including backbone torsion angles, solvent accessibility and contact numbers for all six test sets (TEST2018, TEST2020, Neff1-2020, CASP12-FM, CASP13-FM and CASP14-FM). More significantly, it has a performance comparable to profile-based methods for those proteins with homologous sequences. For example, the accuracy for three-state secondary structure (SS3) prediction for TEST2018 and TEST2020 proteins are 86.7% and 79.8% by SPOT-1D-LM, compared to 74.3% and 73.4% by the single-sequence-based method SPOT-1D-Single and 86.2% and 80.5% by the profile-based method SPOT-1D, respectively. For proteins without homologous sequences (Neff1-2020) SS3 is 80.41% by SPOT-1D-LM which is 3.8% and 8.3% higher than SPOT-1D-Single and SPOT-1D, respectively. SPOT-1D-LM is expected to be useful for genome-wide analysis given its fast performance. Moreover, high-accuracy prediction of both secondary and tertiary structural properties such as backbone angles and solvent accessibility without sequence alignment suggests that highly accurate prediction of protein structures may be made without homologous sequences, the remaining obstacle in the post AlphaFold2 era.

Recently, Alphafold2 has achieved what was thought impossible: predicted protein structures at experimental accuracy for the majority of target proteins in critical assessment of structure prediction techniques (CASP14) 1 . This revolution was built on accumulating improvement in predicting backbone secondary structure 2-6 and residue-residue contact maps [7][8][9] along with advances in deep learning techniques. This success, however, does not mean the protein structure prediction problem is solved, as AlphaFold2 requires a minimum of 30 effective homologous sequences to achieve an accurate structure prediction 1 and a large portion of proteins lacks homologous sequences 10 . Moreover, sequence-homology search requires increasingly intensive computing time. Thus, it is essential to develop accurate structure prediction methods without relying on homologous sequences. To do this, the first step is to develop accurate prediction of protein backbone and other 1D-structural properties for proteins with few homologous sequences.
To date, only a few single-sequence-based methods have been developed for protein secondary structure prediction. Examples are PSIPRED-Single 11 , SPIDER3-Single 12 , ProteinUnet 13 , and SPOT-1D-Single 14 . PSIPRED-Single predicts the secondary structure only while SPIDER3-Single, ProteinUnet, and SPOT-1D-Single predicts secondary structure, Accessible Surface Area (ASA) 15 , Half-Sphere Exposure (HSE) 16 and Backbone torsion angles ( ψ , φ , θ , and τ ). SPIDER3-Single employed iterative learning on a two-layer Bidirectional Long-Short-Term-Memory(LSTM) cells 17 on a training set of approximately 10000 proteins. ProteinUnet followed the same strategy except replacing the Bi-LSTM model with a convolution-based Unet architecture 18  www.nature.com/scientificreports/ similar performance but with a smaller computational requirement. More recently, SPOT-1D-Single improved over all previous predictors by taking advantage of a high sequence identity training set and an ensemble of Convolution and LSTM based hybrid to improve the performance on completely independent test sets. Although these single-sequence models do improve over profile-based methods for proteins with a low number of effective homologous sequences (Neff), there is above 10% gap for those sequences with higher Neff values: 74% for three-state secondary structure prediction, compared to 86% by profile-based techniques 14 .
Recently, unsupervised deep learning methods inspired by Natural Language Processing were introduced to extract features from protein sequences [19][20][21][22] . These methods were trained on extensive protein databases such as Uniref 23 , Uniclust 24 , Pfam 25 , and BFD 26,27 . One state-of-the-art protein language model (LM) is ProtTrans 22 trained on the Uniref50 dataset. It employs a transformer-based auto-encoder model T5 to generate the embedding. Another protein language model also trained on Uniref50 dataset is ESM-1b, which uses a 34 Transformer model 21 . Both ProtTrans and ESM-1b trained downstream models for predicting secondary structure with significant improved performance. However, they did not provide comprehensive benchmark testing. Moreover, they did not examine other one-dimensional properties such as ASA, Half-Sphere Exposure and Backbone torsion angles.
In this work, we explored the combined use of ProtTrans and ESM-1b generated embedding to train a downstream predictor of secondary structure and other 1D tertiary structural properties. We demonstrated that the new alignment-free model can achieve a performance comparable to or better than sequence-profile-based prediction of 1D structural properties for both high and low Neff proteins without searching for homologous sequences.

Results
Feature analysis. Our model was built on a combination of three main input features: single-sequence onehot, ESM-1b, and ProtTrans encodings. We trained three individual neural network models (Two-Layer LSTM, MS-ResNet, and MS-Res-LSTM) with different combinations of these three input features. Results are shown in Fig. 1 for three-state (SS3) secondary structure prediction on independent test sets of TEST2018, TEST2020, www.nature.com/scientificreports/ and Neff1-2020 datasets. The accuracy on the easy TEST2018 (86.5% by three features) is significantly higher than on the hard set TEST2020 (80%) as expected. At the single feature level, both ESM-1b and ProtTrans encodings are significantly better than one-hot encoding in predicting secondary structure. ProtTrans has comparable performance to ESM-1b in TEST2018 but a better performance in the more difficult TEST2020 and Neff1-2020 sets. Adding one-hot encoding to ProtTrans makes marginal improvement over ProtTrans alone on TEST2018 but a larger improvement in TEST2020 and Neff1-2020. On the other hand, adding one-hot encoding to ESM-1b makes a comparable performance on TEST2018 but worse performance on TEST2020. Similar trends are also visible for eight-state (SS8) prediction ( Supplementary Fig. S1). What is the most important is that combining three features makes a consistent improvement in all three datasets and three networks although adding one-hot-encoding only leads to minor improvement over ProtTrans+ESM-1b embedding but is statistically significant (P-value of 1.81 × 10 −2 on all test sets combined). The improvement is the largest for the difficult case (TEST2020 and Neff1-2020). Overall, the three-state (SS3) secondary structure accuracy for all test sets improves 7-10% from 73-76% for a single-sequence-based method (one-hot-encoding) to 80-86% after combining three features. Moreover, the performance on TEST2020 and Neff1-2020 is nearly identical, indicating that, unlike profile-based models, the performance of the current alignment-free method is independent of how many homologous sequences a protein has. Similar trends were observed for eight-state (SS8) structure prediction ( Supplementary Fig. S1), ASA, and HSE (Supplementary Table S1), and backbone torsion angle prediction (Supplementary Table S2).
Ensemble learning. The individual models were combined into an ensemble to further improve model performance. To demonstrate the advantage of ensemble learning over individual models, Table 1 presents the results of the selected three models and the results of the ensemble on TEST2018 and TEST2020. As we can see, for all properties tested, the trends we observed in this table is similar to what we observed in our previous work SPOT-1D-Single. The ensemble performance drops the error for ψ , φ , θ , and τ over the best individual model by 0.90%, 0.89%, 1.20%, and 0.87% on TEST2018, respectively. For secondary structure prediction three-state (SS3) and eight-state (SS8) the ensemble accuracy is 86.74% and 76.47%, which is 0.27% and 0.52% better than the best individual model. Similar improvement is also visible in Pearson's Correlation Coefficient (PCC) for ASA, HSE-U, and CN predictions. The same trend is observed for TEST2020.

Methods comparison.
The performance for three-state (SS3) secondary structure prediction given by our ensemble method named SPOT-1D-LM is compared with four single-sequence-based methods PSIPRED-Single, SPIDER3-Single, ProteinUnet and SPOT-1D-Single along with two profile-based methods SPOT-1D and NetSurfP-2.0 on six different test sets (TEST2018, TEST2020, Neff1-2020, CASP12-FM, CASP13-FM, and CASP14-FM) in Fig. 2. The result confirms a large leap from 72-74% by single-sequence-based methods to 80-86% by alignment-profile-based methods for the prediction accuracy for TEST2018, TEST2020, CASP12-FM, CASP13-FM, and CASP14-FM. The performance of profile-based methods is worse than the performance of single-sequence-based methods only for Neff1-2020, confirming the previous finding that profile-based methods lose their accuracy when lacking homologous sequences. Importantly, our language-model-based method achieves a comparable performance to the profile-based methods for all test sets. Furthermore, it improves over single-sequence-based methods even for Neff1-2020. For example, SPOT-1D-LM performs 0.66%, 1.6%, and 17% better than SPOT-1D, NetSurfP-2.0, and SPOT-1D-Single, respectively, for SS3 prediction for TEST2018. Its performance on TEST2020, CASP12-FM and CASP13-FM is comparable to that of the profile-based SPOT-1D and better than that of the profile-based NetSurfP-2.0. Similar trends are also observed for SS8 prediction, as shown in Supplementary Fig. S2. The improvement over profile-based methods on Neff1-2020 and singlesequence-based methods on all test sets is further confirmed by the statistical significance analysis in Supplementary Table S3. The matching performance of SPOT-1D-LM with profile-based models on backbone torsion  Tables 2, 3 for TEST2020, Supplementary Table S4 for CASP12-FM and Supplementary Table S5 for CASP13-FM. In order to compare to the downstream secondary structure prediction models from ProtTrans and ESM-1b, we compared our model SPOT-1D-LM on CASP12 and NEW364 test set used in ProtTrans, and SPOT-1D-LM outperforms both as shown in Supplementary Table S6. Secondary structure is dominated by local interactions. How about structural properties that are based on tertiary structures? Fig. 3 examines the performance of different predictors for ASA prediction on six different test sets (TEST2018, TEST2020, Neff1-2020, CASP12-FM, CASP13-FM, and CASP14-FM). Again, we observe that profile-based methods perform far better than single-sequence-based methods in ASA prediction except when Neff=1 (Neff1-2020). More importantly, SPOT-1D-LM performs the best for all test sets. It outperforms the profile-based method NetSurfP-2.0 by 4%, 4%, 10%, 0.9% and 9% on TEST2018, TEST2020, Neff1-2020, CASP12-FM, and CASP13-FM, respectively. Comparing to SPOT-1D, its improvement is 3%, 6%, 19%, 1%, and 9%, respectively. Better or comparable performance is also observed for other tertiary structural properties    To enlarge our test sets, TEST2020 contains low-resolution structures. To examine if these low-quality proteins affect our conclusion above, we also obtained TEST2020-HQ. As shown in Supplementary Table S7, we found that the performance on TEST2020 is essentially the same as the performance on TEST2020-HQ for all structural properties predicted. We also compared SPOT-1D-LM to other predictors on CASP14-FM proteins and find that SPOT-1D-LM outperforms other predictors, including profile-based methods for all properties as shown in Supplementary Table S8.

Discussions
In this paper, we have developed a new Language-model-based method for predicting one-dimensional structural properties of proteins, including secondary structure, solvent accessible surface area, and backbone torsion angles. We employed an ensemble of three network variants of ResNet and LSTM models, trained on approximately 40000 proteins with embedding generated from ESM-1b and ProtTrans. The model is then compared to other predictors on independent and non-redundant test sets created by removing any remote homologs (TEST2020, CASP12-FM, and CASP13-FM) or by 25% sequence identity cutoff (TEST2018). The large improvement of our method over any single-sequence-based methods for all structural properties is observed for all six test sets (TEST2018, TEST2020, Neff1-2020, CASP12-FM, CASP13-FM, and CASP14-FM). More importantly, we demonstrated that an alignment-free method can match or improve over an alignment-based method in predicting 1D structural properties, regardless if it is secondary-structure or tertiary-structure-based 1D properties.
The alignment-free method proposed here can skip the intensive computing time required to search for homologous sequences from an exponentially expanding sequence database. As shown in Supplementary  Table S9, generating PsiBlast sequence profiles and HMM models will require 9.3 hours and 6.9 hours, respectively, for 250 proteins in TEST2018 by utilizing 16 cores of Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz machine. After that, it takes additional 0.23 hours by NetSurfP-2.0 or 1.1 hours by SPOT-1D to complete the prediction. SPOT-1D, an ensemble of six different models, uses SPOT-Contact, SPIDER3, CCMpred and DCA as input. This makes the pipeline for SPOT-1D extensively time-consuming. By comparison, SPOT-1D-LM takes a total 0.29 hours on the same 16-core CPU for complete prediction with the same or better accuracy. The singlesequence method SPOT-1D-Single is quicker than SPOT-1D-LM (0.04 hours) but with poorer performance. Moreover, SPOT-1D-LM can complete the whole prediction on a Titan X GPU for 0.04 hours only. Thus, it is now feasible for making highly accurate genome-scale prediction on protein secondary and tertiary structural features.
This method is limited to a protein of ≤1024 amino-acid residues. This should not hamper the analysis of protein sequence features because the largest structural domain found so far contains 692 amino acid residues 28,29 with the majority < 200 residues. Large (long) proteins usually are made of multiple, mostly independent structural domains connected by intrinsically disorder regions. Thus, it is possible to divide a protein into shorter domains prior to make secondary structure or other structural property prediction by using protein domain prediction tools 30 .
The successful matching performance between alignment-free and alignment-based methods highlights the potential of using a similar combination of language models for other structural properties such as proteins intrinsic disorder 31 and distance-based contact maps 8,32 as well as for end-to-end tertiary structure prediction 1,33,34 . In particular, AlphaFold2 has successfully predicted protein structures at an experimental accuracy in CASP14 for those proteins with a minimal of 30 homologous sequences 1 . Our results indicated the possibility that the success of AlphaFold2 can expand to the proteins without homologous sequences by using a combination of language models as input, rather than the multiple aligned homologous sequences as an input 35,36 .

Methods
Datasets. The training and test datasets employed in this work are from our previous work for developing SPOT-1D-Single. Briefly, we started with the dataset prepared by ProteinNet at the highest sequence identity cutoff of 95% according to mmseqs2 tool 37 to maximize the training data. This leads to 50914 proteins submitted to PDB before the year 2016 with resolution < 2.5Å.
To avoid overfitting and achieve an effective validation, we randomly selected 100 proteins one by one from the training set and compared their Hidden Markov Model (HMM) against the HMM of all other proteins in the training set at an e-value cutoff of less than 0.1. Any proteins that were remotely similar to the 100 validation www.nature.com/scientificreports/ proteins were removed from the training set. In addition, we removed any proteins with length more than 1024. This led to 38913 proteins for training and 99 proteins for validation. The first test set employed is TEST2018 5 with 250 proteins released between January 01, 2018 and June 17, 2018 with resolution < 2.5Å and R-free < 0.25 , and have sequence similarity less than 25% to all pre-2018 proteins. We further obtained a hard test set TEST2020 that includes all proteins released between May 2018 and April 2020 with removal of close and remote homologs using HMM models to all proteins released before 2018 on PDB. Due to the limitation of the language model used, we further removed the proteins with lengths greater than 1024. The final TEST2020 contains 671 proteins. A further resolution constraint of < 2.5Å and R-free< 0.25 led to 124 proteins forming TEST2020-HQ. Separating the proteins without homologs (Neff=1) from TEST2020, Neff1-2020 dataset was curated with 46 proteins. Apart from the above-mentioned test sets, we also employed independent test sets CASP12-FM (released in year 2016) CASP13-FM (released in year 2018) and CASP14-FM (released in year 2020). These test sets include 22, 17, and 15 free-modelling proteins released during CASP12, CASP13, and CASP14, respectively. Free modeling targets are those proteins without known structural templates in the protein databank at the time of releases, which are after all proteins in the training and validation sets. Fig. 4, we employed the one-hot encoding from the protein sequence concatenated to the language model embeddings generated using ESM-1b and ProtTrans models. The one-hot encoding has a dimension of L × 20, where L is the length of the protein. The embedding from ESM-1b is generated from a model trained on the Uniref50 dataset and has a dimension of L × 1280. The ProtTrans model was also trained on the Uniref50 and employed the T5-XL model to generate an embedding of dimension of L × 1024. Concatenating all these features yielded the final input features of dimension L × 2324. This input was utilized for both classification and regression models.

Input features. As shown in
Outputs. The outputs of this method have been divided into two main categories: classification and regression. The classification output is extracted from the classification models with 11 output nodes dedicated for protein secondary structure. We use the Dictionary of Secondary Structure of Proteins (DSSP) for assigning three-state (SS3) and eight-state (SS8) secondary structures 38 . We also predicted 1D structural properties which fall under the regression category. They include the solvent accessible surface area (ASA), protein backbone angles ( ψ , φ , θ , and τ ), half-sphere exposures (HSE), and contact number (CN). These outputs are identical to the predictions in our previous methods SPOT-1D 5 and SPOT-1D-Single 14 .
Neural network architecture. The model utilized in SPOT-1D-LM follows the neural networks utilized in SPOT-1D-Single 14 . In brief, we employed an ensemble of three neural network architectures: 2-layer BiLSTM, multi-scale-ResNet (MS-ResNet), and multi-scale-Res-LSTM (MS-Res-LSTM). The ensemble of LSTM-BRNN and ResNet-based models help the models to identify short-and long-range context throughout the sequence 5 . In total, we trained three models to form an ensemble of three for the classification tasks and regression tasks, respectively. Similar to SPOT-1D-Single, both classification and regression models were trained on a batch size of ten with cross-entropy loss and L1-loss, respectively. The ensemble of classification models employed the mean of the classification probabilities from each model. The mean was also employed for the ensemble of the ASA, HSE-u, HSE-d and CN regression models. For the angle prediction, we utilized the median as in SPOT-1D 5 to avoid forbidden angle regions.
The first model we trained is a two-layered bidirectional-LSTM with hidden dimension of 1024 followed by two fully connected layers of size 1000 17 . A dropout rate of 0.5 after each LSTM layer was used to avoid overfitting. The second model we trained is a MS-Resnet, which is made of three parallel stacks of ResNet architectures with a great performance for similar tasks 5,8 . The three stacks differ from each other in terms of the kernel size. The first, second, and third stacks of the ResNets have the kernel sizes of three, five, and seven, respectively. Each parallel stack has 15 blocks of ResNet for which the sizes of convolutional layers vary after every five blocks from www.nature.com/scientificreports/ 64-256. At the end, the output from all three stacks is then concatenated and passed through the output layer.
In every ResNet block, we normalized and activated the output of each convolutional layer by applying batch normalization and ReLU activation function 39,40 . We also applied a dropout rate of 0.5 in each block. The third model we trained is MS-Res-LSTM. This model is a hybrid of the first two models. It includes the MS-ResNet in which one parallel stack of three is replaced by four bidirectional-LSTM layers of a hidden size of 128. The ResNet block stacks have the same configuration as the MS-ResNet stacks with kernel sizes of 5 and 7, respectively. A dropout rate of 0.5 was employed in the bidirectional-LSTM layer.
Performance evaluation. The three-state (SS3) and eight-state (SS8) secondary-structure predictions were evaluated based on the percentage accuracy by concatenating all the proteins together and making an overall assessment. Prediction of ASA, HSE-u, HSE-d, and CN was evaluated by calculating the Pearson's Correlation Coefficient (PCC) between true and predicted values for each protein and then averaged over the whole dataset 41 . To evaluate the model performance for the backbone angles ( ψ , φ , θ , and τ ), we calculate the Mean Absolute Error (MAE) between true angles and predicted angles for the whole dataset concatenated together.