Abstract
Domains are functional and structural units of proteins that govern various biological functions performed by the proteins. Therefore, the characterization of domains in a protein can serve as a proper functional representation of proteins. Here, we employ a self-supervised protocol to derive functionally consistent representations for domains by learning domain-Gene Ontology (GO) co-occurrences and associations. The domain embeddings we constructed turned out to be effective in performing actual function prediction tasks. Extensive evaluations showed that protein representations using the domain embeddings are superior to those of large-scale protein language models in GO prediction tasks. Moreover, the new function prediction method built on the domain embeddings, named Domain-PFP, substantially outperformed the state-of-the-art function predictors. Additionally, Domain-PFP demonstrated competitive performance in the CAFA3 evaluation, achieving overall the best performance among the top teams that participated in the assessment.
Similar content being viewed by others
Introduction
Protein function prediction is one of the long-standing, fundamental topics of bioinformatics, which involves profiling the activities and interactions of proteins1. Although protein functions are eventually determined by experiments, the experimental effort and expense slow down the process of function discovery, which is in contrast to the ever-increasing volume of sequenced proteins2. At present, not even 1% of sequenced proteins have functional annotation3. Unlike relatively cheaper sequencing technologies, there is a deficit of scalable, high-throughput experimental assays to functionally annotate proteins4. This has led to the demand for in-silico methods of automated protein function prediction5. Protein functions have been determined naturally from sequence similarity to known proteins6 and other characteristics of proteins that can trace functional relevance. Such information includes structural configuration7,8,9, phylogenetic information10,11, domain distribution12,13,14, protein networks3,15, and combinations of multiple sources16,17. Recently, various deep learning-based methods were proposed to learn a functional representation of proteins8,16,18,19,20,21,22. Such methods demonstrated substantial improvement over traditional database search-based methods23,24.
Proteins consist of domains, which are functional and structural units responsible for specific functions and interactions25. Therefore, it is compelling to infer the functions of a protein based on the presence and distribution of the various domains in it. InterPro2GO is an ongoing project that assigns GO annotations to specific domains in the InterPro database, and this annotation is done manually by experts26,27. Although the domain-GO mapping by InterPro2GO provides curated information on protein function, the coverage is severely limited. For example, there are approximately 38k InterPro entries and 48k GO terms, but the current version of InterPro2GO (version-date: 2022/03/16) mapping only includes 16,443 unique InterPro entries and 6,482 GO terms. Despite the lack of annotations, several methods have tried to leverage protein domain information for function prediction. Messih et al. analyzed the recurrence and order of protein domains and their influence on protein functions13. Rojano et al. attempted to associate domains and functions through tripartite graphs14. Besides such domain-focused studies, protein domains have been consistently used as a source of complementary functional information in a number of ensemble methods3,16,17, and some analyses even revealed that domain information is the most crucial one16.
As in many other areas in bioinformatics, deep learning has been applied for function prediction from domain information. However, the effective use of domains is critically constrained by low coverage of functional assignments, high dimensionality, and acute data imbalance. For instance, in a recent competitive deep-learning-based model, DeepGOZero22, a 26,406-dimensional input of InterPro feature vectors was reduced to 1024 dimensions using a single multi-layer perceptron (MLP) layer, which results in considerable information loss. A similar situation is observed in DeepGraphGO21 as well.
Here, we introduce Domain-PFP, a protein function prediction method that uses functional representation of proteins through domain-GO association learned by a self-supervised method from protein databases. Self-supervised learning is based on the idea of leveraging the inherent co-occurrence relationship of complementary information in the data to learn new labels in a semi-automatic process28. We used self-supervised learning because it can directly learn domain and GO co-occurrence from abundant protein sequences and is able to alleviate the problem of current domain databases, where many domains do not have function annotation. Following the underlying concepts of self-supervised learning, we first learned pseudo-labels of GO prediction probability from individual domain terms. Then, we derived the dense representation of domains consistent with functional information to characterize protein sequences and used the representation to predict protein functions. The embeddings learned both at the domain and protein level have turned out to be functionally meaningful as the embedding distance showed substantial negative correlations with functional similarity29 of GO terms that are present in the domains and protein sequences. Moreover, a systematic comparison with large-scale Protein Language Model (PLM) representations30,31, which use variants of Transformers32 and BERT33 architectures, and have demonstrated success in function prediction34,35, revealed that our embeddings are more applicable for function prediction, despite being a fraction of the aforementioned PLM complexity. This improvement is further vividly observed in challenging cases of predicting rare and more specific functions. In addition, using a straightforward K-Nearest Neighbors (KNN) model with the learned embeddings along with sequence similarity and interaction information, Domain-PFP remarkably outperforms more complex state-of-the-art methods. Most notably, Domain-PFP achieved an increase in the area under precision-recall curve (AUPR) by 2.43%, 14.58%, and 9.57% over the state-of-the-art method for molecular function (MF), biological process (BP), and cellular components (CC), respectively. Domain-PFP has also demonstrated competitive performance when compared with top-scoring methods in the CAFA3 evaluation36.
Results
Dataset of domains and GO annotations
We collected 568,002 protein sequences from Swiss-Prot (release 2022_3)37 and assigned InterPro domains using InterProScan 5 REST API38. Despite InterPro maximizing domain coverage by combining entries from 13 databases, 36,403 proteins had no InterPro annotations, so we discarded them. Concurrently, we collected GO terms for protein sequences from UniProt. We considered both experimentally and computationally assigned functions since IEA (Inferred from Electronic Annotation) terms demonstrated increased accuracy in our previous works6. We also propagated the parent GO terms using the core ontology release 2021-01-01. In summary, our dataset contained 531,599 proteins with 32,471 unique domains and 33,199 unique GO terms (8,297, 21,805, and 3,097 MF, BP, and CC terms, respectively).
Self-supervised learning for domain-GO embeddings
Using the domain and GO term assignments to protein sequences, we computed the conditional probability of a protein that contains domaini having the GOj function:
Here, \(p\left({domai}{n}_{i}\right)\) represents the probability of a protein containing \({domai}{n}_{i}\), while \(p({domai}{n}_{i}\cap G{O}_{j})\) represents the joint probability of a protein with \({domai}{n}_{i}\) performing \(G{O}_{j}\). We can calculate both probabilities from the co-occurrence relationships of domains and GO terms in the dataset by counting the occurrences. These probabilities serve as the pseudo labels or target function for our self-supervised learning method.
Our ultimate goal is to predict protein functions. To achieve this, we aim to develop a representation of domains that, in conjunction with a learned representation of GO terms, is consistent with the domain-GO co-occurrence conditional probability. In other words, we seek to design two representations or embeddings, \(\phi\) and \(\psi\), which separately represent domains and GO terms, respectively, and a bivariate function \(f\) that can map the conditional probability of the co-occurrence of any \({domai}{n}_{i}\) and \(G{O}_{j}\):
In our case, we utilized two 256-dimensional embedding matrices \(\phi\) and \(\psi\), as representations for domains and GO terms, respectively. The bivariate function \(f\) was modeled as a two-layer densely connected network that takes the Hadamard product of \(\phi ({domai}{n}_{i})\) and \(\psi (G{O}_{j})\) as input, decomposes the values in a 128-dimensional space, and finally predicts the conditional probability \(p(G{O}_{j}|{domai}{n}_{i})\). The network architecture is presented in Fig. 1a, where the function \(f\) is represented as an array of circles in light blue. Concretely, \(f\) takes the following form:
where (W1, b1) and (W2, b2) are weights and biases from the first and the second layer of the network, respectively. The Hadamard product of the two embeddings is represented by the symbol \(\odot\). The network is regularized by dropout, and the domain embedding matrix \(\phi\) is further regularized by L1-norm to impose sparsity. The \(\phi\) embedding matrix for each domain, as well as the \(\psi\) embedding for each GO term, were learned through backpropagation with the mean squared error (MSE) loss using the Adam optimizer with default settings39. We intended to keep the function \(f\) simple so that the domain embeddings could effectively learn functional relevance, rather than letting the function \(f\) learn the correlation between domain and GO term co-occurrence. This is inspired by a recent work, which demonstrated that a strong encoder in conjunction with weak decoder results in a strong representation learner40. The function \(f\) provides the association probability between a domain and a GO term (Eq. 1), which we name DomainGO-prob. We trained three different versions of DomainGO-prob for the three sub-ontologies, MF, BP and CC, respectively.
The overall pipeline for learning the domain embeddings is summarized in Fig. 1b. We started by collecting annotated protein sequences from Swiss-Prot, along with domain and GO term assignments. Domains were obtained from InterProScan, while GO terms were collected from Swiss-Prot. Next, we calculated the conditional probabilities of all the domain-GO associations by counting their co-occurrences in the dataset. Finally, the domain embeddings (\(\phi\)) and GO term embeddings (\(\psi\)) were computed using the network shown in Fig. 1a. The network was trained and validated on the aforementioned dataset of 32,471 unique domains and 33,199 unique GO terms. We randomly selected 80% of the domain-GO pairs for training and used the remaining 20% for validation. Three different models, i.e., three different sets of embeddings were developed for the three sub-ontologies. The details of the network training are described in the Methods.
Predicting GO terms for a query protein (Domain-PFP)
Using the computed domain embeddings, we represented a protein, which may be composed of several domains, as the average of the embeddings of all the domains in it. This is similar to how PLM encodes proteins by averaging the individual residue level representations31. For a protein \({P}_{k}\) with domains \({d}_{{P}_{k}}\) the protein embedding is computed as
With the protein embedding, we can use supervised classifiers to infer protein functions. Here, we used a KNN classifier, following the convention of BLAST or PPI network scoring16,21. KNN models using protein language models have also been shown to be on par with top methods of Critical Assessment of Functional Annotation 3 (CAFA3)34. The confidence score of annotating a protein \({p}_{i}\) with the GO term \(G{O}_{j}\), \({S}_{D}({p}_{i},G{O}_{j})\) is computed as follows:
where \({K}_{{neigh}}\) is a neighborhood of \(K\) proteins, and \(I({p}_{k},G{O}_{j})\) is 1 if the protein \({p}_{k}\) is annotated with \(G{O}_{j}\), and 0 otherwise.
The steps of computing protein embeddings and predicting functions are outlined in Fig. 1c. For a given query protein sequence, domains are assigned using InterProScan38, and their individual domain embeddings are obtained. The embedding of the query protein is then computed by taking the average of the assigned domain embeddings (Eq. 4). Finally, the protein embedding is used to find known proteins that are close in the embedding space (Eq. 5) using a supervised classifier (KNN for our approach) to infer its functions.
Correlation of embedding distance and functional similarity of domains and proteins
To start with, we analyzed how the distance of the domain embeddings correlates with the functional similarity of domains and proteins. Having functionally similar domains close in the embedding space is essential for the embeddings to be useful for function prediction. As a measure of the embedding distance, we adopted the Manhattan distance as it is discussed to be more meaningful in high-dimensional spaces than, for example, the Euclidean distance41. As for functional similarity, we computed the Jaccard Index following a previous work3. For a domain, we considered GO terms are assigned to the domain if they have a conditional probability no less than 0.5, i.e., \({GO\, Terms}=\{G{O}_{i}:p(G{O}_{i}\big|{domain})\ge 0.5\}\). This set of assigned GO terms for domain A and B are denoted as GO TermsdomainA and GO TermsdomainB in the following equation. The Jaccard Index for two domains, A and B is defined as
We randomly selected 100,000 pairs of domains and computed their functional similarity relative to the embedding distance in Fig. 2a. Domain functional similarity was computed separately for each of the three GO categories. Overall, a negative correlation was observed between the embedding distance and functional similarity for all three GO categories. Substantial Jaccard Index values, such as those over 0.5, were observed mainly for domain embedding pairs that were close in distance, for example, <10. Almost all domain pairs with a large distance, for example, a distance of 20 or higher for MF and CC and over 10 for BP, had a small functional similarity value of <0.2. A perfect Jaccard Index of 1.0 was only observed for domain pairs with a relatively small embedding distance. Thus, it is evident that our model generates similar embeddings for functionally similar domains.
We have also examined protein-level functional similarity relative to the embedding similarity (Fig. 2b). As the measure of the functional similarity of proteins, which are annotated by multiple GO terms in the three categories, we used the \({funSim}\) score42. \({funSim}\) essentially computes the average of semantic similarity of best matching GO terms from two proteins for each GO category, and then averages the score over the three GO categories (for the concrete definition, see Methods). \({funSim}\) score ranges from 0 to 1 with 1 as the maximum score.
We took 1,000,000 random pairs of proteins and computed their embeddings for MF, BP, and CC separately and concatenated them to obtain the overall embedding. In Fig. 2b, mean \({funSim}\) score of protein pairs were plotted relative to the Manhattan distance of protein embeddings. We can see the overall trend that \({funSim}\) score drops as protein embeddings become more distant from each other. Large \({funSim}\) scores were observed only for close protein embeddings, e.g., a Manhattan distance of <5.
Overall, in this section, we confirmed that functionally similar domains and proteins are placed close to each other in the embedding space.
Learning InterPro2GO annotations
Next, we examined how well our domain embeddings align with expert-curated GO mappings of InterPro2GO. For this analysis, we used the InterPro2GO mapping of version-date 2022/03/1643, which comprises 35,046 mappings between 16,443 unique InterPro domain entries and 6,482 unique GO terms. We considered 34,832 InterPro2GO annotations, excluding 214 mappings with domains or GO terms that are not included in our dataset.
For all the domain-GO pairs in the InterPro2GO mappings, we predicted the conditional probability using DomainGO-prob (Fig. 1a) that the GO exists in the domain. The results are shown in Fig. 2c (using orange bars). As shown, for over 80% of cases, existing GO term-domain associations have a high score of over 0.9 (the rightmost bar) for all three GO categories. Thus, DomainGO-prob was able to associate GO terms to protein domains using the self-supervised learning protocol that associated GO terms and domains from the co-occurrences in full protein sequences.
We further conducted experiments with an adversarial version of this analysis. Namely, to test the generalization ability to learn from the context of related, co-occurring domains and GO terms alone, we removed all the probability values of domain-GO pairs that exist in the InterPro2GO mapping and then re-trained the DomainGO-prob models. Formally, from the original dataset \({{{{{\mathcal{D}}}}}}{{{{{\mathscr{=}}}}}}\{({domai}{n}_{i},G{O}_{j})\}\) we constructed a new dataset \({{{{{\mathcal{D}}}^{\prime}}}}{{{{{\mathscr{=}}}}}}\{({domai}{n}_{i},G{O}_{j}):({domai}{n}_{i},G{O}_{j})\, \notin \,{{{{{\rm{InterPro}}}}}}2{{{{{\rm{GO}}}}}}\}\). With this dataset \({{{{{\mathcal{D}}}^{\prime}}}}\), we re-trained the DomainGO-prob models and examined the conditional probability of GO terms that exist in InterPro2GO. The results are represented by the blue bars in Fig. 2c. Under this setting, DomainGO-prob predicted a score >0.5 for 66.5%, 81.9%, and 86.5% for MF, BP, and CC, respectively. Thus, even without explicit knowledge, DomainGO-prob was able to extract the meanings of domain-GO relationships only from the contextual information of co-occurrences and hierarchies. Among the three GO categories, the counts of domain-GO associations with the highest probability bin (0.9 to 1.0) of MF terms showed the largest decrease when compared with the results with full training data, \({{{{{\mathcal{D}}}}}}\) (orange bar). This is probably because MF terms (e.g., enzymatic function) are associated with a domain at a residue level unlike BP and CC, which are more contextual44.
Examples of domain-GO associations learned by the network
In this section, we discuss several examples that illustrate how DomainGO-prob learns domain-GO associations. We used the aforementioned adversarial version, i.e., the model trained with \({{{{{{\mathcal{D}}}}}}}^{{\prime} }\), and we examined how the model likely learned the GO terms solely from the co-occurrence of different domains. The examples show that DomainGO-prob recovered the correct domain and GO relationships in InterPro2GO from other domain and GO associations in a way that is consistent with the hierarchical and associative relationships of domains and GO terms. This is analogous to the way grammatical structures and word relations aid in masked language modeling33 in NLP.
The first example (Fig. 3a) is IPR000010, a domain in cysteine protease inhibitors45, which was annotated with GO:0004869 (cysteine-type endopeptidase inhibitor activity) and GO:0004866 (endopeptidase inhibitor activity) both with a high probability of 0.9 by DomainGO-prob. The GO term GO:0004869 in the MF category represents binding to and preventing the activity of cysteine-type endopeptidase. Looking at the domain structure, IPR000010 has three subdomains, among which only one subdomain, IPR025764, a Fetuin-B-type cystatin domain, has an annotated GO term with experimental evidence, GO:0004866 (endopeptidase inhibitor activity)46. From this domain hierarchy, in addition to GO:0004866, a child term, GO:0004869 with an ‘is a’ relationship with GO: 0004866, was correctly transferred to IPR000010 by DomainGO-prob.
The second example is a recovered annotation of a CC term, GO:0005634, which represents nuclear localization, with a probability of 0.99 assigned to IPR000690. In InterPro2GO, GO:0005634 is the only CC term associated with this domain. IPR000690 is Matrin/U1-C, C2H2-type zinc finger, which co-occurs with the homologous superfamily IPR036236 in 86.6% of protein sequences (Fig. 3b). IPR036236 is zinc finger C2H2-type superfamily, and 98.6% of its proteins are annotated with the CC term of nuclear localization. For instance, the protein A5PJN8 has both the two InterPro entries and is also annotated with the GO:0005634 term. Therefore, DomainGO-prob extracted the CC term from the co-occurrence of these domains in proteins and correctly annotated IPR000690.
The next example in Fig. 3c illustrates the transfer of a GO term from multiple co-occurring domains in proteins. DomainGO-prob annotated IPR000081 with the function GO:0016032 (viral process) with a probability of 1.0, which refers to a multi-organism process by a virus. All proteins with this domain (for example, P03303) also have domains IPR007094 (encoded in RNA-containing viruses), IPR001205 (found in RNA viruses), IPR000605 (found in DNA viruses), IPR029053 (forms icosahedral virus shell), IPR002527 (alters membrane permeability), IPR014838 (poliovirus replication) or 8 other domains related to various viral activities. Although not all such co-occurring domains have exactly GO:0016032, they all have related terms, such as GO:0039694 (viral RNA genome replication). Therefore, DomainGO-prob was able to learn the viral process function GO:0016032 by combining such supplementary information.
Some domains are responsible for multiple different functions. For example, the domain IPR000081, which has just been analyzed for viral activity in the previous example, was also correctly assigned with proteolysis (GO:0006508) by DomainGO-prob with a predicted probability of 1.0 (Fig. 3d). However, this information was not learned from the aforementioned co-occurring domains, but rather from the homologous superfamily IPR009003 (Peptidase S1, PA clan), which all proteins with IPR000081 is a part of. For example, the protein P06209 not only contains the domain IPR000081 but also is a member of IPR009003 homologous superfamily. Cysteine peptidase from IPR009003 hydrolyzes a peptide bond using the thiol group47 and thus has the GO:0006508 function, which was derived to IPR000081. It should be noted that despite IPR009003 completely overlapping with IPR000081, DomainGO-prob did not associate IPR009003 with viral activity (GO: 0016032). For the domain IPR009003 DomainGO-prob predicted a small probability of 0.33 for GO:0016032 (viral process), which was likely induced from the several co-occurring domains involved in viral activities, for example, IPR007094, IPR002527, IPR014838. On the contrary, the actual function for IPR000081, i.e., GO:0006508 was predicted with a probability of 1.0. Therefore, for this example DomainGO-prob was capable of contrasting between complementary information.
There are cases where DomainGO-prob failed to associate GO terms to domains. For instance, in Fig. 3e, IPR000174 represents two different Chemokine receptors from the CXC family, namely CXCR1 and CXCR248. Therefore, proteins from this family are annotated with the function GO:0016494 (CXC chemokine receptor activity). Since Chemokine receptors are part of the G protein-coupled receptor (GPCR) family, proteins from the IPR000174 family (for example, P21109) are also members of IPR000276 (G protein-coupled receptor, rhodopsin-like) and have the IPR017452 (GPCR, rhodopsin-like, 7TM) domain. Although this context provides information about the GPCR family, it is difficult to narrow it down to the CXC family, without individual precise information. This precise information is absent in our adversarial mode of training. As a result, DomainGO-prob predicted a low score of 0.34 for GO:0016494 but managed to assign GO:0004930 (G protein-coupled receptor activity) to IPR000174 with 0.95 probability from the co-occurrence of IPR017452.
Comparison with large protein language models in GO function prediction
We evaluated the performance of DomainGO-prob embedding in comparison with 12 large Protein Language Models (PLMs) following the benchmark study performed by Unsal et al.49. The 12 PLMs we compare against are ProtT5-XL30, ProtALBERT30, SeqVec50, ProtBERT-BFD30, ESM-1b31, ProtXLNet30, TAPE-BERT-PFAM51, CPCPProt52, MSA-Transformer53, UniRep54, Learned-Vec55, and ProtVec56. These PLMs were trained on unsupervised tasks such as predicting a segment of masked residues given the rest of the protein30 or predicting the next residue from all the residues before it50, on a large protein sequence dataset, e.g. the entire UniProt. Supplementary Table 1 summarizes how these PLMs were trained.
To use a PLM for GO prediction, Unsal et al. converted the residue-level embedding to protein-level by computing the mean of the embeddings along the residues and used a linear Support Vector Machine model. The benchmark by Unsal et al. was performed on the PROBE benchmark dataset they constructed, which provides GO terms of different difficulties to predict. In the Probe dataset, GO terms are divided into three categories based on the frequency in the PROBE benchmark dataset (low, middle, high having 2–30, 100–500 > 1000 annotated proteins, respectively) and specificity (shallow, normal, specific for the ontologies being within the depth of 1/3rd, 2/3rd and bottom rest, respectively). Therefore, based on frequency and specificity, 3 × 3 = 9 groups of GO terms can be constructed for the three GO categories, i.e. 3 × 9 = 27 groups. Among them, as there were no GO terms that fall under the high-specific group, the benchmark ended up with 25 groups. For each group, at most 5 GO terms were selected based on dissimilarity according to the Lin’s similarity measure42, which resulted in a total of 117 GO terms to predict. The PROBE dataset contains 19,995 human proteins clustered at a 50% identity cutoff and only experimental GO annotation. The human proteins falling under these criteria were used for benchmarking GO function prediction by undergoing a 5-fold cross-validation test. Unsal et al. provided a convenient CodeOcean distribution (https://PROBE.kansil.org, version November 3, 2022), where given the embeddings of the test proteins, GO predictions are made, and the performance is evaluated on the PROBE dataset. We used it to test our DomainGO-pair-based protein embedding (Eq. 4).
For this benchmark, we trained the domain and GO embeddings (Eq. 2) for the three GO categories separately on Swiss-Prot, after removing all the human proteins. We removed these proteins to avoid overlap between the test proteins and the proteins used for training. However, note that the PLMs we compared against almost certainly have these human proteins in their training set, as they used an entire public protein sequence dataset for training. Since our embedding dimension is only 256, which is quite small compared to the PLMs we compared against, we concatenated the embeddings from the three GO categories and performed mean normalization to balance them. This resulted in a 768-dimensional protein embedding vector, as follows:
Here, \({D}_{{MF}}(p),{D}_{{BP}}(p),{D}_{{CC}}(p)\) are computed embeddings for a protein \(p\) for MF, BP, CC sub-ontologies, respectively (using Eq. 4.)
The results are presented in Fig. 4a, where we compared the GO prediction performance of our model (Eq. 7) with 12 models on the PROBE benchmark. The numerical values are provided in Supplementary Table 2. Our model based on DomainGO-prob outperformed all the PLMs in all three categories. For MF, BP, and CC, DomainGO-prob resulted in 0.02, 0.06, and 0.06 higher weighted F1 scores than ProtT5-XL, the previous top method, respectively, and 0.04 when the average across the three GO categories was considered. Notably, this improvement was obtained from a much simpler model with 768-dimensional embeddings with merely a fraction of the parameters of the PLM models by adopting a functionally informed learning protocol. As shown in Supplementary Table 1, ProtT5-XL has 1024-dimensional embeddings and was trained with a network with 3 billion parameters, while the three networks we used (Eq. 7) have only 31 million parameters in total.
An important consideration when training machine learning models for protein analysis is to remove redundancy, i.e., similar sequences from the training set relative to the test set. Therefore, although we have already omitted human protein sequences from our training dataset, we retrained our models after removing proteins with 75%, 50%, and 25% sequence identity with the test set using MMseqs257. The results are shown in Fig. 4b, in comparison with ProtT5-XL. As expected, the F1 score decreased slightly as more sequences were removed. However, this is most likely due to the fact that we were losing some domain and GO association information when we removed the proteins. Nevertheless, in all cases, our embedding performed better than that of ProtT5-XL, even at the identity cutoff of 25%.
In Fig. 4c, we examined how performance changed when considering GO terms of varying levels of difficulty to predict. Weighted F1 scores for the different GO groups in the PROBE benchmark, classified by GO depth and frequency, were separately shown. As we moved from high to low frequency or shallow to specific GO terms, the classification task became more difficult. We compared our model’s performance with the best-performing PLM, ProtT5-XL. Even in this evaluation, it was evident that our model substantially outperformed ProtT5-XL. Interestingly, the margin of the advantage of our model increased as we considered more difficult GO groups. In most of the easier cases, DomainGO-prob was at least similar to or slightly better than ProtT5-XL. In difficult cases, a substantial improvement was observed. For example, for low-frequency and specific CC terms, DomainGO-prob was 20% better. It is apparent that although the PLM was able to comprehend frequent GO terms from unsupervised learning on a large volume of protein sequence data, such models failed to account for rare GO terms and suffered from limited specificity. On the contrary, our self-supervised learning approach seemed to decipher the functional identity of proteins better, regardless of the rarity and specificity of the GO terms to a degree.
GO function prediction by Domain-PFP in comparison with existing methods
Subsequently, we benchmarked the GO prediction performance of Domain-PFP (Eq. 5) on the NetGO dataset16 to compare it with state-of-the-art protein function prediction methods from recent literature. We used the data split of the NetGO dataset into training, validation, and test sets provided in the work of DeepGOZero22, who followed the same data split protocol of NetGO2.016. The NetGO dataset consists of 64,279, 91,443, and 83,004 proteins for MF, BP, and CC categories, respectively, with a specified training, validation, and test set split (Supplementary Table 3). We trained DomainGO-prob on the NetGO training dataset and created a weighted K Nearest Neighbor (KNN) model based on the learned embedding. The number of K-neighbors used for MF, BP, and CC was 1000, 800, and 1200, respectively, which were tuned based on the performance on the validation data split of the NetGO benchmark (Supplementary Fig. 1). The performance of the various models on the test dataset is presented in Table 1. The evaluation results of the existing methods, from BLAST to NetGO2.0 (Server) in Table 1, were taken from the paper of DeepGOZero22.
DeepGOPlus18 infers protein functions through a combination of DiamondBLAST58 and DeepGOCNN, which employs a 1D convolutional neural network to predict GO from the amino acid sequence. TALE + 19 similarly fuses DiamondBLAST with sequence representation learned from a Transformer. Other top-performing methods are either based on domain information or used as a component. For instance, DeepGOZero22 leverages a model-theoretic approach to predict ontologies from InterPro domains, which can be further improved by incorporating DiamondBLAST. DeepGraphGO21 associates InterPro features with protein-protein interaction (PPI) networks employing a graph convolutional neural network. NetGO2.016 is an all-encompassing ensemble method that incorporates BLAST, domain, PPI, GO term frequency, PubMed publications, and sequence information both in form of k-mers and embedding. Among the existing methods, NetGO2.0 has shown the highest evaluation values for MF and the best Smin value in BP36 (note that the NetGO2.0 results are from the current server, ran by the authors of DeepGOZero in their paper).
In the latter half of Table 1 we show results by Domain-PFP and Domain-PFP that incorporate BLAST and PPI information to compare with the other state-of-the-art methods that combine diverse information sources. The scores of GO term j for protein i from BLAST and PPI information are defined as
Here, \(B({p}_{i},{p}_{k})\) and \(\omega ({p}_{i},{p}_{k})\) are the bit-score from DiamondBLAST with ‘more-sensitive’ setting58, and edge weight from STRING PPI network (ver. 11.0)59, respectively. We used the same STRING version as DeepGraphGO21.
The final score is a simple average of the terms from the three sources:
\({I}_{B}\) and \({I}_{N}\) are identity functions, which results in 1 if BLAST and String Network matches are found for the protein \(i\), respectively.
We compared the performance of the methods using the three CAFA evaluation metrics, namely Fmax, AUPR, and Smin36 (see Methods). Fmax computes the maximum possible protein-centric F1 score, overall prediction thresholds. AUPR, the area under PR curve, on the other hand, is a suitable metric for imbalanced data and penalizes the false positive predictions, which is highly applicable to function prediction16. Finally, Smin is a measure of semantic distance between predicted and actual annotation, based on the information content of the individual GO terms18, i.e., this metric indicates the capability of predicting rare GO terms.
Firstly, we compared Domain-PFP with sequence-only or domain-based methods, e.g., DeepGOCNN and DeepGOZero. This is a fair comparison as the base Domain-PFP uses only domain information which is inferred from sequence information. It can be observed from the table that Domain-PFP outperforms these methods in terms of Fmax, AUPR, and Smin in all the three sub-ontologies. Notably, Domain-PFP achieved an AUPR of 0.697 for CC, whereas DeepGOZero, a recent method based on domain information scored 0.645, i.e., a large improvement of 0.052. In terms of Fmax, Domain-PFP outperformed DeepGOZero and DeepGOCNN by achieving 0.013–0.014 and 0.051–0.086 higher scores, respectively.
Adding different features generally improves the performance of function prediction. When BLAST information was combined, Domain-PFP improved the overall performance, except for a slight drop of 0.001 in Fmax for MF. Fmax for BP increased from 0.41 to 0.434, and AUPR for CC increased from 0.697 to 0.717. Furthermore, Domain-PFP with BLAST consistently outperformed DeepGOZero+BLAST, which also uses the same information, in all 9 metrics. For example, DeepGOZero+BLAST achieved AUPR scores of 0.665, 0.356, and 0.654 for MF, BP, and CC, respectively, whereas Domain-PFP + BLAST achieved 0.693, 0.367, and 0.717, representing improvements of 0.028, 0.011, and 0.063, respectively. When compared with DeepGOPlus or TALE + , both of which use BLAST, the improvements made by Domain-PFP + BLAST appeared consistent as well.
Next, we experimented with including PPI information with Domain-PFP. However, this only improved the performance in BP, as expected, since BP involves multiple related and interacting functions that can be captured by PPIs. On the other hand, the performance of MF and CC was negatively affected. This situation is similar to the findings of NetGO2.016, where the authors reported that PPI information performed better than domain information for predicting BP terms but not for MF and CC terms. For example, the Fmax of MF and CC dropped by 0.009 and 0.002, respectively. Despite this, Domain-PFP + PPI still outperformed DeepGraphGO, a method using domain and PPI information in a much more sophisticated graph neural network, in 5 out of 9 metrics.
Finally, we experimented with integrating both BLAST and PPI simultaneously. This brought improvements in all the metrics except for AUPR of CC. Notably, Fmax and AUPR of BP improved by 0.042 and 0.049, respectively. This integration of BLAST and PPI features enabled Domain-PFP to consistently perform superior to all the existing methods. For example, the current state-of-the-art method NetGO2.0 was surpassed by Domain-PFP in 8 out of 9 metrics (except for Fmax for MF). In terms of Fmax for BP and CC, Domain-PFP achieved 0.021 and 0.024 higher scores, respectively. For AUPR, the improvements were 0.017, 0.050, and 0.060 for MF, BP, and CC, respectively. Similarly, in terms of Smin, Domain-PFP + BLAST + PPI achieved 0.041, 0.784, and 0.389 smaller scores for MF, BP, and CC, respectively, implying that non-trivial GO terms were captured better.
This comparative evaluation with state-of-the-art function prediction methods further supports our self-supervised approach of learning functionally informed representations for protein domains. We observed that a simple KNN model with DomainGO-prob embedding not only outperformed more sophisticated deep learning models (e.g., DeepGraphGO) but also methods with access to more information sources (e.g., NetGO2.0). The only case where we fell behind the previous state-of-the-art, NetGO2.0, is in Fmax for MF. which we hypothesize is due to the inclusion of Pubmed publication information that is likely to contain precise information vital for MF prediction.
In our evaluation, we have utilized the benchmark compiled by the authors of DeepGOZero, which was derived from the NetGO benchmark dataset, following protocols similar to CAFA. Using this benchmark allowed us to compare our performance against the other recent methods, that were evaluated on this benchmark by the authors of DeepGOZero. We also evaluated Domain-PFP on the original NetGO benchmark. The results of those experiments are presented in Supplementary Table 4. Domain-PFP with BLAST and PPI showed the highest values for all the metrics except for AUPR for CC. For AUPR for CC, Domain-PFP with BLAST showed the highest value. Domain-PFP alone showed a higher score than all the existing methods compared except for Fmax of MF, where NetGO had the highest score. Compared to Fmax, the improvement in AUPR is more prominent, which can also be observed in the results presented in Table 1.
To assess the performance of Domain-PFP against structure-based protein function predictors we considered two recent methods DeepFRI8 and GAT-GO60. These methods use 3D protein structure information in a graph neural network and protein sequence information with a language model. Both methods were evaluated on a common benchmark dataset, composed of 29,902, 3,323, and 3,416 proteins for training, validation, and testing, respectively. The train and test proteins possess a total of 2,752 GO terms and they are clustered with 40% sequence identity.
We retrained Domain-PFP on this dataset and observed the performance across the three sub-ontologies. The results are presented in Supplementary Table 5. It can be observed that Domain-PFP outperforms much complex graph neural network-based function predictors with access to structural information on all the metrics except for Fmax in CC and AUPR in MF. The performance of Domain-PFP was further improved by including BLAST predictions, which results in the best score for all the metrics.
Evaluation on CAFA3 benchmark
We further evaluated Domain-PFP on the CAFA3 benchmark36. We trained the network model of Domain-PFP using the CAFA3 training dataset and evaluated the results using the official evaluation code. The training dataset comprised 66,841 protein sequences annotated before September 2016, with 677, 3992, and 551 MF, BP, and CC GO terms, respectively (Supplementary Table 6). The test set contained 3328 proteins annotated between September 2016 to February 2017. To include sequence similarity information using BLAST in our pipeline, we constructed a new BLAST database with the CAFA3 training sequences. However, we could not use PPI information from the STRING database for this benchmark because STRING v10.a (the version during the competition timeline) lacked sufficient interaction data of the CAFA3 test proteins. We did not perform any additional hyperparameter tuning and kept the same hyperparameters computed from the NetGO benchmark validation data.
The results of Domain-PFP on the CAFA3 benchmark are presented in Fig. 5 in comparison with the top 10 performing methods as published by the organizers of CAFA336. DomainPFP+BLAST consistently showed a higher Fmax than Domain-PFP alone. For both BP and CC, Domain-PFP + BLAST outperformed the existing methods. Domain-PFP + BLAST achieved a Fmax score of 0.63 for CC, which is 0.02 higher than the CAFA3 top model, Zhu Lab. For BP, Domain-PFP + BLAST showed a slightly higher Fmax of 0.398 than the CAFA3 top model (Fmax: 0.397). For MF, our Fmax score, 0.59, was second to Zhu Lab (Fmax: 0.62), with a substantial margin to the next method, orengo-funfams (Fmax: 0.54). The top method by Zhu Lab combined more diverse information using an ensemble approach, including sequence, domain, homology, biophysical information, which likely gave that a competitive edge, similar to NetoGO2.0. We also mention that both DomainPFP and DomainPFP+BLAST showed higher Fmax scores than DeepGOPlus, which reported Fmax scores of 0.557, 0.390, and 0.614 for MF, BP, and CC, respectively, in their paper18.
Discussion
Despite protein domains carrying the functional signatures of proteins, they have not been used to their full potential to date. Look-up-table-based domain to GO assignments tend to lack coverage. On the other hand, deep learning-based approaches using domains as high-dimensional input suffer from limitations in training data and information loss in network bottlenecks. Therefore, our motivation has been twofold: improving coverage and reducing information loss. Based on recent advancements in self-supervised learning, it has become motivating to apply such concepts in protein domain learning to alleviate these issues. Our method follows one of the core concepts of self-supervised learning where pseudo-labels are first learned to initialize model parameters, which are then used to perform the actual task using a supervised or unsupervised method61. Our approach is consistent with this definition as we first use the domain-go association probabilities as pseudo labels, which initializes our domain embedding parameters; then, we use this embedding later in a supervised learning protocol and we predict the functions of the proteins. This strategy also holds in the benchmarks on the NetGO and CAFA3 dataset we performed. To the best of our knowledge, this work is the first to apply self-supervised learning in the domain of protein function prediction. Based on co-occurrence contextual information between domain and GO terms, we devise embeddings for domains so that functionally related domains have similar embeddings. Since co-occurrences were learned from entire protein sequences, the domain embedding, DomainGO-prob, encodes GO associations that are not explicitly described in the domain database. Remarkably, our rather simple model, Domain-PFP, along with BLAST and PPI information, demonstrated superior performance over all state-of-the-art function predictors.
One likely limitation of this work could be the case of unknown domains. All existing methods based on protein domains fail to predict anything if the domain seen during inference was absent in the training data, in which case they predict a default value. This limitation can possibly be resolved by generating the functionally aware domain representation and localization end-to-end from the protein sequence directly using a larger deep learning model. Another limitation is that the current protein embedding considers domains in a protein equally (Eq. 4), although each domain may have different levels of contribution to protein function. Also, the order of appearance of domains in a protein is not considered, which is known to be relevant to function13. To address these, attention mechanism maybe applicable. These are improvements we wish to explore in our subsequent works. In the current work, we practically alleviated these issues by augmenting with BLAST and PPI information-based predictions.
Another possible future direction could be to combine with general protein language models31, which were shown to perform well in protein tertiary structure prediction and other tasks. Additionally, we wish to analyze the suitability of our model in a zero-shot learning scenario. Specifically, our goal is to train DomainGO-prob on pretrained GO embeddings based on GO tree hierarchy and observe if GO terms absent in the training data can be retrieved this way.
Methods
Neural network architecture
We have designed a neural network to learn the domain-GO co-occurrence conditional probability distribution (Fig. 1a). The domain and GO terms are received as one-hot-encoded inputs, which are passed through two separate embedding layers to generate the 256-dimensional domain and GO embeddings, respectively. Then, from the computed domain and GO embeddings, we calculate the Hadamard product as a measure of correlation between the two types of embeddings and pass them through a densely connected layer of 128 neurons. The neurons are regularized through dropout (p = 0.05) and activated by RELU. Finally, we use a linear layer to predict the \(p({domain|GO})\) score. The domain embedding matrix is extracted to generate the representations of domains. In order to increase the sparsity of the domain embeddings, we apply L1-regularization on that embedding layer (\(\lambda =0.1\)).
Network training
Similar to word2vec embedding training62, we have a comparatively much larger number of domain-GO terms out of context, i.e., \(p({GO|domain})=0\). Thus, we employed negative sampling by randomly selecting 1000–2000 non co-occurring GO terms for each domain. The network was trained by minimizing the MSE (mean squared error) loss with Adam optimizer39 with a learning rate of 0.001 (the other parameters were kept as default) and a batch size of 163,840 for 200 epochs. 20% of domain-GO pairs, which were randomly selected, were used as the validation set. The experiments were performed 10 times and the best model based on validation performance was selected.
funSim score
\({funSim}\) score is popularly used for quantifying similarity of GO term annotation of two proteins29,63. \({funSim}\) score uses the relevance semantic similarity score \({si}{m}_{{REL}}\) for the similarity of GO terms of the same category42:
where common ancestral GO terms of GO1 and GO2 are explored to maximize the score and p(GO) is the probability of GO term in the entire Swiss-Prot database. Then, a set of GO annotations in a GO category for two proteins, a and b, are defined as
where \({s}_{{ij}}\) is \({si}{m}_{{REL}}\) score of \(G{O}_{i}\) and \(G{O}_{j}\) of \({protei}{n}_{a}\) and \({protei}{n}_{b}\), respectively, computed in an all-vs-all fashion.
Finally, \({funSim}\) score is the average of the GOscore from the three GO categories.
\({funSim}\) score ranges from 0 to 1 with 1 as the maximum score.
Evaluation metrics
For the PROBE benchmark, similar to the original benchmark by Unsal et al.49, we used Weighted F1 Score as the evaluation metric. The values were computed using their official CodeOcean distribution.
To compare with state-of-the-art methods, we used the CAFA protein-centric evaluation metrics Fmax, Smin, and AUPR2. We used the same evaluation codes as used by22 to ensure consistency.
Fmax is the maximum possible protein-centric F1 score, computed over all prediction thresholds.
Here, \({pr}\left(\tau \right)\) and \({re}\left(\tau \right)\) Are precision and recall scores, respectively, computed at the cut-off value of \(\tau\). The precision and recall values are computed as
Here, \({N}_{T}\) is the total number of proteins and \(h\left(\tau \right)\) is the number of proteins with a prediction score no smaller than \(\tau\) for at least one GO term. \(I\) is the identity function which returns 1 if the condition is true, 0 otherwise. \(I({G}_{i},{P}_{j})\) therefore, implies if the protein \({P}_{j}\) whether has the GO term \({G}_{i}\) or not. \(S({G}_{i},{P}_{j})\) denotes the prediction score of \({P}_{j}\) having the \({G}_{i}\) term.
The area under precision-recall curve, i.e., AUPR score is computed from the computed precision and recall scores using the trapezoidal rule.
Here, \({x}_{0},{x}_{1},\ldots ,{x}_{N}\) are various recall values, whereas f\(({x}_{0}),f({x}_{1}),\ldots ,({f{x}_{N}})\) are values of precision at those recalls, and \(\Delta {{{{{\rm{x}}}}}}\) is the step size.
Smin is a measure of semantic distance between the ground truth and prediction annotations based on information content of the GO classes. The information content \({{{{{\rm{IC}}}}}}({{{{{\rm{c}}}}}})\) for a class \({{{{{\rm{c}}}}}}\) is computed based on the annotation probability of class \({{{{{\rm{c}}}}}}\) relative to its parent class \(P(c)\)
The two terms remaining uncertainty \(({ru})\) and average misinformation \(({mi})\) are defined as
The value of Smin is computed as
Data availability
The embeddings of the proteins from the PROBE benchmark dataset, computed by DomainGO-prob, GO term prediction by DomainPFP on the CAFA3 dataset, trained DomainGO-prob model weights and Domain-PFP KNN models are accessible at https://github.com/kiharalab/Domain-PFP and can also be downloaded from https://kiharalab.org/domainpfp/ and figshare64. All other data can be obtained from the corresponding author upon reasonable request.
Code availability
The Domain-PFP program is freely available for academic use from GitHub at (https://github.com/kiharalab/Domain-PFP). The snapshot of the code at the time of the publication is also made available at Zenodo65. Furthermore, the program is available to run on Google Colab Notebook (bit.ly/domain-pfp-colab).
References
Cruz, L. M., Trefflich, S., Weiss, V. A. & Castro, M. A. A. in Molecular Biology 1st edn, Vol. 1611 (eds. M. Kaufmann, C. Klinger & A. Savelsbergh) Ch. 55–75 (Humana Press, 2017).
Radivojac, P. et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 10, 221–227 (2013).
Torres, M., Yang, H., Romero, A. E. & Paccanaro, A. Protein function prediction for newly sequenced organisms. Nat. Mach. Intell. 3, 1050–1060 (2021).
Clark, W. T. & Radivojac, P. Analysis of protein function and its prediction from amino acid sequence. Proteins Struct. Funct. Bioinforma. 79, 2086–2096 (2011).
Hawkins, T. & Kihara, D. Function prediction of uncharacterized proteins. J. Bioinforma. Computat. Biol. 05, 1–30 (2007).
Hawkins, T., Chitale, M., Luban, S. & Kihara, D. PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data. Proteins Struct. Funct. Bioinforma. 74, 566–582 (2009).
Dawson, N. L., Orengo, C. & Gáspári, Z. in Structural Bioinformatics. Methods in Molecular Biology 1st edn, Vol. 31 (ed. Gáspári, Z.) Ch. 43–57 (Humana Press, 2020).
Gligorijević, V. et al. Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12, 3168 (2021).
Kagaya, Y., Flannery, S. T., Jain, A. & Kihara, D. ContactPFP: protein function prediction using predicted contact information. Front. Bioinforma. 2, 896295 (2022).
Jain, A. & Kihara, D. Phylo-PFP: improved automated protein function prediction using phylogenetic distance of distantly related sequences. Bioinformatics 35, 753–759 (2019).
Sahraeian, S. M., Luo, K. R. & Brenner, S. E. SIFTER search: a web server for accurate phylogeny-based protein function prediction. Nucleic Acids Res. 43, W141–W147 (2015).
Forslund, K. & Sonnhammer, E. L. L. Predicting protein function from domain content. Bioinformatics 24, 1681–1687 (2008).
Messih, M. A., Chitale, M., Bajic, V. B., Kihara, D. & Gao, X. Protein domain recurrence and order can enhance prediction of protein functions. Bioinformatics 28, i444–i450 (2012).
Rojano, E. et al. Associating protein domains with biological functions: a tripartite network approach. IWBBIO 2019: Bioinform. Biomed. Eng. 8, 155–164 (2019).
Zhao, B. et al. NPF:network propagation for protein function prediction. BMC Bioinforma. 21, 355 (2020).
Yao, S. et al. NetGO 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information. Nucleic Acids Res. 49, W469–W475 (2021).
You, R. et al. GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics 34, 2465–2473 (2018).
Kulmanov, M. & Hoehndorf, R. DeepGOPlus: improved protein function prediction from sequence. Bioinformatics 36, 422–429 (2019).
Cao, Y. & Shen, Y. TALE: Transformer-based protein function annotation with joint sequence–Label embedding. Bioinformatics 37, 2825–2833 (2021).
Wan, C. & Jones, D. T. Protein function prediction is improved by creating synthetic feature samples with generative adversarial networks. Nat. Mach. Intell. 2, 540–550 (2020).
You, R., Yao, S., Mamitsuka, H. & Zhu, S. DeepGraphGO: graph neural network for large-scale, multispecies protein function prediction. Bioinformatics 37, i262–i271 (2021).
Kulmanov, M. & Hoehndorf, R. DeepGOZero: improving protein function prediction from sequence and zero-shot learning based on ontology axioms. Bioinformatics 38, i238–i245 (2022).
Bonetta, R. & Valentino, G. Machine learning techniques for protein function prediction. Proteins Struct. Funct. Bioinforma. 88, 397–413 (2020).
Ibtehaz, N. & Kihara, D. in Machine Learning in Bioinformatics of Protein Sequences 2nd edn, Vol. 1 (ed. Kurgan, L.) Ch. 31–55 (World Scientific, 2023).
Doerks, T., Copley, R. R., Schultz, J., Ponting, C. P. & Bork, P. Systematic identification of novel protein domain families associated with nuclear functions. Genome Res. 12, 47–56 (2002).
Burge, S. et al. Manual GO annotation of predictive protein signatures: the InterPro approach to GO curation. Database 2012, bar068 (2012).
Camon, E. B. et al. An evaluation of GO annotation retrieval for BioCreAtIvE and GOA. BMC Bioinforma. 6, S17 (2005).
Liu, X. et al. Self-supervised learning: generative or contrastive. IEEE Trans. Knowl. Data Eng. 35, 857–876 (2023).
Wei, Q., Khan, I. K., Ding, Z., Yerneni, S. & Kihara, D. NaviGO: Interactive tool for visualization and functional similarity and coherence analysis with gene ontology. BMC Bioinforma. 18, 177 (2017).
Elnaggar, A. et al. ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. 118, e2016239118 (2021).
Vaswani, A. et al. in Attention is All you Need in Neural Information Processing Systems 1st edn, Vol. 1 (eds. Guyon, I., Luxburg, U. Von, Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S. & Garnett, R.) Ch. 30 (Curran Associates, Inc., 2017).
Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. Proc. 2019 Conf. North Am. Chapter Assoc. Computat. Linguist. Hum. Lang. Technol. 1, 4171–4186 (2019).
Littmann, M., Heinzinger, M., Dallago, C., Olenyi, T. & Rost, B. Embeddings from deep learning transfer GO annotations beyond homology. Sci. Rep. 11, 1160 (2021).
Yuan, Q., Xie, J., Xie, J., Zhao, H. & Yang, Y. Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion. Brief. Bioinforma. 24, bbad117 (2023).
Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 244 (2019).
Bateman, A. et al. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49, D480–D489 (2021).
Blum, M. et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res. 49, D344–D354 (2021).
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv https://doi.org/10.48550/arXiv.1412.6980 (2015).
He, K. et al. Masked autoencoders are scalable vision learners. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. https://doi.org/10.48550/arXiv.2111.06377 (2022).
Aggarwal, C. C., Hinneburg, A. & Keim, D. A. in Database Theory 1st edn, Vol. 1973 (eds. Bussche, J. V., Vianu, V.) Ch. 402–434 (Springer Berlin Heidelberg, 2001).
Schlicker, A., Domingues, F. S., Rahnenführer, J. & Lengauer, T. A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinforma. 7, 302 (2006).
Mitchell, A. et al. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res. 43, D213–D221 (2015).
Hill, D. P., Smith, B., McAndrews-Hill, M. S. & Blake, J. A. Gene ontology annotations: what they mean and where they come from. BMC Bioinforma. 9, S2 (2008).
Abrahamson, M., Alvarez-Fernandez, M. & Nathanson, C. M. Cystatins. Biochem. Soc. Symp . 70, 179–199 (2003).
Gaudet, P., Livstone, M. S., Lewis, S. E. & Thomas, P. D. Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium. Brief. Bioinforma. 12, 449–462 (2011).
Barrett, A. J. & Rawlings, N. D. Evolutionary lines of cysteine peptidases. Biol. Chem. 382, 727–733 (2001).
Bizzarri, C. et al. ELR+ CXC chemokines and their receptors (CXC chemokine receptor 1 and CXC chemokine receptor 2) as new therapeutic targets. Pharmacol. Ther. 112, 139–149 (2006).
Unsal, S. et al. Learning functional properties of proteins with language models. Nat. Mach. Intell. 4, 227–245 (2022).
Heinzinger, M. et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinforma. 20, 723 (2019).
Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 32, 9689–9701 (2019).
Choy, C. T., Wong, C. H. & Chan, S. L. Embedding of genes using cancer gene expression data: biological relevance and potential application on biomarker discovery. Front. Genet. 9, 682 (2019).
Rao, R. M. et al. MSA transformer. In Proc. 38th International Conference on Machine Learning 1st edn, Vol. 139 (eds. Meila, M. & Zhang, T.) Ch. 8844–8856 (2021).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Yang, K. K., Wu, Z., Bedbrook, C. N. & Arnold, F. H. Learned protein embeddings for machine learning. Bioinformatics 34, 2642–2648 (2018).
Asgari, E. & Mofrad, M. R. K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One 10, e0141287 (2015).
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
Szklarczyk, D. et al. The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res. 49, D605–D612 (2021).
Lai, B. & Xu, J. Accurate protein function prediction via graph attention networks with predicted structure information. Brief. Bioinforma. 23, bbab502 (2022).
Doersch, C. & Zisserman, A. In IEEE International Conference on Computer Vision (ICCV). https://doi.org/10.48550/arXiv.1708.07860 (IEEE, 2017).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. in Advances in Neural Information Processing Systems 2nd edn, Vol. 1 (eds. Burges, C. J., Bottou, L., Welling, M., Ghahramani, Z. & Weinberger, K. Q.) Ch. 65 (Curran Associates, Inc., 2013).
Chitale, M., Hawkins, T., Park, C. & Kihara, D. ESG: extended similarity group method for automated protein function prediction. Bioinformatics 25, 1739–1745 (2009).
Ibtehaz, N., Kagaya, Y. & Kihara, D. Data associated with domain-PFP: protein function prediction using function-aware domain embedding representations. bioRxiv https://doi.org/10.6084/m9.figshare.24302845 (2023).
Ibtehaz, N., Kagaya, Y. & Kihara, D. Domain-PFP: Protein function prediction using function-aware domain embedding representations. Zenodo https://doi.org/10.5281/zenodo.8436582 (2023).
Acknowledgements
This work was partly supported by the National Science Foundation (DBI2003635, DBI2146026, IIS2211598, DMS2151678, CMMI1825941, and MCB1925643) and by the National Institutes of Health (R01GM133840, 3R01 GM133840-02S1).
Author information
Authors and Affiliations
Contributions
D.K. conceived the study. N.I. designed and implemented the domain-based protein embedding and Domain-PFP. N.I. performed the computation. N.I. and D.K. analyzed the data. Y.K. participated in function evaluation on the CAFA3 dataset. N.I. drafted the manuscript and D.K. edited it. All authors read and approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Biology thanks Shanfeng Zhu, Kentaro Tomii and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Yuedong Yang and Anam Akhtar. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ibtehaz, N., Kagaya, Y. & Kihara, D. Domain-PFP allows protein function prediction using function-aware domain embedding representations. Commun Biol 6, 1103 (2023). https://doi.org/10.1038/s42003-023-05476-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s42003-023-05476-9