Deep hierarchical embedding for simultaneous modeling of GPCR proteins in a unified metric space

Lee, Taeheon; Lee, Sangseon; Kang, Minji; Kim, Sun

doi:10.1038/s41598-021-88623-8

Download PDF

Article
Open access
Published: 05 May 2021

Deep hierarchical embedding for simultaneous modeling of GPCR proteins in a unified metric space

Taeheon Lee¹,
Sangseon Lee²,
Minji Kang³ &
…
Sun Kim^4,5,6,7

Scientific Reports volume 11, Article number: 9543 (2021) Cite this article

3113 Accesses
1 Citations
3 Altmetric
Metrics details

Subjects

Abstract

GPCR proteins belong to diverse families of proteins that are defined at multiple hierarchical levels. Inspecting relationships between GPCR proteins on the hierarchical structure is important, since characteristics of the protein can be inferred from proteins in similar hierarchical information. However, modeling of GPCR families has been performed separately for each of the family, subfamily, and sub-subfamily level. Relationships between GPCR proteins are ignored in these approaches as they process the information in the proteins with several disconnected models. In this study, we propose DeepHier, a deep learning model to simultaneously learn representations of GPCR family hierarchy from the protein sequences with a unified single model. Novel loss term based on metric learning is introduced to incorporate hierarchical relations between proteins. We tested our approach using a public GPCR sequence dataset. Metric distances in the deep feature space corresponded to the hierarchical family relation between GPCR proteins. Furthermore, we demonstrated that further downstream tasks, like phylogenetic reconstruction and motif discovery, are feasible in the constructed embedding space. These results show that hierarchical relations between sequences were successfully captured in both of technical and biological aspects.

Deep embedding and alignment of protein sequences

Article 15 December 2022

Felipe Llinares-López, Quentin Berthet, … Jean-Philippe Vert

CopulaNet: Learning residue co-evolution directly from multiple sequence alignment for protein structure prediction

Article Open access 05 May 2021

Fusong Ju, Jianwei Zhu, … Dongbo Bu

LM-GVP: an extensible sequence and structure informed deep learning framework for protein property prediction

Article Open access 27 April 2022

Zichen Wang, Steven A. Combs, … Peter M. Clark

Introduction

G protein-coupled receptor (GPCR) is the largest transmembrane protein family and one of the most extensively investigated drug targets^1,2. Recently, unexplored GPCR proteins are investigated as novel drug targets for several diseases³. Furthermore, a recent study demonstrated that understanding genetic variations in GPCR proteins could enhance the effectiveness of drugs⁴. Thus, precise characterization of GPCR families can help accelerate drug discovery. GPCR is classified in a hierarchical class structure, represented by family, subfamily, and sub-subfamily level classes. This structure was constructed following the sequence similarity and evolutionary histories of the proteins⁵. Analyzing the characteristics of the proteins with respect to this structure lies at the heart of GPCR studies⁶ and inspecting the relations between GPCR sequences regarding this structure is an important research subject for two main reasons. First, properties of the protein are often inferred from the existing proteins that have already been experimentally validated⁷. Second, evolutionary history of the proteins can be revealed from the relations among proteins.

Approaches based on machine learning techniques, such as hierarchical classification and clustering, on the GPCR class structure have been widely explored. For example, classifiers that are designed specifically at each hierarchical family level were introduced to classify GPCR proteins. In this approach, classification was proceeded in a top-down manner from family classes to sub-subfamily classes^8,9. Unsupervised clustering algorithm was also proposed to investigate the properties of GPCR proteins like sequence similarity, classification of GPCR sequences, and phylogenetic tree reconstruction⁶. These methods have shown successes in modeling GPCR fairly accurately. However, existing methods had to model GPCR at each of the family hierarchies separately and inevitably employed series of separate steps to deal with the features in the class structure since the representations used in the methods can hardly reveal unified features across the class hierarchies. Otherwise, they had to compute the distances between protein sequences for all pairs, which is computationally too heavy^6,9,10. As a result, processing complex features throughout the hierarchy levels as a whole is nearly impossible with previous representations¹¹. In the sense that GPCR class hierarchy was constructed using complex features including phylogenetic traits, ligand types and their functions¹², these approaches may not provide research opportunities to inspect the relations between these traits throughout the GPCR proteins. In this regard, it is important to construct comprehensive representations of GPCR proteins with hierarchical features inclusively incorporated.

In this work, we present DeepHier, a novel method that simultaneously learns and represents complex protein features across the hierarchies. Our end-to-end deep learning network constructs a single embedding space of GPCR sequences where hierarchical sequence information is preserved in terms of metric distances. DeepHier incorporates significant features across all hierarchical levels into a single vector. On the constructed embedding space, distances in the embedding space can be utilized in analyzing GPCR proteins for several downstream tasks.

In a series of experiments, we showed that the metric space generated by DeepHier successfully reflected hierarchical class structure of GPCR proteins. First, we extensively investigated distances among sequences in the embedding space using cluster analysis techniques. In the distance-based cluster analysis on the embedding space, proteins are grouped well according to the hierarchical class labels of proteins. Second, we showed that phylogenetic trees constructed from the embedding vectors accurately reflected the phylogeny of the proteins. In addition, we showed that biologically significant motifs in the GPCR proteins are well-represented in each cluster. We further investigated how those motifs are generalized or narrowed along different hierarchical levels. Last but not least, experiments on the query search demonstrated that similarity search on proteins can be done fairly efficiently with the embedding vectors.

Related work

Analyzing biological sequences with deep learning

Alignment-base algorithms achieved remarkable successes in identifying characteristics of biological sequences. However, performance of alignment-based algorithms might deteriorate significantly when there are large variations in sequences such as duplication or deletions of subsequences¹³. Furthermore, multiple sequence alignment algorithms require huge amount of computation time with respect to the number of sequences¹⁴. For these reasons, a number of alignment-free methods have been developed to achieve performances comparable to alignment-based algorithms.

Recently, deep learning technologies have brought unprecedented advances in analyzing biological sequences. For example, a previous deep learning study proposed deep convolutional layers to recognize protein folds from sequences with competitive accuracy and robustness¹⁵. Other works showed that biological sequences can be successfully modeled with a one-layer convolutional layer equipped with one-max pooling operator^16,17. In these works, each convolutional filter was trained to acquire features that correspond to significant sequence patterns. Especially, DeepFam¹⁷ showed promising results in modeling proteins families with motif features. Recent deep learning work extended motif discovery components in DeepFam by adopting word2vec techniques to model distributional dependencies among nucleotides¹⁸. Although DeepHier is built upon the architecture proposed in DeepFam, it differs from the previous works in that neural network architecture is significantly expanded to incorporate hierarchical class information into a single model. Furthermore, our work enables the embedding of hierarchical features, which are originated from the protein family hierarchy, onto a single metric space.

Metric learning on deep learning

With the success of deep learning models, more and more works have been proposed to utilize distances in the deep feature space. Training objective in these studies are to learn an effective mapping function of data where similarity between inputs can be directly inferred from the embedding vectors. Neural networks are trained to favor close distances between representations for data points of similar properties and distant representations for dissimilar data. For example, Siamese networks¹⁹ and Triplet networks²⁰ were introduced in order to make feature vectors for similar data points to be located closely in distances in the proposed deep embedding spaces. Similarly, center loss was introduced to construct compact intra-class representations and separable inter-class representations²¹. In center loss, mean embedding vectors of data classes are employed as reference vectors to guide training. DeepHier adopted center loss to learn compact and separable embedding function according to protein family information.

In deep metric learning for sequence analyses, components of training phases are designed to grant biologically significant meanings to distances in the embedding space. For instances, a Siamese neural network for biological sequences was introduced, where alignment distances between sequences can be directly estimated from the embedding vectors of the given sequences²². Moreover, an embedding function based on long short-term memory was suggested to incorporate structural similarities of proteins during training²³. While previous studies learned metrics for single level information of biological sequences, up to our knowledge, our work is a first work to model hierarchical relations between protein sequences into a single metric space.

Methods

Neural network architecture

The overall structure of DeepHier is extended and modified from DeepFam¹⁷. DeepFam is a one-dimensional convolutional neural network with different convolutional filter sizes that can model sequence motifs of various lengths. Due to the success of DeepFam in modeling protein families, this architecture has been employed as a baseline neural network in previous studies. For example, DeepNOG extended the encoding layer of DeepFam by representing amino acid characters into numerical vectors using a trainable model²⁴. DeepPPF also adopted the basic architecture of DeepFam by using convolutional filters of various sizes and one-max pooling layer to model motif patterns¹⁸. To fully utilize the modeling capabilities of DeepFam, our work also adopted this architecture.

Rather than merely adopting the previous work, however, we introduce novel components on the architecture, Embedding layer and Multi-branch classifier, to effectively build a metric space. Overall architecture of DeepHier is illustrated in Fig. 1.

Feature extractor with CNN

In DeepFam, variable length convolutional filters were used with one-max pooling layer¹⁷. In such setting, output activation value from each convolutional filter encodes information on whether significant motif patterns are presented in the input sequences. This architecture was demonstrated to be successful in capturing motifs from biological sequences^16,17. Especially, DeepFam was successful in detecting motifs of variable lengths. In this work, we also adopted this one-max pooling architecture since GPCR families are characterized by highly preserved sequence regions²⁵. After applying one-max pooling, the resulting values in the convolutional layer is concatenated into an one-dimensional representation, which can be regarded as a list of existence values for learned motifs.

Embedding layer

A linear layer that comes after the convolutional layer is to project outputs from the convolutional layer onto a low-dimensional vector space. Projected vector will be used as an embedding vector of an input protein sequence. To simplify the explanation, we assume embedding vectors lie in a d-dimensional space. After the linear transformation follows an $L_2$ normalization operator on resulting vectors to keep the distances between embeddings from exploding. Thus, we restricted the vector representation of each sequence to lie in a d-dimensional hyper-sphere.

Multi-branch classifier

Three branches of Multilayer Perceptron (MLP) classifier are attached to the embedding layer. Techniques of using branches in a single neural network, with each branch dedicated to a domain-specific task, were proposed in Multi-Domain Network (MDNet)²⁶ to construct shared features across the multiple domains. In our architecture, three branches correspond to each hierarchical level: class, family and subfamily level. Under this multi-branch architecture, gradients from these classifiers will guide embedding vectors to incorporate hierarchical information regarding GPCR protein family into a single representation.

Loss function

In DeepHier, a loss function with respect to a metric space is introduced to construct meaningful distance relations between protein sequences. We first list the notations used in defining the loss terms in Table 1. In the table, $\mu _{y_i}$ is calculated as a mean vector of deep representations of sequences that corresponds to class $y_i$ proteins²¹ .

Cross-entropy loss with a softmax function

Cross-entropy loss with a softmax function is frequently used as a loss function for training classifiers. Softmax function is used to generate probability distribution of candidate labels from the output layer of the network. In many cases, softmax function is combined with cross-entropy loss in supervised setting to enforce classifiers to output higher probability on desired labels. This is effective for neural network models in learning separable representations between different classes. Likewise, we employed this function as a part of the loss to empower neural networks to learn separable features in input sequences for each label.

Center loss

Center loss²¹ was proposed to complement the cross-entropy loss with a softmax function. Although cross-entropy loss with a softmax function is practical enough to separate features between classes, it lacks ability to learn compact representations between data that is in the same class^21,22. To address this, this additional loss term minimizes the distances among data points within a same class. Center loss can be stated in following form:

$$\begin{aligned} {L}_{C} = \sum _{i=1}^{n}{{\left\| {d(x_{i}) - \mu _{y_i}}\right\| }^2_2} \end{aligned}$$

(1)

With loss from this metric, parameters are updated to make representation of the input data get closer to the mean vector during training. Computer vision community have utilized this form of metric loss functions when there are larger number of classes. Especially, domains like face recognition problems were benefited from metric losses^21,27. GPCR protein modeling is similar to these applications since we need to model relatively large number of classes (86 sub-subfamilies) with few number of data samples. Hence, we used center loss to train DeepHier.

In exploiting the above loss function, mean vectors should be updated simultaneously with parameters being updated. In the previous work²¹, mean vectors are calculated based on the images in mini-batch basis since considering vector representations of the whole dataset, generally comprising of 50K to 200M images for computer vision, is computationally exhaustive. However, the number of sequences in the training data is relatively small, total of 8222 GPCR proteins for our study. Therefore, we updated mean vectors of classes based on the feature vectors from the whole dataset.

Table 1 Notations related to loss terms.

Full size table

Overall loss

Combining cross-entropy loss ${L}_{S}$ from the classifiers and center loss for each hierarchy, overall loss function can be stated in the following equation.

$$\begin{aligned} {L} = \sum _{i\in {S}}\omega _{S_i}{L}_{S_i} + \lambda _{C}\left( \sum _{i\in {S}}\omega _{C_i}{L}_{C_i}\right) \end{aligned}$$

(2)

where S denotes the set of hierarchy levels (S = {family, sub-family, sub-subfamily}) and loss function is stated as a weighted sum of losses from each hierarchy. To balance between center loss and cross-entropy loss with a softmax function, $\lambda _{C}$ was introduced as a weight for center loss²¹. Here, $\omega _{S_i}$ and $\omega _{c_i}$ represent the weight for cross-entropy loss with a softmax function, and center loss respectively.

Training procedure

To effectively incorporate hierarchical information into a single embedding space, we designed a training procedure in three different phases. At each phase of training procedures, we devised DeepHier to focus on loss values from one level, from family-level to sub-subfamily level. In our setting, training phases are ordered following the hierarchical order, from family level to sub-subfamily level. This order ensures neural networks to firstly learn general representations that correspond to family level properties of GPCR families and then acquire fine-grained representations for sub-subfamily level.

Analyzing distances in the embedding space

To assess the quality of distances in the proposed space, we evaluated how well the distances in the embedding space represent the hierarchical class structure of GPCR protein family. Embedding vectors from DeepHier were compared to following methods: (1) model with the same architecture as DeepHier but without using the center loss function (w/o Center Loss); (2) model with the same loss function and feature extractor as in DeepHier but without a multiple branches output layer (w/o Multiple Branch); (3) DeepFam¹⁷, a one-layer convolutional neural network with a one-max pooling operator that captures sequence motifs; (4) DeepPPF¹⁸, a novel convolutional neural network that trained encoding layers for amino acid characters; (5) Autoencoder, a convolutional autoencoder trained with reconstruction loss; (6) MLP classifier on a flattened one-hot encoding vector; and (7) K-mer frequency vectors (3-mer and 4-mer). For models based on neural networks, activations of neurons at the last hidden layer were used as representations for the sequences.

To be specific, clusters were compared to the class labels at family, subfamily and sub-subfamily levels using adjusted mutual information (AMI) and silhouette scores. In calculating the silhouette scores, real class labels from three hierarchical levels were used as groundtruth. Since silhouette score is related with intra-cluster and inter-cluster distances, silhouette scores under our scheme represent the consistency of distances in embedding space to the real class labels. Correspondence between the hierarchical clustering results and GPCR family labels was evaluated using AMI score. In estimating the score, we used the real label information as ground-truth label information to measure the quality of clustering results. Agglomerative clustering based on ward linkage, a hierarchical clustering algorithm where pairs of clusters with closest distances are merged together, were applied to the embedding vectors until the target number of clusters are obtained. In our experiment, we increased the target number of clusters from 2 to 100 with a step size of 3. AMI scores were measured between class labels and cluster results from each of the target cluster number. We provide detailed explanations on AMI and silhouette scores in the supplementary material.

Analyzing the hierarchical structure in the embedding vectors

Relationship between proteins on the hierarchical class structure is an important aspect to be investigated in GPCR protein studies. We analyzed the overall distance relations between proteins in the embedding space compared to the real class labels. Afterwards, we further constructed a phylogenetic tree from the embedding vectors in order to demonstrate that phylogenetic analysis is possible with embedding vectors. Here, a phylogenetic tree was constructed using a neighbor-joining clustering method²⁸.

Discovering motifs from the embedding vectors

As a way of demonstrating the biological significance of the vectors, we performed motif discovery experiments on the deep feature vectors. In these experiments, MUSCLE, a multiple sequence alignment tool²⁹, was used to detect conserved sequence patterns. Sets of reference proteins were selected from the hierarchical clustering results in the above section where target number of clusters were incremented. On the sequences in the designated clusters, we applied MUSCLE to figure out the highly preserved regions of that cluster. Afterwards, preserved regions from the alignment algorithm were compared to known motifs in the literatures to check if results are consistent with biological knowledge.

Query search on embedding spaces

We conducted a sequence similarity query search on the protein family database as one of the downstream tasks on the embedding space from DeepHier. Distances in the low-dimensional embedding space have long been utilized for effective query search in the high dimensional data. In this setting, similarity between data points is defined in terms of distances in the low-dimensional space³⁰. In a similar manner, since deep feature space was trained to express GPCR similarity in terms of metric distances, we directly utilize the euclidean distances between embedding vectors. We handled query search in terms of nearest points in the embedding space. In such setting, unlike conventional alignment-based algorithms for sequences, distance calculations on embedding space could be accomplished simply with numerical calculations. We compared the results with Basic Local Alignment Search Tool (BLAST), the de facto standard tool for sequence similarity search³¹. Since computation speed has been major limitations of alignment-based tools, we compared the execution time of each algorithm.

Experimental setup

Dataset preparation

The GPCR sequences were acquired from the BIAS-PROFS GPCR dataset⁹. In this dataset, sequence labels are annotated in three-level hierarchies, from the family level to the sub-subfamily level. In processing the dataset, classes with fewer than 10 sequences were removed, resulting in 86 sub-subfamilies and 8222 sequences. During the experiment for query search, sequences from the training dataset were utilized as a target database of approximately 6200 sequences. In contrast, query sequences were only retrieved from the test dataset.

Sequence representation

To enable computations on deep learning models, one-hot encoding scheme was adopted for DeepHier, where every amino acid position was represented as a one-hot vector^32,33. In addition, each sequence was padded with zeros to a fixed length because CNN requires input vectors of the same size. Thus, after converting a sequence into one-hot vectors, we padded each sequence to the length of 1000¹⁷. As a result of encoding, each sequence was converted into a vector of size $R^{1000 \times 20}$. Here, 20 is the alphabet size of amino acid characters.

Hyperparameter setting

Hyperparameters used in the experiments are listed in Table 2. Parameters regarding convolutional layer were identical to those of DeepFam. However, in our work, we did not employ dropout operator since the embedding vectors need to be learned consistently while training. In addition, the dimension of embedding vectors needs to be set in advance. We selected the dimension of embedding vectors using validation dataset according to accuracies on sub-subfamily level classification. Weights for softmax and center loss terms are also set, separately for the family, subfamily and sub-subfamily phase.

Table 2 List of hyperparameters.

Full size table

Implementation details

Neural network models were implemented with Python 3.6.8 and PyTorch 1.1.0³⁴, an open-source deep learning library. Dataset was split into train, validation and test datasets with ratios of 0.8, 0.1 and 0.1 respectively for each subfamily class. We selected the best performance model on validation set and performances were measured in 10-fold cross-validation scheme. Only the embedding vectors from the test dataset were used for evaluation and visualization during experiments.

Results

Evaluation on embedding distances

Inspection on silhouette scores

Figure 2 contains the calculated silhouette score from each method. In the figure, only DeepHier and DeepHier w/o multiple branch are the model equipped with a center loss that constructs metric distances in deep feature space during training. The distinctive result from these models is that they show non-negative silhouette scores in three class levels, whereas vectors from other models present non-negative values only in one level and negative or near-zero scores in other levels. These demonstrate that center loss regarding metric space helped neural network to build effective distance relations for GPCR families.

Vectors from the Autoencoder and k-mer frequencies, that do not use any class information during training, show low scores in all levels. This result shows that label information was a powerful supervisor in training a model. Although DeepFam and DeepPPF achieved remarkable classification performances in their original works^17,18, they did not show competitive scores in clustering qualities. These results indicate that center loss in DeepHier effectively constructed a meaningful embedding function compared to other models that do not address metric distances during training.

We now focus on comparison of two models that adopted center loss during parameter updates. Although the model without multiple branch shows positive scores for all three levels, our model shows significantly higher scores. In summary, the proposed two components for metric learning, multiple branches and center loss, have been successful in learning compact representations at all levels.

Inspection on AMI scores

Changes in AMI score for the number of clusters are illustrated in Fig. 2. AMI score is maximized when the number of resulting clusters from the algorithm gets closer to the real number of labels in that class level, 5 for family and 37 for subfamily. Interestingly, this can be interpreted as: hierarchical clustering results show best correspondence when a target cluster number is close to the actual number of class labels. This demonstrates that distance relations between embedding vectors match well with hierarchical class structure. Similar to results from the experiments of silhouette scores, some algorithms achieved comparable AMI scores in the sub-subfamily level clustering evaluation. However, DeepHier shows notably higher score in the family and subfamily levels. In fact, our model is the only method that generates the maximum AMI score higher than 0.7 for all class levels. This again supports that DeepHier was successful in incorporating information from all three class levels into a single unified embedding vector. In addition, comparison between results from DeepHier and DeepHier without center loss demonstrates that center loss in our approach makes our embedding more compactly represented.

Analysis on the hierarchical structure

Overall distance structures in the embedding space

In Fig. 3, distance matrix was generated using pairwise euclidean distances between the embedding vectors. Protein sequence vectors were clustered using a hierarchical clustering algorithm, Unweighted pair group method with arithmetic mean (UPGMA)³⁵. To visualize the hierarchical clustering result, columns were ordered based on the clustering results whereas rows were sorted according to family, subfamily, sub-subfamily class labels. Clustering results are demonstrated with dendrogram in the figure. Each of the three-line color bands on the columns and rows corresponds to the hierarchical class label of each sequence. Thus, correspondence between clustering results and true class information can be revealed by comparing three-line color bands on columns and rows.

It can be observed that hierarchical relations between sequences are apparent in the embedding space from DeepHier. Overall pairwise distances between the sequences in the heat map show that embedding vectors clearly captured the three distinctive distance relations among GPCR sequences, the brightest one ($1.6 \sim $), the middle-range one (1.2–1.6) and the darkest one ($ \sim 0.5$). These distinctions in color level make separations between data points more clearly visible. Comparing boundary regions with true label information represented with color bands on columns and rows, color distinctions correctly correspond to the label information of sequences. For representations from DeepFam, although there exists local clusters of proteins, no hierarchical structure between distances is observed. Furthermore, hierarchical clustering results do not match with real class hierarchies in DeepFam. This demonstrates that even though DeepFam modeled sub-subfamily classes in GPCR protein families fairly well¹⁷, hierarchical structure of the proteins was not properly incorporated. In contrast, through employing center loss and multiple branches, DeepHier constructed distinctive hierarchical relations successfully with distances in the embedding space. We further provide t-SNE visualizations³⁶ for the embedding vectors from DeepHierin the supplementary material.

Phylogenetic tree reconstruction

Phylogenetic tree from the embedding vectors is drawn in Fig. 3. In the tree, each family-level label is represented with a distinct color. A branch in the tree was assigned a color corresponding to the family when more than 70% of the leaves in the branch belong to a specific family. In addition, as we have done in generating the distance matrix, additional figure of overlaying three-layer color rings with colors corresponded to each class in the hierarchy. In the phylogenetic trees, sequences belonging to the same family labels were clustered in close positions. In the second and the third rings that represent subfamily and sub-subfamily respectively, sequences belonging to subfamily and sub-subfamily were also positioned in close locations in the tree. In sum, we created a single embedding space that can be used to construct a phylogenetic tree based on the single embedding space, grouping GPCR sequences closely at family, subfamily and sub-subfamily levels.

Motif discovery

To visualize the discovered motifs, WebLogo, a web-based visualization tool³⁷, was used in Fig. 4. From the clustering results with a target cluster number of three, conserved sequences of DRY and NSY were found to be distinctive features in the first cluster. This cluster had 519 sequences, of which 507 sequences belonged to Family A of GPCR protein family. In fact, the discovered motifs in these clusters are the most characterizable sequence features shown in family A or Rhodopsin GPCR family^1,38,39. Unlike the first cluster, however, alignments on the other clusters does not reveal the conserved sequence of DRY. This corresponds to our knowledge that DRY and NSY are distinctive features for Rhodopsin-like GPCR proteins. Likewise, other significant motifs were found from other clusters too. On the fourth cluster among five clusters in the results, conserved sequences of LIGWG, GPVLASLL and CFLxxEVQ were discovered. These sequences belong to the conserved regions in the transmembrane structures of family B or Secretin receptor family of GPCR family⁴⁰. Indeed, this cluster consists of 46 sequences where 44 of them belonged to family B GPCR proteins. At deeper hierarchical levels, RKAAKTLG and FKQLHXPTN were found to be conserved in the 48^th cluster among 55 clusters. These features are known to be the representative motifs in the Traceamine sub-subfamily that belongs to family A of GPCR proteins. This is consistent with the fact that all the sequences in the cluster are from Traceamine sub-subfamily.

Query search speed on embedding space

We measured execution time of similarity search algorithm. Firstly, preparing database with DeepHier requires embedding the whole sequences in the training dataset through deep learning model, which is composed of nearly 6300 sequences. Second, time spent for searching through the database should be measured. Thus we compared execution time on database construction and database search. For DeepHier, database construction took approximately 12.7 s. In contrast, for BLAST, constructing database using makeblastdb command took 368 ms. For similarity search on embedding space, it took approximately 19 ms on average for single sequence when database is comprised of 6300 sequences. This includes time spent for representing query sequence in terms of embedding vector from the neural network. In contrast, BLAST took 266 ms on average.

Constructing database in our approach is slower than that of conventional alignment-based approach, since DeepHier requires loading of neural network parameters and embedding of whole training sequences. However, once embeddings of the sequences in the database is constructed, we don’t need to repeat this process. Thus, we believe slow execution time in constructing database might not be a bottleneck in similarity search. For execution time spent on similarity search, our method only requires approximately 7.1.

To assess the scalability of similarity search when the size of the database grows, we measured the execution time with varying the size of database. BLAST shows linearly increasing execution time with the growth of database (Fig. 5). However, similarity search on the embedding vectors from DeepHier shows consistent execution time. We believe that since the embedding vectors lie in a low-dimensional vector space, distance can be easily calculated with matrix operation. Thus, we obtained scalability in similarity search. These results suggest that similarity search with our method can be a scalable alternative to BLAST for ever increasing size of biological sequences.

In addition to the execution time for query search, we estimated the accuracy of the search algorithm. DeepHier achieved classification accuracies of 0.985, 0.875, and 0.797 for family, subfamily, and sub-subfamily hierarchy. In contrast, query search with BLAST achieved accuracies of 0.989, 0.862, and 0.777, respectively. Except for family level classes, our method outperformed alignment-based algorithm in query search tasks. In sum, our method enables similarity search on biological sequences with comparable accuracy even without leveraging alignment-based computations.

Discussion

We propose a deep learning based embedding function, DeepHier, that simultaneously learns significant protein features at three hierarchy levels of GPCR families. Sequence features in GPCR proteins are learned from the proposed model, which consists of three-branch classifiers and a novel loss term. The main advantage of simultaneous embedding of the hierarchies is that distances in the embedding space are directly correlated with hierarchical relations between proteins, which lies at the heart of GPCR studies. This is significant in that phylogenetic relationships of GPCR protein families can be modeled with the embedding vectors learnt from our deep learning model. In sum, we believe the proposed approach presented a novel and significant strategies for utilizing deep learning in analyzing GPCR protein sequences.

Until recently, biological sequence analyses relied heavily on alignment-based algorithms. However, recent deep learning works demonstrated that neural networks could be used to model biological sequences without loss of performances¹⁷. Nevertheless, a number of downstream tasks such as sequence query search or phylogenetic analysis were mostly done with multiple sequence alignments. In this regard, our work propose an embedding function where downstream analyses on biological sequences could be enabled without the expensive sequence alignment step. Throughout the experiments, we demonstrated that the embedding space effectively modeled GPCR family hierarchy into a single metric space. Furthermore, phylogenetic relations of the proteins were inferred with the proposed method. In addition, embedding space from DeepHier was utilized in sequence similarity search tasks for ever increasing GPCR databases. Experimental results in our work indicate that several downstream analysis on the protein sequences can be successfully accomplished with the embedding vectors generated from DeepHier.

In this study, we restricted the analysis on the embedding function to the GPCR sequences. However, as investigated in previous works^6,41, discriminating GPCR proteins from non-GPCR proteins is also of great research interest. Additional classification hierarchy that determines whether input sequences are GPCR proteins or non-GPCR proteins could be added to the model. Previous machine learning work proposed a rejection option for the classifier based on the metric distances in the feature space⁴². In a similar regard, during the training, negative datasets or non-GPCR proteins could be fed into the model and we can train the model to embed vectors of GPCR proteins close in embedding space.

As a future work, information from the structural characteristics of GPCR receptor can be extensively incorporated into the training process in a similar manner to the proposed work. Since structural properties are crucial for identifying functions of the protein, such investigation might give opportunity to obtain more accurate modeling of GPCR proteins.

References

Fredriksson, R., Lagerström, M. C., Lundin, L.-G. & Schiöth, H. B. The g-protein-coupled receptors in the human genome form five main families. Phylogenetic analysis, paralogon groups, and fingerprints. Mol. Pharmacol. 63, 1256–1272 (2003).
Bjarnadóttir, T. K. et al. Comprehensive repertoire and phylogenetic analysis of the g protein-coupled receptors in human and mouse. Genomics 88, 263–273 (2006).
Article Google Scholar
Hauser, A. S., Attwood, M. M., Rask-Andersen, M., Schiöth, H. B. & Gloriam, D. E. Trends in GPCR drug discovery: New agents, targets and indications. Nat. Rev. Drug Discov. 16, 829–842 (2017).
Article CAS Google Scholar
Hauser, A. S. et al. Pharmacogenomics of GPCR drug targets. Cell 172, 41–54 (2018).
Article CAS Google Scholar
Mirzadegan, T., Benkö, G., Filipek, S. & Palczewski, K. Sequence analyses of g-protein-coupled receptors: Similarities to rhodopsin. Biochemistry 42, 2759–2767 (2003).
Article CAS Google Scholar
Hu, G.-M., Mai, T.-L. & Chen, C.-M. Visualizing the GPCR network: Classification and evolution. Sci. Rep. 7, 15495 (2017).
Article ADS Google Scholar
Chan, W. K. et al. Glass: A comprehensive database for experimentally validated GPCR-ligand associations. Bioinformatics 31, 3035–3042 (2015).
Article CAS Google Scholar
Bhasin, M. & Raghava, G. GPCRpred: An SVM-based method for prediction of families and subfamilies of G-protein coupled receptors. Nucleic Acids Res. 32, W383–W389 (2004).
Article CAS Google Scholar
Davies, M. N. et al. On the hierarchical classification of G protein-coupled receptors. Bioinformatics 23, 3113–3118 (2007).
Article CAS Google Scholar
Peng, Z.-L., Yang, J.-Y. & Chen, X. An improved classification of G-protein-coupled receptors using sequence-derived features. BMC Bioinform. 11, 420 (2010).
Article Google Scholar
Davies, M. N. et al. Proteomic applications of automated GPCR classification. Proteomics 7, 2800–2814 (2007).
Article CAS Google Scholar
Qian, B., Soyer, O. S., Neubig, R. R. & Goldstein, R. A. Depicting a protein’s two faces: GPCR classification by phylogenetic tree-based HMMS. FEBS Lett. 554, 95–99 (2003).
Article CAS Google Scholar
Zielezinski, A., Vinga, S., Almeida, J. & Karlowski, W. M. Alignment-free sequence comparison: Benefits, applications, and tools. Genome Biol. 18, 186 (2017).
Article Google Scholar
Wang, L. & Jiang, T. On the complexity of multiple sequence alignment. J. Comput. Biol. 1, 337–348 (1994).
Article CAS Google Scholar
Hou, J., Adhikari, B. & Cheng, J. DeepSF: Deep convolutional neural network for mapping protein sequences to folds. Bioinformatics 34, 1295–1303 (2018).
Article CAS Google Scholar
Lanchantin, J., Singh, R., Lin, Z. & Qi, Y. Deep motif: Visualizing genomic sequence classifications. arXiv preprint arXiv:1605.01133 (2016).
Seo, S., Oh, M., Park, Y. & Kim, S. DeepFam: Deep learning based alignment-free method for protein family modeling and prediction. Bioinformatics 34, i254–i262 (2018).
Article CAS Google Scholar
Yusuf, S. M., Zhang, F., Zeng, M. & Li, M. DeepPPF: A deep learning framework for predicting protein family. Neurocomputing 428, 19–29 (2021).
Article Google Scholar
Chopra, S., Hadsell, R. & LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1, 539–546 (IEEE, 2005).
Hoffer, E. & Ailon, N. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, 84–92 (Springer, 2015).
Wen, Y., Zhang, K., Li, Z. & Qiao, Y. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, 499–515 (Springer, 2016).
Zheng, W. et al. Sense: Siamese neural network for sequence embedding and alignment-free comparison. Bioinformatics (2018).
Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. arXiv preprint arXiv:1902.08661 (2019).
Feldbauer, R. et al. Deepnog: Fast and accurate protein orthologous group assignment. Bioinformatics (2020).
Cobanoglu, M. C., Saygin, Y. & Sezerman, U. Classification of GPCRS using family specific motifs. IEEE/ACM Trans. Comput. Biol. Bioinform. 8, 1495–1508 (2010).
Article Google Scholar
Nam, H. & Han, B. Learning multi-domain convolutional neural networks for visual tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4293–4302 (2016).
Schroff, F., Kalenichenko, D. & Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 815–823 (2015).
Saitou, N. & Nei, M. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987).
CAS PubMed Google Scholar
Edgar, R. C. Muscle: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).
Article CAS Google Scholar
Hwang, Y., Han, B. & Ahn, H.-K. A fast nearest neighbor search algorithm by nonlinear embedding. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 3053–3060 (IEEE, 2012).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS Google Scholar
Quang, D. & Xie, X. DanQ: A hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44, e107–e107 (2016).
Article Google Scholar
Pan, X., Rijnbeek, P., Yan, J. & Shen, H.-B. Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks. BMC Genomics 19, 511 (2018).
Article Google Scholar
Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. In Wallach, H. et al. (eds.) Advances in Neural Information Processing Systems, Vol. 32, 8024–8035 (Curran Associates, Inc., 2019).
Müllner, D. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378 (2011).
Maaten, Lvd & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Crooks, G. E., Hon, G., Chandonia, J.-M. & Brenner, S. E. WebLogo: A sequence logo generator. Genome Res. 14, 1188–1190 (2004).
Article CAS Google Scholar
Rosenbaum, D. M., Rasmussen, S. G. & Kobilka, B. K. The structure and function of G-protein-coupled receptors. Nature 459, 356 (2009).
Article ADS CAS Google Scholar
Rovati, G. E., Capra, V. & Neubig, R. R. The highly conserved dry motif of class Ag protein-coupled receptors: Beyond the ground state. Mol. Pharmacol. 71, 959–964 (2007).
Article CAS Google Scholar
Harmar, A. J. Family-B G-protein-coupled receptors. Genome Biol. 2, reviews3013-1 (2001).
Naveed, M. & Khan, A. U. GPCR-MPredictor: Multi-level prediction of G protein-coupled receptors using genetic ensemble. Amino Acids 42, 1809–1823 (2012).
Article CAS Google Scholar
Zadeh, P. H., Hosseini, R. & Sra, S. Deep-RBF networks revisited: Robust classification with rejection. arXiv preprint arXiv:1812.03190 (2018).
Waskom, M. L. Seaborn: Statistical data visualization. J. Open Source Softw. 6, 3021. https://doi.org/10.21105/joss.03021 (2021).
Asnicar, F., Weingart, G., Tickle, T. L., Huttenhower, C. & Segata, N. Compact graphical representation of phylogenetic data and metadata with graphlan. PeerJ 3, e1029 (2015).
Article Google Scholar

Download references

Acknowledgements

This research is supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT (No.NRF-2017M3C4A7065887), the Collaborative Genome Program for Fostering New Post-Genome Industry of the National Research Foundation (NRF) funded by the Ministry of Science and ICT (No.NRF2014M3C9A3063541), the Original Technology Research Program of the National Research Foundation (NRF) funded by the Ministry of Science and ICT (NRF-2019M3E5D4065965) ,and a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health and Welfare, Republic of Korea (Grant number: HI15C3224). Inkscape 1.0.2, a free software for editing graphics, was used throughout this work to obtain the figures in the paper. Overall structure of Fig. 1 and highlights in Supplementary Figure 1 were drawn with Inkscape. Distance matrices in Fig. 3 were drawn with Seaborn 0.11, a python library for visualizing statistical data⁴³. Phylogenetic tree in Fig. 3 was generated with scikit-bio 0.5.6, an open-source python package for processing bioinformatic data, and GraPhlAn 0.9, an open-source tool for visualizing phylogenetic tree⁴⁴.

Author information

Authors and Affiliations

Looxid Labs, Seoul, 06628, Republic of Korea
Taeheon Lee
BK21 FOUR Intelligence Computing, Seoul National University, Seoul, 08826, Republic of Korea
Sangseon Lee
Department of Computer Science, Stanford University, Stanford, CA, 94305, USA
Minji Kang
Bioinformatics Institute, Seoul National University, Seoul, 08826, Republic of Korea
Sun Kim
Department of Computer Science and Engineering, Seoul National University, Seoul, 08826, Republic of Korea
Sun Kim
Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 08826, Republic of Korea
Sun Kim
Institute of Engineering Research, Seoul National University, Seoul, 08826, Republic of Korea
Sun Kim

Authors

Taeheon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Sangseon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Minji Kang
View author publications
You can also search for this author in PubMed Google Scholar
Sun Kim
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors conceived the experiment(s), T.L. and S.L. conducted the experiment(s), T.L., S.L. and M.K. analysed the results. All authors reviewed the manuscript.

Corresponding author

Correspondence to Sun Kim.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lee, T., Lee, S., Kang, M. et al. Deep hierarchical embedding for simultaneous modeling of GPCR proteins in a unified metric space. Sci Rep 11, 9543 (2021). https://doi.org/10.1038/s41598-021-88623-8

Download citation

Received: 24 January 2021
Accepted: 13 April 2021
Published: 05 May 2021
DOI: https://doi.org/10.1038/s41598-021-88623-8

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.