Introduction

In the field of structural biology, it is widely recognized that there is a strong relationship between the three-dimensional structure of a protein and its function 1. The recognition and analysis of structural similarities in proteins can represent a valuable strategy to gain insights into protein functions. In particular, the comparison of protein binding sites represents a challenging area of interest in the field of biology and biochemistry to improve the understanding of protein–ligand interactions, predict off-target effects, facilitate the development of more selective and effective therapeutic agents, investigate drug repurposing strategies, and explore polypharmacological treatments2. In particular, drug repurposing, also known as drug repositioning or drug reprofiling, refers to the process of identifying new therapeutic uses for existing drugs that were originally developed for a different indication3. This approach offers several advantages, including potentially shorter development timelines, reduced costs, and a higher likelihood of success compared to developing entirely new compounds4. On the other hand, polypharmacology refers to the ability of a drug or a pharmacological agent to interact with multiple biological targets within an organism5. In other words, a polypharmacological drug can affect multiple pathways, receptors, or proteins simultaneously. This is in contrast to traditional drug development, where the focus is often on designing drugs that specifically target a single molecular target associated with a particular disease. Since several diseases involve multiple molecular targets, polypharmacology represents an interesting modality in terms of efficacy and adaptability to complex biological environments6.

The present work is inserted in this context and aims to develop a novel computational pipeline to pinpoint protein binding sites sharing similarities with an investigated target of interest. The identification of binding pockets with similar amino acid patterns can indicate potential off-target proteins for a given ligand, which can help in the design of drugs with minimal side effects. Additionally, the characterization of amino acid arrangements in binding pockets can contribute to the development of structure-based drug design methods to engineer drugs targeting specific protein families or selective ligands for individual proteins. This level of selectivity can be essential in the treatment of diseases, particularly when multiple proteins share similar functions but are involved in distinct physiological processes. As an example, we recently employed an embryonic version of the workflow presented here to understand the druggability of a query-binding site searching for similar motifs in proteins able to bind ligands of interest7.

In the past years, several computational methods quantifying the global or local similarity of protein cavities have been developed2,8. In general, most of these methods share three main steps: (i) three-dimensional analysis of the structures of interest; (ii) structure comparison; and (iii) quantification of similarity through a metric (a scoring function). Many different representations of a given binding site are possible with varying degrees of retained information, e.g. the type of amino acid residues that interact with the ligand, or representing the binding site through a surface onto which the physical–chemical characteristics are projected, and even considering protein–ligand interactions. The first two methods can be regarded as structure-based, i.e. they stem from observing the structure of the protein. As far as the actual comparison strategies are concerned, these can be (a) graphical-theoretical approaches, where the maximum common subgraph is searched; (b) fingerprint approaches, where the shapes involved in the binding site are considered; (c) approaches based on labelled 3D points and geometric hashing, i.e. 3D transformations that align pairs of structures. Furthermore, comparison algorithms may or may not depend on the alignment of the structures of interest. Comparison methods that rely on residues can use graphs, fingerprints, or alternative approaches. In particular, the comparison reveals the similarity between the residues, the type of residues, and the atomic composition; also, such methods perform well where the sequence and atomic position of the structure of interest are well preserved. Those that rely on surfaces can instead use graphs or labelled 3D points for comparison. These methods are particularly used when dealing with binding sites in proteins that do not show significant conservation in residues, atomic composition, orientation, or folding, but show considerable selectivity towards common ligands. Indeed, in these cases, the distribution of the properties on the surface of the binding site and the shape of the binding site are determining factors for the selectivity of the ligands. And finally, methods that rely on interactions can use graphs or fingerprints for comparison2. A summary of the main similarity search methods available in the literature is reported in Table 1 and a detailed comparison of the existing methodologies is available in recent litereature2.

Table 1 Summary of the main methods for the similarity search of a protein binding site in previous literature with their relative type of representation of the binding site and strategy of comparison.

The possibility of screening a high number of proteins for similar binding pockets can be particularly helpful and fruitful for those complex mechanisms and processes which may involve similar receptors and ligands for very different functions. In this context, we decided to turn our attention and use the proposed pipeline to explore some of the main actors involved in taste perception, due to the strong relationship between food intake and homeostasis regulation, disease onset, immune response and metabolism. The sense of taste is a sensory modality that plays a fundamental role in discriminating ingestible substances and nutrients from potentially harmful substances that must be avoided, especially in omnivorous species given the range of their feeding strategies48. Humans, in particular, can perceive five primary taste qualities, i.e. sweet, umami, bitter, salty, and sour, through the interaction of molecules contained in food and specialized proteins, namely taste receptors, located on the papillae of the tongue. However, taste receptors are also expressed in other tissues besides the oral cavity, including the skin49,50, brain51, pancreas52,53, heart54,55, urethra56, airway57 and gastric58 smooth muscle cells. Moreover, taste receptors are not only involved in the gustatory function, but they participate in other regulatory activities, such as regulation of metabolic activity59,60, innate immune response and bronchodilatation59,61, diabetes and obesity, glucose level maintenance, appetite regulation, as well as hormone release62, and muscle contraction/relaxation63.

In the present work, we decided to focus our attention on the bitter taste perception and relative actors, given the high scientific output produced in recent years concerning this specific taste sensation. Bitter taste receptors are the proteins responsible for the recognition of bitter foods, normally associated with potentially harmful substances. From a structural point of view, bitter taste receptors are GPCRs belonging to the taste 2 receptor family (TAS2Rs)64 and are characterized by seven transmembrane helices (TMD) connected by three Intracellular Loops (ICLs) and three Extracellular Loops (ECLs)65. The structural core of the 7 TMD bundle is conserved across class-A GPCR and TAS2Rs. This core plays a fundamental role in the ligand binding in the extracellular (EC) region and information transduction in the intracellular (IC) region66. Bitter taste receptors can be activated by a multitude of different agonists through various interaction types in their unique orthosteric binding pocket67,68. Based on the chemical heterogeneity of their agonists, TAS2Rs have been distinguished into promiscuous, such as TAS2R10, TAS2R14, and TAS2R46, which are activated by a variety of chemically diverse compounds, and selective, activated instead by a limited number of similar compounds69.

In the present work, we herein propose a novel, general and automatic algorithm, named VirtuousPocketome, to screen databases of protein structures to identify amino acid patterns that are similar to the ones forming the ligand binding site of a query receptor. Compared to previous literature, the novelty of this work resides in three main aspects: (i) the proposed pipeline accounts for the dynamics of the protein–ligand interaction by considering multiple binding site configurations obtained from a molecular dynamics (MD) trajectory; (ii) the identification of the crucial protein–ligand binding interactions is completely automatic; (iii) the results of the similarity search are filtered using an ad-hoc multi-step filtering process to exclude patterns unlikely to bind the ligand of interest. The algorithm builds structural motifs of the target binding site of the query receptor-ligand complex after clustering a molecular dynamics trajectory and then searches for similar patterns inside a specific protein database using the ASSAM code31 followed by additional ad-hoc filtering steps. We applied our computational pipeline to screen the currently solved human proteome for proteins that exhibit a highly similar local amino acid pattern to the one lining the strychnine binding site in the human TAS2R46 bitter taste receptor. The rationale of the work is to explore the taste transduction pathway with a proteomic perspective to elucidate the possible role of tastants beyond the mere taste perception and to investigate whether other classes of proteins have a conserved ability to recognize such ligands, with possible implications in nutrition, homeostasis, and disease.

Results

Details concerning convergence of the MD simulations and structural equilibrium of investigated molecular models are reported in the Supplementary Information (see also Figs. S1 and S2). The last 50 ns of each simulation replica were considered as structural equilibrium and were concatenated to obtain a final 150 ns-long trajectory representing the ensemble of protein conformations. These concatenated trajectories were used for the subsequent analysis and to search for similar binding pockets within the human proteome through the pipeline described herein.

Motifs creation

In the first step of the proposed pipeline, the motifs composed of the most important protein residues interacting with strychnine were identified. In particular, starting from the above-mentioned ensemble trajectory, residues within 10 Å from the position of the ligand have been extracted and their conformations clustered using the K-Means algorithm from MDAnalysis70. Three clusters were identified as the optimum number through the silhouette method. After extracting the motif cluster centroids, we used PLIP to narrow down the most important non-covalent interactions which define the final three motifs. In all three of them, the ligand formed a salt-bridge interaction with residue GLU2657.39 and two hydrophobic interactions with residues TYR853.29 and TRP883.32, highlighting the stability of these contacts throughout the MD simulation. Furthermore, additional hydrophobic interactions were found with residues THR692.64, PHE2526.58, and PHE2617.35 for motif 0 and with VAL612.56, and VAL2496.55 for motif 2 (Fig. 1). The three defined motifs were used for the subsequent similarity search step with ASSAM.

Figure 1
figure 1

Main interactions defining the three motifs for the bitter taste receptor interacting with strychnine. (A) Representative snapshot of the bitter-ligand complex, (B) PLIP interaction analysis identifying Hydrophobic Interaction (HI) and Saltbridge Interaction (SB), (C, D, E) site views of the three motifs identified. The bitter taste receptor is represented in grey, the strychnine in blue and the interacting residues in green (hydrophobic interactions) and purple (salt bridges).

To check the importance of the identified residues, we also evaluated the interaction probability of protein residues with strychnine during the entire concatenated trajectory (500 ns × 3 replicas = 1500 ns). We calculated the interactions between the ligand and the receptor in each frame of the simulation (one frame every 200 ps) and we divided the total occurrences of a specific interaction by the total number of frames, obtaining an interaction probability. In line with the results on the cluster centroids, residues TYR85, TRP88, and GLU265 demonstrated the highest values with interaction probability higher than 80% (see also Table S1 and Fig. S3). Interestingly, some weak interactions with the ligand in the initial experimental structure, such as those with residues ASN65 and THR180, are lost during the molecular dynamics simulation.

Similarity search and multi-step filtering

The similarity search against the entire human proteome consisting of 58,972 structures (see the Database Curation section) resulted in a total of 6718 hits using the ASSAM code. The subsequent steps, i.e. the SASA and docking filtering steps, yielded a total of 1852 and 257 hits respectively (Fig. 2A). In detail, we adopted a SASA threshold of 0.75, thus preserving all protein hits having a SASA in the binding pocket of at least 75% of the SASA of the original query binding site; on the other hand, we chose a docking threshold equal to 0.1, thus keeping the hits whose docking score would not differ by more than the 10% from the docking score of the query protein/ligand complex. The best hit in terms of docking score after the multi-step filtering process, namely PDB 3A4S, is represented in Fig. 2B, highlighting the correspondence between the original motifs in the bitter taste receptor and the relative matching residues in the identified protein.

Figure 2
figure 2

(A) Number of total structures in the original human proteome database and the number of selected hits from the similarity search and the subsequent multi-step filtering. (B) Binding site view of the strychnine bound to the best hit (PDB: 3A4S) according to the docking score at the end of the multi-step filtering process. Protein is rendered in grey, strychnine in blue, residues in the original motif of the bitter taste receptor in red and matching residues in the 3A4S structure in violet.

The pipeline automatically generated a report file (see ‘Results.txt’ file in the Supplementary Material) which includes the list of all the hits at the end of the screening process with additional information, such as PDB IDs with doi of the relative publication, protein classes, shared residues between hits and query, SASA and Docking Scores. The complete list of the 257 retrieved protein hits is reported in Table S2.

Functional enrichment and signaling pathway analyses

VirtuousPocketome retrieved 145 unique Uniprot IDs relative to the previously identified hits, meaning that multiple PDBs at the end of the multistep filtering process corresponded to the same protein. The DAVID software was then employed to analyse the Gene Ontology terms and the signalling pathways data as described in the “Methods” section.

The functional enrichment analysis revealed that the input genes were significantly enriched for a total of 16 GO terms in the Cellular Components category, 10 terms in the Biological Processes category and 14 terms in the Molecular Functions category, based on the corrected p-value. In addition, 0 KEGG and 0 Reactome pathways were found to be significantly enriched at the same p-value threshold. The best 5 GO terms for each of the above-mentioned categories are represented in Fig. 3, whereas all the significantly retrieved GO terms are represented in the Supplementary Information (Figs. S4 and S5). Regarding the Biological Processes (BP), most of the retrieved genes are related to metabolic processes (40.9%), including organic substance, cellular and nitrogen compound metabolic processes. Besides, the most represented Molecular Functions (MF) are related to the binding of different species, such as proteins, small compounds or ions, and to enzyme activity, such as transferase, hydrolase, and oxidoreductase. Finally, regarding the last analysed GO term, the most represented Cellular Components (CC) are cytoplasm and membrane (39.0%).

Figure 3
figure 3

Bar plots representing the best 5 retrieved GO terms for each category in the third level of the GO hierarchy relative to Biological Processes (BP), Molecular Functions (MF) and Cellular Components (CC).

Discussion

Herein, we developed a novel computational pipeline, named VirtuousPocketome, to screen a desired database for proteins sharing similar binding sites with a specific protein of interest. VirtuousPocketome is based on four major steps: (i) the motifs creation step, which identifies the most important residues involved in the interaction of the protein–ligand complex under investigation; (ii) the similarity search step, in which the human solved proteome is screened for similar motifs compared to the ones retrieved in the previous step; (iii) the multi-step filtering step, which preserves only the protein hits with binding sites effectively accessible and with a certain docking affinity for the query ligand; (iv) functional enrichment and signalling pathway analyses, which identifies the most relevant cellular components, molecular functions, biological processes and signalling pathways related to the protein hits identified in the previous steps. VirtuousPocketome takes as input the molecular structure (and eventually also the molecular dynamics trajectory) of a protein–ligand complex and gives as output a list of PDB structures sharing similar binding sites. The developed protocol is automatic and unbiased and can be applied to any protein–ligand complex.

To evaluate the developed code and prove its potential, we here selected the human TAS2R46 bitter taste receptor bound to a bitter ligand, i.e. strychnine and we screened the entire human proteome to search for similar structural motifs. The choice for this molecular system was driven by the recent experimental determination of the TAS2R46-strychnine complex structure71, as well as the fact that strychnine is experimentally known to target not only TAS2R4667 but also other bitter taste receptors, such as TAS2R1072, and even other proteins73. Moreover, bitter taste receptors have been widely investigated through computational molecular modelling in past years, ensuring enough data for comparison and evaluation of some of the results of the platform68,72,74,75,76,77. Furthermore, the strong relationship between food intake and health status, including the regulation of homeostasis and metabolism, makes the chosen molecular machinery a particularly intriguing and relevant testbed for our computational screening pipeline to pinpoint possible secondary targets outside the gustatory system for food-related tastants.

The first step of the proposed pipeline, i.e. the motifs creation step, applied to the TAS2R46 bitter taste receptor bound to strychnine pinpointed three major motifs of residues mostly involved in the ligand binding. In particular, all three motifs shared a salt-bridge interaction with residue GLU2657.39 and two hydrophobic interactions with residues TYR853.29 and TRP883.32, whereas the first motif comprised also hydrophobic interactions with residues THR692.64, PHE2526.58, and PHE2617.35 and the third motif with VAL612.56, and VAL2496.55 (Fig. 1). Interestingly, some of these residues have already been suggested by previous literature to be important interactions for ligand binding of bitter taste receptors. In detail, GLU2657.39 and TRP883.32 were demonstrated to be pivotal in TAS2R46 activation by strychnine71. Moreover, mutagenesis studies have reported the importance of residue GLU2657.39 for for agonist responsiveness of TAS2R4667 and the same position has been also linked to ligand binding for similar receptors, including hydroxytryptamine (5-HT) receptors78,79, adrenergic receptors80,81, purinergic receptors82,83, and cholecystokinin-B (CCK-B)/gastrin receptor84. Moreover, residue TRP883.32 is widely conserved among TAS2Rs and was found to be crucial in the activation of TAS2R43, TAS2R30 and TAS2R4667,85. The importance of residue in position 3.29 (TYR85 for TAS2R46) was confirmed also for TAS2R10, which shows a similar binding site to TAS2R46 and is also activated by strychnine72. These pieces of the literature confirmed the reliability of the motifs creation step of the proposed pipeline to pinpoint the most important and relevant residues involved in the ligand binding. It is worth mentioning that the possibility of analysing a molecular dynamics trajectory allows the identification of multiple motifs, three in the present case, one for each identified cluster centroid, ensuring a more exhaustive sampling of the protein–ligand interactions.

Starting from the identified motifs, similar amino acid patterns were searched in the entire currently solved human proteome, consisting of 58,972 structures. The similarity search using the ASSAM code resulted in 6718 hits. Then, the multi-step filtering reduced the number of detected sites to a total of 257 PDB structures, which represent 0.44% of the original database and 3.83% of the structures in ASSAM code output (Fig. 2A). It is worth mentioning that in the final set of 257 hits notable entries include the potassium voltage-gated channel and acetylcholinesterase. The potassium voltage-gated channel is known to be modulated by strychnine (https://pubchem.ncbi.nlm.nih.gov/bioassay/588834). Although there is no experimental assay in the literature for the binding affinity between strychnine and acetylcholinesterase, strychnine is recognized to bind to the acetylcholine receptor, which binds acetylcholine similarly to the acetylcholinesterase86,87. The presented approach was revealed to be effective in discarding a high number of spurious proteins whose matching motifs were unlikely to bind strychnine, due to insufficient solvent exposure or low predicted affinity obtained from molecular docking. It is essential to highlight that while this pipeline can identify structures with amino acid patterns similar to a specified binding pocket, the actual ability of the ligand to interact with the identified proteins needs validation through experimental methods. This study is presented as a screening platform designed to pinpoint proteins that are more likely to be promising targets for the given ligand, with the understanding that experimental testing is crucial for confirmation.

The functional enrichment analysis allowed us to pinpoint the main biological processes, molecular functions, and cellular components related to the gene expressing the retrieved hit proteins at the end of the VirtuousPocketome pipeline (Fig. 3). Most of the biological processes highlighted are connected to metabolic processes, which seems an intriguing result considering the strong relationship between taste perception, food intake and metabolism. Indeed, these results might indicate that strychnine has not only the ability to activate the TAS2R46 bitter taste receptor and elicit the bitter taste sensation, but it should be also the potential trigger or modulator of proteins directly involved in the metabolic processes. Moreover, the protein hits are mainly involved in molecular functions related to the binding of proteins or other small compounds and are localized mainly in the cytoplasm and membrane. These results seem pretty reasonable considering that TAS2R46 is a transmembrane protein and a promiscuous bitter taste receptor, able to bind a wide spectrum of chemical compounds. In light of these results, it is worth mentioning that strychnine is known to bind glycine and acetylcholine receptors73, which are membrane proteins involved in transport and signalling functions. Therefore, the presented pipeline was able to detect similar binding sites in receptors with the same localization and biological function of known strychnine targets. In addition, strychnine is a rather promiscuous ligand with anti-plasmodial and anti-cancer activity and other yet-unresolved molecular targets88.

In conclusion, in the present work, we developed a novel, general and automatic pipeline for identifying proteins sharing similar binding pockets with a query receptor-ligand complex by screening the entire solved human proteome. The developed pipeline is also freely accessible at https://virtuous.isi.gr/#/virtuous-pocketome and will be easily expanded in the future to other protein databases. We used the TAS2R46-strychnine complex as a testbed for the proposed method to investigate if other proteins except the taste receptors might share similar binding pockets for the recognition of tastants. The proposed methodology allows for a deep investigation of the TAS2R46-specific residues needed for the strychnine binding to pinpoint other human proteins sharing similar binding pockets and investigate the potential roles that this tastant could play in contexts beyond the gustatory system. The retrieved proteins could be further analyzed to predict whether the interaction with the compound of interest results in their activation or modulation. This will increase our understanding of possible secondary effects of tastants beyond the mere taste perception and their impact on the biological processes and molecular functions in which the retrieved hit proteins are involved. This approach can also assist the design of specific foods and ingredients to develop personalised treatments able to target desired proteins or receptors involved in specific processes or diseases with the ultimate goal of accessing the potential of the diet as a supplement to traditional pharmacological treatments.

Methods

VirtuousPocketome workflow

Overall workflow

The proposed computational pipeline requires as mandatory inputs the coordinates (PDB file) of a protein–ligand complex and the chain label(s) uniquely identifying the receptor(s) and the ligand in the provided structure. In order to allow the ligand to accommodate in the binding site and to assess the dynamic nature of the protein–ligand binding, the user can specify a molecular dynamics trajectory (GROMACS xtc/trr/pdb file), thus considering several configurations of the investigated complex. Moreover, additional custom parameters can also be defined by experienced users if needed (details in the following).

At present, the code is limited to the screening of the entire human proteome, but it can be easily expanded to any PDB-like database.

The overall workflow of the algorithm designed in the present work is divided into four main steps, each of which will be described in detail in the following paragraphs: Motifs Creation, Similarity Search, Multi-step Filtering, and Enrichment and Signaling Pathway Analyses.

The main output consists of a Txt file collecting the PDBs of the identified proteins sharing similar accessible binding site(s) compared to the query receptor-ligand complex. Additionally, the code provides as output the list of unique UniProt IDs related to the retrieved proteins, plots summarising the main results (see the “Results” section), and visualisation states of the molecular system under investigation.

The overall workflow is represented in Fig. 4 in a flow chart representation.

Figure 4
figure 4

Flow Chart of the overall workflow of VirtuousPocketome.

For the implementation of this workflow, we mainly used GROMACS functionalities89, MDAnalysis modules70, and pdb-tools90. The ASSAM source code was kindly provided by Prof. M. Firdaus Raih from the Molecular Function Regulation Lab (http://mfrlab.org).

Step 1: motifs creation

Starting from the provided PDB input file and the MD simulation (if present) of the protein–ligand complex, a list of residues defining the binding site is retrieved based on the distance between the ligand and the receptor. The default distance threshold is set to 10 Å, but this can be overridden with a custom value provided by the user as an additional parameter. If the user only provides a single PDB file of the protein–ligand complex, a single binding site is defined according to the chosen distance threshold between the ligand and the protein. Conversely, if the user provides also the MD trajectory, the obtained subset of coordinates from each frame of the simulation identifying the ligand and the residues of the protein binding site are clusterized with the K-Means algorithms implemented in MDAnalysis, which is based on the distances between atoms positions70. The user can set the desired number of clusters (k), otherwise, the optimal number of clusters is retrieved using the silhouette method ranging from a minimum of 2 clusters up to a maximum of 12 clusters. The silhouette method, as provided by the built-in scikit-learn function (sklearn.metrics.silhouette_score), is an example of an evaluation metric to indicate if clusters are well-defined. This method computes the mean distance between a sample and all other points within the same class (a) and the mean distance between a sample and all other points in the nearest neighbouring cluster (b). The Silhouette Coefficient, s, for a single sample, is then calculated as reported in Eq. (1):

$$s=\frac{b-a}{{\text{max}}\left(a,b\right)}$$
(1)

Then, the total Silhouette Coefficient is calculated as the mean of the Silhouette Coefficient for each sample. A higher Silhouette Coefficient score is indicative of denser and well-separated clusters, aligning with the conventional understanding of a cluster. It suggests that the samples within each cluster are tightly packed and distinct from samples in other clusters. This reflects a higher degree of cohesion within clusters and a greater separation between them. Therefore, the algorithm calculates the silhouette coefficient for different numbers of clusters (from 2 to 12) and then chooses the optimal number of clusters according to the best-achieved silhouette coefficient score. The subsequent centroids of the clusters are saved.

Starting from the previously defined binding sites, the subset of residues forming the binding site is further refined, highlighting those residues involved in non-covalent interactions with the ligand. These interacting residues are retrieved using the Protein–Ligand Interaction Profiler (PLIP) software91, which detects hydrogen bonds, hydrophobic contacts, pi-stacking, pi-cation interactions, salt bridges, water bridges, metal complexes, and halogen bonds between ligands and targets. This two-step analysis allows the definition of the most relevant protein residues for the interaction with the ligand. This subset of residues will be defined as motifs in the following. If only the PDB file is passed by the user, a single motif will be created; if an MD trajectory is provided, a motif will be produced for each centroid of the binding site.

Step 2: similarity search

In the second step, the similarity search for each extracted motif is carried out using the ASSAM software. The fundamental principles underlying the search methodologies of ASSAM are described in previous literature31,92. In brief, the protein structure is represented as a graph, where nodes represent individual amino acid side chains and the geometric relationships between nodes form the edges of the graph. Each node is composed of two pseudo-atoms, which generate vectors that correspond to the nodes in the graph. The positions of these pseudo-atoms are strategically chosen to emphasize the functional portion of the corresponding side chain. The geometric relationships between pairs of residues are defined by distances calculated between their corresponding vectors, and these relationships are represented as the edges of the graph. If we let S, M and E denote the start, middle and end of a vector, the edges of the graph encompass five components: SS, SE, ES, EE, and MM distances, although only a subset of these distances is typically used to specify a query pattern. ASSAM employs a maximal common subgraph (MCS) approach using the Bron and Kerbosch MCS algorithm to enumerate all possible correspondences with similar protein patterns93.

The choice to utilize ASSAM was made after an extensive examination of tools documented in the literature for assessing protein binding pockets (see also Table 1). Specifically, we scrutinized various methods, emphasizing those deemed most suitable for our objectives and compatible with integration into our pipeline. We specifically looked for methods that could compare a particular binding site with a broad set of proteins in a database (customizable if necessary), be executable locally rather than solely as a web server, provide output data on similar residues between the query protein and the target protein, perform searches quickly, and not require the prior definition of binding sites for proteins in the search database. This analysis pointed ASSAM as the most appropriate for the present work. In particular, ASSAM was chosen for (i) the simplicity in defining the binding site to be searched, i.e. a coordinate file containing the residues of interest in PDB format, (ii) the possibility to screen against any desired database in PDB format, (iii) the comparably fast time to solution, and (iv) the possibility of considering and preserving both right-handed and left-handed orientations of the alpha-helices to retrieve the superpositions. A left-handed α-helical bundle superimposed onto a right-handed one is not equivalent, particularly concerning the handedness of the two groupings of amino acids. However, when it comes to chemical activity, two groupings of amino acids may exhibit the same behaviour despite having different handedness. The crucial factor, in this case, is the distance between the individual residues31. Thanks to the source code kindly provided by the ASSAM developers, the proposed pipeline does not rely on the ASSAM web-based analysis tools and runs entirely offline on-premises.

The output of the ASSAM search is formatted as a list, in which each row corresponds to a hit protein found in the screened protein database, along with its PDB accession ID, which presents a match with the residues of the input motif, also referred to as query. The matching residues between the query and the hit are also indicated, as well as the root-mean-square deviation (RMSD) value between query residues and hit residues after alignment. Finally, additional pieces of information are retrieved, namely the number of the initial conformation from which the query motif was created, and further information on the hit protein obtained from the RCSB Protein Data Bank site, such as the DOI of the corresponding publication and the EC classification.

Step 3: multi-step filtering

To further refine the output from the previous steps and select only the protein hits with accessible and high-affinity binding sites, two additional filtering steps, i.e. (i) the SASA and (ii) the docking filters, have been implemented.

The first step involves the calculation of the Solvent-Accessible Surface Area (SASA) using GROMACS89. In detail, for each hit protein, the SASA, evaluated in nm2, is calculated by measuring the value for all the residues matching the residues of the query protein’s motifs. Values equal to zero indicate that the cleft identified by the residues is not solvent-exposed, but is rather buried in the structure of the protein and therefore likely inaccessible for the solvent or the ligand. Values that are greater than zero, on the other hand, indicate that the cleft formed by the hit residues is on the surface of the corresponding protein and therefore might allow for ligand binding. Only hits with the binding site having SASA greater than a predefined threshold of the SASA value of the corresponding query motif are retained. The SASA criterion is summarised by Eq. (2):

$$\begin{array}{c}{SASA}_{HIT}>{SASA}_{query}*SAS{A}_{threshold}\end{array}$$
(2)

where SASAHIT corresponds to the calculated SASA value of the given hit motif, \({SASA}_{query}\) is the SASA value of the corresponding motif of the query protein–ligand complex, and \(SAS{A}_{threshold}\) is the threshold to select only hits with the desired SASA. \(SAS{A}_{threshold}\) is set to 75% by default, but the user can specify a custom value in the additional input parameters. The selection of this threshold was imposed as a compromise to obtain a reasonable number of protein hits that could be effectively and rationally considered while retaining targets with motifs that are solvent-exposed and prone to ligand docking. The decision to set as default a high SASA threshold is based on the importance of solvent-accessible surface area (SASA) in assessing the capability of a protein binding site to accommodate ligands. Considering only SASA values considerably lower than the original protein binding site could compromise the algorithm's ability to retain only those binding sites most similar to the reference complex and thus more prone to ligand binding. Setting the SASA threshold to higher values, on the other hand, may lead to the exclusion of potentially promising binding sites and protein hits. As a result, users have the flexibility to adjust this threshold based on their screening objectives.

The second filtering step relies on the molecular docking of the original ligand in the query complex onto the retrieved hits from the previous steps. The docking procedure was implemented using SPORES and PLANTS94,95. PLANTS software was chosen since it exhibited excellent performance in terms of pose prediction and time to solution compared to other molecular docking software96,97 and since it has been already successfully used for molecular docking and virtual screening campaigns on GPCRs98,99,100,101,102. These qualities make PLANTS particularly well-suited for the present work. Only hits with binding sites exhibiting a docking score (DSCORE) similar to or below the docking score of the original protein–ligand complex are retained. In particular, the docking score of the selected hit (\({DSCORE}_{HIT}\)) is compared with the docking score on the original query complex (\({\overline{DSCORE} }_{query}\)) adjusted by a defined percentage using the specified docking threshold (\({DSCORE}_{threshold}\)). The docking threshold is set by default as 10% of the original protein–ligand complex docking score, if only a PDB is provided as input, or the 10% of the average PLANTS docking scores between the centroids of the original protein–ligand complex, if the MD trajectory was specified. The docking filtering criterion is therefore defined by Eq. (3):

$$\begin{array}{c}{DSCORE}_{HIT}<{\overline{DSCORE} }_{query}+\left|{\overline{DSCORE} }_{query}\right|*{DSCORE}_{threshold}\end{array}$$
(3)

Step 4: enrichment and signalling pathway analyses

In this step, functional enrichment and signalling pathway analyses were performed to collect information regarding roles, functions, distributions, expressions and pathways in which the identified targets from the previous steps are involved. The pipeline automatically retrieves the unique UniProt IDs of the protein hits (since several PDB codes can correspond to the same protein) and searches for related genes using the DAVID functional annotation program103,104. We decided to focus our attention on the Gene Ontology (GO) terms105 related to the Cellular Components (CC), Biological Processes (BP) and Molecular Functions (MF). The resulting GO terms were used to identify significantly enriched functional categories, using a statistical approach that takes into account the size of the protein list, the number of genes associated with each GO term, and the background gene set. We also performed pathway analysis using the Kyoto Encyclopedia of Genes and Genomes (KEGG) and the Reactome databases to identify the most represented pathways106,107,108. GO terms and KEGG pathways with corrected for multiple testing q-value < 0.1 were considered to be significantly enriched. The correction of the p-values and the calculation of the q-values were performed using the Benjamini–Hochberg FDR adjustment method109. The pipeline automatically generates separate plots with the above-mentioned analysis.

Database curation

The developed tool can screen a database of target proteins in the PDB format for binding pocket similarities against a query protein–ligand complex. For the present work, we have collected and refined all the experimentally solved PDB structures belonging to the human proteome. We first retrieved all the solved PDB codes belonging to homo sapiens from the NCBI database110 using a dedicated Python API (Bio.Entrez package), and obtained a total of 60,159 entries (1st February 2023). We downloaded all the found queries from the RCSB database (http://www.rcsb.org/)111, reaching a total of 59,267 structures (892 PDBs were not available). Then, the database was cleaned by removing (i) the chains in the PDBs not belonging to the human organism (such as in the case of protein chimaeras), (ii) multiple models from each PDB, (iii) ANISOU and HETATM lines, (iv) alternative locations of the atoms in the PDB by preserving the ones with the highest occupancy. At the end of this cleaning protocol, we ended up with 58,972 PDB files.

Molecular modelling and dynamics

System setup

The previously described workflow was applied to search for proteins sharing similar binding pockets with a human bitter taste receptor. We employed the recently-solved TAS2R46 human bitter taste receptor bound with an agonist bitter compound, named strychnine (PDB ID: 7XP6)71. We first removed undesired molecules and structures from the PDB, preserving only the bitter taste receptor and strychnine. Since the receptor structure presents some missing residues (157–172), we downloaded the relative model (sequence P59540 of Homo Sapiens hTAS2R46) from the AlphaFold Protein Structure Database112. The AlphaFold model was then aligned to the receptor in the 7XP6 PDB. Missing residues in the original experimental structure were then built by homology modelling with MOE (Molecular Operating Environment) software113 using the relative portion of the AlphaFold model as a template, thus filling the gap in the original experimental structure. Then, the structure of strychnine was refined using MOE, assigning the correct protonation at neutral pH and salt concentration of 0.15 M (see also Fig. S6). As reported in previous literature71, the tertiary amine of the molecule is protonated and its total charge is + 1, as also reported in the DrugBank database (https://go.drugbank.com/drugs/DB15954).

The obtained complex was embedded into a POPC membrane using the CHARMM-GUI web server114. The receptor was aligned with the membrane bilayer using the OPM web server included in the CHARMM-GUI preparation protocol. The final number of POPC lipids (165) was set by a trial-and-error approach to creating a box large enough to meet the minimum image convention for the protein during the MD simulation: the final box size was set to 8.36 × 8.36 × 11.47 nm, with the z direction perpendicular to the cell membrane plane. The system box was filled with water and neutralized using NA+ and CL- ions at 0.15 M physiological concentration.

We used the AMBER forcefield to describe the molecular system: in detail, we used the Lipid-21 forcefield115 for lipids, the AMBER19SB116 for the protein, ions and water and the General Amber Force Field (GAFF2) forcefield117 to obtain the topology for strychnine, as implemented directly in the CHARMM-GUI suite.

Simulation protocol

The simulation workflow suggested by CHARMM-GUI was followed during minimisation, equilibration with position restraints and final simulation production. First, the system was energy-minimized using the steepest descent algorithm for 5000 steps. Then, six equilibration steps were performed gradually reducing the position restraints on the lipids and protein-heavy atoms (from 1000 to 0 kJmol−1 nm−1 for lipids, from 4000 to 50 kJmol−1 nm−1 for the protein backbone and from 2000 to 0 kJmol−1 nm−1 for protein side-chain heavy atoms). The system was equilibrated in the NVT ensemble for 250 ps with a conservative timestep of 1 fs, using the Berendsen thermostat118 with a coupling time constant of 1 ps and a reference temperature of 303.15 K, which is above the phase-transition temperature for POPC, and subsequently in the NPT ensemble for 125 ps with the same 1 fs timestep, followed by a further simulation of 375 ps with a 2 fs timestep, using the Berendsen thermostat with the same parameters as before and the Berendsen barostat118 with semi-isotropic pressure coupling at 1 atm with a coupling time constant of 5 ps. Overall, the systems underwent 750 ps of equilibration.

The TAS2R46 bitter taste receptor system was then simulated for 400 ns without position restraints with a 2 fs time step in the NPT ensemble. Temperature coupling was done with the Nose–Hoover algorithm119, whereas, pressure coupling with the Parrinello-Rahman algorithm120. The PME algorithm was used for electrostatic interactions with a cut-off of 0.9 nm. A reciprocal grid of 72 × 72 × 96 cells was used with 4th-order B-spline interpolation. A single cut-off of 0.9 nm was used for Van der Waals interactions. LINCS (LINear Constraint Solver) algorithm for h-bonds121 was applied in each simulation step.

Three simulation replicas were performed to ensure the reproducibility of the simulations and to enlarge the simulation statics. All Molecular Dynamics (MD) simulations were performed using GROMACS89 and the Visual Molecular Dynamics (VMD) package was employed for the visual inspection of the simulated systems122.