Abstract
G protein-coupled receptors (GPCRs) are a large superfamily of cell membrane proteins that play an important physiological role as transmitters of extracellular signals. Signal transmission through the cell membrane depends on conformational changes in the transmembrane region of the receptor, which makes the investigation of the dynamics in these regions particularly relevant. Molecular dynamics (MD) simulations provide a wealth of data about the structure, dynamics, and physiological function of biological macromolecules by modelling the interactions between their atomic constituents. In this study, a Recurrent and Convolutional Neural Network (RNN) model, namely Long Short-Term Memory (LSTM), is used to predict the dynamics of two GPCR states and three specific simulations of each one, through their activation path and focussing on specific receptor regions. Active and inactive states of the GPCRs are analysed in six scenarios involving APO, Full Agonist (BI 167107) and Partial Inverse Agonist (carazolol) of the receptor. Four Machine Learning models with increasing complexity in terms of neural network architecture are evaluated, and their results discussed. The best method achieves an overall RMSD lower than 0.139 Å and the transmembrane helices are the regions showing the minimum prediction errors and minimum relative movements of the protein.
Similar content being viewed by others
Introduction
G protein-coupled receptors (GPCRs) are a large and diverse superfamily of eukaryotic cell membrane proteins. They are receptors for a large diversity of extracellular signals including light, pressure, chemical ligands, neurotransmitters and metabolites, among others1,2,3,4, and play an important physiological role as transmitters of extracellular signals to the cell5.
Due to their participation in a wide range of activation pathways and important biological processes and because of their high affinity binding to drugs, GPCRs have become a prime research concern in pharmacology and a major target for drug discovery6. In fact, approximately 34% of all the drugs approved by the US Food and Drug Administration5 target GPCRs with the aim of either activating (agonist) or deactivating (antagonist) the receptor7.
The functionality of proteins is determined by their 3D structural configuration, which varies according to the binding processes of orthosteric and allosteric ligands, the lipidic environment and post-translational modifications8. These sources of variability elicit dynamical changes in the GPCR that result in the generation of specific signals.
The understanding of these signal transmission mechanisms in the receptor would provide us with a key to drug development and testing. The GPCR structure includes seven trans-membrane (TM) helices, linked by intra-cellular loops (ICL) and extracellular loops (ECL). All these regions play a role in the activation process9, but the TM regions are of particular importance10, as they have to undergo a conformational change to transmit the signal trough the cell membrane.
Molecular dynamics (MD) simulations provide a wealth of data about the structure, dynamics, and physiological function of biological macromolecules by modelling the interactions between their atomic constituents. The computer-assisted analysis of MD simulation data should allow the study of the receptors dynamic behavior, particularly in their interaction with drugs. Machine Learning (ML) tools can be particularly efficient in such endeavours.
This study investigates the ability of a recurrent neural network (RNN) model, namely Long Short-Term Memory (LSTM)11, to predict the dynamics of two GPCR states and three specific simulations of each one, through their activation path. Most importantly, the relative relevance of different regions of the receptor (TM, ECL, and ICL) for this prediction is also estimated as part of the analysis. A unidirectional LSTM (ULSTM) and a bidirectional LSTM (BLSTM) are used to predict the path trajectories for three types of simulations in the active and inactive states of the \({\upbeta }\)-2 adrenergic receptor (\({\upbeta }\) 2AR) GPCR. More specifically, these simulations are analysed for the 2RH1 (inactive state) and 3P0G (active state) structures, both with APO, Full Agonist (BI-167107) and Partial Inverse Agonist (carazolol). In addition, the LSTM variants are compared to other architectures, such as Random Forest (RF) and chains of Artificial Neural Networks (ANN), namely Convolutional Neural Networks (CNN) with LSTM (CNN-LSTM).
Advances in biotechnology, X-ray crystallography, and cryoelectron microscopy (cryo-EM) in recent years have generated an exponential increase in available GPCR simulation data, easing GPCR analysis, visualisation, and data-driven experimental designs5. Machine Learning can be used as tool to extract knowledge from complex data and different ML models have successfully been applied to many areas in proteomics, including the MD domain. They have been applied, for instance, to the study of protein pocket dynamics12, to enhance sampling13,14, and to generate new digital structures15,16. Some studies have used ML methods to identify different biological function states from MD conformations to explain the allosteric mechanism. For example, Fleetwood et al. (2020)17 used ML and statistical approaches (Principal Component Analysis, Random Forest, Autoencoder, Restricted Boltzmann Machine, and Multilayer Perceptron) to analyse conformational changes within the soluble proteins and ligand binding to a GPCR. Zhou et al. (2018)18, in turn, used Decision Trees and ANNs to classify ligand unbound and bound states from MD trajectories of PDZ2 protein. Both models achieved, in turn, 75% and 80% of predictive accuracy.
Most recent ML-based approaches concern the use of different variants of Deep Learning (DL) methods. Jumper et al. (2022)19 and Baek et al. (2021)20 developed, respectively, the very successful AlphaFold and RoseTTAFold models for protein structure forecasting from sequence as input. Other authors have applied CNN methods to the prediction of interactions between proteins21, protein with ligand22,22, protein folding, protein phosphorylation23, and protein structure classification24. Notice though that the input data for the CNN are images described as spacial arrays, whereas the characteristics that describe GPCRs conformation are not structured. Hayatshahi et al. (2019)25 distinguished otherwise similar allosteric states of proteins adopting conventional ML and DL approaches on extensive MD simulations. Plante et al. 201926 combined a densely connected ANN and the pixel representation to identify ligand bound and unbound states.
Recurrent Neural Networks are ANNs where connections between points-nodes can create a cycle. This enables them to show temporal dynamic behavior. They have been particularly successful in applications to human language modelling27. The LSTM models address a limitation of the RNN architecture, namely its inability to learn information from the distant past, allowing the network to dynamically learn to forget old aspects of information. They have been used to mimic trajectories produced by simulations28,29, achieving accurate short-term predictions. They have also shown great potential for sequence processing30, resulting in a large body of literature studying the trajectories from simulation systems31.
In Tsai et al. (2020)29, LSTMs were used to predict the temporal evolution of chemical/biophysical trajectories. Mohamma et al. (2019)32 applied these models to find temporal correlations between atoms. Kadupitiya et al. (2020)33 used LSTM for the numerical integrator that solves Newton’s equations in MD simulations33. Other authors have applied LSTM over the low-dimensional molecular simulations to detect rare events in the sequential data34. Liang et al.35 applied them to molecular step-positions forecasting for S-protein on the SARSCoV2 dynamics. Ludwig et al.(2022) evaluated the performance of BLSTMs in the task of increasing the 3D spacial resolution of MD trajectories as a data post-processing step.
We carried out a preliminary study using LSTM36, in which the best representation of amino-acids in 3D space to predict MD trajectories of a GPCR receptor as a whole was obtained, but, to the best of our knowledge, there is no reported work on MD forecasting by GPCR regions, which is the main concern of the current study.
Materials
GPCR MD simulations
The MD simulations used in this study were created in Google Exacycle cloud computing platform37 as a way to improve understanding of the drug efficacy at GPCR receptors. These simulations could be incorporated into a valid and functional structure-based drug discovery approach through pathway analysis. The simulations under study were created by Kohlhoff et al.38 by computing intensively short MD trajectories in parallel on the cloud platform, and are publicly available as source data at the SimTK (https://simtk.org/projects/natchemgpcrdata/) repository. The authors of these short simulations further analysed them by assembling larger trajectories using extensive sampling with Markov state modeling. We summarily describe these larger simulations next according to the description by their authors.
The crystal structure of the membrane for PDB id:2RH1 (inactive) and id:3P0G (active) was created from the OPM database39. Inactive (2RH1) and active structures (3P0G) without ligand (APO) in addition to binding of the receptor to the partial inverse agonist and the full agonist (2RH1 with BI-167107, and 3P0G with carazolol)40,41.
The structures were embedded in a bilayer of POPC lipid molecules in a orthorhombic box of size 10.0 \(\times \) 10.0 \(\times \) 8.5 nm. The system was solvated in TIP3P water molecules interspersed with Na+ and Cl- ions for molecular stabilisation with cholesterol and a final ion concentration of 0.15 M.
Protein, water, and ions were parameterized with the AMBER03 force field42 and lipids with the Berger unified atom force field. Carazolol and BI-167107 ligands were extracted from the PDB entries 2RH1 and 3P0G, respectively, and parameterized for the general Amber force field (GAFF)43 with acpype44 and antechamber45. For simulations in which the agonist and the partial inverse agonist were switched, the ligand positions were changed after superimposing the two crystal structures using all protein residues with atoms within 6Å of either ligand. The sizes of the resulting molecular dynamics systems range from 58,406 to 59,044 atoms.
The receptor structures of both N- and C-termini were not fully resolved during crystallography. In 2RH1, the structures involve residues from 30 to 342, and for 3P0G, residues from 23 to 344. In intracellular loop 3 (ICL3), between helices 5 and 6, the missing residues are substituted in 2RH1 and 3P0G with T4 lysosomes and a nanobody, accordingly. These residues are 231–262 in 2RH1 and 228–264 in 3P0G. \({\upbeta }\)2AR remains functional even in the absence of ICL3.
Hydrogen bonds and hydrogen bond networks enable intramolecular water to act as a facilitator of biomolecule dynamics. During the equilibrium and production experiments, water molecules were able to move freely within the simulation system and enter and exit the receptor during the simulations.
Considering ionic lock formation, a salt bridge between intracellular residues E268 and R131 is a feature of the receptor’s inactive state and disruption of this ionic lock is involved in receptor activation46. It has been demonstrated that the inactive state shows a mixture of ionic locks formed and broken at equilibrium47.
In the extracellular region, the helical movements extend around the mean helical position. The crystal structures of the active and inactive conformations on the extracellular side are almost identical in the movements of helices 2 and 3 (with a difference of less than 1% or less), while the other five helices are shifted from 0.379 Å (helix 4) to 0.773 Å (helix 1) in relation to each other. The active structure has a more compact helical formation than the inactive structure. During the simulation, helices 6 and 7 were compressed in all systems, while helices 4 and 5 moved slightly outward. Helix 1 showed the greatest relative movement within the simulations, particularly in the inactive structure.
The central region of the transmembrane helix shows the greatest stability compared to the intra- and extracellular sides. This region is usually quite condensed. The most significant distinctions between the active and inactive structures are observed in helices 1, 6 and 7. During the simulation, helix 6 appears to be converging towards the inactive state when the simulation is initiated from the active conformation. Helix 1, on the other hand, moves out in all systems.
The most notable structural differences are seen in the movement of the transmembrane helices in the intracellular region. Helices 6 and 7 are particularly distinct between the active and inactive structures, with a displacement of 6.951 Å and 3.47 Å respectively. Helices 1, 2, and 4 are further away from the center in the active state, while helix 3 is much closer (with a range of offsets from 1.4 to 2.277 Å).
Residues by helix, and limits of the helices by residue id are defined as follows (with residue numbers in brackets): Helix 1 (29–60), Helix 2 (67–96), Helix 3 (103–136), Helix 4 (147–171), Helix 5 (197–229), Helix 6 (267–298), Helix 7 (305–328)48.
Although Kohlhoff et al.38 assembled a larger simulation for the different crystal structures from the short simulations using an extensive sampling with Markov state modelling, the simulations in this study are based on short trajectories which are released at the source repository by the authors. More precisely, the simulations in the study comprise 2, 000 trajectories for each, classified into six types of crystal structure of \({\upbeta }\)2AR: APO for the simulations of 2RH1-icl3 and 3P0G-a, Full Agonist (FA) for the simulations of 2RH1-b and 3P0G-b, and Partial Inverse Agonist (PIA) for 2RH1-c and 3P0G-c. The receptor consists of 282 amino acids for the inactive and 344 for the active states. Each trajectory describes the 3D position of the receptor along 28 consecutive time-steps (trajectory length), which are hereon referred to as frames. The time elapsed between each frame is 500 picoseconds. Activation and deactivation proceed through multiple pathways and typically visit metastable intermediate states. The simulation data under study, as in Gutiérrez-Mondragón et al.49, primarily comprise intermediate states of each receptor.
Structural sequence domains
GPCRs have three main structural regions, namely a seven-helix TM domain, an extracellular domain built by the N-terminus and three ECLs, and the intracellular domain, including the C-terminus and two ICLs50.
Table 1 provides a detailed description of each region of the \({\upbeta }\)2AR-GPCR receptor under study. The 2RH1 and 3P0G structures contain residues 30-344. Both have gaps in the sequence, where ICL 3 between TM2 and TM3 is replaced in 2RH1 and 3P0G with T4-lysozyme and a nanobody, respectively. These residues are 231-262 for 2RH1, and 228-264 for 3P0G. \({\upbeta }\)2AR remains functional even in the absence these regions.
Figure 1 represents the common structure of a \({\upbeta }\)2 adrenergic GPCR. In it, the 7 TM, 2 ICL and 3 ECL regions are shown. In addition, BI-167107 ligand binding with the protein is displayed in an image inset.
Methods
The long short-term memory model
LSTM11 is a neural network of the RNN family, designed for the analysis of temporal data. A schematic explanation of how LSTM works is shown in Fig. 2. Summarily, LSTM has an input gate (i), a forgetting gate (f), a memory gate (c) and an output gate (o). The input gate decides whether to let the incoming signal go through to the memory gate, or block it. The output gate could allow a new signal output, or avoid it trough the memory gate. The forgetting cell is responsible for remembering or forgetting the previous state of the memory gate. The update of memory gate states is carried out by feeding the previous output gate back onto itself by recurrent connections of two consecutive time steps. The reading-and-writing memory cell is controlled by a group of sigmoid gates (x). At a given time, the LSTM receives inputs from different sources: the current amino-acid positions \(\textit{X}_\textit{xyz}\) as the input, the previous hidden state of all LSTM units (h), as well as the previous memory gate state \(\textit{c}_{(t -1)}\). Then, the output gate returns the estimated probability of the next 3D amino-acid positions for the sequences (Px, Py, Pz).
In short, LSTM solves one of the limitations of the RNN architecture, namely the inability to learn information from the far past. Therefore, LSTMs are able to accumulate information for a long period of time by allowing the network to dynamically learn and forget old aspects of information. In this work, an LSTM model was trained with 3D position sequences to predict their next movement.
In this paper, ULSTM, BLSTM, as well as CNN-LSTM as a chain of ANNs are investigated. ULSTM works by processing data in the forward direction, while BLSTM processes sequence data in both forward and backward directions with two separate hidden layers51. In addition, the LSTM variants are evaluated with other ML approaches, namely RF and CNN, in order to compare the results obtain with model architectures of different complexities.
Convolutional neural network
The CNN is a feedforward Neural Network proposed by Lecun et al.52 that has been shown to perform exceedingly well in image and natural language processing tasks53. It can also be used effectively to predict time series. Local perception and weight sharing of the CNN model can dramatically reduce its number of parameters, increasing the effectiveness of its learning process54. The CNN architecture consists mostly of a convolution layer and a two-part regrouping layer. Each convolution layer contains a plurality of convolution kernels. Following the convolution operation of the convolution layer, the data features are extracted, but their dimension is very high, so, and to reduce the computational cost of training, a pooling layer is added after the convolution layer to reduce the feature dimension55. In our experiments, a combined CNN-LSTM shallow Neural Network-based forecasting model has also been applied. Figure 3 shows an schematic representation of such a model architecture, and Table 2 provides a brief quantitative summary of its elements.
Experimental methodology
Data underwent linear max-min normalisation56 and were returned to the original range of values in Angstrom (Å) units. The 3D positions of amino acids were extracted for each frame. Note, though, that the original database included the positions of atoms, instead of the positions of amino acids. Therefore, the amino acids mass centers were calculated and used as 3D positions that represent them. Data prepossessing was performed on the 3D positions (x, y, z dimensions) of each residue. Five time frames were used to train the model, and the next frame was predicted from these. An overlap of 4 frames was used to select the following sequence on the training set; this means that, in the end, each frame was predicted, providing the 5 previous frames in the simulation. Then, the average error of all predictions was calculated and reported.
Two thousand trajectories per simulation were used. Simulations are represented in two states: \({\upbeta }\)2AR-2RH1 (simulations started in inactive state) and \({\upbeta }\)2AR-3P0G (simulations started in inactive state) both for APO, Full Agonist (BI-167107) and partial inverse antagonist conforming 6 types of the simulation. We refer to trajectories as nClones.
The LSTM model training was carried out using the 3D amino acid position (x, y, z) per frame. We refer to the length of a sequence as nSteps-in (which was 5 in our experiments), and the position of the amino acid representative data point as the center of mass. The centers of mass of the amino acid were employed as a representation of the residue’s position. This center of mass parameter is the best representative of the amino acid in the 3D space under LSTM forecasting36. All experiments were carried out in this way.
The parameters of training configuration of the LSTM were: epochs = 100, verbose = 0, activation = relu, input shape = (nSteps-in, length of amino acid chain). All the remaining parameters of the Keras57 framework were retained by default.
The RF58 algorithm, used for comparison, is a more conventional ML approach that can behave as a regression model. The Sklearn library59 implementation was used with default parameters, with the exception of maxdepth = number of residues 3 (xyz positions) and randomstate = 0.
For each type of simulation, 10,000 trajectories are available from the database reported in Kohlhoff et al.38. Two thousand of these are randomly selected and split into 5 folds (Four of them were used for cross-validation, and the remaining hold-out fold was used for test.) each including 400 trajectories. Results do not improve by using a bigger number of simulations. Ten iterations of this procedure were performed to obtain statistical significance tests.
The experiments were repeated by randomly choosing 2000 trajectories from the original set of trajectories (10,000). This procedure generates 10 models that have been evaluated and yields 10 RMSD values per experiment (ML approaches and the 6 types of simulations). Students t-tests (pvalue \(< 0,001\) and pvalue \(< 0,005\))60 were carried out to find statistically significant differences between experiments. Standard deviations and p-values are shown in the tables of results.
Given the potentially higher flexibility of the loops (ECL and ILC), a specific analysis was performed in which only the amino acids belonging to the TM were used for training and prediction. These results can then be compared with models obtained by training with the full protein sequences.
The quality of the test predictions was assessed through the Root Mean Square Deviation (RMSD), commonly used to assess the similarity between simulated and predicted atomic coordinates and therefore straightforwardly generalizable to the centers of mass of the amino acids, as we have have done in this study.
Experimental setup
Three experiments were performed for both GPCR states. The first experiment (E1) evaluates the capability of the RF,ULSTM, BLSTM and CNN-LSTM models to predict steps of the GPCR trajectories, discriminating by TM, ECL and ICL regions for active and inactives states.
For that, the prediction error of each of the amino acid positions was calculated and compared between models, region by region. Similar evaluations were carried out in the second experiment (E2), this time focused on the seven TM, comparing the prediction error between the 2RH1 and 3P0G for each TM (TM1 to TM6).
Focusing now on the large dynamics of the ICL and ECL of the GPCR, Experiment E3 evaluates the prediction capability of the models on the ICL (ICL1, ICL2) and ECL (ECL1, ECL2, ECL3) regions.
Results and discussion
Our study investigates the capability of LSTM models to predict GPCR MD trajectories for its different states and constituent regions. Different GPCR regions may play different roles in the MD associated to each state. The investigation of regions individually is therefore relevant.
As mentioned, prediction errors are reported using the RMSD61, with original range values in Angstroms (Å) units. The standard deviation (std)62 between experiment repetitions is also calculated for the RMSD metric. Table 3 shows the RMSD and the std in Å for 2RH1 inactive and 3P0G active states, in which prediction errors for RF, ULSTM, BLSTM, CNN-LSTM are compared. The results discriminated by APO, PA and PIA simulations are also shown in Table 3.
Regarding experiment E1, three questions can be answered:
-
Which model is the best one? TM, ECL and ICL regions show similar prediction error values for the different models. Furthermore, these errors are not uniform when comparing APO, FA and PIA simulations (see italics in Table 3). Despite the increasing complexity of the models evaluated (RF, ULSTM, BLSTM and CNN-LSTM), no clear differences were found between them with the exception of RF, which performed significantly worse than the rest. Using an Occam’s razor criterion, ULSTM, being the simplest model in terms of architecture complexity and computational resources used to obtain this goals, should be selected.
-
Which GPCR region yields best results and in what analysis? The TM regions are shown to be the best predicted regions of the GPCRs (see bold values in Table 3, with significant differences with \(p<0.005\) for \({\upbeta }\)2AR-2RH1 (inactive state in APO) and \(p<0.001\) for the remaining simulations respect to ICL and ECL . Comparing now the ICL and ECL regions, ILC achieve the minimum error for APO, FA and PIA simulations, with the exception of the active state of the APO simulation, where the minimum error was obtained for the ECL region, however no significance different are shown between these regions.
-
Are there any substantial differences between active and inactive states? Regarding inactive (2RH1) and active (3P0G) states, no strong differences were observed between their regions. Two exceptions are found in this analysis, once for the ECL region in the APO state and the TM region in the FA state that show significant differences ( pvalue \(< {0.05 }\)), see RMSD values in Table 3 with single asterisk). Only in these cases, the active state show clear differences from the inactive one.
The experimental results show that the LSTM model performed the best in predicting the dynamics of the TM regions and that, overall, the ICL regions yielded the highest prediction error. However the dynamics of the GPCR are different by region determining that some regions are more prone to conformational changes. The flexibility of the molecule was calculated by region and compared by examining the prediction error between regions, as shown in Tables 3 and 6.
The transmembrane regions are less flexible than the ILC and EXL regions, which are more likely to experience changes in their 3D structure. This may be one of the reasons why minor error has been obtained in the transmembrane regions.
Regarding now E2 and focusing on the six TM regions, the prediction errors are shown in Tables 4 and 7. The former includes results obtained training the models with the whole receptor, while the latter includes results obtained training the models only with the TM regions. No statistically significant differences (\(p>0.05\)) are observed between the errors obtained for models trained with loops and those trained without loops. In general terms, TM2 was the best predicted region for the active state. These results coincide for all the simulations for APO, FA and PIA. However, for the inactive state, the best values were obtained for TM2 or TM3, depending on the simulation, showing substantial differences between both regions. TM2 and TM3 do not show statistically significant differences between them (\(\textit{p}>0.05\)). However, TM2 shows significant differences with respect to TM1, TM4, TM5, TM6, TM7 in almost all simulations and models (except in active state for PIA simulation with BLSTM in frot of TM4), see value \(^{*1}\) in Table 4.
A more detailed analysis of the experimental results provided further information on the MD of the specific receptor regions. While TM2 coincides for all the simulations as the best predicted in active and inactive states, TM3, instead, is the best for inactive states in PIA simulations, but it do not show significance (pvalue \(> 0.05\))differences in contrast of TM2.
These proteins share a highly conserved motif of seven transmembrane helices connected by three extracellular and three intracellular loops. Movements of transmembrane regions III and IV are responsible for the activation of G protein-coupled receptors63. The conformational changes of the receptor transmembrane regions are closely related to the \({\upbeta }\)2-adrenergic activation (\({\upbeta }\)2AR) pathway64. It is known that an outer displacement of TM6 from the centre of the helices and displacements of TM5 and TM7 are part of the activation mechanism of a receptor10. However, the details of the mechanisms of interaction between residues, which unchain the activation, are still unclear. The Helices 6 and 7 of the original simulation show a strong difference between inactive and active structures, with a relative displacement of 6.951 and 3.47 Å, respectively. Helices 1, 2, and 4 are shifted away from the center in the active state, while helix 3 is noticeably nearer (the range of relative displacements is from 1.4 to 2.277 Å). During simulation, helix 6 moves inwards in the active state simulations, while helix 7 moves outwards. When Kohlhoff et al.,38 compare simulations started from the active with those started from the inactive state, the relative displacements of helices 1 through 4 between active and inactive remain almost constant, indicating the importance of their rearrangement as a distinguishing element of receptor activation. Therefore, it is essential to carry out this experiment to examine each domain of the transmembrane region of the receptor in detail. It could easily be claimed that the region that is structurally incomplete is most inaccurately modelled.
The original paper describing the MD simulations used in this study did not attempt to model the missing sections, which becomes a limitation of our reported results. Regarding missing sections in ICL (231-262 in 2RH1, and 228-264 in 3P0G between helices 5 and 6), it is difficult to draw solid conclusions about the differences between ICL and ECL since the protein is not accurately modelled. Nevertheless, \({\upbeta }\)2AR remains functional even in the absence of ICL3.
The prediction errors for experiment E3 are shown in Tables 5 for ICL and 6 for ECL. For the ICL regions, ICL1 was identified as the region with the lowest prediction error in both states and all simulations. Interestingly , the accurate prediction of MD of the ICL1 region contrasts with the results of ICL2, which yields a significantly statistics differences (pvalue \(< 0.01\)), with RMSD differences greater than 0.2 Å.
In the case of the ECL regions, both simulations showed that ECL1 had the lowest prediction error, which was significantly different from ICL2 (pvalue \(< 0.05\)) and from ECL3 (pvalue \(< 0.01\)). These results coincide for APO and FA simulations (see the RMSD values in Table 6). Regarding the PIA simulations, only significant differences (pvalue \(< 0.05\)) were found between ECL1 and ECL2 (Table 7).
Beyond that, our experiments were carried out with three ML models of increasing structural complexity in terms of their network architecture. The results show inconclusive differences between DL models, with minor differences depending on the simulations and state. Therefore, the simplest of these models, namely ULSTM, would be the preferred choice for further investigation. In the comparison of DL models (ULSTM, BLSTM, CNN-LSTM) with conventional ML models (RF), DL models have shown strong and significant (pvalue \(< 0.01\)) differences with RF.
The existing literature has reported the use of the RMSD error in the static 3D prediction for different proteins: for instance, Chen & Brooks65 used it as a metric to ascertain whether MD simulations provide high-resolution refinement of protein structure. Lee et al.66 predicted 3D structure using molecular mechanics based on the surface area free energies for two small proteins (HP-36 and S15). The RMSD error values obtained were 0.77 Å for HP-36 protein and 0.83 Å for S15 protein. More specifically for GPCRs, Kaczor et al.67 analysed different methods for protein-protein docking and evaluated the generation of new digital protein-protein complexes in the transmembrane environment. The best method achieved an overall RMSD lower than 0.7 Å in 8 out of 12 simulations. Even if not directly comparable to this study, we have reported errors lower than 0.13 Å as measured by RMSD in simulations of dynamics that are a major challenge for models.
The approach proposed in this paper allows predicting 3D residue positions from the MD time series. It could be used in prospective experiments by setting threshold error targets for the discrimination between states and exploring whether the method can achieve them and how well they compare with those obtained with alternative methods. Furthermore, such investigation could be assisted by visualising results by residue, as in Figure 4, which maps prediction errors using coloured ribbons. This would allow for visual interactive and intuitive evaluation, assessing which are the best or worst modelled residues, distinguishing those that are exposed to the solvent, or those exposed to the ligand, to name a few possibilities.
Conclusions
LSTM Neural Networks have in the past shown promise in problems of GPCR dynamics forecasting. The current study has provided evidence that LSTM models, in three different architectures, are capable of predicting the dynamic trajectories of GPCRs in six states with a reasonable efficacy, and far better that more standard ML models such as RF.
The TM helices are a key GPCR region due to their physiological role in signal transmission. Our LSTM models have been abble to predict the dynamics of TM2 and TM3 the best. Nevertheless, the details of the mechanism of interaction between amino-acids that unchains the activation remain unclear.
Although ULSTM is the shallowest of the investigated DL architectures, it has yielded competitive performance when compared to more complex models such as BLSTM and the combination CNN-LSTM.
LSTM models, though, suffer from some limitations when used to process long MD trajectories. For this reason, as a next step, we plan to investigate the capabilities of generative models (which have successfully been used for the modelling of protein MDs16) such as Transformers68 or Autoencoders69, for the prediction of long trajectories. We also plan to evaluate alternative representations of GPCR data, including graph representations.
No significant differences were observed between models trained with loops and only with the TM regions. This could be due to the fact that the data representation used by the models is a frame-by-frame relationship. We suggest that further research should be conducted on the representation of molecular dynamics through graphs that explicitly consider the connections between neighbouring residues.
Change history
30 April 2024
A Correction to this paper has been published: https://doi.org/10.1038/s41598-024-60566-w
References
Liebmann, C. Regulation of map kinase activity by peptide receptor signalling pathway: Paradigms of multiplicity. Cell. Signal. 13, 777–785 (2001).
van Blesen, T. et al. Receptor-tyrosine-kinase-and g\(\beta \)\(\gamma \)-mediated map kinase activation by a common signalling pathway. Nature 376, 781–784 (1995).
Zhang, Q. et al. Regulating quantal size of neurotransmitter release through a GPCR voltage sensor. Proc. Natl. Acad. Sci. 117, 26985–26995 (2020).
Betke, K., Wells, C. & Hamm, H. GPCR mediated regulation of synaptic transmission. Prog. Neurobiol. 96, 304–321 (2012).
Rodríguez-Espigares, I. et al. GPCRmd uncovers the dynamics of the 3D-GPCRome. Nat. Methods 17, 777–787 (2020).
Rask-Andersen, M., Almén, M. S. & Schiöth, H. Trends in the exploitation of novel drug targets. Nat. Rev. Drug Discov. 10, 579–590 (2011).
Basith, S. et al. Exploring g protein-coupled receptors (GPCRs) ligand space via cheminformatics approaches: Impact on rational drug design. Front. Pharmacol. 9, 128 (2018).
Torrens-Fontanals, M. et al. How do molecular dynamics data complement static structural data of GPCRs. Int. J. Mol. Sci. 21, 5933 (2020).
Wheatley, M. et al. Lifting the lid on GPCRs: The role of extracellular loops. Br. J. Pharmacol. 165, 1688–1703 (2012).
Latorraca, N., Venkatakrishnan, A. & Dror, R. GPCR dynamics: Structures in motion. Chem. Rev. 117, 139–155 (2017).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Chen, Z. et al. D3pockets: A method and web server for systematic analysis of protein pocket dynamics. J. Chem. Inf. Model. 59, 3353–3358 (2019).
Ribeiro, J. M. L. & Tiwary, P. Achieving reversible ligand-protein unbinding with deep learning and molecular dynamics through rave. BioRxiv 400002 (2018).
Tsuchiya, Y., Taneishi, K. & Yonezawa, Y. Autoencoder-based detection of dynamic allostery triggered by ligand binding based on molecular dynamics. J. Chem. Inf. Model. 59, 4043–4051 (2019).
Wu, H., Mardt, A., Pasquali, L. & Noe, F. Deep generative markov state models. Adv. Neural Inf. Process. Syst. 31 (2018).
Degiacomi, M. Coupling molecular dynamics and deep learning to mine protein conformational space. Structure 27, 1034–1040 (2019).
Fleetwood, O., Kasimova, M., Westerlund, A. & Delemotte, L. Molecular insights from conformational ensembles via machine learning. Biophys. J. 118, 765–780 (2020).
Zhou, H., Dong, Z. & Tao, P. Recognition of protein allosteric states and residues: Machine learning approaches. J. Comput. Chem. 39, 1481–1490 (2018).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Townshend, R., Bedi, R., Suriana, P. & Dror, R. End-to-end learning on 3D protein structure for interface prediction. Adv. Neural Inf. Process. Syst. 32 (2019).
Jiménez, J., Doerr, S., Martínez-Rosell, G., Rose, A. & De Fabritiis, G. DeepSite: Protein-binding site predictor using 3D-convolutional neural networks. Bioinformatics 33, 3036–3042 (2017).
Luo, F., Wang, M., Liu, Y., Zhao, X.-M. & Li, A. DeepPhos: Prediction of protein phosphorylation sites with deep learning. Bioinformatics 35, 2766–2773 (2019).
de Jesus, D. R., Cuevas, J., Rivera, W. & Crivelli, S. Capsule networks for protein structure classification and prediction. arXiv preprint arXiv:1808.07475 (2018).
Hayatshahi, H., Ahuactzin, E., Tao, P., Wang, S. & Liu, J. Probing protein allostery as a residue-specific concept via residue response maps. J. Chem. Inf. Model. 59, 4691–4705 (2019).
Plante, A., Shore, D. M., Morra, G., Khelashvili, G. & Weinstein, H. A machine learning approach for the discovery of ligand-specific functional mechanisms of GPCRs. Molecules 24, 2097 (2019).
Rico-Martines, R., Kevrekidis, I. G., Kube, M. C. & Hudson, J. L. Discrete- vs. continuous-time nonlinear signal processing: Attractors, transitions and parallel implementation issues. In 1993 American Control Conference 1475–1479 (IEEE, 1993).
Eslamibidgoli, M. J., Mokhtari, M. & Eikerling, M. Recurrent neural network-based model for accelerated trajectory analysis in AIMD simulations. arXiv preprint arXiv:1909.10124 (2019).
Tsai, S.-T., Kuo, E.-J. & Tiwary, P. Learning molecular dynamics with simple language model built upon long short-term memory neural network. Nat. Commun. 11, 1–11 (2020).
Lukoševičius, M. & Jaeger, H. Reservoir computing approaches to recurrent neural network training. Comput. Sci. Rev. 3, 127–149 (2009).
Pathak, J., Hunt, B., Girvan, M., Lu, Z. & Ott, E. Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach. Phys. Rev. Lett. 120, 024102 (2018).
Eslamibidgoli, M. J., Mokhtari, M. & Eikerling, M. H. Recurrent neural network-based model for accelerated trajectory analysis in AIMD simulations. arXiv preprint arXiv:1909.10124 (2019).
Kadupitiya, J., Fox, G. & Jadhao, V. Deep learning based integrators for solving Newton’s equations with large timesteps. arXiv preprint arXiv:2004.06493 (2020).
Zeng, W., Cao, S., Huang, X. & Yao, Y. A note on learning rare events in molecular dynamics using LSTM and transformer. arXiv preprint arXiv:2107.06573 (2021).
Liang, D. et al. Supervised machine learning approach to molecular dynamics forecast of SARS-CoV-2 spike glycoproteins at varying temperatures. MRS Adv. 6, 362–367 (2021).
López-Correa, J. M., König, C. & Vellido, A. Long short-term memory to predict 3D amino acids positions in GPCR molecular dynamics (2022).
Hellerstein, J., Kohlhoff, K. & Konerding, D. Science in the cloud: Accelerating discovery in the 21st century. Internet Comput. 16, 64–68 (2012).
Kohlhoff, K. et al. Cloud-based simulations on google exacycle reveal ligand modulation of GPCR activation pathways. Nat. Chem. 6, 15–21 (2014).
Lomize, M. A., Lomize, A. L., Pogozheva, I. D. & Mosberg, H. I. OPM: Orientations of proteins in membranes database. Bioinformatics 22, 623–625 (2006).
Cherezov, V. et al. High-resolution crystal structure of an engineered human \(\beta \)2-adrenergic g protein-coupled receptor. Science 318, 1258–1265 (2007).
Rasmussen, S. G. et al. Structure of a nanobody-stabilized active state of the \(\beta \)2 adrenoceptor. Nature 469, 175–180 (2011).
Duan, Y. et al. A point-charge force field for molecular mechanics simulations of proteins based on condensed-phase quantum mechanical calculations. J. Comput. Chem. 24, 1999–2012 (2003).
Wang, J., Wolf, R. M., Caldwell, J. W., Kollman, P. A. & Case, D. A. Development and testing of a general amber force field. J. Comput. Chem. 25, 1157–1174 (2004).
da Sousa Silva, A. W. & Vranken, W. F. ACPYPE-antechamber python parser interface. BMC Res. Notes 5, 1–8 (2012).
Wang, J., Wang, W., Kollman, P. A. & Case, D. A. Automatic atom type and bond type perception in molecular mechanical calculations. J. Mol. Graph. Model. 25, 247–260 (2006).
Ballesteros, J. A. et al. Activation of the \(\beta \)2-adrenergic receptor involves disruption of an ionic lock between the cytoplasmic ends of transmembrane segments 3 and 6. J. Biol. Chem. 276, 29171–29177 (2001).
Vogel, R. et al. Functional role of the “ionic lock’’-an interhelical hydrogen-bond network in family a heptahelical receptors. J. Mol. Biol. 380, 648–655 (2008).
Rosenbaum, D. M. et al. Structure and function of an irreversible agonist-\(\beta \)2 adrenoceptor complex. Nature 469, 236–240 (2011).
Gutiérrez-Mondragón, M. A., König, C. & Vellido, A. Recognition of conformational states of a g protein-coupled receptor from molecular dynamic simulations using sampling techniques. In International Work-Conference on Bioinformatics and Biomedical Engineering (IWBBIO) 3–16 (Springer, 2023).
König, C., Alquézar, R., Vellido, A. & Giraldo, J. Systematic analysis of primary sequence domain segments for the discrimination between class C GPCR subtypes. Interdiscip. Sci. Comput. Life Sci. 10, 43–52 (2018).
Cui, Z., Ke, R., Pu, Z. & Wang, Y. Deep bidirectional and unidirectional LSTM recurrent neural network for network-wide traffic speed prediction. arXiv preprint arXiv:1801.02143 (2018).
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
Kim, B. & Kim, T.-G. Cooperation of simulation and data model for performance analysis of complex systems. Int. J. Simul. Model. 18, 608–619 (2019).
Qin, L., Yu, N. & Zhao, D. Applying the convolutional neural network deep learning technology to behavioural recognition in intelligent video. Tehnički Vjesnik J. 25, 528–535 (2018).
Lu, W., Li, J., Li, Y., Sun, A. & Wang, J. A CNN-LSTM-based model to forecast stock prices. Complexity 2020, 1–10 (2020).
Jahan, A. & Edwards, K. A state-of-the-art survey on the influence of normalization techniques in ranking: Improving the materials selection process in engineering design. Mater. Des. 1980–2015(65), 335–342 (2015).
Chollet, F. Keras documentation. https://keras.io (2015). Accessed: 2023-02-23.
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Pölsterl, S. scikit-survival: A library for time-to-event analysis built on top of scikit-learn. J. Mach. Learn. Res. 21, 8747–8752 (2020).
Student. The probable error of a mean. Biometrika 6, 1–25 (1908).
Sargsyan, K., Grauffel, C. & Lim, C. How molecular size impacts RMSD applications in molecular dynamics simulations. J. Chem. Theory Comput. 13, 1518–1524 (2017).
Lee, D. K., In, J. & Lee, S. Standard deviation and standard error of the mean. Korean J. Anesthesiol. 68, 220 (2015).
Gether, U. et al. Agonists induce conformational changes in transmembrane domains III and VI of the \(\beta \)2 adrenoceptor. EMBO J. 16, 6737–6747 (1997).
Kofuku, Y. et al. Efficacy of the \(\beta \)2-adrenergic receptor is determined by conformational equilibrium in the transmembrane region. Nat. Commun. 3, 1045 (2012).
Chen, J. & Brooks, C. III. Can molecular dynamics simulations provide high-resolution refinement of protein structure?. Proteins: Struct. Funct. Bioinform. 67, 922–930 (2007).
Lee, M., Baker, D. & Kollman, P. 2.1 and 1.8 å average c\(\alpha \) RMSD structure predictions on two small proteins, HP-36 and S15. J. Am. Chem. Soc. 123, 1040–1046 (2001).
Kaczor, A., Selent, J., Sanz, F. & Pastor, M. Modeling complexes of transmembrane proteins: Systematic analysis of protein protein docking tools. Mol. Inf. 32, 717–733 (2013).
Giuliari, F., Hasan, I., Cristani, M. & Galasso, F. Transformer networks for trajectory forecasting. In 25th International Conference on Pattern Recognition 10335–10342 (IEEE-ICPR, 2021).
Bank, D., Koenigstein, N. & Giryes, R. Autoencoders. arXiv preprint arXiv:2003.05991 (2020).
Acknowledgements
This work is funded by the Spanish PID2022-143299OB-I00 research project and by the PRE2020-092428 PhD training program, through the Ministry of Science and Innovation.
Author information
Authors and Affiliations
Contributions
All authors made significant contributions to this manuscript. J.M.L.-C.: conceptualisation, methodology, data-analysis, writing, scripts programming, provided suggestions on the experimental design and original draft. C.K.: methodology, writing and experimental design and original draft, supervision. A.V.: provided the methodological research design and data analysis, writing, supervision, and project funding acquisition. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original online version of this Article was revised: The original version of this Article contained an error in the Acknowledgements section. Full information regarding the correction made can be found in the correction for this Article.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
López-Correa, J.M., König, C. & Vellido, A. GPCR molecular dynamics forecasting using recurrent neural networks. Sci Rep 13, 20995 (2023). https://doi.org/10.1038/s41598-023-48346-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-023-48346-4
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.