Introduction

Biological membranes form permeable fences between the interior of cells and the external environment. They are composed of phospholipid bilayers, which form a particular, fluid, medium that differs from the surrounding aqueous solution. A lot of proteins are embedded in, attached to, or cross the membranes. We focus here on integral membrane proteins, which cross the membrane and have thus a transmembrane, and one or two extramembrane domains.

Membrane proteins are a very important class of proteins. They play key roles in the localization and organization of the cell, as well as in the cellular function by transferring specific molecules, ions and other types of signals from the cell exterior to the interior and vice versa. They constitute about 30% of the entire human proteome1. They are the focus of a lot of pharmaceutical research, as they correspond to about 60% of the current drug targets2.

In spite of their importance, membrane proteins have been much less studied than globular proteins. They are indeed very difficult to analyze, as their folding, native structure, stability and activity is reached only within the lipid bilayer, which complicates getting their experimental X-ray structures. Generally, their large size makes also difficult to obtain them by nuclear magnetic resonance spectroscopy. These are the reasons why transmembrane protein structures only represent about 2% of the available structures deposited in the Protein Data Bank (PDB)3. The analysis and modeling of the 3-dimensional (3D) structure of membrane proteins are thus key objectives for rationally guiding protein design and engineering experiments.

Due to the difference between the aqueous and lipid environments, the structure and composition of transmembrane regions substantially differ from those of the extramembrane domains and from globular proteins4. This implies that interactions that are favorable in globular regions are not necessarily so in transmembrane regions, and vice versa. However, the relative strength of the different types of interactions in the two environments is not easy to evaluate.

To tackle this issue, semi-empirical, physics-based, energy functions adapted to membrane proteins have been designed and used for computational modeling and design purposes (see5,6 for reviews). Such potentials have also been used to orient proteins into membranes, using coarse-grained molecular dynamics simulations7,8, or simplified potentials including anisotropic solvent models of lipid bilayers9. Another approach consists in deriving statistical potentials from sets of known membrane protein structures. Such potentials have been applied to evaluate structural models of membrane proteins10,11,12,13 and to position proteins into lipid membranes10,14.

Some authors analyzed separately α-helical and β-barrel proteins15,16. Indeed, gram-negative bacteria have two membranes, an inner membrane composed of a phospholipid bilayer and an outer membrane which is an asymmetrical bilayer of phospholipids in the inner leaflet and lipopolysaccharides in the outer leaflet. This difference implies that the membrane proteins differ according to whether they are inserted in the inner or outer membrane. In particular, α-helical transmembrane proteins are mostly found in the cytoplasmic membranes of prokaryotic and eukaryotic cells and rarely in outer membranes, whereas β-barrel proteins have so far only been found in outer membranes of gram-negative bacteria, mitochondria and chloroplasts17,18.

In this paper, we chose to apply the statistical potential formalism to derive distance potentials from trans- and extramembrane protein regions, as this yields an objective way to compare residue-residue interactions that prevail in lipid and aqueous environments. We also derived potentials separately on α-helical and β-barrel transmembrane regions to investigate whether differences are visible between interaction strengths. We should in principle also distinguish between extramembrane residues that are in the cytoplasmic, periplasmic or extracellular regions. For example, it has been shown that positive charges in α-helical domains are more often situated in the cytoplasmic domain where they make interactions with the lipid molecules19,20, and that charged residues in β-proteins are more frequently located on the extracellular side21,22. However, we chose to group these regions into a single category called extramembrane, which we occasionally separate into two subcategories: intracellular regions that are situated at the cellular side and are either cytoplasmic or periplasmic, and extracellular regions that can be periplasmic or really extracellular. Indeed, the number of membrane proteins with an experimental structure is currently too limited to yield reliable statistics if we define too many subregions.

Methods

Membrane and globular protein datasets

To set up our membrane protein dataset, we used the OPM database9, which contains experimental structures of integral membrane proteins. From these, we selected the proteins of which the structure was obtained by X-ray crystallography with a resolution of 2.5 Å at most. In a second step, we imposed a threshold on the pairwise sequence identity of 30%, with the help of the protein culling server PISCES23. Our final dataset \({\mathscr{D}}\) contains 165 membrane protein structures, among which 108 α-helical and 52 β-barrel polytopic integral proteins, and 5 α-helical monotopic integral proteins that do not span the lipid bilayer completely. They are listed in Supplementary Material Table S1.

The proteins from this dataset were divided into their transmembrane and extramembrane regions, using the OPM annotations. Note that most of these annotations are predictions; the impact of this is discussed in the Conclusion section. We got in this way two datasets, the \({{\mathscr{D}}}^{{\rm{TM}}}\) set that contains all the transmembrane protein segments, and the \({{\mathscr{D}}}^{{\rm{EM}}}\) set that contains the extramembrane protein regions, and thus mix extracellular, periplasmic and cytoplasmic segments. We occasionally separated the \({{\mathscr{D}}}^{{\rm{TM}}}\) dataset into transmembrane regions with α-helical or β-barrel conformations. The protein segments that make up these datasets are specified in Table S2.

To the best of our knowledge, the dataset of protein membrane structures constructed in this paper is currently the largest non-redundant dataset used to derive effective potentials10,11,12.

For comparison, we also considered the \({{\mathscr{D}}}^{{\rm{GL}}}\) dataset set up in24, which contains 3,823 X-ray structures of globular proteins, with a resolution of maximum 2.5 Å and a pairwise sequence identity of 20% at most.

Statistical potentials

Statistical potentials are coarse-grained energy functions derived from frequencies of observation of associations between sequence and structure elements in a dataset of protein structures using the inverse Boltzmann law25,26. In particular, we considered here the potentials:

$$\begin{array}{lllll}{\rm{\Delta }}W(s,d) & = & -{k}_{B}T\,\mathrm{ln}\,\frac{F(s,d)}{F(s)F(d)} & = & -{k}_{B}T\,\mathrm{ln}\,\frac{n(s,d)\,n}{n(s)\,n(d)}\\ {\rm{\Delta }}W({s}_{1},{s}_{2},d) & = & -{k}_{B}T\,\mathrm{ln}\,\frac{F({s}_{1},{s}_{2},d)}{F({s}_{1},{s}_{2})F(d)} & = & -{k}_{B}T\,\mathrm{ln}\,\frac{n({s}_{1},{s}_{2},d)\,n}{n({s}_{1},{s}_{2})\,n(d)}\end{array}$$
(1)

where kB is the Boltzmann constant, T the absolute temperature conventionally taken to be room temperature, s, s1 and s2 amino acid types, n numbers of occurrences and F relative frequencies. d is the spatial distance between the side chain geometric centers of two residues separated by at least one residue along the chain; the type of one of these residues (s) or of both residues (s1 and s2) are specified. The distance values between 3 and 9.9 Å are divided into discrete bins of 0.3 Å width and the last bin contains all distances above 9.9 Å. Details about the computation of the potentials can be found in24,26,27.

The potentials depend on the protein structure dataset from which the relative frequencies F are computed. Taking advantage of this dependence, a careful analysis of the relative strength of the interactions as a function of the temperature28 and of the solubility29 has been previously performed. Here, we extended this approach to membrane proteins and considered for that purpose the three datasets \({{\mathscr{D}}}^{{\rm{TM}}}\), \({{\mathscr{D}}}^{{\rm{EM}}}\) and \({{\mathscr{D}}}^{{\rm{GL}}}\). From these, we derived the transmembrane potential ΔWTM, the extramembrane potential ΔWEM and the globular protein potential ΔWGL, which describe the interactions in these respective protein regions.

Amino acids that share similar properties can be considered together when computing the potentials. Such potentials are referred to as group potentials. In summing up the number of occurrences of different amino acid types belonging to the same group, their sizes have to be taken into account. In practice, we shifted the inter-residue distances d between larger amino acids towards smaller distances by subtracting the difference in radii between these amino acids and the smallest amino acid in the group. We analyzed here group potentials involving positively charged residues (Lys, Arg), negatively charged residues (Glu, Asp), aromatic residues (Phe, Tyr, Trp), aliphatic residues (Ile, Val, Leu), non-charged polar residues (Gln, Asn, Ser, Thr), small residues (Gly, Ala), and sulfur-containing residues (Cys, Met).

Coping with finite-size dataset effect

Using frequencies of observation in a protein structure dataset to estimate free energy contributions through Eq. (1) implicitly assumes that the number of structures in the set is large enough to provide statistically significant values. This is, in general, a reasonable hypothesis for standard statistical potentials derived from thousands of globular structures. However, in the case of membrane proteins, the number of experimental structures is rather small and they are moreover divided into their trans- and extramembrane parts.

To cope with the finite-size effect, and get smooth and statistically significant potentials, we introduced two additional layers of computation. The first layer consists in dropping the potentials computed from distance bins d that do not contain a sufficient number of occurrences. We chose the threshold value on n(s, d) and n(s1, s2, d) equal to 10. If this value is not reached, the potentials are set to zero. Equation (1) thus becomes for ΔW(s1, s2, d):

$$\begin{array}{llll}{\rm{\Delta }}W({s}_{1},{s}_{2},d) & = & -{k}_{B}T\,ln\,\frac{n({s}_{1},{s}_{2},d)\,n}{n({s}_{1},{s}_{2})\,n(d)} & {\rm{if}}\,n({s}_{1},{s}_{2},d)\ge 10\\ {\rm{\Delta }}W({s}_{1},{s}_{2},d) & = & 0\, & {\rm{otherwise}}\end{array}$$
(2)

and similarly for the potential ΔW(s, d).

The second layer consists in smoothing the potential curves by replacing the number of occurrences in each bin with the weighted sum of the occurrences of the β neighborhood bins as:

$$\hat{n}({s}_{1},{s}_{2},d)=\mathop{\sum }\limits_{i=1}^{\beta }\,\frac{1}{{\alpha }^{i}}n({s}_{1},{s}_{2},d-i)+n({s}_{1},{s}_{2},d)+\mathop{\sum }\limits_{i=1}^{\beta }\,\frac{1}{{\alpha }^{i}}n({s}_{1},{s}_{2},d+i)$$
(3)

where d represents here a discrete distance bin rather than a continuous distance value, and where we chose β = 4 and α = 4/3. The number of occurrences \(\hat{n}({s}_{1},{s}_{2})\), \(\hat{n}(d)\) and \(\hat{n}\) are obtained from \(\hat{n}({s}_{1},{s}_{2},d)\) by summing over all distance bins and/or all amino acid types. The smoothing of \({\rm{\Delta }}W(s,d)\) is done in the same way.

Trans- and extramembrane folding free energy

The folding free energy of a protein represented by its sequence S and 3D conformation C was computed using the potentials derived from the \({{\mathscr{D}}}^{TM}\), \({{\mathscr{D}}}^{EM}\) and \({{\mathscr{D}}}^{GL}\) datasets as:

$$\begin{array}{rcl}{\rm{\Delta }}{W}_{sd}^{\mu }(S,C) & = & \frac{1}{2}\mathop{\sum }\limits_{i,j=1,|i-j| > 1}^{N}\,{\rm{\Delta }}{W}^{\mu }({s}_{i},{d}_{ij})\\ {\rm{\Delta }}{W}_{sds}^{\mu }(S,C) & = & \frac{1}{2}\mathop{\sum }\limits_{i,j=1,|i-j| > 1}^{N}\,{\rm{\Delta }}{W}^{\mu }({s}_{i},{s}_{j},{d}_{ij})\end{array}$$
(4)

where i and j denote positions along the amino acid sequence, N is the sequence length, and μ equals TM, EM or GL. To avoid any overfitting, the folding free energies were computed using a leave-one-out cross validation strategy, consisting in removing the target protein \((\bar{S},\bar{C})\) from the dataset \({{\mathscr{D}}}^{\mu }\) when computing its folding free energy \({\rm{\Delta }}{W}^{\mu }\,(\bar{S},\bar{C})\). Note that this cross validation procedure is very strict, since the datasets contain, by construction, no proteins with more than 30% pairwise sequence identity.

Per-residue folding energies

To test the accuracy and applicability of our potentials, we employed them to determine whether residues are localized in the trans- or extramembrane regions. For that purpose, we estimated the per-residue contributions to the folding free energy30. For residue i, we have:

$$\begin{array}{ccc}{\rm{\Delta }}{G}_{sd}^{i,\mu } & = & \frac{1}{2}\mathop{\sum }\limits_{j=1,|i-j| > 1}^{N}\,{\rm{\Delta }}{W}^{\mu }({s}_{i},{d}_{ij})\\ {\rm{\Delta }}{G}_{sds}^{i,\mu } & = & \frac{1}{2}\mathop{\sum }\limits_{j=1,|i-j| > 1}^{N}\,{\rm{\Delta }}{W}^{\mu }({s}_{i},{s}_{j},{d}_{ij})\end{array}$$
(5)

It is easy to see that the sum over all residues yields the global folding free energies of Eq. (4).

Results and Discussion

Amino acid frequencies

The relative frequencies of the twenty amino acids differ among the trans- and extramembrane datasets \({{\mathscr{D}}}^{{\rm{TM}}}\) and \({{\mathscr{D}}}^{{\rm{EM}}}\), as seen in Figs 1a and S1. Notably, the \({{\mathscr{D}}}^{{\rm{EM}}}\) frequencies are quite similar to the \({{\mathscr{D}}}^{{\rm{GL}}}\) frequencies, which is not surprising as the environments of globular proteins and extramembrane regions are similar, except for the region interacting with the membrane and the transmembrane region. We also analyzed the frequency of different types of residues as a function of the distance to the intra- and extracellular water-membrane interfaces, as shown in Fig. 1b.

Figure 1
figure 1

Relative frequencies of amino acid groups in the datasets \({{\mathscr{D}}}^{EM}\) and \({{\mathscr{D}}}^{TM}\). The amino acid groups are defined in  Methods. (a) Frequencies as a function of the solvent accessibility of the residues: 0–20% (blue), 20–50% (orange), and 50–100% (grey). (b) Frequencies as a function of the distance with respect to the water-membrane interfaces. Darker blue indicates higher frequencies and lighter blue lower frequencies. The straight lines represent the water-membrane interfaces. The extracellular side is directed upward and the intracellular side downward.

The clearest difference between transmembrane and extramembrane regions is observed for aliphatic residues Val, Ile and Leu: they are much more numerous in the former than in latter. In extramembrane regions, they tend to be located in the protein interior to avoid contact with water molecules, whereas in transmembrane regions, they are almost uniformly distributed; only near the interface does their frequency start to decrease. Note that Leu is more frequent than Val and Ile in transmembrane regions, probably because the former are favored in α-helices and the latter in β-strands31 and our dataset contains more α- than β-transmembrane domains.

Aromatic amino acids were also found more frequently in the transmembrane than in the extramembrane regions. They are preferentially located near the water-membrane and the protein-membrane interfaces. This observation is consistent with the finding that aromatic residues are very important in anchoring the protein into the membrane where they tend to form cation-π interactions with some positively charged lipid head groups32,33,34.

In contrast, charged amino acids are much more frequently observed in extra- than in transmembrane regions. This results from the large energetic cost of transferring a charged amino acid from an aqueous environment with a high dielectric constant (εwater = 80) to the membrane that has a low dielectric constant (εmembrane = 2 to 4)35. Moreover, we found differences in the distribution of positively charged residues in proteins whose transmembrane domain is α-helical. Indeed, as seen in Fig. S2, their frequency is higher in the regions oriented towards the cell interior than towards the cell exterior. This is consistent with the “positive-inside rule”, stating that positive residues are more abundant in the cytoplasmic regions than in the periplasmic regions for α-helical transmembrane domains inserted in bacterial inner membranes, or than extracellular regions in the case of eukaryotic membranes36. In cytochrome P450, the insertion or deletion of positively charged residues in some loop regions have been shown to modify the protein orientation with respect to the membrane and the translocation of protein segments across it37,38. The general explanation of this rule is that the interaction of the positively charged residues of the intracellular domain with the negatively charged lipids of the cytosolic membrane surface through electrostatic interactions causes the retention of the positively charged residues on the cytoplasmic face of the membrane39,40,41. Note that the positive-inside-rule has been used to predict the transmembrane orientation of α-helical membrane proteins42.

In β-barrel membrane proteins inserted into outer bacterial membranes, no significant differences are visible in Fig. S2 between charged residue frequencies in the intra- and extracellular regions. Yet, a compositional asymmetry has been described before, with a larger frequency of both positively and negatively charged residues in the extracellular regions22,43, where lipopolysaccharides are generally attached to the membrane. This “charge-outside” rule is not observed in our dataset.

Like the charged residues, the uncharged polar residues are also preferentially located in the extramembrane regions rather than inside the membrane. Their frequency is almost identical at both sides of the membrane.

Preferred interactions in transmembrane regions

Statistical distance potentials were derived separately from the datasets \({{\mathscr{D}}}^{{\rm{TM}}}\), \({{\mathscr{D}}}^{{\rm{EM}}}\) and \({{\mathscr{D}}}^{{\rm{GL}}}\), as described in Methods. Their comparison yields an objective evaluation of the residue-residue interactions that are more favorable in the transmembrane than in the extramembrane regions, and than in globular proteins. The potentials so obtained are depicted in Fig. S3. The potentials computed separately for α-helical and β-barrel transmembrane regions are shown in Fig. S5.

As expected, potentials obtained from residues with similar properties have similar shapes, up to a global distance shift due to the differences in residue size. We thus defined group potentials that combine residue pairs that have similar distance potentials, according to the procedure detailed in Methods. A selected subset of these potentials – those that are discussed in what follows – are shown in Figs 28 and in Figs S4 and S6. Group potentials have the double advantage of limiting the small-dataset effect that affects the statistical significance of the potentials, and of simplifying the interpretations of the results by considering together the residue pairs that form the same types of interactions.

Figure 2
figure 2

Salt bridge interactions between Arg or Lys and Asp or Glu. (a) Energy profile (in kcal/mol) as a function of the interresidue distance d (in Å) in globular proteins (dashed black line), extramembrane regions (blue line) and transmembrane regions (red line). (b) Difference between the energy profiles of salt bridges involving Arg (small dashed red) and Lys residues (large dashed red). (c,d) Salt bridges occurring in the transmembrane region of the protein structures 5AYN (iron transporter ferroportin) and 2WJR (NanC porin), respectively. The residues Arg and Lys are drawn in red, and Glu and Asp in blue.

Figure 3
figure 3

Interactions between amino acids of equal charge. Energy profiles (in kcal/mol) as a function of the interresidue distance d (in Å) for interactions: (a) between two negatively charged residues (Asp or Glu), and (b) between two positively charged residues (Lys or Arg), in globular proteins (dashed black line), extramembrane regions (blue line) and in transmembrane regions (red line). (c) Energy profiles of +/+ interactions in the transmembrane regions of α-helical proteins (orange) and β-barrel proteins (magenta). (d) Example of a cluster of four arginines separated by less then 4 Å inside the hydrophilic channel of the transmembrane region of KdgM porin from the Dickeya dadantii (PDB code 4FQE).

Figure 4
figure 4

Polar-polar and polar-charge interactions. (a) Energy profile (in kcal/mol) as a function of the interresidue distance d (in Å) for globular proteins (dashed black), extramembrane regions (blue) and transmembrane regions (red). (b) Cluster of polar residue interactions at the interface between transmembrane helices in 4ZW9 (GLUT3 glucose transporter).

Figure 5
figure 5

Anion-π interactions between Asp or Glu and Phe, Tyr or Trp. (a) Energy profile (in kcal/mol) as a function of the interresidue distance d (in Å) in globular proteins (dashed black), extramembrane regions (blue) and transmembrane regions (red). (b) Example of an anion-π interactions in the PDB structure 1A0S (sucrose-specific porin). The negatively charged amino acids are in red, the positively charged ones in blue and the aromatic residues in magenta.

Figure 6
figure 6

Cation-π interactions between Lys or Arg and Phe, Tyr or Trp (a) Energy profile (in kcal/mol) as a function of the interresidue distance d (in Å) in globular proteins (dashed black), extramembrane regions (blue) and transmembrane regions (red). (b) Energy profile in α-helical (orange) and β-barrel (magenta) transmembrane regions. (c,d) Distribution of the θ angle between the aromatic and guanidinium planes, for Arg-involving cation-π interactions. 0° corresponds to stacked and 90° to parallel conformations. (c) Distributions from extramembrane regions are in blue and those from transmembrane regions in red. (d) Distributions from α-helical (orange) and β-barrel (magenta) transmembrane proteins.

Figure 7
figure 7

Sulfur-π interactions between Met or Cys and Phe, Tyr or Trp. (a) Energy profile (in kcal/mol) as a function of the interresidue distance d (in Å) in globular proteins (dashed black), extramembrane (blue) and in the transmembrane regions (red). (b) Energy profile for sulfur-π (small dashed red), π-π (large dashed red) and sulfur-sulfur (continuous red) interactions in the transmembrane regions. (c) Example of sulfur-π interaction in the transmembrane region of the PDB structure 3S8G (ba3 cytochrome c oxidase); Met and Cys are in yellow and aromatic residues in magenta.

Figure 8
figure 8

Aromatic-aromatic interaction between Phe, Tyr or Trp and aliphatic-aliphatic interactions between Ile, Leu or Val (a) Energy profile (in kcal/mol) of aromatic-aromatic interactions as a function of the interresidue distance d (in Å) in globular proteins (dashed black), extramembrane (blue) and transmembrane regions (red). (b) Energy profile for aromatic-aromatic interactions in α-helical proteins (orange) and in β-proteins (magenta). (c) Energy profile (in kcal/mol) for aliphatic-aliphatic interactions as a function of the distance d (in Å) in globular proteins (dashed black), extramembrane (blue) and transmembrane regions (red).

Salt bridge interactions

Salt bridges are electrostatic interactions between positively (Lys, Arg) and negatively charged (Glu, Asp) residues which play an important role in the stabilization - especially thermostabilization44 - of globular proteins. Here we studied the energetic contributions of this kind of interaction in the different regions of membrane proteins as a function of the distance between the residues’ side chain geometric centers. As shown in Fig. 2a, both the extra- and transmembrane potentials have a characteristic minimum at a distance of about 4 Å1Footnote 1, but the latter are shifted downwards, by about −0.6 kcal/mol, over the whole distance range. Salt bridges appear thus much more stabilizing in the transmembrane than in the extramembrane region.

Two energy contributions play a role in the formation of salt bridges in globular proteins: the desolvation penalty upon burying an ion inside the protein, which is usually counterbalanced by the electrostatic gain in approaching the two opposite charges. In transmembrane protein regions, the situation is substantially different because the protein interior is more hydrophilic than the surface that is in contact with lipid molecules: the dielectric constant of the lipid bilayer is εmembrane ≈ 1−2, whereas εinterior varies from 2−6, up to 80 in the case of the hydrophilic channel in β-barrel porins or α-helical aquaporins45. Thus, burying an ion constitutes here an energy gain, which is added to the stabilizing electrostatic interaction between the two charged residues. We also observe that, in the transmembrane regions, Lys-containing salt bridges tend to be less stabilizing than Arg-containing ones (Fig. 2b), in which the positive charge is delocalized on the guanidinium group.

The salt bridge geometries vary according to the type of proteins. For example, stabilizing salt bridges are recurrently found across transmembrane helices in “charge zipper” conformations, defined as extended salt bridge ladders along transmembrane helical segments46, as illustrated in Fig. 2c. In other membrane proteins such as porin-like β-barrel structures, a large network of salt bridge interactions is observed in the hydrophilic pore, as shown in Fig. 2d.

Note that salt bridges have sometimes also pivotal functional roles. For example, they are responsible for G protein-coupled receptor (GPCR) activation and trafficking47, and for ion channel gating48.

Interactions between amino acids of equal charge

Here we focused on electrostatic interactions between two positively or two negatively charged residues, which are commonly known to be unfavorable. As seen in Fig. 3a,b, this is indeed the case when these interactions are established between residues in globular proteins or extramembrane domains. In contrast, when two amino acids of equal charge are both in the transmembrane domain, the interaction becomes stabilizing. This can be explained by the solvation gain obtained by burying the charged residues in the more hydrophilic core or by locating them inside hydrophilic channels, which tends to dominate the repulsive electrostatic force between the two electric charges.

Surprisingly, these effective potentials become even more favorable at short distances, in spite of the electrostatic repulsion. As seen in Fig. 3c, this counterintuitive effect is actually driven by β-barrel proteins, while in α-helical proteins +/+ and −/− interactions are very rare. Usually, we found such interactions to be located in the hydrophilic channel interior of transmembrane β-barrel structures. This can be explained by the earlier observation49 of favorable clusters of positively or of negatively charged residues in interaction with water molecules. Note that this stabilizing effect is amplified for residues in which the charge is delocalized. In Arg, where the charge is delocalized on the guanidinium group, the dispersion forces between stacked guanidinium groups reduces the electrostatic repulsion. An example of an Arg cluster is given in Fig. 3d.

Other polar-polar interactions

Not only the interactions between two charged residues, but also those between two non-charged polar residues, or between one charged and one non-charged polar residue, were found to be much more favorable in the transmembrane than in the extramembrane regions, and even more so, than in globular proteins (Fig. 4). The shift between the potentials is, however, smaller than for charge-charge interactions: about 0.4 kcal/mol at small distances. Note that the stabilization effect is slightly larger in β-barrel transmembrane proteins than in α-helical proteins due to the fact that the former are often channel-like structures filled with water, with which the polar moieties make favorable interactions.

Buried polar residues have previously been described as contributing significantly to the stability of membrane protein structures50, and to be especially important in the helix-helix interactions and in homo-oligomerization processes51,52. An example of polar cluster is shown in Fig. 4b.

Anion - π interactions

Since aromatic rings have non-vanishing quadruple moments, they can establish edgewise interactions with Asp and Glu side-chain carboxylate ions. Only recently has this kind of interaction received special attention in the context of their contribution to protein stabilization53,54,55. Even though some analyses suggest that their contribution is slightly destabilizing, their high occurrence frequency in biomolecular structures can be taken to signal cooperative phenomena involving other charged or aromatic residues, in which stability compensations could occur through more complex geometries such as anion-π-cation or anion-π-π systems54,55.

Figure 5 confirms that the effective energy contributions of anion-π interactions are destabilizing in both extramembrane regions and globular proteins, whereas their minimum value becomes neutral in the transmembrane part. Note that in the center of β-barrel membrane proteins, the anion-π interactions occur prevalently in complex geometries such as the one depicted in Fig. 5b involving two anions, two cations and two aromatic residues interacting with the aqueous solvent. In helical transmembrane regions, aromatic residues sometimes establish anion-π interactions with phospholipid anions; this occurs prevalently at the lipid-water interface53.

Cation-π interactions

Cation-π interactions are established when the cationic side chain of Lys or Arg is localized above or below the aromatic ring of Phe, Trp or Tyr. They play an important role in the stabilization of protein structures of both membrane and globular proteins and in protein-protein, protein-DNA and protein-ligand complexes56,57,58,59,60.

The distance-dependent energy profile of this kind of interactions is depicted in Fig. 6a. The potentials extracted from transmembrane, extramembrane and globular regions are similar, with a slightly more negative curve at short distances (<4 Å) in the case of globular proteins, and a preference for transmembrane regions with respect to the extramembrane ones for <6 Å.

It has been suggested that cation-π interactions influence more strongly β-barrel than α-helical transmembrane proteins61. In order to objectively study this difference, we plotted cation-π energy profiles extracted from these two different protein classes (Fig. 6b). What we found differs from previous findings61: the energy profile at short distances (below 5 Å) is negative in α-helical and slightly positive in β proteins. This indicates that cation-π interactions contribute more to stability in α-helical transmembrane regions.

In cation-π interactions involving Arg, the planar guanidinium group and the aromatic moieties can make favorable stacking interactions, which add up to the electrostatic interactions. We analyzed the geometry of these interactions through the study the distribution of the angle between the aromatic and guanidinium planes. As shown in Fig. 6c,d, the angle is preferentially around 20° in β-barrel transmembrane regions and the two planes are thus almost in parallel, stacked, conformations. In extramembrane regions, a preference for stacked conformations is also visible, whereas in α-helical transmembrane regions, basically all angle values are observed.

Cation-π interactions are known to be important not only for stability but also for their functional roles such as for example in substrate and ligand binding60,62. When they are established between the aromatic residues of the protein and the positively charged portion of phospholipid head groups, they are fundamental to anchor the protein to the membrane32,33,34. The importance of the aromatic rings in membrane anchoring is not easy to show using the statistical potential formalism as the so-obtained effective potentials take only implicitly the impact of the environment into account; indications of this anchoring effect are observed from the aromatic amino acid frequencies in Fig. 1.

Preferred interactions in extramembrane regions

We now have a closer look at the residue-residue interactions that are more favorable in the extramembrane than in the transmembrane regions, as measured by the distance potentials.

Sulfur-aromatic interactions

Sulfur-containing amino acids (Cys and Met) are highly polarizable and can establish nonbonded interactions with aromatic moieties. It has been shown that they play important roles not only in the stabilization of protein structures63,64,65,66 but also in their function66,67, as for example in the protection of Met against oxidation leading to methionine sulfide.

The potentials in Fig. 7a show the stabilizing contribution of sulfur-aromatic interactions, which is much stronger for the extramembrane than for the transmembrane regions. Indeed, for the latter region, the entire energy profile is shifted by about +0.2 kcal/mol on the average over all distances. It is interesting to note that sulfur-π interactions in transmembrane regions occur almost exclusively in α-helical proteins where interhelical interactions frequently involve methionine surrounded by a cage of aromatic residues. In the extramembrane region, they frequently involve partially exposed residues and more sulfur than aromatic residues (Fig. 7c).

We compared the strength of sulfur-π interactions and of aromatic-aromatic and sulfur-sulfur interactions in the transmembrane regions, but did not find a clear difference between the minimum energy values (Fig. 7b). This contrasts with earlier results obtained by a combination of structural bioinformatics and ab initio quantum chemistry calculations, which suggested that sulfur-aromatic interactions in membrane proteins are more stabilizing than aromatic-aromatic or sulfur-sulfur interactions65.

Regarding the geometry of the sulfur-π interactions, we did not see any substantial difference between the trans- and extramembrane regions. In both regions, we observed a slight preference for conformations with an angle of about 40–45° between the sulfur and the normal vector defined by the plane of the aromatic ring, in agreement with earlier findings63.

Aromatic interactions

Due to their hydrophobic nature, especially marked for Phe, aromatic amino acids prefer to be located in transmembrane regions or in the core of extramembrane regions (Fig. 1). On the basis of their energy profiles (Fig. 8), we observed that the interactions between pairs of aromatic residues are more favorable in extramembrane than in transmembrane regions. Moreover, they have almost the same weight in α-helical and β-barrel proteins, with a slight preference for the former (Fig. 8b), in agreement with earlier studies68. Note that in β-barrel proteins, the aromatic residues are usually lipid-facing, whereas in α-helical proteins they are in the protein interior. This difference is due to the fact that β-barrel transmembrane regions have almost no core.

The geometries of the aromatic-aromatic interactions are similar in transmembrane and extramembrane regions (data not shown). They occur preferentially in a T-shaped conformation. Note that π-π stacking plays a role not only in the tertiary structure stabilization but also in the oligomerization of the membrane protein subunits68.

When aromatic amino acids are positioned close to the lipid interface, they are known to play important roles in anchoring and positioning the protein inside the lipid medium through lipid-aromatic interactions (see69,70). The interactions between amino acids and lipid molecules are, however, not captured by our statistical potentials, which consider both lipids and water as the protein environment.

Aliphatic interactions

While hydrophobic forces play a dominant role for folding and stability in globular proteins, they contribute less to the stability of the transmembrane proteins71. This is indeed exactly what we observe in the energy profiles of Fig. 8c. When the interactions are established in extramembrane regions, the potentials are clearly stabilizing with an energy minimum at about 6 Å like in globular proteins. In transmembrane regions, the minimum is still present but about 0.4 kcal/mol higher, which indicates that these interactions are only marginally stabilizing.

However, even though hydrophobic forces are less important for folding, they are one of the contributing factors for the positioning and anchoring of the protein to the lipid membrane, especially in peripheral membrane proteins71 but also in integral membrane proteins. Indeed, hydrophobic interactions can be established between exposed non-polar residues and hydrophobic lipid moieties of the membrane, which determine the insertion and position of the proteins72. There are indeed more and more indications of protein-membrane hydrophobic matching, in which the hydrophobic part of the transmembrane domain has to match the hydrophobic thickness of the membrane bilayer in which it is embedded; moreover, this matching condition appears to strongly influence protein function72. Since our statistical potentials take implicitly but not explicitly the membrane bilayer into account, the latter effects are only observed indirectly.

Application of the membrane potentials to predict residue localization

The newly developed membrane statistical potentials were used to perform a binary classification of the residues into those that belong to the transmembrane or extramembrane regions. We computed for that purpose the per-residue contributions to the folding free energy derived from the extra- and transmembrane datasets \({{\mathscr{D}}}^{EM}\) and \({{\mathscr{D}}}^{TM}\) defined in Eq. (5). In general, we expect that if the per-residue contribution computed with transmembrane potentials is lower than that computed with extramembrane potentials, the residue in situated inside the membrane, and vice versa. But there are sometimes deviations from this rule. Indeed, some residues correspond to stability weaknesses, which means that they contribute unfavorably to the overall folding free energy30,73.

To predict the localization of a residue, we considered linear combinations of the per-residue folding free energies computed with the potentials “sd” and “sds” from the two datasets \({{\mathscr{D}}}^{{\rm{TM}}}\) and \({{\mathscr{D}}}^{{\rm{EM}}}\):

$${I}^{i}={\alpha }_{1}{\rm{\Delta }}{G}_{sd}^{i,{\rm{TM}}}+{\alpha }_{2}{\rm{\Delta }}{G}_{sds}^{i,{\rm{TM}}}+{\alpha }_{3}{\rm{\Delta }}{G}_{sd}^{i,{\rm{EM}}}+{\alpha }_{4}{\rm{\Delta }}{G}_{sds}^{i,{\rm{EM}}}+{\alpha }_{5}\,\mathrm{log}\,N+{\alpha }_{6}$$
(6)

where the coefficients α are parameters. We added two terms in this localization index: a constant term and a term proportional to the logarithm of the protein length. The latter term is introduced to correct for the possible length dependence of amino acid and distance frequencies74. We also defined a smoothed version of this localization index, by averaging it over a window of five successive residues along the chain centered around the target residue:

$${I}_{{\rm{sm}}}^{i}=\frac{1}{2}({I}^{i-2}+{I}^{i-1}+{I}^{i}+{I}^{i+1}+{I}^{i+2})$$
(7)

This index was used to classify the residues into two groups: the residues i with \({I}_{{\rm{sm}}}^{i}\le {\alpha }_{0}\) were considered to belong to the transmembrane region and those with \({I}_{{\rm{sm}}}^{i} > {\alpha }_{0}\) to the extramembrane region. The seven parameters αj (with j = 0…6) were identified so as to optimize the values of the balanced accuracy (BACC); the area under the receiver operating characteristic curve (AUC) was also computed.

The tests of performance were done using a strict leave-one-out cross validation procedure, where the target protein, whose residues we want to classify, is removed in all the stages of the computations, from the derivation of the statistical potentials to the optimization of the parameters. As the pairwise sequence identity inside the datasets is low (<30%), the cross validation is strict and in principle free from biases.

As shown in Table 1 and Fig. 9a, we obtained a BACC of 0.75 and an AUC of 0.83 on the whole set of membrane proteins. These good results indicate that our potentials describe quite well the stability properties of the membrane proteins in the two completely different environments that are water and lipids, and thus that they can be used to localize residues inside or outside the membrane.

Table 1 BACC and AUC values of the prediction of residue localization (inside or outside the membrane) obtained from the index Ism.
Figure 9
figure 9

Residue localization inside or outside the membrane. (a) Distribution of the \({I}_{{\rm{sm}}}^{i}\) index for the transmembrane (red) and extramembrane (blue) regions. The binary classification is performed using the threshold value indicated by the vertical black line, which yields a BACC value of 0.75. (b) Representation of the \({I}_{{\rm{sm}}}^{i}\) values for a prenyltransferase integral transmembrane protein (PDB code 4TQ4). The color scale, from red to blue, represents the \({I}_{{\rm{sm}}}^{i}\) values; red indicates a strong prediction of transmembrane localization and blue, of extramembrane localization.

Our predictor works better for the α-helical proteins (AUC = 0.86) than for the β-barrels (AUC = 0.74). We can argue that this difference is due to the fact that our dataset is dominated for two thirds by α-helical proteins, and that it is thus normal that this type of proteins is better predicted than β-barrels. Moreover, the β-barrel subset consist of channel and porin structures, in which the transmembrane region has an internal hydrophilic region in contact with water, and this makes this set substantially more difficult to predict using distance potentials only.

An example of localization prediction is shown in Fig. (9b) for Archaeoglobus fulgidus prenyltransferase, an α-helical integral membrane protein. Its residues are colored according to the predicted values of the localization index \({I}_{{\rm{sm}}}^{i}\). Clearly, our potentials are able to discriminate between extra- and transmembrane regions. Note that some of the residues that are not localized correctly are close to the membrane-water interface, where our potentials are the least accurate (see Conclusion). Some others could correspond to stability weaknesses, which means that they would benefit from being mutated to improve the global protein stability.

Conclusion

In this paper, we developed new transmembrane and extramembrane residue-residue potentials in view of identifying the amino acid interactions that contribute more strongly to the stabilization of either the transmembrane or the extramembrane region, and we compared them with their interaction strength in globular proteins. First of all, we observed that the potentials derived from globular proteins are much more similar to those derived from extramembrane than from transmembrane regions.

Despite their low occurrence in transmembrane regions, it seems that interactions involving polar residues tend to contribute more to the stability of these regions than of the extramembrane regions. In particular, salt bridges are stabler by more than 0.5 kcal/mol, and interactions between residues of equal charge, which are usually destabilizing, become stabilizing when located inside the membrane. This effect can be explained by the fact that burying a charged residue inside the lipid environment is not associated with a desolvation penalty, as it is in an aqueous environment. Note that clusters of positively or negatively charged residues situated inside β-barrel porin channels may have, not only a structural, but also a functional role in the flux of targeted molecules through the membrane. Non-charged polar-polar and anion-π interactions appear also more favorable in the transmembrane region, and so do cation-π interactions but to a much smaller extent.

Opposite trends are observed in the extramembrane regions. Hydrophobic residues, despite their preferential location in transmembrane regions, establish stronger effective interactions in extramembrane regions due to their pronounced tendency to avoid contact with water molecules, but not with lipids. In particular, aromatic-aromatic, aliphatic-aliphatic and aromatic-sulfur interactions appear to contribute more to stability in extramemembrane regions.

Note that these results have to be understood in the context of statistical mean-force potentials in which the water and lipid molecules are not considered explicitly. The lack of interactions between polar residues in extramembrane regions is indeed counterbalanced by interactions between polar residues and water molecules. Similarly, the lack of interactions between hydrophobic residues in intramembrane regions is counterbalanced by interactions between hydrophobic residues and lipid molecules.

Moreover, the class of transmembrane proteins strongly influences the effective strength of some of the residue-residue interactions. Indeed, we observed marked differences between some potentials derived from α-helical and β-barrel transmembrane domains. This is related to the fact that the latter are all channel-like structures filled with water and that the residues pointing towards the channel interior are mostly hydrophilic, whereas only a small fraction of the α-helical transmembrane proteins have such a structure. In fact, β-barrel transmembrane regions have no real core. Another difference between these two protein classes is due to the fact that β-barrel membrane proteins tend to be located in the outer membrane whose characteristics differ from the internal membrane where the α-helical proteins are almost exclusively located. The effect of two different environments of course influences the shape of our membrane-protein statistical potentials.

In order to check the validity of our statistical potentials, we used them to predict whether a residue is localized in the transmembrane or in the extramembrane region. The high BACC and AUC values obtained in cross validation, in addition to the fact that their application is extremely fast, make these potentials an invaluable asset for various investigations in membrane protein design or in large-scale studies of membrane positioning.

Despite the good results obtained, our potentials can still be improved. First of all, we have to remember that we used the OPM annotations to identify transmembrane regions in membrane proteins, and that most of them are predictions. This is unavoidable as the number of proteins with experimentally characterized membrane positioning and thickness is extremely limited. This means that there could be inaccuracies in these annotations, which in turn could lead to some inaccuracies in the potentials even though their construction is robust against OPM misclassification errors.

Furthermore, when larger larger datasets of membrane proteins will be available, our statistical potentials will certainly yield a more accurate description of the stabilizing contributions, and will make it possible to divide the dataset into several subclasses of transmembrane proteins that have specific characteristics such as ion channels, (aqua)porins, α-helical or β-barrel topology, or their insertion into different membrane types, which are likely to influence the effective interactions. Moreover, potentials that involve other structural descriptors than the interresidue distance, such as backbone torsion angle domains or solvent accessibility could further improve the prediction of residue localization presented here. This will be the subject of a forthcoming paper.

Finally, the interactions that prevail at the water-lipid or protein-lipid interface are crucial for the anchoring of transmembrane proteins into the membrane and are not well described by our statistical potentials. These are by definition effective potentials and thus the interactions with the lipid or aqueous environment are only considered indirectly. Combining the present analysis with explicit solvent models could be a possibility to unravel this important aspect of membrane proteins.