GLN: a method to reveal unique properties of lasso type topology in proteins

Niemyska, Wanda; Millett, Kenneth C.; Sulkowska, Joanna I.

doi:10.1038/s41598-020-71874-2

Download PDF

Article
Open access
Published: 16 September 2020

GLN: a method to reveal unique properties of lasso type topology in proteins

Wanda Niemyska^1,2,
Kenneth C. Millett⁴ &
Joanna I. Sulkowska^2,3

Scientific Reports volume 10, Article number: 15186 (2020) Cite this article

1446 Accesses
7 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Geometry and topology are the main factors that determine the functional properties of proteins. In this work, we show how to use the Gauss linking integral (GLN) in the form of a matrix diagram—for a pair of a loop and a tail—to study both the geometry and topology of proteins with closed loops e.g. lassos. We show that the GLN method is a significantly faster technique to detect entanglement in lasso proteins in comparison with other methods. Based on the GLN technique, we conduct comprehensive analysis of all proteins deposited in the PDB and compare it to the statistical properties of the polymers. We show how high and low GLN values correlate with the internal exibility of proteins, and how the GLN in the form of a matrix diagram can be used to study folding and unfolding routes. Finally, we discuss how the GLN method can be applied to study entanglement between two structures none of which are closed loops. Since this approach is much faster than other linking invariants, the next step will be evaluation of lassos in much longer molecules such as RNA or loops in a single chromosome.

Sequence and structural patterns detected in entangled proteins reveal the importance of co-translational folding

Article Open access 10 June 2019

Topological links in predicted protein complex structures reveal limitations of AlphaFold

Article Open access 28 October 2023

The protein folding rate and the geometry and topology of the native state

Article Open access 16 April 2022

Introduction

The protein backbone describes a collection of space curves, a type of spatial structure that mathematicians have been analysing and comparing for a long time. One well-known measure of how two such curves interact with one another is the Gauss linking integral, which is related to Ampere’s law of electrostatics and has important applications in modern physics. For two oriented closed curves the Gauss linking integral is always integer, called the linking number, giving an integer invariant describing the number of times one curve winds around the other. The linking number of two not linked curves is 0, while the Hopf link is the simplest link with linking number equal to $+$ 1 or − 1, depending upon the relative orientation of the curves¹, see Supplementary Information Fig. S1.

Protein chains are open curves which is often challenging for mathematicians, and induces high computational complexity of algorithms involving randomness and statistics^2,3, as in the case of identifying knots⁴, slipknots^5,6 and links in proteins⁷. Against such a backdrop, the fact that Gauss linking integral may be defined generally for open curves and calculated precisely for polygonal chains makes this measure particularly attractive.

The first biological applications of the Gauss linking integral are found in studies of DNA structure⁸. Røgen and Fain applied this measure for comparing and effective classifying protein structures⁹. More recently, the Gauss integral has been used for identifying linking in domain-swapped protein dimers¹⁰.

In this paper we show that the Gauss linking integral, which we denote by GLN, captures unique properties of lasso proteins (Fig. 1), another type of non-trivial topology identified recently in proteins containing a disulfide or other type of bridge^11,12. Complex lasso topology is found in at least 18% of all proteins with disulfide bridges in a non-redundant subset of PDB, and thus represents the largest group of proteins with non-trivial topology. (It’s important to remember that in general speaking of proteins topology, we mean the use of mathematical concepts and topological strategies to study protein chain geometry).

Lassos occur in structures with disulfide (or other) bridges creating a loop and a pair of termini. When at least one terminus of a protein backbone is entangled with the covalent loop (closed by such a bridge) a topologically complex structure is formed. The topology is identified by a spanning specific surface (i.e. minimal surface) on the covalent loop (Fig. 1) and identifying the crossings of the tails and the surface¹¹. Currently several classes of lasso structures in proteins are known. In addition to the trivial lasso L$_0$, the principal structures are the single lasso L$_1$, the double lasso L$_2$, and the triple lasso L$_3$, depending upon whether the loop is pierced once, twice and three times, respectively, by the same tail, which goes through the loop and turns back several times. The structure with more than one piercing from the same direction is called a lasso supercoiling LS (when one tail pierces the loop then winds around the protein chain comprising the loop and pierces it again). Another case identified in proteins is the two-sided lasso LL (when a loop is pierced by both tails). It is important to note that from mathematical point of view all classes of lassos are topologically equivalent to trivial lasso L$_0$ because the free ends are not prevented from unwinding. And even if we connected free ends not disturbing windings, except lasso supercoil LS the rest would be still topologically equivalent to trivial lasso. But, from biological point of view, they are still very interesting complex structures. For example, a correlation between a type of lasso topology and the specific function of protein has been identified¹¹. All proteins that form any type of lasso are collected in the LassoProt database¹².

Proteins with lassos are found in all domains of life and possess diverse functions^11,12. Lasso topology can influence thermodynamics properties and biological activity of proteins^13,14. Cystein bridges provide stability to protein structures and a non-trivial topology can enhance this influence^7,15. However, it is also known that non-trivial topology hinders the folding pathway¹⁶, leading to possible misfolding¹⁷. How evolution solves this delicate balance is one of the open questions. There are many others at the interface of biology and mathematics. What is the role of the lasso? Is there a correlation between the lasso type and the biological function? How do these proteins fold in oxidative conditions? The latter question however does not concern the lasso peptides which are class of ribosomally synthesized posttranslationally modified natural products found in bacteria. However these peptides have a diverse set of pharmacologically relevant activities, including inhibition of bacterial growth, receptor antagonism, and enzyme inhibition¹⁸. Thus, can lasso topology be useful in bioengineering or in pharmacological applications to design proteins with desired fold, stability or other features? In polymer chemistry, lassos (known as tadpoles) are used to design materials with desired properties^19,20,21. Since lassos are defined using open curves they are also inspiring mathematicians to construct topological tools capable of classifying them^22,23. However, up to now, the question of whether a loop and a tail can be entangled in protein while the minimal surface spanned on the loop is not pierced, hasn’t been asked. How might this entanglement influence protein biophysical properties? The Gauss linking integral approach could reveal more information about lasso proteins than the previous geometric method.

The aim of this research is to better understand the entanglement of lasso proteins and its influence on their thermodynamical properties. To do so we first introduce a new technique based on the Gauss linking integral and, then, apply it to assess the topological complexity of proteins with disulfide bridges. We show that GLN provides new information about the entanglement of the loop and tails, related to geometric features of the minimal disc piercings but, in addition, identifies entangled proteins with different complex lasso topology. We introduce GLN fingerprint to display the local winding of a protein backbone and as another method to quantify entanglement in proteins with non-trivial linking topology. Finally, we use GLN as descriptor to study the free energy landscape of proteins and show influence of non-trivial topology on proteins stability and folding pathway.

Results

Our new approach relies on the definition of the Guass linking integral. Let us first consider a protein chain with a disulfide bond connecting two amino acids that, in this way, creates an unknotted covalent loop. The complementary parts of the chain are the tails. When at least one tail pierces a minimal surface spanned on the loop, the entire structure is called a complex lasso (Fig. 1). In this study, we compute the Gauss linking integral, which we denote by GLN, quantifying the linking between each tail and the closed loop. The GLN is an algebraic measure of how many times (and in which direction) the tail winds around the loop, with cancellation. For example, a value of GLN close to 1 means that the tail winds around the loop more or less once, in total. In the most simple cases, the tail passes once through the surface spanned on the loop (in a positive direction, following natural orientation of protein from the N terminus to C terminus). Such structure resembles the single lasso called L$_1$. If the direction is reversed, the linking number is close to $-\,1$. Note that, in complex cases, the tail can pass around the loop twice in a positive direction and once in a negative direction for an algebraic total of about 1. Moreover, by definition, the linking number of two unlinked curves is 0 although one can not infer with certainty that linking number 0 curves can be separated. This is demonstrated by the “Whitehead” link in which the algebraic linking of the two closed loops is zero but they are geometrically entangled and one chain intersects a minimal surface spanned on the other chain at least twice in opposite and therefore cancelling directions. We will present conditions to identify and classify proteins with cystein bridges.

GLN definition from protein perspective

The mathematical definition of linking number between two closed curves in 3 dimensions is given by the Gauss double integral. In the case of proteins, the molecular chains become collections of points, i.e., positions of C$\alpha$ atoms, and the integrals may be replaced by sums of exact quantities determined by pairs of segments connecting the points as determined by the molecular chain²⁴. We must relax the expectation of having an integer indicator of linking as we perform the double Gauss integral over open chains. See the “Materials and methods” section for the details. We propose the analysis of four main values for each pair consisting of a loop and a tail:

1.
whGLN: the GLN value of a loop and a whole tail,
2.
minGLN, and
3.
maxGLN

respectively, the minimum and maximum values of GLN between a loop and any fragment of a tail, and
4.
$max|GLN|=\max \{maxGLN,-minGLN\}$.

Additionally, for each triple of a loop and two tails, we consider max2|GLN| value defined to be the maximum of max|GLN| values for both tails. We determine the positive directions of windings according to natural direction of a protein chain; oriented from the N-terminus to the C-terminus. A high maxGLN or low minGLN indicate that the corresponding part of a tail significantly winds around a loop in a “positive” or “negative” direction, respectively. Usually the minimal surface spanned on the loop is pierced by this part of the tail.

We analyzed the entire set of all 5106 non-redundant proteins in the Protein Data Bank with at least one disulfide bridge (13,320 covalent loops in a total)—from the LassoProt database¹². See in “Materials and methods” section for the details about the dataset.

Application of GLN to this dataset reveals the gaussian distribution with long tail as shown in Fig. 2. In the majority of cases, the GLN is near 0.2 indicating proteins in which t the minimal surface spanned on the loop is probably not pierced. However, the long tail shows that, in high fraction of chains with cysteine bridges at least one tail significantly winds around the loop. For example, in 21% of chains, we have at least one loop with $max2|GLN|>0.6$ and, in 9.4% of loops, we have $max2|GLN|>0.6$. The value 0.6 seems to be a good threshold with which to distinguish between complex and trivial topologies, since over 93% of loops with $max2|GLN|>0.6$ have the minimal surface spanned on the loop pierced by a tail at least once and only 4% of loops with $max2|GLN|\le 0.6$ have loop spanning surfaces pierced by either tail.

The GLN fingerprint as a method to classify lasso structures

To identify the correlation between topology and geometry of proteins, we adopt the idea of topological fingerprint used to exhibit the internal knots in proteins called slipknots^6,25. Here, we present the linking complexity in the form of a matrix diagram—for a pair of a loop and a tail—that shows the GLN between the loop and the entire tail and each of its subchains.

The analysis of our dataset reveals that covalent loops in proteins can be classified into a few distinct motifs, represented by particular patterns within the matrix diagrams. Four characteristic motifs are shown in Fig. 3. Each point of the matrix corresponds to a specific subchain of the tail, where the id of the first residue is on the x-axis and the id of the last residue is on the y-axis. As a consequence, the left bottom corner corresponds to the whole tail. The color intensity indicates the value of the GLN between the disulfide loop and the specific subchain of the tail. A red color indicates negative linking values reflecting the negative direction while blue indicates positive linking values. These GLN matrices are used to introduce the following classification of proteins with cystein bridges:

$\mathbf{gL}_{0}$, no clear colorfull patches in the matrix indicating that the tail does not wind around the loop.
$\mathbf{gL}_{1}$, there is one colorfull patch in the matrix (e.g. in the left bottom corner) indicating that the tail winds around the loop once. The color indicates the direction.
$\mathbf{gL}_{2}$, there are two patches in different colors in the matrix, (e.g. one on the left edge and second one on the bottom edge). This indicates that the tail winds around the loop in one direction and then in the opposite direction. [This spatial arrangement can be observed by following the left edge of the matrix in a descending direction: the beginning of the analyzed segment remains the same—beginning of the tail—while the end of the analyzed segment is moving towards the end of the tail. When we approach the patch, a color begins to appear meaning the tail begins to wind around the loop. Below the colorfull patch we again see white indicating that the tail winds around the loop but in the opposite direction thereby cancelling the initial winding contribution. Thus the windings “cancel” themselves and the corner of matrix is again almost white (see Fig. 1)].
$\mathbf{gL}_{3}$, there are four colorfull patches in the matrix, e.g. one in the middle in the different color than three other patches; this indicates that the tail winds around the loop in one direction, then turns and winds around the loop in the opposite direction, and finally turns back one more time.
$\mathbf{gL}_{n}$, for any natural n, there is specific, dependent on n, number of colorfull patches (namely $\left\lfloor \frac{n+1}{2} \right\rfloor \cdot \left\lfloor \frac{n+2}{2} \right\rfloor$) in the matrix; this indicates that tail winds around the loop n times, each next time in the opposite direction.
$\mathbf{gLS}$, there is usually one big patch in one color which at some point becomes very intensive—claret or navy in the case of negative and positive windings, respectively; this means that the tail winds around the loop in one direction (making a full circle) and then winds around it one more time in the same direction.
$\mathbf{gLL}$, if both matrices for two tails have at least one colorfull patch; this indicates that both tails wind around the loop.

Similar GLN matrices indicate the same topological motifs even though the chains may have a different structure. Examples of the same GLN matrices for proteins with very low sequence similarity are shown in Supplementary Information (Figs. S4 and S5). The motifs ${\hbox {gL}}_n$, gLS and gLL usually correspond to the lasso types ${\hbox {L}}_n$, LS and LL, respectively. The GLN matrices reveal much more detail about the geometry of the chains with lassos. By analysing the location, size and color of a collection of patches one may deduce which parts of the tail wind around the loop and how fast and tightly they wind. For the most part intense patches correspond to the tail piercing the minimal surface spanned on the loop. This is not always the case since the tail may make almost full circle around the loop, but do not pierce the minimal surface spanned on the loop (see Table 1). Such complex configurations had not been identified by methods that studied intersetions with the minimal surface spanned on the loop¹¹.

Classification of lasso protein structures and entangled but unpierced loops

In this section we describe some methods to classify proteins with lassos based on the Gauss linking integral. We propose a precise classification of loop–tail pairs having distinct linking motifs presented by the GLN fingerprints (Fig. 3). This is based on three positive real numbers $t_L,t_{L+},t_{LS}$ (for instance $t_L, t_{L+} \approx 0.6, t_{LS} \approx 1.5$), as follows:

$\mathbf{gL}_{0}$—if $max|GLN|\le t_L$, $\bullet$gLS—if $max|GLN|>t_{LS}$;

In the all next three cases we demand that $max|GLN|\in (t_L,t_{LS}]$, and:
$\mathbf{gL}_{1}$—if exactly one value of maxGLN and $-minGLN$ is greater than $t_L$,
$\mathbf{gL}_{2+}$—if both values maxGLN and $-minGLN$ are greater than $t_{L}$ and $|whGLN|\le t_{L+}$,
$\mathbf{gL}_{3+}$—if both values maxGLN and $-minGLN$ are greater than $t_{L}$ and $|whGLN|> t_{L+}$. One can consider whole triple consisting of a loop and two tails: if one of the tails is classified as gL$_0$, then we say that the triple is of the type of the second tail; if both tails are classified in different way than gL$_0$, we say that the triple is of the type gLL.

Let $\hbox {L}_{2+}$ denote the sum of types $\hbox {L}_{2n}$ for any natural $n\ge 1$ (in proteins we have found so far examples of $\hbox {L}_2$, $\hbox {L}_4$ and $\hbox {L}_6$, see Ref.¹²). Let $\hbox {L}_{3+}$ denote the sum of types $\hbox {L}_{2n+1}$ for any natural $n\ge 1$ (in proteins we only know examples of $\hbox {L}_3$). We found that it is possible to choose particular values of $t_{L}, t_{L+}, t_{LS}$ (i.e. $t_{L}=0.69, t_{L+}=0.6, t_{LS}=1.55$) such that as much as $98\%$ of loops are classified in an analogous way by both the techniques of minimal surfaces and the GLN as shown in the Fig. 4 (see Supplementary Information Fig. S5 for detailed comparison). Most of the remaining 2% of loops are structures with intriguing properties that were not recognized before¹¹. We split them into the three groups.

The first group consists of proteins in which the minimal surface spanned on the loops are not pierced but the tails strongly wind around the loop, or the surfaces spanned on loops are twisted and wind around the tails. When the loop is twisted it appears that there is not enough space to thread the tail through the loop although it is composed of more than 100 amino acids. There are only 15 such proteins among the set of non-redundant chains of a length lower than 500 amino acids (see Table 1), with $max|GLN|>0.69$ and no piercings. One can ask how does this type of entanglement influence the free energy landscape of the protein in oxidizing conditions? We speculate that, in this case, some part of the configurational space is excluded from protein backbone exploration during folding. Unwanted threading will have to backtrack thereby slowing down folding or even leading to missfolding.

The second group contains proteins with high |GLN| values and the closed loops that are pierced by the tails, but, in minimal surface technique, these piercings are interpreted as being too shallow and are reduced, i.e. they are not taken into account. (Generally, this is a reasonable approach since, for instance, all helices that are crossing surfaces usually do cross them at least three times on a short distance. We wish to interpret this as simply one meaningful crossing. However, it is not an easy problem to distinguish shallow crossings from relevant ones (see Supplementary Information Fig. S6) and the parallel analysis of GLN matrices may be very helpful in recognizing which reductions are justified or are spatially reasonable.)

The third group consists of structures with low max|GLN| value but with tails piercing the minimal surface spanned on the loops. There are only 9 such loops (0.01% of the analyzed data set), see Supplementary Information, Table S1. These structures have $max|GLN|\le 0.6$ and no examples with $max|GLN|<0.5$. With a detailed analysis, we found that in some structures the GLN value is low because the piercing segment lies in the plane of the loop—i.e. is quite “shallow”.

Table 1 “Entangled” proteins without piercing through a covalent loop closed by a disulfide bridge. Based on loops from non-redundant chains of a length lower than 500 amino acids, which are not pierced, but have $max|GLN|>0.69$.

Full size table

Unique biophysical features of lasso proteins

An analysis of the statistics concerning GLN reveals interesting features from the biological point of view. First of all, the windings in the negative direction occur significantly more often than those in the positive direction. For example, among the loops of $\hbox {gL}_1$ type over 63% have a negative GLN value (see Fig. 5, panel B). However, a detailed analysis of basic physico-chemical properties (a type of amino acids, type of disulfide bridge²⁶) does yet not provide an explanation of this difference.

The histogram of all whGLN values reveals a noticeable depression around the value $-\,0.5$ (see Fig. 5, panel C). This shows that there are only a few tails that come close to the loop but are not pierced through it. In the case of the random polymers with the same size of the loop and tails, such behaviour is not observed (see Fig. 5, panel D). This implies that the depression in proteins distribution arises from a specific side chain interaction which makes contacts outside the loop or, if they are close enough, to the loop whose the minimal surface spanned on the loop they would pierce.

Considering the lengths of loops and tails we find that the average value of maxGLN depends logarithmically on the length of a tail, up to a length of around 40 amino acids. Next, maxGLN saturates and remains stable around the value 0.25 (0.55 for polymers) (see Fig. 6).

Finally, the analysis of B-factors (the temperature factor) shows that in chains with short loops amino acids for which |GLN| between the loop and the tail’s fragment from begining to the amino acid is the highest, have higher B-factors than average ones. Moreover, amino acids for which |GLN| between the loop and the unit segment corresponding to the amino acid is the highest (often those segments pierce the minimal surface spanned on the loop)—have significantly lower B-factors, lower even than amino acids creating cysteine bridges. For all loops the tendency is similar, however a little bit less strong (see Table 2). This suggests that the parts of tails piercing the loops spanning surfaces are more stable, while the parts of tails between bridges and crossings fluctuate more. This is in agreement with available experimental data for lasso type polypetides²⁷.

Table 2 Correlation between GLN values (of unit segments of tails and whole loop) and B-factors for corresponding amino acids in lasso proteins. Second column: proteins with loops consisting of less than 50 amino acids are taken into account. Third column: all loops.

Full size table

The strong correlation between GLN values of unit segments and whole loop, and B-factors for corresponding amino acids is clearly visible in Fig. 7. High B-factors correlate with low |GLN| values and inversely—high |GLN| values correlate with low B-factors. This again suggests that pieces of the tail winding around the loop are more stable that the other segments of the tail.

Applications of the GLN fingerprint

Understanding the mechanism by which proteins fold to their native structure is a central problem in protein science²⁸. In the case of a majority of proteins, native contacts are sufficient to drive the folding of the protein^29,30,31 since their free energy landscape is minimally frustrated³². The fraction of native contacts, called Q, was shown to be a good reaction coordinate to study the folding mechanism for a majority of proteins²⁸. However, in the case of proteins with non-trivial topology (e.g. the smallest knotted protein MJ0366³³), Q merely represents the progress of folding³⁴.

Next, we show that the GLN values and the GLN fingerprint can reveal information, hidden from Q, about the topology based on unfolding pathways simulated with a structure based model³⁵. In fact, in the case of the ribonuclease U2 protein with the $gL_{3}$ motif (the loop is pierced three times), GLN values reveal an ensemble of the transition states composed of at least two unfolding pathways: via the slipknot topology^16,36 or direct unthreading (see Fig. 8). Moreover, superposition of the fingerprints over the time shows how the protein backbone travels through the available conformation space. The same technique can be applied to reveal untying of even more complex topologies such as the supercoling motif gLS (one tail winding around the loop and piercing it two or more times from the same site). The unfolding pathway for a protein with $gLS_3$ is shown in Supplementary Information Fig. S7.

The application of the GLN is not limited to studying lasso proteins or proteins with links⁷. Since the GLN measures mutual entanglement its fingerprint is different for “the same” protein with two topologies—unknotted and knotted (see Supplementary Information Fig. S8)³⁷. Furthermore, the pattern of the GLN fingerprint can be used to identify the type of secondary structures of the protein which are usually visible via a contact map. Note, that the shape of the contact map depends on the cutoff distance used to determine physical contacts while GLN does not depend on additional parameters. Moreover, sign of GLN (blue or red color on the matrix) indicates the “direction of contact”, i.e. from this it can be deduced on which side the fragments of protein chain being in contact pass each other (for more details see Supplementary Information Figs. S8, S9). Thus, the GLN fingerprint of a native conformation can be used as a reference value for a reaction coordinate in studying the folding pathways of protein.

Discussion and conclusions

We have shown that the GLN method is a significantly faster technique to detect entanglement in proteins with closed loops in the comparison with the methods which rely on minimal surfaces spanning the covalent loops¹¹. The method also reveals much more information about the geometry of chains with lassos which may lead to the new biological and chemical discoveries. However, the algorithm based on the surfaces has the advantage of giving precise information about the exact residues that cross the spanning surface which may lead to an important insight from the biological point of view. We believe both approaches can compliment each other and, together, help focus study on important features of the protein.

The GLN fingerprint of a native conformation can be used as a reference value for a reaction coordinate in studying the folding pathways of protein. It can also be used to compare proteins e.g. during CASP or CAPRI competition. Indeed, it can be pushed further, so that the GLN fingerprint provides a powerful tool to be used to improve already very successful deep learning algorithms used to predict tertiary and quaternary structure of proteins via image recognition³⁷.

The present method can be applied to any structure in which a loop and tail can be defined. Apart from the cysteine bridge loops investigated here, a loop can be formed, among others, by a salt bridge, by a hydrogen bond, or by ions. An example of the last case is the human transport protein (PDB code 1n84), with the loop closed by Tyr95-Fe339-Asp63 interaction whose spanning surface is pierced by C-terminal tail (Thr250)³⁸ thus forming lasso of gL$_1$ type.

Moreover, one can apply GLN approach to study entanglement between two structures none of which are closed loops. Lately new algorithm, GISA, was proposed to study local entanglement in protein chains and other biopolymers³⁹. The algorithm computes Gauss integrals between many pairs of quite short fragments of chain and finds rare invariant values. It can be helpful in search for knots, links and highly entangled configurations not previously described as well. Furthermore since this approach is much faster than other linking invariants it will provide a very useful technique to study loops in a single chromosome as well as chromosome entanglement in the cell^40,41. Current methods allow one to describe single chromosomes with high resolution (thousands of beads). This number is already an order of magnitude bigger than the typical length of the protein.

Materials and methods

Gaussian linking number

A definition of linking number between two closed curves $\gamma _1$ and $\gamma _2$ in 3 dimensions is given by the Gauss double integral,

$$\begin{aligned} GLN\equiv \frac{1}{4\pi }\oint _{\gamma _1}\oint _{\gamma _2} \frac{\vec {r}^{(1)}-\vec {r}^{(2)}}{|\vec {r}^{(1)}-\vec {r}^{(2)}|^3}\cdot \big (d\vec {r}^{(1)}\times d\vec {r}^{(2)}\big ), \end{aligned}$$

(1)

where $\vec {r}^{(1)}$ and $\vec {r}^{(2)}$ are positions of two curves. Gauss proved that, for closed oriented curves, this integral is always integer, is an invariant up to isotopy, and measures how many times one curve winds around the second one. In the protein case chains become collections of points, i.e., positions of C$\alpha$ atoms $\{ \vec {r}_1^{(k)}, \vec {r}_2^{(k)},\ldots \vec {r}_{N_k}^{(k)} \}$, for the chains of the length $N_k$, $k=1,2$. The integrals may be replaced by sums over segments $d\vec {R}_i^{(k)}=\vec {r}_{i+1}^{(k)}-\vec {r_i}^{(k)}$, for which we use the midpoint approximation $\vec {R}_i^{(k)}=(\vec {r}_{i+1}^{(k)}+\vec {r_i}^{(k)})/2$. We can replace the requirement of having oriented closed loops by oriented open arcs giving a real value as a measure of linking rather than an integer. We can then perform the double Gauss discrete integral over the open chains,

$$\begin{aligned} GLN\equiv \frac{1}{4\pi }\sum _{i=1}^{N_1-1}\sum _{j=1}^{N_2-1}\frac{\vec {R}_{i}^{(1)} -\vec {R}_j^{(2)}}{|\vec {R}_i^{(1)}-\vec {R}_j^{(2)}|^3}\cdot \left( d\vec {R}_i^{(1)}\times d\vec {R}_j^{(2)}\right) . \end{aligned}$$

(2)

Note, one can simply employ the Banchoff method on the open chain to explicitly calculate this integral²⁴

Let us denote

$$\begin{aligned} G(i,j):=\frac{\vec {R}_{i}^{(1)}-\vec {R}_j^{(2)}}{|\vec {R}_i^{(1)} -\vec {R}_j^{(2)}|^3}\cdot \left( d\vec {R}_i^{(1)}\times d\vec {R}_j^{(2)}\right) , \end{aligned}$$

(3)

$i\in \{1\ldots N_1-1\},j\in \{1\ldots N_2-1\}$, and consider a pair of a tail of a length $N_1$ and a loop of a length $N_2$. We calculate and then analyze four main values for each pair of a loop and a tail:

whGLN: value of the Gauss double integral between a loop and whole tail,
$$\begin{aligned} whGLN = \frac{1}{4\pi }\sum _{i=1}^{N_1-1}\sum _{j=1}^{N_2-1}G(i,j); \end{aligned}$$
(4)
minGLN (maxGLN): minimum (maximum) value of the Gauss double integral between a loop and any fragment of a tail,
$$\begin{aligned} minGLN = \min _{\begin{array}{c} k,l\in \{1\ldots N_1-1\}, \\ k< l \end{array}} \frac{1}{4\pi }\sum _{i=k}^{l}\sum _{j=1}^{N_2-1}G(i,j); \end{aligned}$$
(5)
$max|GLN|=\max \{maxGLN,-minGLN\}$.

Additionaly for each triple of a loop and two tails we considered max2|GLN|, which is a maximum of max|GLN| for both tails.

subsectionProtein dataset We use the set of 5106 non-redundant proteins with at least one bridge from LassoProt database¹², March 2016. By non-redundant we mean sequence similarity is lower than 35%, including X-ray, NMR, CEM structures and proteins with unresolved parts. We chose only one chain from each protein and identified 13,320 covalent loops in a total. This dataset includes 1276 chains with unresolved parts which were reconstructed with Gaprepair⁴² based on Modeller⁴³. For details see Supplementary Information file.

The minimal surface method and molecular visualization

The surface is approximated by a discrete triangulation as described in^11,12. To distinguish structures with the same number of piercings but where the way he minimal surface spanned on the loop is pierced is different, an orientation of the surface spanned on the disulfide loop was introduced. Two piercings may occur if the tail pierces the loop in one direction and then the inverse (the $L_2$ structure), or pierces it twice in the same direction, winding around the loop (the $LS_2$ structure). Additionally Pylasso⁴⁴ and PyLink⁴⁵ plugin for PyMOL were used to facilitate analysis and perform Molecular graphics.

Molecular dynamics simulation

The kinetics data were obtained based on a coarse-grained model and conducted using the Gromacs package with SMOG 2 software³⁵ employing parameters from⁴⁶. The code for SMOG can be downloaded at http://smog-server.org/smog2.

Random lassos sampling

Phantom lassos (polymers deprived of any interactions and volume) were created by connecting phantom loops and phantom tails. Phantom loops were created as equilateral polygons using the dedicated algorithm⁴⁷ and tested earlier in the Ref.⁴⁸.

Data availability

The datasets analysed during the current study are available at LassoProt database ¹².

References

Glickman, M. H. & Ciechanover, A. The ubiquitin-proteasome proteolytic pathway: destruction for the sake of construction. Physiol. Rev. 82, 373–428 (2002).
CAS PubMed Google Scholar
Virnau, P., Mirny, L. A. & Kardar, M. Intricate knots in proteins: function and evolution. PLoS Comput. Biol. 2, e122 (2006).
ADS PubMed PubMed Central Google Scholar
Millett, K. C., Rawdon, E. J., Stasiak, A. & Sułkowska, J. I. Identifying knots in proteins. Biochem. Soc. Trans. 41, 533–537 (2013).
CAS PubMed Google Scholar
Jamroz, M. et al. Knotprot: a database of proteins with knots and slipknots. Nucleic Acids Res. 43, D306–D314 (2015).
CAS PubMed Google Scholar
King, N. P., Yeates, E. O. & Yeates, T. O. Identification of rare slipknots in proteins and their implications for stability and folding. J. Mol. Biol. 373, 153–166 (2007).
CAS PubMed Google Scholar
Sułkowska, J. I., Rawdon, E. J., Millett, K. C., Onuchic, J. N. & Stasiak, A. Conservation of complex knotting and slipknotting patterns in proteins. Proc. Natl. Acad. Sci. 109, E1715–E1723 (2012).
ADS PubMed Google Scholar
Dabrowski-Tumanski, P. & Sulkowska, J. I. Topological knots and links in proteins. Proc. Natl. Acad. Sci. 114, 3415–3420 (2017).
CAS PubMed Google Scholar
White, J. H. Self-linking and the gauss integral in higher dimensions. Am. J. Math. 91, 693–728 (1969).
MathSciNet MATH Google Scholar
Røgen, P. & Fain, B. Automatic classification of protein structure by using gauss integrals. Proc. Natl. Acad. Sci. 100, 119–124 (2003).
ADS PubMed Google Scholar
Baiesi, M., Orlandini, E., Seno, F. & Trovato, A. Exploring the correlation between the folding rates of proteins and the entanglement of their native states. J. Phys. A Math. Theor. 50, 504001 (2017).
ADS MATH Google Scholar
Niemyska, W. et al. Complex lasso: new entangled motifs in proteins. Sci. Rep. 6, 36895 (2016).
ADS CAS PubMed PubMed Central Google Scholar
Dabrowski-Tumanski, P., Niemyska, W., Pasznik, P. & Sulkowska, J. I. Lassoprot: server to analyze biopolymers with lassos. Nucleic Acids Res. 44, W383–W389 (2016).
CAS PubMed PubMed Central Google Scholar
Haglund, E. et al. The unique cysteine knot regulates the pleotropic hormone leptin. PLoS ONE 7, e45654 (2012).
ADS CAS PubMed PubMed Central Google Scholar
Haglund, E. et al. Pierced lasso bundles are a new class of knot-like motifs. PLoS Comput. Biol. 10, e1003613 (2014).
PubMed PubMed Central Google Scholar
Niewieczerzał, S. & Sulkowska, J. I. Supercoiling in a protein increases its stability. Phys. Rev. Lett. 123, 138102 (2019).
ADS PubMed Google Scholar
Sułkowska, J. I., Sułkowski, P. & Onuchic, J. Dodging the crisis of folding proteins with knots. Proc. Natl. Acad. Sci. 106, 3119–3124 (2009).
ADS MathSciNet PubMed MATH Google Scholar
Qin, M., Wang, W. & Thirumalai, D. Protein folding guides disulfide bond formation. Proc. Natl. Acad. Sci. 112, 11241–11246 (2015).
ADS CAS PubMed Google Scholar
Maksimov, M. O., Pelczer, I. & Link, A. J. Precursor-centric genome-mining approach for lasso peptide discovery. Proc. Natl. Acad. Sci. 109, 15223–15228 (2012).
ADS CAS PubMed Google Scholar
Tezuka, Y. & Oike, H. Topological polymer chemistry: systematic classification of nonlinear polymer topologies. J. Am. Chem. Soc. 123, 11570–11576 (2001).
CAS PubMed Google Scholar
Kricheldorf, H. R. Cyclic polymers: synthetic strategies and physical properties. J. Polym. Sci. Part A Polym. Chem. 48, 251–284 (2010).
ADS CAS Google Scholar
Tezuka, Y. Topological polymer chemistry designing complex macromolecular graph constructions. Acc. Chem. Res. 50, 2661–2672 (2017).
CAS PubMed Google Scholar
Tian, W., Lei, X., Kauffman, L. H. & Liang, J. A knot polynomial invariant for analysis of topology of RNA stems and protein disulfide bonds. Mol. Based Math. Biol. 5, 21–30 (2017).
MathSciNet PubMed PubMed Central MATH Google Scholar
Dabrowski-Tumanski, P. & Sulkowska, J. I. The APS-bracket—a topological tool to classify lasso proteins, RNAs and other tadpole-like structures. React. Funct. Polym. 132, 19–25 (2018).
CAS Google Scholar
Banchoff, T. Self linking numbers of space polygons. Indiana Univ. Math. J. 25, 1171–1188 (1976).
MathSciNet MATH Google Scholar
Yeates, T. O., Norcross, T. S. & King, N. P. Knotted and topologically complex proteins as models for studying folding and stability. Curr. Opin. Chem. Biol. 11, 595–603 (2007).
CAS PubMed PubMed Central Google Scholar
Bulaj, G. Formation of disulfide bonds in proteins and peptides. Biotechnol. Adv. 23, 87–92 (2005).
CAS PubMed Google Scholar
Zimmermann, M., Hegemann, J. D., Xie, X. & Marahiel, M. A. The astexin-1 lasso peptides: biosynthesis, stability, and structural studies. Chem. Biol. 20, 558–569 (2013).
CAS PubMed Google Scholar
Best, R. B., Hummer, G. & Eaton, W. A. Native contacts determine protein folding mechanisms in atomistic simulations. Proc. Natl. Acad. Sci. 110, 17874–17879 (2013).
ADS CAS PubMed Google Scholar
Bryngelson, J. D., Onuchic, J. N., Socci, N. D. & Wolynes, P. G. Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins Struct. Funct. Bioinform. 21, 167–195 (1995).
CAS Google Scholar
Wolynes, P. G., Onuchic, J. N. & Thirumalai, D. Navigating the folding routes. Science 267, 1619 (1995).
ADS CAS PubMed Google Scholar
Thirumalai, D., O'Brien, E. P., Morrison, G. & Hyeon, C. Theoretical perspectives on protein folding. Annu. Rev. Biophys. 39, 159–183 (2010).
CAS PubMed Google Scholar
Wolynes, P. G. Recent successes of the energy landscape theory of protein folding and function. Q. Rev. Biophys. 38, 405–410 (2005).
CAS PubMed Google Scholar
Bölinger, D. et al. A stevedore's protein knot. PLoS Comput. Biol. 6, e1000731–e1000731 (2010).
MathSciNet PubMed PubMed Central Google Scholar
Dabrowski-Tumanski, P., Jarmolinska, A. & Sulkowska, J. Prediction of the optimal set of contacts to fold the smallest knotted protein. J. Phys. Condens. Matter 27, 354109 (2015).
CAS PubMed Google Scholar
Noel, J. K., Whitford, P. C., Sanbonmatsu, K. Y. & Onuchic, J. N. Smog@ ctbp: simplified deployment of structure-based models in gromacs. Nucleic Acids Res. 38, W657–W661 (2010).
CAS PubMed PubMed Central Google Scholar
Noel, J. K., Sułkowska, J. I. & Onuchic, J. N. Slipknotting upon native-like loop formation in a trefoil knot protein. Proc. Natl. Acad. Sci. 107, 15403–15408 (2010).
ADS CAS PubMed Google Scholar
Gao, M., Zhou, H. & Skolnick, J. Destini: a deep-learning approach to contact-driven protein structure prediction. Sci. Rep. 9, 1–13 (2019).
Google Scholar
Dabrowski-Tumanski, P. Knots, lassos and links, topological manifolds in biological objects. Thesis 1–150, (2018).
Grønbæk, C., Hamelryck, T. & Røgen, P. Gisa: Using gauss integrals to identify rare conformations in protein structures. bioRxiv 758029 (2019).
Sulkowska, J. I. et al. Knotgenome: a server to analyze entanglements of chromosomes. Nucleic Acids Res. 46, W17–W24 (2018).
CAS PubMed PubMed Central Google Scholar
Niewieczerzal, S., Niemyska, W. & Sulkowska, J. I. Defining and detecting links in chromosomes. Sci. Rep. 9, 1–10 (2019).
CAS Google Scholar
Jarmolinska, A. I., Kadlof, M., Dabrowski-Tumanski, P. & Sulkowska, J. I. Gaprepairer: a server to model a structural gap and validate it using topological analysis. Bioinformatics 34, 3300–3307 (2018).
CAS PubMed Google Scholar
Webb, B. & Sali, A. Protein structure modeling with modeller. In Protein Structure Prediction 1–15 (2014).
Gierut, A. M., Niemyska, W., Dabrowski-Tumanski, P., Sułkowski, P. & Sulkowska, J. I. Pylasso: a pymol plugin to identify lassos. Bioinformatics 33, 3819–3821 (2017).
CAS PubMed Google Scholar
Gierut, A., Dabrowski-Tumanski, P., Niemyska, W., Millett, K. C. & Sulkowska, J. I. Pylink: a pymol plugin to identify links. under review (2018).
Sułkowska, J. I. & Cieplak, M. Selection of optimal variants of gō-like models of proteins through studies of stretching. Biophys. J. 95, 3174–3191 (2008).
ADS PubMed PubMed Central Google Scholar
Cantarella, J., Duplantier, B., Shonkwiler, C. & Uehara, E. A fast direct sampling algorithm for equilateral closed polygons. J. Phys. A Math. Theor. 49, 275202 (2016).
ADS MathSciNet MATH Google Scholar
Dabrowski-Tumanski, P., Gren, B. & Sulkowska, J. I. Statistical properties of lasso-shape polymers and their implications for complex lasso proteins function. Polymers 11, 707 (2019).
CAS PubMed Central Google Scholar

Download references

Acknowledgements

The authors would like to thank Szymon Niewieczerzal, Bartosz Gren for help with running simulations, Eleni Panagiotou, Pawel Dabrowski–Tumanski for useful discussions. This work was financed from the budget of Polish Ministry for Science and Higher Education Grant [#0003/ID3/2016/64 Ideas Plus] to JIS, and University of Warsaw [#501-D313-86-0117000-03] to WN.

Author information

Authors and Affiliations

Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2, 02-097, Warsaw, Poland
Wanda Niemyska
Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097, Warsaw, Poland
Wanda Niemyska & Joanna I. Sulkowska
Faculty of Chemistry, University of Warsaw, Pasteura 1, 02-093, Warsaw, Poland
Joanna I. Sulkowska
Department of Mathematics, University of California Santa Barbara, Santa Barbara, CA, 93106, USA
Kenneth C. Millett

Authors

Wanda Niemyska
View author publications
You can also search for this author in PubMed Google Scholar
Kenneth C. Millett
View author publications
You can also search for this author in PubMed Google Scholar
Joanna I. Sulkowska
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.I.S., K.C.M. and W.N designed the work, W.N. and J.I.S performed the work and wrote the paper.

Corresponding author

Correspondence to Joanna I. Sulkowska.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Niemyska, W., Millett, K.C. & Sulkowska, J.I. GLN: a method to reveal unique properties of lasso type topology in proteins. Sci Rep 10, 15186 (2020). https://doi.org/10.1038/s41598-020-71874-2

Download citation

Received: 17 April 2020
Accepted: 17 August 2020
Published: 16 September 2020
DOI: https://doi.org/10.1038/s41598-020-71874-2

This article is cited by

Mathematical topology and geometry-based classification of tauopathies
- Masumi Sugiyama
- Kenneth S. Kosik
- Eleni Panagiotou
Scientific Reports (2024)
Topological links in predicted protein complex structures reveal limitations of AlphaFold
- Yingnan Hou
- Tengyu Xie
- Jing Huang
Communications Biology (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Sequence and structural patterns detected in entangled proteins reveal the importance of co-translational folding

Topological links in predicted protein complex structures reveal limitations of AlphaFold

The protein folding rate and the geometry and topology of the native state

Introduction

Results

GLN definition from protein perspective

The GLN fingerprint as a method to classify lasso structures

Classification of lasso protein structures and entangled but unpierced loops

Unique biophysical features of lasso proteins

Applications of the GLN fingerprint

Discussion and conclusions

Materials and methods

Gaussian linking number

Note, one can simply employ the Banchoff method on the open chain to explicitly calculate this integral24

The minimal surface method and molecular visualization

Molecular dynamics simulation

Random lassos sampling

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary information

Supplementary material 1

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Mathematical topology and geometry-based classification of tauopathies

Topological links in predicted protein complex structures reveal limitations of AlphaFold

Comments

Search

Quick links

Note, one can simply employ the Banchoff method on the open chain to explicitly calculate this integral²⁴