Low-resolution description of the conformational space for intrinsically disordered proteins

Intrinsically disordered proteins (IDP) are at the center of numerous biological processes, and attract consequently extreme interest in structural biology. Numerous approaches have been developed for generating sets of IDP conformations verifying a given set of experimental measurements. We propose here to perform a systematic enumeration of protein conformations, carried out using the TAiBP approach based on distance geometry. This enumeration was performed on two proteins, Sic1 and pSic1, corresponding to unphosphorylated and phosphorylated states of an IDP. The relative populations of the obtained conformations were then obtained by fitting SAXS curves as well as Ramachandran probability maps, the original finite mixture approach RamaMix being developed for this second task. The similarity between profiles of local gyration radii provides to a certain extent a converged view of the Sic1 and pSic1 conformational space. Profiles and populations are thus proposed for describing IDP conformations. Different variations of the resulting gyration radius between phosphorylated and unphosphorylated states are observed, depending on the set of enumerated conformations as well as on the methods used for obtaining the populations.


Short title
Conformational space of IDPs Sic1 and pSic1 September 5, 2022 Extraction of boxes from Ramachandran likelihood The likelihood Ramachandran maps, calculated by TALOS-N [1], were first normalized in order to get the sum of values equal to 1 and to produce probability maps. Each φ, ψ box was determined from these maps in the following way. In the current state of the Ramachandran map, the pixels belonging to a box are removed from the map. From the remaining pixels, the pixel of maximum probability value and larger than the threshold, is selected and a box is iteratively drawn around this position by testing systematically all pixels neighboring the current box limits. All neighbouring pixels containing values larger than a given threshold are included in the box. If values are smaller than the threshold, the calculation stops, the current box definition is kept for further analyses and the pixels selected from the box are removed from the Ramachandran map. This approach is iteratively applied to the map up to the situation where all remaining map pixels display values smaller than the threshold. In order to probe the reproducibility of TAiBP results, two sets of boxes have been determined with threshold values of 0.01 (Figures S1 and S3) and 0.011 ( Figures S2 and S4).
In pSic1, the presence of phosphorylated Threonines 7, 35 and 47 and of phosphorylated Serines 71, 78 and 82 makes impossible the TALOS-N predictions for residues 5-9, 33-37, 45-49, 69-73, 76-84, due to the lack of phosphorylated proteins in the learning set of the neural network. Thus, for these residues, generic boxes (Table S2) have been used as input of TAiBP, in order to cover the Ramachandran regions corresponding to α-helix, extended, β strand and loop structures.

Enumeration of conformations
The enumeration of protein conformations was performed using boxes of backbone angles φ and ψ. These boxes (Figures S1-S4) have been extracted from the likelihood Ramachandran maps obtained by TALOS-N [1] as described in the previous section. During the tree building, each atomic position is determined by trilateration from the previously determined atomic positions, following a specific ordering (Table S4) [2]. More precisely, two out of three of the distances involved in trilateration must be known exactly, and one may be subject to uncertainty and represented by an interval [2,3]. The iBP algorithm was the one described by Worley et al [4,3].
The backbone dihedral angles φ and ψ can be straightforwardly related to bond lengths and bond angles and respectively to distances between atoms C of residues i − 1 and i and between atoms N of residues i and i + 1. This equivalence between the backbone dihedral angles and inter-atomic distances permits to use the angles φ and ψ for the socalled branching step. This branching step is performed by discretization of the intervals in order to generate new branches in the tree.
The bond lengths, bond angles, improper angles and van der Waals radii were taken from the force field protein-allhdg5-4 PARALLHDG (version 5.3) [5,6]. The van der Waals radii were scaled by a factor of 0.7.
For each fragment, two dummy residues were added at the N and C terminal extremities, and the φ and ψ dihedral angles of the inner peptide residues were sampled according to the box limits (Table S1). In order to avoid pruning due to slight discrepancy between distances, a tolerance of 0.05Å has been added to the bounds of distance intervals. The maximum number of branches by discretized interval was set to 4. The minimum discretization factor, which is the minimum ratio between each distance interval to the number of tree branches generated within the interval, was set to 0.1Å, in order to avoid that the branching oversamples small intervals. A maximum number of 10 9 saved conformations was permitted for each iBP run. The solutions were stored in a multiframe dcd format [7].

Clustering of generated conformations
The approach Self-Organizing Map (SOM) [8,9,10,11], used to cluster conformations, is an artificial neural network (ANN) trained using unsupervised learning. SOM displays the advantage with respect to the k-means clustering approach that it does not require the predetermined knowledge of the number of clusters. The SOM approach was used after each iBP calculation or assembly step as soon as the number of saved conformations was larger than 100. The conformations are encoded from the distances d ij calculated between the n C α atoms by diagonalizing the covariance matrix C: whered s = 1 n n p=1 d s,p . The information contained in the matrix C is equivalent to its four largest eigenvalues along with the corresponding eigenvectors, and is formatted as an input vector of length 4(n+1). These vectors are used to train a periodic Euclidean 2D self-organizing map (SOM), which corresponds to a three-dimensional matrix. The first two matrix dimensions were chosen to be 100 × 100 and define the map size, the third dimension being equal to 4(n+1). Each vector along the third dimension defines a neuron of the map.
The neurons of the self-organizing map are initialized with a random uniform distribution covering the range of values of the input vectors. At each step, an input vector is presented to the map, and the neurons closest to this input are updated. The training parameters were those previously described [12,11].
Once the SOM has been determined, representative conformations are extracted from the conventional Unified distance matrix (U-matrix) calculated from the final SOM neurons. For each neuron ν, the corresponding U-matrix element is calculated as the average Euclidean distance between the neuron ν and its eight immediate neighbors: where N (ν) is the set of neighbors, and d(ν, µ) is the Euclidean distance between the neurons µ and ν. pSic1 3 The neurons corresponding to local minima of the U-matrix, and thus to local maxima of conformational homogeneity, are extracted and for all performed runs except pSic1 3 , the protein conformation displaying the closest distance to each neuron is saved.
In the case of pSic1 3 , among the conformations saved in the neurons, the one displaying the longest distance between the Carbons α of the first and the last residues is saved, in order to obtain more extended conformations in agreement with the values of gyration radii measured in Ref. [13]. The conformations generated during the iBP or assembly steps are finally replaced by the sets of representative conformations extracted from local minima of U-matrix.

Molecular dynamics refinement in implicit solvent
Molecular dynamics (MD) trajectories were used to relax the Sic1 and pSic1 conformations obtained from the TAiBP approach. The MD trajectories were recorded using NAMD 2.13 [14]. Topology parameters were taken from the force fields c36 [15] and c36m [16]. The simulations were performed at a temperature of 300 K. A Generalized Born implicit solvent (GBIS) [17] model was used with an ion concentration of 0.3M, and a cutoff of 12Å for calculating Born radius. A cutoff of 14Å and a switching distance of 13Å were defined for non-bonded interactions. The RATTLE algorithm [18,19] was used to keep all covalent bonds involving hydrogens rigid, enabling a time step of 2 fs. Temperature was regulated according to a Langevin thermostat [20]. At the beginning of each trajectory, the system was first minimized for 1,000 steps, then heated up gradually from 0 K to 300 K in 30,000 integration steps. Finally, the system was equilibrated for 5,000 steps. During all steps, from minimization to production, positional restraints were applied on protein backbone atoms with a constant force of 1 kcal/mol. A production run of 100ps was then performed and the conformation of the final frame was saved as the relaxed conformation.

Determination of the populations from the Ramachandran maps
Using the neural network TALOS-N [1], it is possible, starting from the NMR chemical shifts measured on protein atoms, to determine for each residue n a 2D probability density p n TALOS-N (φ, ψ). As NMR analyses a sample containing a mixture of conformations, we propose to decompose the probability map produced by TALOS-N as a mixture of probability densities p n q (φ, ψ), corresponding to a certain number of free energy basins q present in the experimental sample: where γ q ≥ 0 is the proportion of local basins q in the NMR sample. Thus: In the following, the problem described by Eq. 3 will be named as a Q-class mixture problem, each class corresponding to a conformation of the studied protein. In addition, for each conformation q, the couple of angles corresponding to the bottom of the basin will be named its location parameters.
To fit the set of N available TALOS-N probability densities using the mixture model (3), we have to rely on a discrepancy measure between both probability maps. Kullback-Leibler divergence is a standard choice: Here, we consider the sum of discrepancy measures over the N residues between TALOS-N probability densities and the corresponding mixture model densities p n = Q q=1 γ q p n q : In practice, TALOS-N probability densities are available on a finite rectangular grid {φ 1 , . . . , φ I }× {ψ 1 , . . . , ψ J }. Let us modify (6) using a zeroth-order approximation of the integrals: wherep n ij = p n TALOS-N (φ i , ψ j ) and Finally, the minimization of Eq. 7 with respect to γ = (γ 1 , . . . , γ Q ) amounts to the maximization of p n ij (φ, ψ) being given by (8), under the constraints γ q ≥ 0 and Q q=1 γ q = 1. Let us remark that f is a concave function of γ, so that its local maximization with respect to γ cannot be trapped in a local maximum.
In case TALOS-N yielded observations in the form of angle couples y n m = (φ n m , ψ n m ), m = 1, . . . , M , instead of a probability map, we would naturally maximize the log-likelihood of the data [21], a well-known local optimization scheme to reach this goal being the EM algorithm [22,23].
Let us remark the similarity between Eqs. (9) and (10). In fact, Eq. (9) identifies with the log-likelihood of virtual data, each couple (φ i , ψ j ) being observed D n ij =p n ij × C times for the nth residual (C being an arbitrary constant). This corresponds to a well-known correspondance between the log-likelihood and the Kullback-Leibler divergence between the empirical data distribution and the parametrized one (see for instance [23]).
In the case where data points correspond to couples of angles, we must consider that the support of densities p q is a torus, i.e., that they are doubly circular. The most natural circular extension of the univariate Gaussian is the wrapped normal distribution. However, the von Mises distribution is usually considered as a better option, being more easily tractable [24]. Moreover, multivariate extensions exist for the latter. In particular, in the Ref. [25] a bivariate version was introduced, motivated by problems of modelling torsional angles in molecules, and a pseudo-maximum likelihood method was proposed [24] to estimate its parameters. Moreover, a so-called cosine version was investigated [26] and an Expectation-Maximization (EM) algorithm was used [26,27] to solve a problem that is almost identical to ours.
Here, we adopt the same bivariate periodic sine model as [25]: with and κ 1 , κ 2 ≥ 0 and λ 2 < κ 1 κ 2 . A difficulty is that the integration constant is expressed as an infinite series, depending on parameters (κ 1 , κ 2 , λ): In Ref. [25], expressions of κ 1 , κ 2 , λ are given as functions of the parameters (σ 2 1 , σ 2 2 , ρ) of a bivariate Gaussian: where ρ ∈ (−1, 1) denotes the normalized correlation coefficient between the two components of the bivariate Gaussian. These expressions are valid only in the case where σ 2 1 and σ 2 2 are small. They are easily inverted as Using (15), we can replace a Gaussian mode p n q by a periodized version, with approximately the same location and the same spread. This is not specific to the Gaussian case, so it also holds for the bivariate von Mises-type model.
In the following section "Maximum likelihood estimation for bivariate sine mixtures", we are deriving the equations describing an original approach for solving the problem (3) by a maximum likelihood approach.

Maximum likelihood estimation for bivariate sine mixtures
Let Y = (y 1 , . . . , y D ) stand for D iid datapoints. We make the assumption that each y d is sampled from a Q-class mixture model, and we use the notation C d to refer to the random class attached to y d , taking values in (1, . . . , Q). For each d, we have with unknown parameters θ = (γ, ζ) = (γ, ζ L , ζ S ), including • normalized weights γ = (γ q ), • location parameters ζ L = (ζ L q ) where ζ L q = (φ q , ψ q ) is specific to class q, We would like to estimate θ according to the maximum likelihood principle: . Equivalently,θ maximizes the log-likelihood, which reads In the following, we are first describing what should be an Expectation-Maximization (EM) algorithm adapted for solving the maximum likelihood problem, to finally remark that the Maximization step of the EM cannot be solved analytically. Thus, we turn to a solution based on a well-grounded gradient-based optimization scheme. We derive explicit expressions for the gradient terms, on the basis of the Expectation step of the EM. At the end of this section, we present how to include into the optimization scheme, several Ramachandran probability maps corresponding to several protein residues.

Expectation-Maximization (EM) algorithm
The EM algorithm is a reference solution to determineθ by iterative local optimization.
Each EM iteration consists in solving the following auxiliary problem: where Q is the expectation of the log-likelihood of the "complete" dataset: with On the one hand, along classical derivations, we get where P qd = P q (y d ), with P q (y) = Pr(C = q|y; θ old ) = γ old q p q (y; ζ old ) q γ old q p q (y; ζ old ) .
On the other hand, The optimization problem (17) splits in two parts at each iteration, according to The first subproblem is constrained by q γ q = 1. It has a simple, explicit solution. Unfortunately, the second subproblem cannot be solved analytically for the sine model, neither for the shape parameters ζ S , nor for the location parameters ζ L . As a consequence, exact closed-form EM formulas do not exist for the sine model. To our best knowledge, the same holds for other von Mises type models, such as the cosine version of [26]. Indeed, we guess that the EM algorithm used therein solves the maximization step in an approximate way. We rather propose a different solution, relying on a well-grounded gradient-based optimization scheme (namely, the L-BFGS-B algorithm [28]) applied to the log-likelihood itself.

Gradient-based log-likelihood maximization
Fisher's identity [29] relates the gradient of Q to the gradient of the log-likelihood L: This property is very useful when the M step is not closed-form, since it allows one to replace non-explicit EM iterations by explicit gradient-based iterations, directly applicable to the log-likelihood.
Partial derivative w.r.t. the weights γ Given Eqs (19), (22) and (27), we have Optimization w.r.t. the weights must be conducted under the constraints of nonnegativity and sum-to-one. The latter can be easily handled using the simple reparameterization γ q = γ q r γ r . It is easy to establish that Partial derivative w.r.t. the shape parameters ζ S Given Eqs (19), (24) and (27), we have where p(y d ; ζ L q , ζ S q ) is a sine density defined by (11). Explicit expressions for the partial derivative depend on each shape parameter, according to where, given the expression of T (Eq. (13)) and I m (u) = m u I m (u) + I m+1 (u) (see [30]), Optimization must be performed under the nonlinear inequality constraint λ 2 < κ 1 κ 2 . A simpler alternative consists in replacing λ by ρ = λ/ √ κ 1 κ 2 in the parameterization, so the constraint becomes ρ ∈ (−1, 1). We then need to replace Eq. (12) by and (30)-(32) by with given Partial derivative w.r.t. the location parameters ζ L Given Eqs (19), (22) and (27), we have Moreover,

Case of multiple datasets
In the case where N residues are available, each conformation is characterized by a unique weight vector, whereas its location and shape parameters are specific to each residue. The identification problem then consists in estimating: • Q normalized weights γ = (γ q ) for the protein conformations (classes), • 5N Q = 3N Q + 2N Q shape and location parameters specific to each conformation and each residue, respectively ζ S qn = (κ 1qn , κ 2qn , λ qn ) and ζ L qn = (φ qn , ψ qn ).
The log-likelihood then reads where the nth residue corresponds to D n observed pairs of angles y dn .
The gradient component relative to each shape or location parameter can still be calculated using the equations (37)-(42) and (44)-(46), respectively, while a summation of Eq.
(28) over all residues must be performed to obtain the gradient components relative to the weight parameters.
The scheme developed here has been used to calculate the relative weigths of the conformations by fitting the probability Ramachandran maps obtained using TALOS-N [1].

Figures
Figure S1  Figure S1: Boxes for backbone angles used as inputs for the run Sic1 1 and obtained from the Ramachandran maps using a threshold of 0.01. The boxes and the corresponding sequence are colored according to the considered residue.  Figure S2: Boxes for backbone angles used as inputs for the run Sic1 2 and obtained from the Ramachandran maps using a threshold of 0.011. The boxes and the corresponding sequence are colored according to the considered residue.  Figure S3: Boxes for backbone angles used as inputs for the run pSic1 1 and obtained from the Ramachandran maps using a threshold of 0.01. The boxes and the corresponding sequence are colored according to the considered residue. The pT and pS residues are marked as X and Z in the sequences.  Figure S4: Boxes for backbone angle used as inputs for the run pSic1 2 using a threshold of 0.011. The boxes and the corresponding sequence are colored according to the considered residue. The pT and pS residues are marked as X and Z in the sequences.   Figure S6 (3 next pages) Figure S6: Superimposition of the MERA φ, ψ distributions obtained on residues of pSic1 with the (φ, ψ) input boxes for TAiBP. The size of the points on MERA distribution is large for predicted probability values larger than 0.005 and small for the other probability values. The TAiBP input boxes are colored in magenta and green for the duplicated TAiBP runs: pSic1 1 and pSic1 2 .       Figure S10: Distances between the profiles P q (Eq. 11 of the main text) for local gyration radii between the conformations selected from different fittings of SAXS curves (BioEn1,BioEn2, BioEn3). The limit of 8Å used for the superposed plots of profiles ( Figure  5 of the main text) is drawn in red on the scale of distance.  Figure S11: Distances between the profiles P q (Eq. 11 of the main text) for local gyration radii between the conformations selected from fitting of SAXS curves (BioEn1,BioEn2, BioEn3) and of Ramachandran maps (RamaMix). The limit of 8Å used for the superposed plots of profiles ( Figure 5 of the main text) is drawn in red on the scale of distance. Figure S12 map      Table S4: Atom re-ordering used during the iBP calculation step within the first, the last and the inner residues of the peptide fragment. The order is described by the list of atoms names, the signs "-" and "+" describing atoms located in the previous and the next residues in the primary sequence. and also pooled with pSic1 1 and pSic1 2 to produce pSic1 13 pSic1 23 . For each SAXS curve and set of protein conformations, after ten runs starting from random values of populations and performed on the whole set of conformations, all conformations for which the sum of populations over the ten runs was larger than 0.01 were gathered, and a second run of ten additional BioEn calculations was performed on this reduced set of conformations. The average and standard deviation values of populations obtained for each selected conformation from the second set of BioEn runs, are given in the  Table S6: Conformations and populations selected by fitting of the Ramachandran maps using RamaMix. For each set of protein conformations, 100 runs were performed starting from random values for the populations. The few optimizations which did not converge, were discarded: 2 for pSic1 13 and for pSic1 23 . The backbone angles φ and ψ were allowed to move up to 15 • . The populations of conformations for the converged runs were averaged and these mean values are given as percentages in the Table along with the corresponding standard deviation values. The labels of conformations also selected by BioEn are written in bold. Labels larger than 200 correspond to conformations from pSic1 3 .