De novo protein conformational sampling using a probabilistic graphical model

Bhattacharya, Debswapna; Cheng, Jianlin

doi:10.1038/srep16332

Download PDF

Article
Open access
Published: 06 November 2015

De novo protein conformational sampling using a probabilistic graphical model

Debswapna Bhattacharya¹ &
Jianlin Cheng^1,2,3

Scientific Reports volume 5, Article number: 16332 (2015) Cite this article

2685 Accesses
20 Citations
Metrics details

Subjects

Abstract

Efficient exploration of protein conformational space remains challenging especially for large proteins when assembling discretized structural fragments extracted from a protein structure data database. We propose a fragment-free probabilistic graphical model, FUSION, for conformational sampling in continuous space and assess its accuracy using ‘blind’ protein targets with a length up to 250 residues from the CASP11 structure prediction exercise. The method reduces sampling bottlenecks, exhibits strong convergence and demonstrates better performance than the popular fragment assembly method, ROSETTA, on relatively larger proteins with a length of more than 150 residues in our benchmark set. FUSION is freely available through a web server at http://protein.rnet.missouri.edu/FUSION/.

Conformational ensembles of the human intrinsically disordered proteome

Article 31 January 2024

Artificial intelligence guided conformational mining of intrinsically disordered proteins

Article Open access 20 June 2022

Efficient sampling of high-dimensional free energy landscapes using adaptive reinforced dynamics

Article 24 December 2021

Introduction

Successfully predicting protein three-dimensional structures of near-experimental accuracy from their amino acid sequence requires efficient navigation of astronomically large conformational space¹ accessible to proteins. Fragment assembly approaches^2,3,4 include rapid exploration of conformational space by restricting local conformations (i.e., fragments) to those observed in experimentally-solved structures extracted from the Protein Data Bank (PDB) and assembling the fragments to form complete structures. Such a locally restrained search strategy has proven to be an extremely powerful method to fold small proteins (<100 residues) with reasonable accuracy⁵. For larger proteins, however, the inherent rigidity and incomplete coverage of these discrete fragments often impose kinetic limitations in sampling^6,7, hindering the possibility of accurate de novo protein structure prediction.

Significant progress has been made recently to overcome the limitations of fragment-based methods by performing probabilistic sampling guided by local structural preferences^8,9,10. Although promising and mathematically attractive, these approaches are either based on coarse-grained (i.e., C_α) representations of protein structure^8,9, or they assume ideality in backbone planarity¹⁰ (i.e., ω-angles). A coarse-grained model is of limited use in high-resolution protein structure prediction because of the one-to-many correspondence between C_α traces and full atomic detail of a protein’s backbone. Also, small, but cumulative and often systematic, deviations from ideality in backbone planarity exists¹¹, which, if ignored, might also lead to possible minor distortions in the structure. Assuming ideal bond lengths and bond angles, the minimum angular degrees of freedom needed are three dihedral angles (ϕ, ψ, ω) to accurately place backbone Cartesian coordinates (x, y, z of three atoms - N, C_α and C) of a residue. This is the granularity used by typical fragment assembly methods, such as ROSETTA²; however, fragment-free de novo sampling at this grain has not yet been demonstrated, to our knowledge.

Here, we propose an Input-Output Hidden Markov Model¹² (IOHMM) to capture the preferences of the dihedral angles associated with protein backbone (ϕ, ψ, ω) given its sequence as shown in Fig. 1a. An IOHMM is based on a non-homogeneous Markov chain, where emission and transition probabilities depend on the input. IOHMM is, therefore, an appropriate choice for protein structure prediction, where the goal is to sample protein conformation given its sequence (i.e., input). The proposed model, FUSION, captures local relationships between protein sequence and structural features and allows for probabilistic sampling of conformational space of the protein backbone in full-atomic detail (i.e., at the same granularity as fragment assembly) from a continuous space different from the discrete space of fragment assembly.

Results

In this section, we first briefly describe the architecture of FUSION, then describe sampling strategies and finally present an evaluation of its performance from various perspectives.

Architecture of FUSION

FUSION ensures sequential dependencies between protein sequence (input) and structural space (output) through a Markov chain of hidden states. In each slice, as presented in Fig. 1b, an input node (A) captures the protein’s sequence space. Connections between the input nodes represent the transition probabilities between residues along the protein chain. Output (i.e., emission) nodes correspond to structural space, modeled using secondary structure (S), dihedral angle pair (D: ϕ, ψ) and peptide bond conformation (P: ω). The hidden node (H) is a discrete node that can adopt 30 states (which is the optimal number of states) where, each of these states specifies which mixture component is chosen among the possible emission distributions. The optimal number of hidden states and all other associated parameters were determined by training the model using a maximum likelihood method on a large set of representative experimentally-solved protein structures.

Generating protein conformation

In the trained model, each hidden node value is associated with preferences of secondary structure types and backbone geometry, conditioned on sequence. This provides a convenient way of generating protein structural features compatible with its sequence. For a given protein sequence, a corresponding hidden node sequence can be sampled from one end to the other through plausible paths in the transition matrices of input and hidden nodes. After obtaining a particular sequence of hidden node values, emission values for the output nodes are drawn from the corresponding conditional probability distributions. It is also possible to seamlessly resample random-length segments of the protein using the forward-backtrack algorithm¹³. Furthermore, the inclusion of secondary structure information into the model allows for sampling the conformational space associated with both amino acid, and, optionally, secondary structure, when the latter becomes available (e.g., predicted from amino acid sequence). The sampled dihedral angles (ϕ, ψ, ω), conditioned on sequence-based observations can then be readily converted into Cartesian coordinates, giving rise to a protein backbone in full-atomic detail. Repeated resampling of random stretches of dihedral angles in FUSION mimic fragment replacement in fragment assembly methods, but in a probabilistic way, which reduces intrinsic sampling bottlenecks imposed by a discretized fragment library (e.g., boundary effects²).

The probabilistic nature of FUSION facilitates its effective integration as a proposed distribution in Markov Chain Monte Carlo (MCMC) simulations, under the control of an empirical force field. We used the classic Metropolis-Hastings MCMC approach¹⁴, by resampling random stretches (3 to 15 residue segment) of the current candidate structure, x, having a dihedral angle sequence d, to propose a new sequence of dihedral angles d′, resulting in the next candidate structure, x′ and accepting or rejecting the move using standard Metropolis-Hastings acceptance criterion. Simulations were carried out using the low-resolution scoring function of ROSETTA¹⁵, together with ambiguous sequence-derived predicted information. FUSION’s model-based conditional sampling approach removes a major bottleneck of using fragment assembly as a proposal distribution that, by contrast, implicitly introduces a system-specific bias into the force field, which is difficult to quantify¹⁶. Thus, it is generally impossible to satisfy the condition of detailed balance¹⁴, which is a fundamental prerequisite to ensure that simulations sample the Boltzmann distribution of the applied force filed.

Blind assessment of FUSION

We blindly tested the generality and accuracy of FUSION using 42 protein targets with a sequence length less than 250 residues that were simultaneously under investigation in X-ray crystallography or NMR spectroscopy laboratories during the 11th community-wide experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP11). A reduced representation of protein structure was adopted that used the backbone atoms and a side-chain centroid to generate up to 10,000 low-resolution models for each protein sequence within a limited prediction window of three days. The reduced models were then expanded by adding side chains using the smoothed backbone-dependent rotamer library^17,18 to produce all-atom decoys.

Angular preferences

To investigate whether FUSION captures the angular preference of dihedral angles (ϕ, ψ, ω) observed in proteins, we ranked the all-atom decoy population for each target using DFIRE¹⁹ statistical potential and compared the joint histograms of (ϕ, ψ), (ϕ, ω), (ψ, ω) angles from the lowest scoring decoy as well as the lowest C_α-rmsd (root mean square deviation of alpha-carbon coordinates after optimal structural superposition) decoy in the set of the top five low-scoring decoys (i.e., best of five) with that of their experimental structures. As shown in Fig. 2, the distribution of (ϕ, ψ) angles was in close agreement with the observations in both cases and covered the entire allowed space of the Ramachandran plot²⁰. The distributions of (ϕ, ω) and (ψ, ω), despite correctly capturing the major peaks, revealed noticeable deviations in ω angles compared to their experimental counterparts. However, these apparent outliers might be due to the restraints imposed by the use of ideal bond lengths and angles during the simulations, to some degree.

Secondary structure propensity

In addition to capturing the dihedral angle distribution, FUSION decoys revealed excellent similarity in overall secondary structure content compared to the experimental structures. In Table 1, we present the secondary structure content of the experimental structures and FUSION decoys. Over the entire benchmark set, having ~34% helix (α-helix, 3₁₀-helix and π-helix) and ~30% β-strand (extended strand and isolated β-bridge), both lowest scoring and the best of five decoys contained ~32% helix and ~23% β-strand. It should be noted, however, that formation of β-strand residues requires specific nonlocal interaction (i.e., hydrogen bonding), which is beyond the scope of a Markovian model like FUSION and was primarily achieved by the scoring function.

Table 1 Secondary structure contents in the experimental structures and FUSION decoys.

Full size table

Nature of sampled energy landscape

To study the energy landscape encountered during FUSION simulations, we examined the relationship between DFIRE energy score and C_α-rmsd of decoy populations. In Fig. 3, we show 2-dimensional distribution of conformations as a function of the DFIRE energy score, on the y axis and the C_α-rmsd to the native state on the x axis for a diverse set of targets with different topologies and sequence lengths. Strong convergence was observed in several cases as defined by a distinct funnel-shaped energy landscape. FUSION produced convergent sampling across a broad spectrum of target lengths ranging from small targets with a relatively simple fold (like T0773-D1; Fig. 3a) to larger protein having complex topologies (like T0776-D1; Fig. 3f). The energy landscape encountered by FUSION over the entire benchmark set is presented in Supplementary Fig. 1.

Extent and distribution of conformational sampling

To further examine the degree of conformational sampling done by FUSION, we investigated the proportion of good decoys (having a C_α-rmsd below 6 Å with the native structure), the accuracies of the decoys having the lowest DFIRE scores and the best decoys out of the five lowest DFIRE scores. Table 2 reports each of these measures for all the targets in the benchmark set. For 24 out of 42 targets, FUSION generated some good decoys with C_α-rmsd less than or equal to 6 Å. For 15 targets, the best of the top five lowest scoring decoys selected by DFIRE from all the decoys generated by FUSION had an accuracy better than 6 Å, even though the percentage of good decoys is not always high. As expected, smaller size targets tend to have a much higher proportion of good decoys as well as a higher accuracy than that found using larger targets. Nevertheless, for some fairly large proteins having more than 200 residues, the low-scoring conformations sampled by FUSION reached close to the 6 Å mark. For instance, the best of five low-scoring decoys for target T0760-D1, a 210-residues β protein domain, achieved an accuracy of 6.66 Å. However, for target T0849-D1, a 236-residue mostly helical protein domain, the best of five low-scoring decoys achieved an accuracy of 8.71 Å.

Table 2 Accuracy of FUSION decoys.

Full size table

To gain additional insights into the nature of the decoy population, especially for larger proteins, we examined the Gaussian kernel density estimation for the accuracy of decoys generated by FUSION. In Fig. 4, we show the distribution and degree of sampling for three targets with a sequence length of more than 200 residues. For T0760-D1 (Fig. 4a), the range of sampled conformational space is diverse with a high density of decoy population between 15 Å and 20 Å and reaching an accuracy of 5.53 Å. For T0805-D1 (Fig. 4b), the conformation space is less diverse with a definite peak near 10 Å. The best decoy attained 5.91 Å C_α-rmsd. For T0849-D1 (Fig. 4c), the distribution is multimodal with many peaks between 6 Å and 20 Å with the best decoy reaching an accuracy of ~6 Å. The degree and distribution of conformational sampling for all targets are presented in Supplementary Fig. 2.

Comparisons with fragment-assembly

We compared FUSION with the popular fragment assembly method ROSETTA², which constructs a library of fragments from PDB using sequence profile, secondary structures and other sequence-derived features. FUSION also assembles fragments to produce the final structure. A direct comparison between FUSION and ROSETTA, therefore, is not fair because ROSETTA has a clear advantage in its use of multiple sequence information during fragment selection. Moreover, we did not exclude homologues fragments in order to realize the full potential of ROSETTA. On the other hand, FUSION does not have such advantage since the training dataset is non-homologous to the benchmark set curated well before CASP11 and it is a model-based sampling approach rather than a fragment assembly method based on a fragment library. However, FUSION simulations used ambiguous distance restraints derived from sequence-based predicted residue-residue contacts as an additional pseudo energy term, which were not used in ROSETTA. We, therefore, decided to compare the accuracy of the best decoys generated by ROSETTA and FUSION. The comparison offers some interesting insights.

Out of 16 smaller proteins with a length of less than 100 residues, ROSETTA outperformed FUSION in 14 cases in terms of the accuracy on reporting the best decoy as shown in Table 3. For instance, for target T0759-D1, a 34 residues small protein domain, the best decoy produced by ROSETTA had a C_α-rmsd of 0.67 Å compared to the native protein, outperforming the best decoy generated by FUSION with a C_α-rmsd of 2.38 Å by a large margin. However, for larger proteins with more than 150 residues, in terms of accuracy of the best decoy, FUSION performed better than ROSETTA. For eight out of nine targets with more than 150 residues, the best models generated by FUSION were consistently more accurate than ROSETTA. As shown in Fig. 5, in six out of nine cases, the best models generated by FUSION were reasonably accurate with a C_α-rmsd less than 6 Å, while ROSETTA failed to reach the accuracy of 6 Å in any of the cases. Moreover, for target T0849-D1 with a sequence length of 236 residues, the best decoy generated by FUSION attained 6.01 Å C_α-rmsd, while best decoy generated by ROSETTA had an accuracy of 9.01 Å.

Table 3 Accuracy of best decoys generated by FUSION and ROSETTA.

Full size table

Discussion

This study introduces a probabilistic approach for sampling of a protein backbone in full atomic detail in continuous space, free from a fragment library. The sampled conformation has a reasonable stereochemistry, which is reflected by its realistic angles and secondary structure. Its ability to incorporate noisy predicted information during simulation and complete coverage of the conformational space accessible to proteins makes it fundamentally different from prior fragment assembly approaches.

An analysis of the performance of the proposed method, FUSION, in a blind assessment revealed its capability to perform convergent sampling, covering a large spectrum of conformational space accessible to a protein sequence. It performs favorably especially for larger proteins producing more accurate decoys compared to fragment assembly techniques, opening the possibility to predict near-native structural models even for large proteins in a de novo manner.

An obvious next step in the future is to extend the model to capture both backbone and side chain conformational bias. Given the large degrees of freedoms in the side chain of a protein molecule, this will pose a formidable computational challenge. Integrating multiple sequence alignment information into the model could be another possible direction to be investigated in the future.

To facilitate usage of the FUSION method by life scientists around the world, a public web server has been made freely available at http://protein.rnet.missouri.edu/FUSION/, where users can access and submit FUSION modeling jobs. Instructions on submitting and retrieving modeling jobs are also provided at the website. Due to limited computational resources and to ensure a reasonable turn-around time, the maximum number of decoys per job submission is limited to 10,000.

Methods

Parameterization of protein conformational space

Before formulating a probabilistic model capturing detailed sequence to structure relationships, mathematical parameterization of protein conformational space is essential. Twenty naturally occurring amino acid residues usually specify protein sequence space. Due to their intrinsic stereochemistry, these residues give rise to distinct population distributions in Ramachandran space²⁰. Analysis of high-resolution experimental structures^21,22,23 has shown that it is convenient to consider these distributions in eight classes: (1) glycines not preceding prolines, (2) prolines not preceding prolines, (3) β-branched amino acid residues, isoleucines and valines, not preceding prolines, (4) all amino acids except glycines, prolines, isoleucines and valines not preceding prolines, (5) glycines preceding prolines, (6) prolines preceding prolines, (7) β-branched residues isoleucines and valines preceding prolines and (8) all amino acids except glycine, proline, isoleucine and valine preceding prolines. We use these eight classes of amino acids residues to represent protein sequence space.

On the structural side, we adopt a backbone-only representation of proteins, where, each amino acid residues in a protein chain can be characterized using three angular degrees of freedom, the ϕ, ψ and ω dihedral angles, assuming ideal bond lengths and bond angles²⁴. Due to the presence of steric hindrance and electrostatic interactions, backbone dihedral angle pairs (ϕ, ψ) cluster together in distinct regions of the Ramachandran plot in naturally-occurring protein structures. Densely populated regions correspond to low energy conformations found in common elements of secondary structures, most significantly, right-handed α-helices, left-handed α_L-turns and extended β-strands. We, therefore, considered three-state secondary structure types (helix, strand and coil) to capture this preference. Furthermore, we included the peptide ω angles, which have been found to exhibit systematic variations in (ϕ, ψ) space in proteins^11,25. This parameterization is simple, yet adequate to describe protein backbone conformation in atomic detail.

Formulating the probabilistic graphical model

We briefly describe the most important aspects of the proposed model in Fig. 1. For each slice, i, a residue type identifier, A_i specifies which of the eight classes of residue types serves as input in a given slice and a hidden variable, H, that can adopt 30 different discrete states (see below). Each of these states (H_i) corresponds to a specific emission distribution over secondary structures (S_i: helix, strand and coil), dihedral angle pairs (D_i: ϕ, ψ) and peptide bond conformations (P_i: ω). Conformational space of a protein with n residues is specified by the following probability distribution:

where, the sum runs over all possible hidden node sequences H = (H_i, …, H_n).

We model the discrete nodes A and S using conditional probability tables. In order to capture the angular preferences of the backbone dihedral angle pair node (D), we use a mixture of bivariate von Mises distributions (the cosine variant), which is most suited for this purpose²⁶. Bivariate von Mises distribution specifies dihedral angle pairs (ϕ, ψ), both ranging from –π to π as points on torus. The probability density function is given by:

where, μ and ν are means for ϕ and ψ, respectively, κ₁ and κ₂ are their concentration, while κ₃ is related to their correlation.

Angular preference of the ω dihedral angle node of a peptide bond (P) is modeled using a mixture of von Mises distribution²⁷, which can be considered the circular equivalent of Gaussian distribution. The von Mises distribution takes the circular nature of angular data into account, but it also represents dihedral angles ranging from –π to π as points on circle. The probability density function has the following form:

where, λ is the mean angle, κ > 0 is a concentration parameter and I₀ is the modified Bessel function of the first kind and of the order zero.

Training data, parameter estimation and model selection

As training data, we collected 1,740 non-redundant protein domains, covering different SCOP folds, from the SABmark dataset, version 1.65²⁸. Residue class and angle information was extracted directly from the training data, whereas three-state secondary structures (helix, strand and coil) were assigned using DSSP²⁹. The training dataset contains 270,350 observations.

Parameter estimation for FUSION was done using Stochastic Expectation-Maximization (S-EM)³⁰, as implemented in Mocapy⁺⁺ software package³¹. In each iteration, the S-EM algorithm consisted of two steps: (1) for each observation in the training set, plausible hidden nodes were resampled using the forward-backtrack algorithm¹³, which allocated each observation in the training set to a specific hidden state (E-step); (2) the parameters were updated using maximum likelihood, assuming the model was fully observed (M-step). S-EM algorithm is known to be a better choice than classic EM algorithm on large datasets due to its computational efficiency and its ability to avoid convergence to local optima³⁰.

The optimal size of the hidden node is a hyperparameter that has to be determined separately and choosing the optimal hidden node size is crucial for the model to succeed. For low size, the model will be too coarse; however, if the size is too high, it will lead to overfitting. We estimated the optimal hidden node size using the Akaike Information Criterion (AIC)³², a well-established model selection criterion:

where, L(θ|d) is the likelihood of the model given the data d and n is the number of parameters. The AIC value reaches a minimal value for the optimal model. The AIC was calculated for hidden node sizes of 10 to 100 (with a step size of 5), using a likelihood obtained after convergence of the S-EM algorithm. Since the nature of the training process is stochastic, parameter estimation for each hidden node size was repeated four times with different starting conditions. For a model with a hidden node size of 30, the AIC value reached its minimum value, resulting in 7,812 parameters (Supplementary Fig. 3). We chose this model as the optimum one.

Conformational sampling

For a given stretch of n residue protein sequence, the amino acid residues can be readily mapped to the residue classes (A₁, …, A_n). The plausible values of the hidden nodes, H_i, are then sampled from one end to the other, from the distribution P(H_i|A_i = a_i, H_i−1 = h_i−1). Based on the sequence of hidden node values, samples for corresponding emission nodes are drawn from the corresponding conditional probability distribution .

Once we have sampled a sequence of hidden values, (H₁, …, H_n), a sequence of secondary structure types (S₁, …, S_n), a sequence of (ϕ, ψ) angle pairs (D₁, …, D_n) and a sequence of ω dihedral angles of the peptide bonds (P₁, …, P_n), given an appropriate sequence of residue classes (A₁, …, A_n), resampling a sub-sequence, from position l to m can then be done using the forward-backtrack algorithm¹³. The algorithm involves two steps. In the first step, the forward variables are calculated for each possible hidden node value k in each slice j ∈ (l, …, m), using the forward algorithm³³. Subsequently, the hidden nodes values, h_j, are sampled from position l to position m proportional to . In the second step, emission nodes at each position j ∈ (l, …, m) are sampled from the conditional probability distribution . In case the secondary structure information is available, or, predicted from the protein sequence; hence, the same sampling and resampling strategies can be applied simply by treating secondary structure types (S_i) of the corresponding sequence position i, as observed. This unique conditional sampling approach makes it possible to incorporate observed structural features to guide the sampling of dihedral angles.

Simulation protocol

For each protein in the benchmark set, we predicted a three-state secondary structure (helix, strand and coil) from the amino acid sequence using machine-learning based secondary structure predictors PSIPRED³⁴ and Raptor-X³⁵. To reduce the effect of noisy prediction on the modeling performance, we flagged the secondary structure as observed only when the consensus confidence (confidence of secondary structure ∈ [0, 1]) for a residue was above 0.5. For the rest of the residues, secondary structures were left hidden, allowing flexibility during the simulations.

We used ROSETTA’s low-resolution scoring function, E_rosetta, as one part of the FUSION’s energy functions to guide the simulations, accessed through its Python-based interface, PyRosetta³⁶. Briefly, it includes terms for van der Waals hard sphere repulsion (vdw), residue environment (env), residue pair (pair), C_β packing density (cb), secondary structure packing [helix–helix pairing (hh), helix-strand pairing (hs), strand-strand pairing (ss), strand pair distance (rsigma) and sheet formation from strands (sheet)], plus radius of gyration (rg). The details for each of these terms have been described elsewhere¹⁵. In general, the ROSETTA low-resolution scoring function favors compact structures with buried hydrophobic residues and paired β strands. To further guide the sampling, we added ambiguous distance restraints as an additional pseudo energy term using sequence-based predicted residue to residue contacts [two amino acid residues are considered to be in contact if the distance between their C_β atoms (C_α for glycine) in the experimental structure is less than 8Å] using NNcon³⁷ and PhyCMAP³⁸. The contact energy was defined as a function of atom pair distance restraint³⁹ between C_β atoms (C_α for glycine):

where, x is the distance between the corresponding atoms for a contact pair, lb is the lower bound (1.5 Å), ub is the upper bound (8 Å) and rswitch is a constant of 0.5. We filtered all contacts except the top L/5 (sorted by confidence of prediction ∈ [0, 1]) and we predicted contacts from each predictor, where L is the sequence length of the protein. In order to further account for low accuracy in sequence-based predicted contacts, contact energy was evaluated within ± δ neighboring residues of a predicted contact pair [i, j], for small values of δ and the minimum energy value was considered as ambiguous contact energy E_ij (e.g., for δ = 1, ambiguous contact energy, E_ij, of a predicted contact pair [i, j] would be the minimum of contact energy evaluated at [i, j], [i ± 1, j], [i, j ± 1] and [i ± 1, j ± 1]). Summing up E_ij values over the top L/5 predicted contact pairs from each contact predictor resulted in the contact-derived restraint energy, E_contact. The value of δ was set as a logarithmic function of the sequence separation between the residues under consideration: . Based on our preliminary simulation runs, such an ambiguous (less than residue-level precision) definition of contact not only compensates for the noise in contact prediction, but it also facilitates achieving an optimal balance between contact-derived restraints energy and general physical chemistry, which is implicit in the ROSETTA scoring function. The total energy, E_total (x), of a conformation x with dihedral angle sequence d, is a linear combination of ROSETTA low-resolution scoring function and contact-derived restraint energy function:

Subsequently, Boltzmann’s law was used to convert the energies into probabilities:

where, the inverse temperature, β, was set to 2.0 kT.

For a transition from a dihedral angle sequence from d to d′ in the FUSION model, Metropolis-Hastings acceptance criterion can be expressed as:

where, is the acceptance probability corresponding to the transition from state d to d′; moreover, P(d′) and P(d) are the probabilities of d and d′ according to the target distribution, while and are the probabilities of a move from state d′ to state d and vice versa, according to the proposal distribution.

Since we used FUSION as a solely proposed distribution and the transition in dihedral sequence from state d to d′ results in a transition of conformation from x to x′, the Metropolis-Hasting expression reduces to:

where, P_total (x) is the scoring function derived probability described above and P_fusion (d) is the product of the probabilities of dihedral angles in d according to FUSION, conditioned on the residue classes and optionally secondary structure types.

We performed 20,000 MCMC iterations to generate each low-resolution model by resampling random stretches of 3 to 15 residue segment and selected the structure with the highest probability (i.e., lowest energy). The lowest energy structure was further relaxed using a smooth reparameterized version of ROSETTA’s low-resolution scoring function^2,15.

Sources of experimental PDB structures in the benchmark set

The experimental PDB structures used in the 42-protein benchmark set were downloaded from the CASP11 website at http://predictioncenter.org/download_area/CASP11/targets/. The domain definitions and the PDB accession codes were provide by CASP assessors at http://predictioncenter.org/casp11/domains_summary.cgi.

Additional Information

How to cite this article: Bhattacharya, D. and Cheng, J. De novo protein conformational sampling using a probabilistic graphical model. Sci. Rep. 5, 16332; doi: 10.1038/srep16332 (2015).

References

Levinthal, C. Are there pathways for protein folding. J. Chim. phys 65, 44–45 (1968).
Article Google Scholar
Simons, K. T., Kooperberg, C., Huang, E. & Baker, D. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 268, 209–225 (1997).
Article CAS Google Scholar
Chikenji, G., Fujitsuka, Y. & Takada, S. A reversible fragment assembly method for de novo protein structure prediction. The Journal of Chemical Physics 119, 6895–6903 (2003).
Article CAS ADS Google Scholar
Chikenji, G., Fujitsuka, Y. & Takada, S. Shaping up the protein folding funnel by local interaction: lesson from a structure prediction study. Proc. Natl. Acad. Sci. USA 103, 3141–3146 (2006).
Article CAS ADS Google Scholar
Bradley, P., Misura, K. M. & Baker, D. Toward high-resolution de novo structure prediction for small proteins. Science 309, 1868–1871 (2005).
Article CAS ADS Google Scholar
Hegler, J. A., Lätzer, J., Shehu, A., Clementi, C. & Wolynes, P. G. Restriction versus guidance in protein structure prediction. Proc. Natl. Acad. Sci. 106, 15302–15307 (2009).
Article CAS ADS Google Scholar
Kim, D. E., Blum, B., Bradley, P. & Baker, D. Sampling bottlenecks in de novo protein structure prediction. J. Mol. Biol. 393, 249–260 (2009).
Article Google Scholar
Hamelryck, T., Kent, J. T. & Krogh, A. Sampling realistic protein conformations using local structural bias. PLoS Comput. Biol. 2, e131 (2006).
Article ADS Google Scholar
Zhao, F., Li, S., Sterner, B. W. & Xu, J. Discriminative learning for protein conformation sampling. Proteins: Structure, Function and Bioinformatics 73, 228–240 (2008).
Article CAS Google Scholar
Boomsma, W. et al. A generative, probabilistic model of local protein structure. Proc. Natl. Acad. Sci. 105, 8932–8937 (2008).
Article CAS ADS Google Scholar
Berkholz, D. S., Driggers, C. M., Shapovalov, M. V., Dunbrack, R. L. & Karplus, P. A. Nonplanar peptide bonds in proteins are common and conserved but not biased toward active sites. Proc. Natl. Acad. Sci. 109, 449–453 (2012).
Article CAS ADS Google Scholar
Bengio, Y. & Frasconi, P. Input-output HMMs for sequence processing. Neural Networks, IEEE Transactions on 7, 1231–1249 (1996).
Article CAS Google Scholar
Cawley, S. L. & Pachter, L. HMM sampling and applications to gene finding and alternative splicing. Bioinformatics 19, ii36–ii41 (2003).
Article Google Scholar
Gilks, W. R., Richardson, S. & Spiegelhalter, D.J. Introducing markov chain monte carlo. Markov chain Monte Carlo in practice 1, 19 (1996).
MATH Google Scholar
Rohl, C. A., Strauss, C. E., Misura, K. M. & Baker, D. Protein structure prediction using Rosetta. Methods Enzymol. 383, 66–93 (2004).
Article CAS Google Scholar
Przytycka, T. Significance of conformational biases in Monte Carlo simulations of protein folding: Lessons from Metropolis–Hastings approach. Proteins: Structure, Function and Bioinformatics 57, 338–344 (2004).
Article CAS Google Scholar
Shapovalov, M. V. & Dunbrack, R. L. A smoothed backbone-dependent rotamer library for proteins derived from adaptive kernel density estimates and regressions. Structure 19, 844–858 (2011).
Article CAS Google Scholar
Kuhlman, B. & Baker, D. Native protein sequences are close to optimal for their structures. Proc. Natl. Acad. Sci. 97, 10383–10388 (2000).
Article CAS ADS Google Scholar
Zhou, H. & Zhou, Y. Distance‐scaled, finite ideal‐gas reference state improves structure‐derived potentials of mean force for structure selection and stability prediction. Protein Sci. 11, 2714–2726 (2002).
Article CAS Google Scholar
Ramachandran, G., Ramakrishnan, C. & Sasisekharan, V. Stereochemistry of polypeptide chain configurations. J. Mol. Biol. 7, 95–99 (1963).
Article CAS Google Scholar
Lovell, S. C. et al. Structure validation by Cα geometry: ϕ, ψ and Cβ deviation. Proteins: Structure, Function and Bioinformatics 50, 437–450 (2003).
Article CAS Google Scholar
Ho, B. K. & Brasseur, R. The Ramachandran plots of glycine and pre-proline. BMC Struct. Biol. 5, 14 (2005).
Article Google Scholar
Karplus, P. A. Experimentally observed conformation-dependent geometry and hidden strain in proteins. Protein Sci. 5, 1406–1420 (1996).
Article CAS Google Scholar
Engh, R. A. & Huber, R. Accurate bond and angle parameters for X-ray protein structure refinement. Acta Crystallographica Section A: Foundations of Crystallography 47, 392–400 (1991).
Google Scholar
MacArthur, M. W. & Thornton, J. M. Deviations from planarity of the peptide bond in peptides and proteins. J. Mol. Biol. 264, 1180–1195 (1996).
Article CAS Google Scholar
Mardia, K. V., Taylor, C. C. & Subramaniam, G. K. Protein bioinformatics and mixtures of bivariate von Mises distributions for angular data. Biometrics 63, 505–512 (2007).
Article CAS MathSciNet Google Scholar
Mardia, K. V. & Jupp, P. E. Directional Statistics. Vol. 494 (John Wiley & Sons, 2009).
Van Walle, I., Lasters, I. & Wyns, L. SABmark—a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics 21, 1267–1268 (2005).
Article CAS Google Scholar
Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983).
Article CAS Google Scholar
Nielsen, S. F. The stochastic EM algorithm: estimation and asymptotic results. Bernoulli, 457–489 (2000).
Article MathSciNet Google Scholar
Paluszewski, M. & Hamelryck, T. Mocapy++-A toolkit for inference and learning in dynamic Bayesian networks. BMC Bioinformatics 11, 126 (2010).
Article Google Scholar
Burnham, K. P. & Anderson, D. R. Model Selection and Multimodel Inference: a Practical Information-Theoretic Approach. (Springer Science & Business Media, 2002).
Durbin, R. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. (Cambridge university press, 1998).
Jones, D. T. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292, 195–202 (1999).
Article CAS Google Scholar
Wang, Z., Zhao, F., Peng, J. & Xu, J. Protein 8-class secondary structure prediction using conditional neural fields. Proteomics 11, 3786–3792 (2011).
Article CAS Google Scholar
Chaudhury, S., Lyskov, S. & Gray, J. J. PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics 26, 689–691 (2010).
Article CAS Google Scholar
Tegge, A. N., Wang, Z., Eickholt, J. & Cheng, J. NNcon: improved protein contact map prediction using 2D-recursive neural networks. Nucleic Acids Res. 37, W515–W518 (2009).
Article CAS Google Scholar
Wang, Z. & Xu, J. Predicting protein contact map using evolutionary and physical constraints by integer programming. Bioinformatics 29, i266–i273 (2013).
Article CAS MathSciNet Google Scholar
Raman, S. et al. NMR structure determination for larger proteins using backbone-only data. Science 327, 1014–1018 (2010).
Article CAS ADS Google Scholar

Download references

Acknowledgements

We thank Rosetta and PyRosetta community for making their software freely available and to the CASP11 organizers and assessors for providing comprehensive data. We acknowledge financial support from US National Institutes of Health (NIH) grant (R01GM093123) to JC.

Author information

Authors and Affiliations

Department of Computer Science, University of Missouri, Columbia, 65211, MO, USA
Debswapna Bhattacharya & Jianlin Cheng
Informatics Institute, University of Missouri, Columbia, 65211, MO, USA
Jianlin Cheng
Bond Life Science Center, University of Missouri, Columbia, 65211, MO, USA
Jianlin Cheng

Authors

Debswapna Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar
Jianlin Cheng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.B. and J.C. designed the research. D.B. implemented the method and generated the data. D.B. and J.C. analyzed the data and wrote the paper.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Electronic supplementary material

Supplementary Information

Rights and permissions

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Reprints and permissions

About this article

Cite this article

Bhattacharya, D., Cheng, J. De novo protein conformational sampling using a probabilistic graphical model. Sci Rep 5, 16332 (2015). https://doi.org/10.1038/srep16332

Download citation

Received: 10 June 2015
Accepted: 13 October 2015
Published: 06 November 2015
DOI: https://doi.org/10.1038/srep16332

This article is cited by

On the circular correlation coefficients for bivariate von Mises distributions on a torus
- Saptarshi Chakraborty
- Samuel W. K. Wong
Statistical Papers (2023)
Protein Structure Prediction: Conventional and Deep Learning Perspectives
- V. A. Jisna
- P. B. Jayaraj
The Protein Journal (2021)
DeepQA: improving the estimation of single protein model quality with deep belief networks
- Renzhi Cao
- Debswapna Bhattacharya
- Jianlin Cheng
BMC Bioinformatics (2016)
ConEVA: a toolbox for comprehensive assessment of protein contacts
- Badri Adhikari
- Jackson Nowotny
- Jianlin Cheng
BMC Bioinformatics (2016)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Architecture of FUSION

Generating protein conformation

Blind assessment of FUSION

Angular preferences

Secondary structure propensity

Nature of sampled energy landscape

Extent and distribution of conformational sampling

Comparisons with fragment-assembly

Discussion

Methods

Parameterization of protein conformational space

Formulating the probabilistic graphical model

Training data, parameter estimation and model selection

Conformational sampling

Simulation protocol

Sources of experimental PDB structures in the benchmark set

Additional Information

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Ethics declarations

Competing interests

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links