Quality and bias of protein disorder predictors

Nielsen, Jakob T.; Mulder, Frans A. A.

doi:10.1038/s41598-019-41644-w

Download PDF

Article
Open access
Published: 26 March 2019

Quality and bias of protein disorder predictors

Scientific Reports volume 9, Article number: 5137 (2019) Cite this article

7382 Accesses
61 Citations
42 Altmetric
Metrics details

Subjects

Abstract

Disorder in proteins is vital for biological function, yet it is challenging to characterize. Therefore, methods for predicting protein disorder from sequence are fundamental. Currently, predictors are trained and evaluated using data from X-ray structures or from various biochemical or spectroscopic data. However, the prediction accuracy of disordered predictors is not calibrated, nor is it established whether predictors are intrinsically biased towards one of the extremes of the order-disorder axis. We therefore generated and validated a comprehensive experimental benchmarking set of site-specific and continuous disorder, using deposited NMR chemical shift data. This novel experimental data collection is fully appropriate and represents the full spectrum of disorder. We subsequently analyzed the performance of 26 widely-used disorder prediction methods and found that these vary noticeably. At the same time, a distinct bias for over-predicting order was identified for some algorithms. Our analysis has important implications for the validity and the interpretation of protein disorder, as utilized, for example, in assessing the content of disorder in proteomes.

Accurate structure prediction of biomolecular interactions with AlphaFold 3

Article 08 May 2024

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

Entropy, irreversibility and inference at the foundations of statistical physics

Article 01 May 2024

Introduction

Interest in intrinsically disordered proteins (IDPs) has grown immensely over the past decades. IDPs can serve a large range of functions due to their enhanced sampling of conformational space compared to structured proteins and their involvement in many important biological processes and diseases have been discovered recently^{1,2,3,4,5,6,7}. Although experimental characterization of IDPs is very challenging, protein sequence composition has distinct biases and this has inspired the development of a large number of computational methods for predicting disorder from sequence^8,9. Recently, predictions of disorder by various methods have been compiled into databases^10,11,12 enabling consensus predictions, and meta-methods have emerged that predict disorder based on output from other predictors^13,14,15. Protein disordered region (DR) prediction has been assessed periodically through the critical assessment of structure prediction (CASP) initiative¹⁶. DR predictions did not improve from CASP8 to CASP9¹⁷, and only slightly for CASP10¹⁸. This apparent stagnation in accuracy of disorder predictors would suggest that development of new more sophisticated predictors would not have sufficient merit, and DR predictions were not evaluated anymore in subsequent CASP assessments.

We argue that this stagnation can be attributed to the vague authority of the evaluation, caused by insufficient quality of the data used to evaluate (and train) the predictors: In CASP, DR predictors were evaluated using missing density in X-ray structures as the disorder criterion. However, regions in X-ray structures might falsely appear ordered due to biases in non-native conditions required for X-ray crystallography characterization. In addition, since only proteins amenable to X-ray diffraction are included, such data sets are imbalanced in the sense that missing residues are relatively rare (only 2.4% in the set analyzed here) causing balance problems in the model building. As a complement, disorder analysis can be done for proteins in solution, as done in the DisProt database^19,20, and this data collection has frequently been used to train and evaluate disorder predictors²¹. Unfortunately, DisProt suffers from a heterogeneous compilation of data from diverse experimental sources, such as CD and sensitivity to proteolytic degradation, which lack position-specific information. Several false positive IDPs were indeed found in DisProt in a previous analysis²². A more serious issue arises from the fact that all currently applied evaluation criteria are binary classifiers, which ignore meaningful, intermediate order or a continuous range of structure^1,23,24, and therewith limit disorder prediction to a low-precision binary-classification problem. A more balanced dataset with a higher precision and accuracy would renew the potential in the development of bioinformatics methods for predicting disorder from sequence. For this purpose, we resorted to experimental data from NMR spectroscopy.

It is well-established that proteins can be studied with high accuracy in solution under near-native conditions by NMR spectroscopy. First, the structure-determination process provides an ensemble of structures where each model is consistent with the experimental data^{25,26,27,28,29}. Second, and more quantitatively, nuclear spin relaxation rates provide information about the time-scale and amplitude of dynamics in proteins^30,31,32, capturing and validating the variability in the NMR structures. Unfortunately, spin relaxation experiments and data analysis are relatively complicated to pursue, are not applicable for all time scales or for IDPs, and therefore there is very little data available for highly dynamic sites in proteins³³. Thirdly, chemical shifts are very sensitive to the local structure, are measured routinely and with very high precision for both structured proteins and IDPs^34,35, and have been used extensively to report on protein structure and dynamics^36,37. In particular, chemical shifts and their deviation from random coil values have been used to determine and quantify order/disorder and conformational propensities in IDPs^{38,39,40,41,42,43}. Modern molecular dynamics (MD) simulations reproduce experimental dynamical data with increasing accuracy⁴⁴ and, in particular, spin relaxation data has been used as an exquisite standard to benchmark MD force fields⁴⁵. IDPs can be simulated with high accuracy in the description of local conformational equilibria, and a very close agreement has been established between the degree of order/disorder in IDPs and secondary chemical shifts^46,47,48.

Recently, we introduced the Chemical shift Z-score for assessing Order/Disorder (the CheZOD score)²², which is based on deviations from random coil chemical shifts (RCCSs) using our refined formulation of RCCS reference values⁴⁹. In contrast to other methods for describing order/disorder, this CheZOD Z-score provides a position-specific and continuous measure of order/disorder in proteins. Furthermore, the corresponding CheZOD database of such Z-scores for 117 proteins studied at near-native conditions is diverse and balanced, containing equal amounts of disordered and ordered residues²². Here, we rigorously benchmark the performance of 26 disorder prediction methods by assessing the agreement between the estimated probabilities of disorder and the experimental Z-scores for each predictor, and use this result to rank the accuracy of the predictors. We observed that the accuracy of the predictors depends on the type of features applied, the method of optimization, and that the newest predictors are generally the most accurate. Some predictors are biased towards over-predicting order. Our analysis suggests that current DR predictions are limited by the quality of the training data rather than by the capacity of the data mining approaches. Improved predictors can therefore be anticipated.

Results

Measures of disorder and flexibility in protein structures: p53 as an example

To illustrate the process of disorder assignment, we consider the human oncogene protein p53, which contains ordered as well as disordered domains and is often used for illustrating predictions of disorder and interactions in IDPs^50,51. p53 is interesting because of its involvement in more than 50% of human cancers and many diverse biological processes due to its multitude of conformations^46,47,48. Estimated disorder probabilities for a large number of prediction methods (Fig. 1a, obtained from the genesilico server¹³) show agreement for some regions, but also substantial differences between the individual predictors. It is not possible to identify the most appropriate predictor a priori although that choice would have a dramatic impact for the prediction of disordered regions (see Supplementary Fig. S1 for prediction examples for 5 additional proteins). Consensus predictions from MobiDB-lite¹¹ (Fig. 1b) and D²P²¹⁰ (Fig. 1c) suggest disorder outside of the structured domains and higher probability of disorder for the loops in the core domain (e.g. res. 181–191). However, disorder is also predicted for part of a rigid internal beta-strand in the core domain (res. 156–162) and for the entire folded tetramerization domain. When the DisProt database²⁰ (Fig. 1d) is used to assign disorder, two loop regions are assigned as confident disorder (res. 114–120 and 182–187), whereas the linker between the core domain and tetramerization domain (res. 293–312) shows ambiguous disorder. The remaining residues are classified as context-dependent, meaning that these regions cannot be assigned unequivocally to a disordered/ordered state. X-ray structures for the p53 core domain have missing densities for the ends of some of the sequence constructs. In contrast, internal residues with missing densities were only observed for two of the 12 chains for the loop comprising residues Lys120 and Ser121, which were also classified confidently as disordered in the DisProt database (Fig. 1e). A continuous measure for local disorder/order, for which data is more abundant and balanced, is the local structural variation in an NMR ensemble. Here we introduce two types of structural order parameters, S and T, based on NMR ensemble variation in dihedral angles and the Cα internal distances, respectively, (see Online Methods). These order parameters span from zero to unity, ranging from complete disorder to order, and are in qualitative agreement with disorder predictions (e.g. for the two confident DisProt disorder regions) and show dips in order/increase in flexibility in all the loop regions of the core domain (Fig. 1h). Finally, we provide experimental disorder through the introduction of a continuous site-specific descriptor derived from assigned chemical shifts²² for p53⁵² (see Fig. 1i). According to these Z-scores, the core domain and tetramerization domain are ordered, whereas several loops in the core domain are disordered to varying degree (Fig. 1i). For example, the loop comprising residues Lys120 and Ser121. There is a very close agreement between disorder from Z-scores and structural flexibility in the NMR ensemble (Fig. 1f,g). A more comprehensive systematic comparison, for a large set of proteins, reveals good agreement between CheZOD Z-scores and other measures of disorder, including structural variability in MD simulations^53,54 (see Supplementary Results 1 and Supplementary Figs S2–S5).

Benchmarking the performance of disorder predictors

Above, a qualitative agreement was observed between Z-scores and estimated disorder probabilities for p53 with some noteworthy differences between individual predictors. To analyze the agreement systematically, disorder predictions were obtained for the 117 proteins in the CheZOD database as described in Online Methods. The calculated Z-scores were compared to the estimated disorder probabilities for a large set of different disorder predictors (see Table 1 and Online Methods) with the aim of identifying the best methods as those having the best agreement between estimated disorder probabilities and Z-scores. Figure 2 shows scatter plots of the Z-scores vs. the estimated probabilities (Z vs. p) for each predictor. It is seen that most predictors provide relatively high estimated probabilities of disorder for residues with low Z-scores and correspondingly lower probabilities for residues with high Z-scores. Qualitative agreement is observed, but the predictions are clearly different, with different qualities and biases in the correlation with Z-scores. To assess this agreement quantitively, we take full advantage of the continuous descriptor of disorder by determining the Pearson correlation coefficient, R_P, of agreement (see Fig. 3). This number is ideal for ranking the predictors from the best (largest absolute value) to the worst. As Z-scores increase with order while p is a measure of disorder, –1 indicates a perfect correlation and 0 expresses a complete lack of correlation. It is seen that binary predictors show poor correlation, while the newer, continuous methods SPOT-disorder⁵⁵, MFDp2¹⁴ and AUCpreD⁵⁶ predict best (Table 1 and Fig. 3). Furthermore, the genesilico metapredictors¹³ perform slightly better than all the methods used by the metapredictors but slightly inferior to the newer methods mentioned above (Table 1 and Fig. 3). The ESpritz⁵⁷ methods perform increasingly well, when trained on DisProt data, X-ray data, and NMR data, respectively (Table 1 and Fig. 3). Two methods that use NMR data for training – s2D⁵⁸ and DynaMine⁵⁹ – were also included. These methods were trained on continuous-valued target data; i.e. chemical shift derived secondary structure populations for s2D and local fast dynamics, as defined by the order parameter, for DynaMine. Here we interpret the predicted populations of non-alpha-helix/beta-sheet as the probability of disorder and use a bijective transformation of the predicted order parameters to convert it to a pseudo-probability (see Online Methods). Judged by the Pearson correlation coefficient, these two methods are ranked in the middle for predicting Z-scores. The Spearman rank correlation coefficient, R_S, describing the agreement with a monotonic relationship between p and Z (not necessarily linear) was also calculated, and showed the same trend for the predictors (see Table 1 and Supplementary Fig. S6).

Table 1 Performance of disorder predictors.

Full size table

It is evident from Fig. 2 that predictions and Z-scores cluster in four quadrants due to the underlying bimodal distribution of Z-scores²² and the binary nature of the classification used for training the methods. A very slight over-representation of “medium-range” Z-scores (close to 8.0) for average probabilities (close to 0.5) is seen only for the best ranked methods and IUPred⁶⁰. To enable comparison with previous benchmarks, we also performed analysis for a binary classification of disorder using the definition Z < 8 for disorder. This Z-score threshold provides the optimal agreement for a binary classification of order/disorder for all prediction methods on average (see Supplementary Fig. S10). A good predictor should optimize the fraction of correctly identified disordered residues (true positives, TP) while simultaneously minimizing the fraction of false positives (FP). ROC curves display TP vs. FP as a function of the probability threshold and the corresponding area under this curve (AUC) is an aggregate measure of the quality of a predictor that is not affected by any skew/bias of the estimated probabilities. A perfect classifier would yield AUC = 1, whereas random guessing gives AUC = 0.5. ROC curves for all predictors are shown in Fig. 3 and the AUC values are listed in Table 1. The non-binary methods display AUCs ranging from 0.733 (MetaDisorder3D) to 0.890 (SPOT-disorder) and reiterate the trends described above for the ranking of the predictors (see Table S1 and Supplementary Fig. S7).

It is apparent from Fig. 2 that some predictors are continuous, while other are more bimodal. In addition, for some methods predictions cluster on one side, suggesting a prediction bias. To quantify this bias of over-predicting order or disorder, the average probability of predicting low Z-scores (pZL for Z-scores < 8.0) and high Z-scores (pZH, Z-scores > 8.0) was calculated for each method. An unbiased method would have an average probability pZA = (pZL + pZH)/2 close to 0.5. At the same time, methods with good discrimination between order and disorder will display a large probability difference, pZD = pZL − pZH. Figure 4 plots the average probability (pZA, bias) against the probability difference. It is seen that DISOPRED2⁴, DisEMBL hotloops⁶¹ and GlobPlot⁶² are biased towards under-predicting disorder (e.g. using pZA < 0.3). On the other hand, no methods over-predict disorder (no method has pZA > 0.7). Along the other axis, SPOT-disorder has the highest probability difference suggesting the best (formal) discrimination between order and disorder. The above findings are mirrored in a classical confusion-based analysis (see Supplementary Table S2) except that for DISOPRED2 and DisEMBL hotloops a probability cut-off different from p = 0.5 was used, and therefore no significant over-prediction of order was found by this analysis. GlobPlot and ESpritz-Xray⁵⁷ methods have False Negative Rates (FNRs) as high as 0.98 and 0.718, respectively, but at the other end of the extreme, the methods with the highest False Positive Rates (FPRs), ESpritz_DisProt and DISPROT⁶³ (VSL2b), have FPRs of 0.415 and 0.401, respectively, and do not over-predict disorder to a similar extent.

Discussion

Residues with missing X-ray densities are relatively rare, with only 2.4% of the residues being non-observed in the dataset tested here (see Methods) and 8.6% in a set used for training SPOT-disorder⁵⁵ (See Supplementary Discussion and Supplementary Table S2) and the disordered regions identified in X-ray data are relatively short (Supp.Table S2). Conversely, long regions of disordered residues as well as completely disordered proteins are abundant in the DisProt database^19,20 (see Supplementary Discussion). This pronounced difference between the two data sources has long been realized, and complementary methods dedicated to predicting either short or long regions of disorder have been developed by training on X-ray or DisProt data, respectively^57,64,65. Interestingly, yet maybe not surprising, dedicated subversions of predictors show the best performance when evaluated on the same type of data as were used for training^21,66. To elaborate on this, the CheZOD database was divided into different subsets chosen as to represent data sets with different characteristics as e.g. content of disorder and size of disordered regions (see Supplementary Discussion). It was found that the ranking of the prediction methods was generally preserved and that the performance on the different subsets reflect the data used for training the methods (see Supplementary Discussion and Supplementary Figs S8 and S9). Since the CheZOD database is diverse and balanced, containing both structured proteins with short and long disordered loops as well as completely disordered proteins²², it is ideal for assessing the performance of predictors of general disorder of no particular flavor.

NMR-derived Z-scores for proteins in the CheZOD database have been applied here in an attempt to rigorously benchmark the performance of a large number of disorder predictors (see Table 1). Contrary to CASP chronological extrapolations outlined above, it was found that the most recent predictors feature improved performance. Notably, the newer implementation of DISOPRED, DISOPRED3⁶⁷, performs significantly better than the older version, DISOPRED2⁴. Several trends in the performance of the predictors related to the type of inputs and optimization procedure were observed. Older methods and methods that focus on speed use only amino acid (AA) sequence-based features, such as AA composition, physiochemical properties, interaction energies and sequence complexity, and display comparatively less good performance. Inclusion of evolutionary information derived from multiple sequence alignment profiles expands the repertoire with complementary features. The group of predictors here that use evolutionary (Evo) information generally perform better than the predictors without it (Table 1). Finally, the metapredictors that use estimated disorder probabilities from other predictors display very good performance.

To compare the authority of different data-sources to judge disorder, we perform a comparison across data for the same methods by deriving traditional binary classifier metrics; the AUC and the Mathews correlation coefficient (MCC) (see e.g.¹⁸). MCC is a balanced measure of correlation that considers false and true positives as well as their negatives. The AUC and MCCs were calculated and compared to values reported in the literature for testing against DisProt²¹ and X-ray data, as summarized previously⁹ (see Table S1 and Fig. 5). We find that values for both AUC and MCC are significantly higher for the same predictors when compared to the DisProt and X-ray evaluation sets, respectively (Fig. 5 and Table S1). This strongly suggests that the CheZOD Z-score classifier is more predictable and more accurate, in the sense that it contains fewer miss-classifications.

The analysis presented here provides a guideline for selecting the most appropriate predictor for assessing disorder and to avoid intrinsic bias. As a point in case, DISOPRED2 was used to estimate the content of disordered residues in various proteomes revealing a content of ca. 33% in Eukaryotes⁴. Importantly, our analysis now shows that DISOPRED2 markedly under-predicts disorder, suggesting that protein disorder in eukaryotes is even more prevalent than previously assumed.

Conclusions

We have demonstrated that validated, balanced NMR chemical shift data of proteins can be used to benchmark widely-used disorder predictors. Cross-data comparison of the performance for the same predictors demonstrated that the CheZOD dataset is more appropriate than previously utilized sources. A detailed analysis revealed that the most recent and most advanced prediction methods display the best performance, and bias for under-predicting disorder was evaluated quantitatively. We provided several performance measures to help researchers make an informed decision for selection of the most appropriate disorder prediction method.

Methods

Production of disorder probabilities for the proteins in the benchmarking set

The genesilico metaserver (http://iimcb.genesilico.pl/metadisorder/) was used to obtain estimated probabilities of disorder for a range of different disorder prediction methods (see Table 1 in main text) including their own meta-predictors. Furthermore, we added predictions from several other methods where parallel batch job submission was possible using their servers: SPOT-disorder⁵⁵, MFDp2¹⁴, AUCpreD⁵⁶, three versions of ESpritz⁵⁷ based on different training data, viz. X-ray missing density, NMR ensemble structural disorder classification and DisProt disorder. Classic DisEMBL binary predictions were replaced by continuous predictions using the automatic job submission system at http://dis.embl.de. DynaMine⁵⁹ and s2D⁵⁸, which predict continuous NMR data, were also included. Predictions of populations of secondary structure types from s2D were interpreted using the sum of the estimated populations of alpha-helix and beta-sheet as a probability of order, as before²⁴. Predictions of the order parameter S² from Dynamine were converted to a probability of disorder using the bijective transformation $p=\sqrt{1-{S}^{2}}$. To summarize, the prediction methods tested were: MetaDisorder including MD/MD2/3D variants¹³, SPOT-disorder⁵⁵, AUCpreD (with/without evolution)⁵⁶, MFDp2¹⁴, PrDOS⁶⁸, RONN⁶⁹, DISpro⁷⁰, DISOPRED2⁴, DISOPRED3⁶⁷, s2D⁵⁸, DynaMine⁵⁹, ESpritz NMR/Xray/DisProt variants⁵⁷, DISPROT⁶³ (VSL2b) (also referred to as PONDR), IUPred long/short variants⁶⁰, Pdisorder (http://www.softberry.com/), DisEMBL coils/remark465/hotloops variants⁶¹ and GlobPlot⁶².

The set of structured proteins with chemical shifts

The database of structured proteins described before⁴⁹ was used. However, in the present study we did not exclude entries homologous to proteins from the CheZOD database leading to a final set of 896 proteins with assigned chemical shifts. From this set, 222 proteins structures were determined by X-ray crystallography whereas the remaining 674 were determined by NMR spectroscopy. A trimmed unbiased set of X-ray structures was derived from the set of 222 proteins by removing entries if (i) the biologically significant oligomerization state was not a monomer, (ii) larger ligands were present, (iii) the protein sequence of the X-ray structure and the corresponding sequence of assigned chemical shifts differed for more than 10% of the residues. These criteria resulted in a reduced database of 90 entries. For both sets of X-ray structures, residues in the X-ray sequence (SEQRES record) that were absent in the coordinate section (i.e. those mentioned in the REMARK 465) were identified. Following this procedure, we identified 717 missing residues in the set of 222 X-ray structures compared to 30495 residues that were observed in the structure - and similarly 234/13581 for the reduced 90 entries set. Note that only residues with assigned chemical shifts in the corresponding NMR study were included in the above analysis. Within the set of entries corresponding to NMR structures, the 100 with the highest fraction of residues with CheZOD Z-scores < 5.0 were selected and used for comparison with the parameters (see below) describing structural variation in the corresponding NMR ensemble of structures. Furthermore, we identified 23 proteins from the refDB database⁷¹ described above having chemical shifts assigned for all backbone atom types that had available simulated molecular dynamics trajectories in the Dynameomics database^53,54. The Z-scores were compared to the rms Cα coordinate fluctuations within the MD trajectories for these proteins.

Definition of torsion angle and coordinate variations and order parameters

The dihedral angle order parameter S_HW of Hyberts, Wagner and co-workers⁷² is defined as:

$${S}_{HW}(\theta )=\frac{1}{N}\sqrt{{(\sum _{i=1}^{N}\sin ({\theta }_{i}))}^{2}+{(\sum _{i=1}^{N}\cos ({\theta }_{i}))}^{2}}$$

(1)

for an ensemble of N structures, where θ_i is the value of a particular dihedral angle θ in the i^th member of the ensemble. Based on the backbone dihedral angles ϕ and ψ, the sequence-specific backbone dihedral angle parameter, D_i, for residue i in a protein sequence is defined as:

$${D}_{i}=\frac{1}{6}\sum _{j=i-1,\,i,i+1}({S}_{HW}({\varphi }_{j})+{S}_{HW}({\psi }_{j}))$$

(2)

This order parameter is converted to a torsion angle standard deviation, s(i), using the approximate relation⁷²:

$$s(i)=2\arccos (1+\frac{\mathrm{ln}({D}_{i})}{2})$$

(3)

A parameter describing the variation in Cartesian coordinates for a specific residue is derived from the inter-atomic variance matrix (IVM) following a procedure akin to the FindCore algorithm⁷³. Each element, v_ij in the variance matrix is defined as:

$${v}_{ij}=\frac{1}{N}\sum _{k=1}^{N}{({d}_{ijk}-{\bar{d}}_{ij})}^{2},\,{\bar{d}}_{ij}=\frac{1}{N}\sum _{k=1}^{N}{d}_{ijk}$$

(4)

where d_ijk is the Cα(i)-Cα(j) distance for conformer, k, in the ensemble.

Each row, v_i, in the matrix, excluding diagonal and next-to-diagonal elements, v_ii and v_ij with |i-j| = 1 is sorted numerically and indexed by increasing rank:

$${\lambda }_{i1} < {\lambda }_{i2} < \cdots < {\lambda }_{in}$$

(5)

where ${\lambda }_{ij}={v}_{iq}$ is j’th smallest element of the row v_i and n denotes the total number of such variance elements – i.e. the number of residues minus 3.

The residue coordinate variation, t(i), is then calculated as the weighted average:

$$t(i)=\frac{{\sum }_{j=1}^{n}{w}_{j}\sqrt{{\lambda }_{ij}}}{{\sum }_{j=1}^{n}{w}_{j}},\,{w}_{j}={e}^{-\beta {(\frac{j}{n})}^{2}}$$

(6)

where β = 10.0 is used here. The parameters, s and t, describing the residue angle and coordinate variation, respectively, are then converted to the corresponding order parameters, S and T, using:

$${\rm{S}}=\frac{1}{(1+{(\frac{{\rm{s}}}{{{\rm{s}}}_{0}})}^{2})}\,{\rm{a}}{\rm{n}}{\rm{d}}\,{\rm{T}}=\frac{1}{(1+{(\frac{{\rm{t}}}{{{\rm{t}}}_{0}})}^{2})}$$

(7)

where s₀ = 75° and t₀ = 1.5 Å were used here as the reference values.

The Jensen-Shannon divergence, JSD⁷⁴, describes the similarity between two (discrete) probability distributions, P and Q.

$$JSD(P,Q)=\frac{1}{2}D(P||M)+\frac{1}{2}D(Q||M)$$

(8)

where M is the average of the distributions

$$M=\frac{1}{2}(P+Q)$$

(9)

and D is the Kullbeck-Leibner divergence:

$$D(P||M)=\sum _{i}P(i)\mathrm{log}(\frac{P(i)}{M(i)})$$

(10)

here we calculate JSD for the distributions of Z-scores corresponding to above/below reference values s₀ = 1.5 Å and t₀ = 75° for the residue angle and coordinate variation, respectively, and for residues corresponding to observed residues in X-ray structures vs. missing residues (REMARK 465).

Data Availability

The full database containing protein sequences, BMRB id, and CheZOD Z-scores is available at http://www.protein-nmr.org./.

References

Dyson, H. J. & Wright, P. E. Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol 6, 197–208 (2005).
Article CAS Google Scholar
Wright, P. E. & Dyson, H. J. Intrinsically disordered proteins in cellular signalling and regulation. Nat Rev Mol Cell Biol 16, 18–29 (2015).
Article CAS Google Scholar
van der Lee, R. et al. Classification of intrinsically disordered regions and proteins. Chem Rev 114, 6589–6631 (2014).
Article Google Scholar
Ward, J. J., Sodhi, J. S., McGuffin, L. J., Buxton, B. F. & Jones, D. T. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337, 635–645 (2004).
Article CAS Google Scholar
Uversky, V. N., Oldfield, C. J. & Dunker, A. K. Intrinsically disordered proteins in human diseases: introducing the D2 concept. Annu Rev Biophys 37, 215–246 (2008).
Article CAS Google Scholar
Romero, P., Obradovic, Z. & Dunker, A. K. Natively disordered proteins: functions and predictions. Appl Bioinformatics 3, 105–113 (2004).
Article CAS Google Scholar
Midic, U., Oldfield, C. J., Dunker, A. K., Obradovic, Z. & Uversky, V. N. Unfoldomics of human genetic diseases: illustrative examples of ordered and intrinsically disordered members of the human diseasome. Protein Pept Lett 16, 1533–1547 (2009).
Article CAS Google Scholar
Atkins, J. D., Boateng, S. Y., Sorensen, T. & McGuffin, L. J. Disorder Prediction Methods, Their Applicability to Different Protein Targets and Their Usefulness for Guiding Experimental Studies. Int J Mol Sci 16, 19040–19054 (2015).
Article CAS Google Scholar
Meng, F. C., Uversky, V. N. & Kurgan, L. Comprehensive review of methods for prediction of intrinsic disorder and its molecular functions. Cellular and Molecular Life Sciences 74, 3069–3090 (2017).
Article CAS Google Scholar
Oates, M. E. et al. D²P²: database of disordered protein predictions. Nucleic Acids Research 41, D508–D516 (2013).
Piovesan, D. et al. MobiDB 3.0: more annotations for intrinsic disorder, conformational diversity and interactions in proteins. Nucleic Acids Research 46, D471–D476 (2018).
Article CAS Google Scholar
Di Domenico, T., Walsh, I. & Tosatto, S. C. E. Analysis and consensus of currently available intrinsic protein disorder annotation sources in the MobiDB database. Bmc Bioinformatics 14 (2013).
Kozlowski, L. P. & Bujnicki, J. M. MetaDisorder: a meta-server for the prediction of intrinsic disorder in proteins. BMC Bioinformatics 13, 111 (2012).
Article ADS Google Scholar
Mizianty, M. J., Peng, Z. & Kurgan, L. MFDp2. Intrinsically Disordered. Proteins 1, e24428 (2013).
Google Scholar
Schlessinger, A., Punta, M., Yachdav, G., Kajan, L. & Rost, B. Improved Disorder Prediction by Combination of Orthogonal Approaches. Plos One 4 (2009).
Moult, J., Pedersen, J. T., Judson, R. & Fidelis, K. A large-scale experiment to assess protein structure prediction methods. Proteins 23, ii–v (1995).
Article CAS Google Scholar
Monastyrskyy, B., Fidelis, K., Moult, J., Tramontano, A. & Kryshtafovych, A. Evaluation of disorder predictions in CASP9. Proteins 79(Suppl 10), 107–118 (2011).
Article CAS Google Scholar
Monastyrskyy, B., Kryshtafovych, A., Moult, J., Tramontano, A. & Fidelis, K. Assessment of protein disorder region predictions in CASP10. Proteins 82, 127–137 (2014).
Article CAS Google Scholar
Sickmeier, M. et al. DisProt: the Database of Disordered Proteins. Nucleic Acids Res 35, D786–793 (2007).
Article CAS Google Scholar
Piovesan, D. et al. DisProt 7.0: a major update of the database of disordered proteins. Nucleic Acids Res 45, D219–D227 (2017).
Article CAS Google Scholar
Necci, M., Piovesan, D., Dosztanyi, Z., Tompa, P. & Tosatto, S. C. E. A comprehensive assessment of long intrinsic protein disorder from the DisProt database. Bioinformatics 34, 445–452 (2018).
Article CAS Google Scholar
Nielsen, J. T. & Mulder, F. A. A. There is Diversity in Disorder—“In all Chaos there is a Cosmos, in all Disorder a Secret Order”. Frontiers in Molecular Biosciences 3 (2016).
Toth-Petroczy, A. et al. Structured States of Disordered Proteins from Genomic Sequences. Cell 167, 158–170.e112 (2016).
Article CAS Google Scholar
Sormanni, P. et al. Simultaneous quantification of protein order and disorder. Nat Chem Biol 13, 339–342 (2017).
Article CAS Google Scholar
Wuthrich, K. Protein-structure determination in solution by nmr-spectroscopy. J Biol Chem 265, 22059–22062 (1990).
CAS PubMed Google Scholar
Wagner, G., Hyberts, S. G. & Havel, T. F. NMR structure determination in solution - a critique and comparison with x-ray crystallography. Ann Rev Biophys Biomol Struct 21, 167–198 (1992).
Article CAS Google Scholar
Brunger, A. T. & Nilges, M. Computational challenges for macromolecular structure determination by x-ray crystallography and solution nmr-spectroscopy. Q Rev Biophys 26, 49–125 (1993).
Article CAS Google Scholar
Guntert, P. Structure calculation of biological macromolecules from NMR data. Q Rev Biophys 31, 145–237 (1998).
Article CAS Google Scholar
Wuthrich, K. NMR studies of structure and function of biological macromolecules (Nobel Lecture). Angew Chem Int Ed 42, 3340–3363 (2003).
Article Google Scholar
Palmer, A. G., Kroenke, C. D. & Loria, J. P. Nuclear magnetic resonance methods for quantifying microsecond-to-millisecond motions in biological macromolecules. Nucl Magn Reson. Biol Macromol, Pt B 339, 204–238 (2001).
CAS Google Scholar
Palmer, A. G. NMR characterization of the dynamics of biomacromolecules. Chem Rev 104, 3623–3640 (2004).
Article CAS Google Scholar
Mittermaier, A. & Kay, L. E. Review - New tools provide new insights in NMR studies of protein dynamics. Science 312, 224–228 (2006).
Article ADS CAS Google Scholar
Ulrich, E. L. et al. BioMagResBank. Nucleic Acids Research 36, D402–D408 (2008).
Article CAS Google Scholar
Felli, I. C. & Pierattelli, R. Recent progress in NMR spectroscopy: toward the study of intrinsically disordered proteins of increasing size and complexity. IUBMB Life 64, 473–481 (2012).
Article CAS Google Scholar
Brutscher, B. et al. NMR Methods for the Study of Instrinsically Disordered Proteins Structure, Dynamics, and Interactions: General Overview and Practical Guidelines. Adv Exp Med Biol 870, 49–122 (2015).
Article CAS Google Scholar
Wishart, D. S. & Sykes, B. D. Chemical-shifts as a tool for structure determination. Nucl Magn Reson, Pt C 239, 363–392 (1994).
CAS Google Scholar
Wishart, D. S. & Case, D. A. Use of chemical shifts in macromolecular structure determination. Nucl Magn Reson. Biol Macromol, Pt A 338, 3–34 (2001).
CAS Google Scholar
Berjanskii, M. V. & Wishart, D. S. A Simple Method To Predict Protein Flexibility Using Secondary Chemical Shifts. J Ame Chem Soc 127, 14970–14971 (2005).
Article CAS Google Scholar
Marsh, J. A., Singh, V. K., Jia, Z. & Forman-Kay, J. D. Sensitivity of secondary structure propensities to sequence differences between alpha- and gamma-synuclein: implications for fibrillation. Protein Sci 15, 2795–2804 (2006).
Article CAS Google Scholar
Camilloni, C., De Simone, A., Vranken, W. F. & Vendruscolo, M. Determination of Secondary Structure Populations in Disordered States of Proteins Using Nuclear Magnetic Resonance Chemical Shifts. Biochemistry 51, 2224–2231 (2012).
Article CAS Google Scholar
Kjaergaard, M. & Poulsen, F. M. Disordered proteins studied by chemical shifts. Prog Nucl Magn Reson Spectrosc 60, 42–51 (2012).
Article CAS Google Scholar
Tamiola, K. & Mulder, F. A. A. Using NMR chemical shifts to calculate the propensity for structural order and disorder in proteins. Biochem Soc Trans 40, 1014–1020 (2012).
Article CAS Google Scholar
Kragelj, J., Ozenne, V., Blackledge, M. & Jensen, M. R. Conformational propensities of intrinsically disordered proteins from NMR chemical shifts. Chemphyschem 14, 3034–3045 (2013).
Article CAS Google Scholar
Best, R. B. & Lindorff-Larsen, K. Editorial overview: Theory and simulation: Interpreting experimental data at the molecular level. Curr Opin Struct Biol 49, IV–VI (2018).
Article CAS Google Scholar
Showalter, S. A. & Bruschweiler, R. Validation of molecular dynamics simulations of biomolecules using NMR spin relaxation as benchmarks: Application to the AMBER99SB force field. J Chem Theo Comput 3, 961–975 (2007).
Article CAS Google Scholar
Joerger, A. C. & Fersht, A. R. In Annu Rev Biochem Vol. 77 Annu Rev Biochem 557–582 (2008).
Oldfield, C. J. et al. Flexible nets: disorder and induced fit in the associations of p53 and 14-3-3 with their partners. BMC Genomics 9 (2008).
Meek, D. W. Regulation of the p53 response and its relationship to cancer. Biochem J 469, 325–346 (2015).
Nielsen, J. T. & Mulder, F. A. A. POTENCI: prediction of temperature, neighbor and pH-corrected chemical shifts for intrinsically disordered proteins. J Biomol NMR 70, 141–165 (2018).
Article CAS Google Scholar
Uversky, V. N. p53 Proteoforms and Intrinsic Disorder: An Illustration of the Protein Structure-Function Continuum Concept. Int J Molec Sci 17 (2016).
Xue, B., Brown, C. J., Dunker, A. K. & Uversky, V. N. Intrinsically disordered regions of p53 family are highly diversified in evolution. Biochim Biophys Acta - Proteins and Proteomics 1834, 725–738 (2013).
Article CAS Google Scholar
Ayed, A. et al. Latent and active p53 are identical in conformation. Nat Struct Biol 8, 756–760 (2001).
Article CAS Google Scholar
Benson, N. C. & Daggett, V. Dynameomics: Large-scale assessment of native protein flexibility. Protein Sci 17, 2038–2050 (2008).
Article CAS Google Scholar
van der Kamp, M. W. et al. Dynameomics: A Comprehensive Database of Protein Dynamics. Structure 18, 423–435 (2010).
Article Google Scholar
Hanson, J., Yang, Y., Paliwal, K. & Zhou, Y. Improving protein disorder prediction by deep bidirectional long short-term memory recurrent neural networks. Bioinformatics 33, 685–692 (2017).
PubMed Google Scholar
Wang, S., Ma, J. & Xu, J. AUCpreD: proteome-level protein disorder prediction by AUC-maximized deep convolutional neural fields. Bioinformatics 32, i672–i679 (2016).
Article CAS Google Scholar
Walsh, I., Martin, A. J., Di Domenico, T. & Tosatto, S. C. ESpritz: accurate and fast prediction of protein disorder. Bioinformatics 28, 503–509 (2012).
Article CAS Google Scholar
Sormanni, P., Camilloni, C., Fariselli, P. & Vendruscolo, M. The s2D method: simultaneous sequence-based prediction of the statistical populations of ordered and disordered regions in proteins. J Mol Biol 427, 982–996 (2015).
Article CAS Google Scholar
Cilia, E., Pancsa, R., Tompa, P., Lenaerts, T. & Vranken, W. F. From protein sequence to dynamics and disorder with DynaMine. Nat Commun 4, 2741 (2013).
Article ADS Google Scholar
Dosztanyi, Z., Csizmok, V., Tompa, P. & Simon, I. IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21, 3433–3434 (2005).
Article CAS Google Scholar
Linding, R. et al. Protein disorder prediction: implications for structural proteomics. Structure 11, 1453–1459 (2003).
Article CAS Google Scholar
Linding, R., Russell, R. B., Neduva, V. & Gibson, T. J. GlobPlot: Exploring protein sequences for globularity and disorder. Nucleic Acids Res 31, 3701–3708 (2003).
Article CAS Google Scholar
Vucetic, S., Brown, C. J., Dunker, A. K. & Obradovic, Z. Flavors of protein disorder. Proteins 52, 573–584 (2003).
Article CAS Google Scholar
Hirose, S., Shimizu, K., Kanai, S., Kuroda, Y. & Noguchi, T. POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions. Bioinformatics 23, 2046–2053 (2007).
Article CAS Google Scholar
Shimizu, K., Hirose, S. & Noguchi, T. POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix. Bioinformatics 23, 2337–2338 (2007).
Article CAS Google Scholar
Walsh, I. et al. Comprehensive large-scale assessment of intrinsic protein disorder. Bioinformatics 31, 201–208 (2015).
Article CAS Google Scholar
Jones, D. T. & Cozzetto, D. DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics 31, 857–863 (2015).
Article CAS Google Scholar
Ishida, T. & Kinoshita, K. PrDOS: prediction of disordered protein regions from amino acid sequence. Nucleic Acids Res 35, W460–464 (2007).
Article Google Scholar
Yang, Z. R., Thomson, R., McNeil, P. & Esnouf, R. M. RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics 21, 3369–3376 (2005).
Article CAS Google Scholar
Cheng, J., Sweredoski, M. J. & Baldi, P. Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data. Data Min. Knowl. Discov. 11, 213–222 (2005).
Article MathSciNet Google Scholar
Zhang, H. Y., Neal, S. & Wishart, D. S. RefDB: A database of uniformly referenced protein chemical shifts. J Biomol NMR 25, 173–195 (2003).
Article CAS Google Scholar
Hyberts, S. G., Goldberg, M. S., Havel, T. F. & Wagner, G. The solution structure of eglin c based on measurements of many NOEs and coupling constants and its comparison with X-ray structures. Protein Sci 1, 736–751 (1992).
Article CAS Google Scholar
Snyder, D. A. & Montelione, G. T. Clustering algorithms for identifying core atom sets and for assessing the precision of protein structure ensembles. Proteins: Structure, Function, and Bioinformatics 59, 673–686 (2005).
Article CAS Google Scholar
Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans Inform Theory 37, 145–151 (1991).
Article MathSciNet Google Scholar
Hrabe, T. et al. PDBFlex: exploring flexibility in protein structures. Nucleic Acids Res 44, D423–D428 (2016).
Article CAS Google Scholar
Canadillas, J. M. et al. Solution structure of p53 core domain: structural basis for its instability. Proc Natl Acad Sci USA 103, 2109–2114 (2006).
Article ADS Google Scholar
Rowell, J. P., Simpson, K. L., Stott, K., Watson, M. & Thomas, J. O. HMGB1-facilitated p53 DNA binding occurs via HMG-Box/p53 transactivation domain interaction, regulated by the acidic tail. Structure 20, 2014–2024 (2012).
Article CAS Google Scholar
Wong, T. S. et al. Biophysical characterizations of human mitochondrial transcription factor A and its binding to tumor suppressor p53. Nucleic Acids Res 37, 6765–6783 (2009).
Article CAS Google Scholar

Download references

Acknowledgements

We thank Dr. Lukasz P. Kozlowski for help with batch submissions on the genesilico server.

Author information

Authors and Affiliations

Interdisciplinary Nanoscience Center (iNANO), Aarhus University, Gustav Wieds Vej 14, 8000, Aarhus C, Denmark
Jakob T. Nielsen & Frans A. A. Mulder
Department of Chemistry, Aarhus University, Langelandsgade 140, 8000, Aarhus C, Denmark
Jakob T. Nielsen & Frans A. A. Mulder

Authors

Jakob T. Nielsen
View author publications
You can also search for this author in PubMed Google Scholar
Frans A. A. Mulder
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The project was designed and developed by F.A.A.M. and J.T.N. J.T.N. performed the mathematical and statistical analysis of the data and produced the figures. The paper was written by F.A.A.M. and J.T.N.

Corresponding authors

Correspondence to Jakob T. Nielsen or Frans A. A. Mulder.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Nielsen, J.T., Mulder, F.A.A. Quality and bias of protein disorder predictors. Sci Rep 9, 5137 (2019). https://doi.org/10.1038/s41598-019-41644-w

Download citation

Received: 04 December 2018
Accepted: 13 March 2019
Published: 26 March 2019
DOI: https://doi.org/10.1038/s41598-019-41644-w

This article is cited by

Backbone and side chain resonance assignment of the intrinsically disordered human DBNDD1 protein
- Christoph Wiedemann
- Kingsley Benjamin Obika
- Frank Bordusa
Biomolecular NMR Assignments (2022)
Extent of intrinsic disorder and NMR chemical shift assignments of the distal N-termini from human TRPV1, TRPV2 and TRPV3 ion channels
- Christoph Wiedemann
- Benedikt Goretzki
- Ute A. Hellmich
Biomolecular NMR Assignments (2022)
Positive selection and intrinsic disorder are associated with multifunctional C4(AC4) proteins and geminivirus diversification
- Carl Michael Deom
- Marin Talbot Brewer
- Paul M. Severns
Scientific Reports (2021)
PhosIDP: a web tool to visualize the location of phosphorylation sites in disordered regions
- Sonia T. Nicolaou
- Max Hebditch
- Jim Warwicker
Scientific Reports (2021)
TSSC4 is a component of U5 snRNP that promotes tri-snRNP formation
- Klára Klimešová
- Jitka Vojáčková
- David Staněk
Nature Communications (2021)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.