Introduction

Intrinsically disordered proteins (IDPs) fail to form a specific stable 3D structure under native conditions. Instead, they are in a statistical equilibrium involving several more or less unfolded conformations dictated by the local amino acid sequence. The structural dynamics and flexibility found in IDPs has been linked to key biological processes involving regulatory and signaling functions1,2,3. This has led to a growing interest in the structural characterization of IDPs4,5,6. Biophysical techniques for characterizing protein structure, such as X-ray crystallography, small angle X-ray scattering7, and NMR spectroscopy can be used for characterizing disorder experimentally. However, the experimental characterization of IDPs is time-consuming, laborious, and expensive. To mitigate this problem, a large number of computational methods that aim to predict disorder from sequence have therefore emerged8,9.

Contemporary disorder prediction methods are trained on sets of protein sequences with experimentally annotated disorder/order classification. Recently, we discussed the shortcomings of current disorder classification procedures and introduced the use of NMR spectroscopic data as an alternative benchmark10. In short, X-ray crystallography is a widely used criterion for judging disorder, where missing electron density is interpreted as disorder. However, the requirement of producing crystals is not commensurate with the observation of disordered residues. Another frequently used source of annotation, containing more cases of disorder, is the community-maintained DisProt database, which provides annotations based on data from various experimental sources. Unfortunately, DisProt has inconsistent annotations due to the heterogeneous composition of techniques, often lacks position-specific information (e.g. annotation derived from CD and sensitivity to proteolytic degradation), and contains false classifications in some cases11. As an alternative, Nuclear Magnetic Resonance (NMR) spectroscopy can provide an accurate and residue-specific description of the structure and dynamics of IDPs. For example, the local variation in NMR ensembles has been used to define a disorder classification12,13. However, this classification depends also on the local precision of the NMR ensemble, which can vary substantially depending on the amount of available constraints and the protocol used to enforce the constraints and derive the structural ensemble. Furthermore, all available classifiers are binary, ignoring potentially meaningful intermediate disorder14,15. The primary NMR observables, the chemical shifts, are measured routinely, are very precise and provide information on the local structure of proteins in solution16. Due to their dynamic nature, IDPs exhibit statistically-averaged “random coil” chemical shifts17. Conversely, secondary chemical shifts (i.e. the deviation of measured chemical shifts from random coil chemical shifts) indicate formation of structure, and have therefore been used to quantify order/disorder and conformational propensities in IDPs16,18,19,20,21,22,23,24,25.

Previously, we introduced the CheZOD Z-score11, which is based on secondary chemical shifts, and quantifies the degree of local disorder on a continuous scale. Z-score profiles were derived for a set of 117 carefully selected, representative proteins (herein referred to as the "117" database), revealing a diverse spectrum of disorder. It was demonstrated that the Z-score scale, besides being a reliable measure of disorder, also agrees well with other measures of disorder, such as missing densities in X-ray structures, structural variation in NMR-derived structural ensembles and positional variation in MD trajectories10. The CheZOD database was used to benchmark the performance of 26 disorder prediction methods10 by assessing the agreement between the estimated probabilities of order and the experimental Z-scores. A modest correlation was found (best method shows an absolute Spearman correlation coefficient of 0.638), and all prediction methods proved inadequate to predict intermediate disorder/order. Furthermore, it was found that the accuracy of disorder prediction methods was limited by the quality of the training data.

We present here ODiNPred; Prediction of protein Order and Disorder by evaluation on NMR data. ODiNPred was trained on a greatly expanded version of the CheZOD database, with experimental continuous-valued disorder Z-scores for 1325 protein sequences (herein referred to as the "1325" database), which spans a comparable number of disordered and ordered residues. ODiNPred uses a deep neural network and 157 residue-specific sequence features to predict a real-valued Z-score of disorder, which can be converted to a probability of disorder. Previously, the “117” database was used to derive a comprehensive and detailed benchmarking of prediction methods10. To align with this analysis, ODiNPred was evaluated on the “117” database in a cross-validation setting and was found to outperform 26 recently-tested prediction methods with a Spearman correlation coefficient between observed and predicted Z-scores of 0.649. Prediction accuracy was equal to that of SPOT-disorder, suggesting that this algorithm is the best currently available that is not trained on NMR data. Furthermore, the performance of ODiNPred was assessed on the full 1325 protein sequences in a cross-validation setting, where it demonstrated superior performance. Four biologically relevant examples are provided to illustrate the utility of ODiNPred to comprehensively categorize protein order and disorder. ODiNPred is accessible at https://st-protein.chem.au.dk/odinpred.

Results and discussion

Training ODiNPred on a database with balanced order and disorder

The CheZOD dataset of 1325 protein sequences and their corresponding Z-scores was used to train ODiNPred. This database was constructed in a way to ensure balanced amounts of disordered and ordered residues (see “Methods” section). A histogram of all pooled Z-scores reveals a bimodal distribution (Fig. 1), as was previously seen before for the "117" database11, with Z-scores raging between − 5.0 and 16.15, where the lower end of the scale corresponds to fully disordered residues. The diversity, dynamic range, and balance of the CheZOD database is apparent from the visualization in Supplementary Fig. S1. A threshold value, Z = 8.0, is used to distinguish between disordered and ordered residues10. Residues with Z-scores < 3.0 can be considered fully disordered11, whereas 3.0 < Z < 8.0 corresponds to cases with fractional formation of local, ordered structure. Conversely, residues with Z > 11.0 correspond to segments of regular secondary structure or structured rigid loops, whereas 8.0 < Z < 11.0 corresponds to flexible loops between ordered segments. To investigate whether the distribution of experimental Z-scores could be interpreted as two broad classes of order and disorder, the distribution was fitted to a weighted sum of skew-normal distributions26 as described in “Methods” section. Indeed, a close fit was observed, and this model distribution is henceforth applied here for the statistical inference of disorder probabilities based on experimental Z-scores (see “Methods” section). According to this model distribution, the fraction of disordered residues in our database was 36.3%.

Figure 1
figure 1

Histogram of all Z-scores in the “1325” CheZOD database used for training ODiNPred. Fits to skew-normal distributions (see “Methods” section) are shown with dashed lines in green and red for disordered, and ordered residues, respectively. A full magenta line indicates the sum of the two distributions.

Disorder propensities for amino acid types

It is well-established that individual amino acid types have different disorder propensities27,28,29. Analysis of the cumulative distribution of Z-scores for individual amino-acid types gives a much more detailed picture along the full scale of disorder and reveals a trend that agrees well with previous findings (Fig. 2); Secondary structure-breaking amino acids, such as glycine, have a larger number of low Z-scores (higher disorder propensity), hydrophobic residues such as tryptophan and isoleucine have the smallest disorder propensities, while hydrophilic/charged side chains give neutral disorder propensities. Figure 2 reinforces that glutamine may be used as a representative for the average behavior of disordered residues21. Furthermore, proline displays a very distinct pattern of being disorder-promoting to structured regions, but not to highly disordered segments30,31.

Figure 2
figure 2

Cumulative distributions functions (CDFs) for Z-scores in the “1325” database used for training ODiNPred. The CDF for all residues combined is shown as a black curve, and CDFs corresponding to specific amino acid types are shown with different colors (see legend). Z-score median values are highlighted by filled circles.

Performance of ODiNPred evaluated by blind prediction of Z-scores

ODiNPred uses a deep neural network to predict continuous-valued disorder Z-scores by training on the "1325" database and used 157 unique sequence input features as described in “Methods” section. ODiNPred was trained in a tenfold cross-validation setting (see “Methods” section) that allows for blind evaluation of predictions for the 1325 sequences (and for any subset of these as discussed below). We note that cross-validation is not biased by training, since, per construction of the database (see “Methods” section), all sequences in each subset contain no more than 50% sequence identity to any of the sequences in the training set, and only include 5.7% identity on average. Figure 3 shows observed versus predicted Z-scores for the 1325 sequences in the CheZOD database. A good agreement was observed when evolutionary features were included (RPearson = 0.759), and when these were left out (RPearson = 0.731). Disorder was predicted for the 1325 sequences using three other popular methods, the fast and accurate MobiDB-lite32 and the two top-performing methods from the previous benchmark study on the "117" sequences10: SPOT-disorder33 and MFDp234. Compared to ODiNPred, a weaker correlation is apparent from the scatter plot in Fig. 3. As an alternative, we also evaluated the Spearman rank correlation coefficients (RSpearman), which compare the ranking of order probabilities to the ranking of Z-scores, without assuming that these should be linearly dependent. The highest RSpearman was obtained for ODiNPred (Fig. 3 and Table 1). To estimate the performance of our new prediction algorithm, we compared these results with a comprehensive selection of 26 contemporary predictors from a recent benchmark10. In this analysis, RSpearman was computed for the “117” database. It should be noted that, although some of the sequences from the “117” database were present in the “1325” database as well, all metrics were carried out in a strict cross-validation setting (see “Methods” section) to ensure proper blind predictions for validation. The values for RSpearman are shown as a bar plot in Fig. 4. It is apparent that ODiNPred and SPOT-disorder stand out as best-performing. Furthermore, when not using evolutionary features (which leads to significant time savings), ODiNPred is significantly more accurate than other methods. We also note that ODiNPred performs noticeably better than several other NMR-based methods such as ESpritz-NMR, S2D, and Dynamine, where the latter were trained on continuous-valued target data derived from NMR spectroscopy.

Figure 3
figure 3

Performance of selected disorder prediction methods on the “1325” cross-validation set. Scatter plots show (a) predicted Z-scores (Zpred) vs. observed Z-scores (Zobs) for ODiNPred for the merged cross-validation sets from CheZOD (see “Methods” section). (bd). Probability of order (porder equal to 1 minus the probability of disorder) vs. Z-scores for (b) SPOT-disorder, (c) MFDp2, and (d) MobiDB-lite. Note that MobiDB-lite provides fractions of consensus disorder among eight different fast predictors, hence values are restricted to the rational fractions: 0/8, 1/8, …, 8/8, and white noise with amplitude 0.03 was added to the predictions to allow for better visualization (nota bene: the correlation was computed prior to the adding of noise for graphical display).

Table 1 Spearman rank correlation coefficients for the four disorder predictors compared in Fig. 3. In case of ODiNPred the correlation is between observed and predicted Z-scores. In the other cases the correlation is derived between observed Z-scores and estimated probabilities of order.
Figure 4
figure 4

Performance of ODiNPred and other methods on the “117” benchmark set. Spearman rank correlation coefficients are shown as colored bars (see legend to Table 1). A dotted line separates methods that employ evolutional features (top) from those that do not (bottom).

Performance of ODiNPred on other benchmark data

To follow standard benchmarking procedures for disorder predictors, we also evaluated the performance of ODiNPred on CASP935 and CASP1036 datasets using the binary disorder/order classifier provided by CASP for target values. The area under the receiver operating characteristics curve, AUC (see “Methods” section), captures the ability to simultaneously detect disordered residues while also preventing false classification of ordered residues as disordered. The AUC was used in previous CASP evaluations to assess the performance of predictors. ODiNPred was evaluated using the estimated probabilities of disorder, and for CASP9 and CASP10 datasets, we obtained AUC = 0.760 and 0.790, respectively. In this comparison, ODiNPred is not deemed to be among the best predictors (which range between 0.56 and 0.855 for CASP9 and 0.599 to 0.907 for CASP1036). It should be noted, however, that in CASP9 and CASP10 disorder is highly under-represented, contributing only 9.2% and 5.9% of the target set, respectively10. Such an overwhelming imbalance suggests that the predictors trained on these datasets were trained to recognize features of order and would, consequently, be overly focused on identifying ordered residues correctly. Indeed, it was found that some predictors that are trained on X-ray data overestimate order10. The classical AUC for ROC (AUC-ROC) can be optimistic in cases with pronounced class imbalance. In contrast, the precision-recall would not have this optimism bias and will, in principle, be more suited for imbalanced test sets37. We, therefore, derived the AUC for the precision-recall curve (AUC-PR) and found an area of 0.365, compared to ranges between 0.193 and 0.603 in the CASP10 evaluation36. However, precision-recall analysis mainly focuses on the performance of the classifier of the minority class37, which is neither optimal, as predictors should be balanced to accurately recognize both disorder and order. This impedes an objective comparison with CASP. Instead, we argue that the CheZOD database is likely better suited than the CASP targets for assessing the quality of disorder predictions, since it contains balanced order/disorder that matches experimental observations, has more accurate disorder classification by continuous-valued targets for disorder, and has successfully been used to benchmark disorder predictors10. To substantiate this point, we derived the Mathews Correlation Coefficient (MCC) and AUC-ROC for the “1325” cross-validation set, using a threshold CheZOD value Z = 8 to demarcate disorder and order. Following this definition, AUC-ROC = 0.914 and MCC = 0.690. These numbers exceed performance indicators for other predictors as well as for various benchmark data sets. For example, for CASP10, the highest MCC is 0.53136. This result is in line with our previous study10, where also other predictors perform better on the “117” database than on DisProt or CASP X-ray datasets in terms of MCC and AUC-ROC. This reinforces the notion that the NMR-derived Z-score is a reliable and predictable classifier of protein disorder.

Importance of features

Application of noisy features will lead to over-fitting and will have a negative impact on the performance of neural networks. There appears to be little consensus on how to prune neural networks38. Here, Gaussian noise was applied after the first hidden layer to prevent over-fitting. In order to test for over-fitting with a more systematic procedure that provides specific insights, we implemented the permutation importance procedure39 to rank features on how important they are in predicting the target scores. This was implemented for each of the 10 validation datasets, shuffling each feature one-by-one (with 5 repetitions) while keeping all other features constant. The optimized neural network construct with all other parameters as described in the “Methods” section was applied. The squared Pearson Correlation (R2) between the prediction and the targets were calculated for the cross-validation subset and averaged over the 5 shuffle rounds. A feature was considered important if R2 decreased on shuffling it and vice versa. The specific magnitude of the change in R2 is not straightforward to interpret (it depends, among other factors, on the total number of features), but the relative magnitude of the change in R2 on shuffling was used to rank the features from most to least important (see Figure S2 in the Supporting Information). It was observed that the most important features were those accounting for (i) hydrophobic clusters, (ii) predicted secondary structure, and (iii) evolutionary relationship. These advanced features proved to be more important than the traditionally-applied single amino acid contributions and single univariate features derived from amino-acid specific properties such as isoelectric point. This can be explained by the fact that simple features can be constructed by a linear combination of the amino acid distribution features, whereas the more advanced features will be more orthogonal to the others. Features that account for repeats and linear motifs in a sequence had the least importance. We argue, that this is because these features are sparse (have relatively few non-zero values) and therefore do not contain much information on a statistical basis. However, these might be important for specific cases or for predicting contextual disorder/order as discussed below. In conclusion, all features were found to be important as groups and we, therefore, chose to keep all features.

Evaluation of ODiNPred on four important examples of disorder in biology

To validate and to illustrate the application of ODiNPred, it was applied to four well-studied proteins that were not part of the “1325” database:

Case 1: The human oncogene protein p53 is involved in numerous protein–protein interactions, which is reflected in its large span of conformations40. Recently, we analyzed disorder predictions, disorder annotations and Z-scores for p53, and found that disorder prediction was challenging10. With ODiNPred, however, we are now able to demonstrate a close correspondence between predicted Z-scores and those derived from experimental data (Fig. 5A)41 (RPearson = 0.76). Both the disordered terminal regions are correctly identified, as well as the internal disordered region between the two ordered domains. Concurrently, the DNA binding domain (middle part) and the tetramerization domain (res. 325–355) are predicted to be structured (high Z-scores). Furthermore, fluctuations in experimental Z-scores reveal relatively flexible loops (Z-scores between ca. 3 and 10) in the DNA binding domain (middle part), and ODiNPred correctly reproduces these flexible loops, albeit in some cases with slightly larger length or amplitude. It is worth noting that two patches in the otherwise disordered N-terminal domain are predicted to have intermediate order; stretches centered around residue 25 and 50. Indeed, the former region forms a small alpha-helix whereas the latter become structured upon binding of e.g. HMGB142.

Figure 5
figure 5

Applications of ODiNPred for disorder prediction. Top panels: Profiles of predicted Z-scores from ODiNPred (black) compared to Z-scores from experimental NMR data (green). Bottom panels: Derived probabilities of disorder estimated by ODiNPred (see “Methods” section) highlighting predicted disordered and ordered residues using red and blue bars, respectively. (A) Human oncogene protein p53 (Uniprot P04637, Z-scores from published chemical shifts41 and shifts deposited in the BioMagResBank entry 17,760). (B) Human Prion Protein (Uniprot P04156. BMRB id 440245). (C) DFD Drosophila HOX transcription factor (Uniprot P07548, BMRB id 2762154) segment. (D) TDP-43 (Uniprot Q13148). Z-scores were derived from NMR data from four separately studied domains: (i) N-terminal domain (NTD) residues 3–89 (BMRB id 34081)62, (ii) RNA recognition motif 1 (RRM1) residues 91–190 (BMRB id 18765)63, (iii) RRM2 domain residues 191–264 (BMRB id 19922)64, (iv) Low complexity domain (LCD) residues 268–413 (BMRB id 26823)59.

Case 2: The human prion protein (hPrP) is associated with fatal transmissible spongiform encephalopathies43,44. The structure determined by NMR reveals a folded C-terminal domain and a disordered N-terminus45. The C-terminal domain shows high experimental Z-scores and is correctly predicted as structured by ODiNPred (Fig. 5B). Again, it is noticeable that the two larger flexible loops in the C-terminal structured domain are correctly identified by ODiNPred to have increased flexibility. The N-terminal ~ 100 residues have low experimental and predicted Z-scores and, hence, are predicted correctly as disordered by ODiNPred. The quadruple octa-repeat (OR) regions (residues P60-Q91, Fig. 5B) have degenerate chemical shifts for the four repeats and consequently a repeating pattern of Z-scores. Noticeably, the Z-scores for the OR are slightly higher, on average, compared to the other residues in the flanking disordered region. The OR repeat is involved in the misfolding of PrP and constitutes an aggregation locus, influenced by intrinsic flexibility and environmental conditions46,47,48. The OR can adopt stable turn-like structures in the presence of co-factors, such as metal ions and sulfated glycans49,50,51,52, and these local structures were demonstrated to be transiently present under native conditions. This transient structure is reflected in elevated Z-scores for the OR. Coincidently, ODiNPred predicts slightly higher Z-scores for this segment.

Case 3: The “deformed” (DFD) HOX transcription factor controls the development of the labial and prothorax segments in Drosophila53. A segment of DFD containing a conserved 60-residue DNA-binding homeodomain and the 30 preceding residues (T337–K426) was studied by NMR spectroscopy and other biophysical characterization techniques54. ODiNPred predicts the C-terminal DNA-binding domain to be ordered whereas the 30 N-terminal residues are predicted as mostly disordered (Fig. 5C), in agreement with experiment. More specifically, intermediate Z-scores, 6 < Z < 11, are predicted for residues 2–18 (numbering as in Fig. 5C). Indeed, residues with intermediate experimental Z-scores are part of this segment. Residues 8–11 were demonstrated to be more rigid than the remainder of the disordered N-terminal region by NMR relaxation analysis and MD simulations. This segment with reduced flexibility is specifically recognized by other co-transcription factors54.

Case 4: The protein TDP-4355 binds to chromosomally integrated trans-activation response element (TAR) DNA and represses HIV-1 transcription56. In addition, it is implicated in amyotrophic lateral sclerosis (ALS) and neurodegenerative diseases57,58. ODiNPred correctly identifies the folded domains of the N-terminus and the two RNA recognition motifs, having both large experimental and predicted Z-scores (Fig. 5D). ODiNPred correctly predicts the flexible linkers between the isolated domains, which are disordered as defined by Z-scores < 8.0. Furthermore, ODiNPred correctly predicts the C-terminal low complexity domain (LCD) to be disordered, with a segment in the middle of intermediate order, as judged by intermediate observed and predicted Z-scores (residues 320–340). This specific segment has been shown to form a transient α-helix59,60 and is involved in liquid–liquid phase separation. The LCD is prone to pathological aggregation, with the α-helical segment mediating tertiary contacts that lead to oligomerization61, and mutations in this segment are correlated to ALS. Clearly, ODiNPred is able to accurately pinpoint segments with intermediate Z-scores, which corresponds to sites of important biological function which are implicated in human pathology.

The full spectrum of order and disorder

ODiNPred extends the repertoire of disorder prediction beyond the categorical and binary disordered and ordered states. We have demonstrated here that ODiNPred can accurately predict flexible parts of otherwise structured proteins as well as segments with transient structure or reduced flexibility within disordered regions (see Fig. 5). Such segments are potentially important for the biological function of many proteins, as the remarkable spatio-temporal heterogeneity of IDPs is closely linked to their interaction promiscuity and multifunctionality65. The ensemble of conformations sampled by IDPs constitutes a pre-existing equilibrium in which conformations are available for binding and interaction with ligands or other macromolecules66. Furthermore, for structured proteins, the unbound states of flexible loops contain transiently formed conformations that may resemble ligand-bound states67. Segments that are partially folded, or have transient residual structure (semi-foldons), as well as segments that can fold dependent on interactions (inducible foldons) are widespread in IDPs and are important for biological function68. Protein segments can also be conditionally disordered or transiently disordered depending on the environment and interactions69. In addition, a semi-disordered region is prone to aggregation in fully unstructured regions but disposed to local unfolding that exposes the hydrophobic core to aggregation in structured globular proteins. These important “semi-disordered segments” can be identified experimentally as segments with intermediate Z-scores. Since ODiNPred is trained with data covering the full spectrum of disorder, it can predict semi-disorder with confidence, as demonstrated above.

Disordered segments often interact with other proteins serving an important functional role70. The IDEAL database annotates a number of protein-binding IDR segments referred to as “protein segments” (ProSs)71. Interactions of IDRs have been categorized to occur for three types of segments; LCRs (Low Complexity Regions), SLiMs (Small Linear Motifs), and MoRFs (Molecular Recognition Features)72. LCRs are identified by their lower sequence complexity and often by their repetitiveness and are associated with the more generic role of mediating protein liquid–liquid phase separation73. SLiMs are distinct short conserved sequences that often mediate the interaction with specific proteins74 , and are collected in the ELM database75. Features that relate to SLiMs and LCDs were included in the features used by ODiNPred. These features appeared to have limited importance for predicting disorder (see above), which might be due to their limited number of non-zero values. However, these might be important for predicting local variation in disorder and contextual disorder. In contrast to SLiMs and LCDs, MoRFs are a much more general class of longer segments (10–70 residues) that gain some degree of structure upon binding to their targets76. MoRFs appear to have some latent propensity to form secondary structure, that could be predicted from sequence features as envisioned in the FELLS analysis77 . Other attempts have been made to predicts these segments from more general sequence features with some degree of success78,79,80,81,82 . The ANCHOR/ANCHOR2 method differs from other methods, estimating the disordering binding propensity as the product of the disorder probability and the estimated amino acid pair energy gained upon binding83,84,85,86. For example, for p53 discussed above, ANCHOR/ANCHOR2 assigns a high probability for disordered binding to the N-terminal region (residues 20–60) and the C-terminal region (residues 380–398)87, which was also predicted by MoRFMPM80,87. These regions contain confirmed MoRFs88,89,90 and are annotated as ProSs in the IDEAL database. In agreement, ODiNPred, predicts intermediate probabilities of disorder. This suggests that the MoRF segments have some pre-existing order bias, which is reflected in the dispersion of the average chemical shifts, and thereby gives rise to an elevation of experimental Z-scores, which are mirrored by ODiNPred. This highlights the potency of ODiNPred to predict beyond the order/disorder dichotomy and demonstrates that a comprehensive disorder classification—capturing complex concepts such as intermediate and contextual disorder—is now within reach.

Conclusions

A new disorder prediction method, ODiNPred, was presented. ODiNPred uses a deep neural network and is trained on experimental NMR-derived Z-scores from 1325 proteins, applying 157 sequence features. When evaluated against a previous benchmark set consisting of 117 proteins, ODiNPred ranked second of 27 prediction methods, with SPOT-disorder showing marginally better performance. When evaluated in a cross-validation setting against the more extensive "1325" database presented herein, ODiNPred displays performance that even surpassed SPOT-disorder. Predictions with ODiNPred can provide key insights, as highlighted with four example cases. ODiNPred can be freely accessed at https://st-protein.chem.au.dk/odinpred.

Methods

Datasets

The CheZOD database was expanded to contain 1325 protein sequences with a balanced overall content of disordered residues applying an iterative procedure of adding complementary datasets. As an initial construct, we used the database for training the POTENCI procedure17 (containing most of the proteins from the original CheZOD database) keeping the 178 entries having at least 20% residues with Z-scores < 5.0 (fIDR5 > 0.2; fIDR5 denoting the fraction of residues with Z-scores below 5). Subsequently, this database was expanded with new sequences and their corresponding Z-score profiles derived from chemical shifts deposited in the BMRB database91. Entries were only considered if experimental conditions were native, non-denaturing, and non-complexed, as described before11. Since the BMRB database has an over-representation of structured proteins, precautions were taken to favor proteins with more disorder in order to construct a balanced database. This was accomplished by adding sequences with progressively more order in separate steps. In each new step, a new candidate set was stripped for sequences with more than 50% sequence identity among themselves and against the previous iteration of the database. Increasingly stricter demands were imposed on the new data to enforce balance. Specifically, in the first iteration, it was required that fIDR5 > 0.5 and that the average number of assigned chemical shifts per residue was greater than 2.0 (ACSR > 2.0). In the following iterations, it was required that (i) ACSR > 4.0 for 0.5 > fIDR5 > 0.3, (ii) ACSR > 5.0 for 0.3 > fIDR5 > 0.2, (iii) ACSR > 6.0 for 0.2 > fIDR5 > 0.1, and (iv) ACSR > 6.0 for 0.1 > fIDR5 > 0.05 while keeping only the 100 sequences with the largest number of residues. Finally, the new CheZOD database was complemented with sequences from the larger database of structured proteins derived from RefDB92, requiring that ACSR > 6 or ACSR > 5.5 and fIDR5 > 0.05. It should be noted here that the procedure of requiring fewer ACSR for disordered proteins does not lead to significant statistical bias, since Z-scores are largely unaffected by the number of chemical shifts used to derive them, whereas, in contrast, Z-scores scale approximately linearly with the number of chemical shifts for completely structured residues11.

The newly derived CheZOD dataset, containing 1325 protein sequences and their corresponding Z-score values, was then split randomly into 10 disjoint sub-datasets with 132 or 133 entries each. Each of these sub-datasets was utilized for different purposes: (i) a training set used for learning the weights of the neural network, (ii) a testing set used for evaluating the goodness-of-fit of the model after each epoch within the neural network optimization algorithm, and (iii) a validation set used for blind evaluation of the neural network optimized with complementary datasets. Eight sub-datasets were applied for training, whereas a single was used for testing and validation (i.e. 1,060 sequences for training and 132/133 each for testing and validation). The definition of the training/testing/validation sets was varied systematically to define models 1 to 10. In model n, sub-dataset n and n − 1 were used for validation and testing, respectively, whereas the remaining sub-datasets 1, 2, …, n − 2, n + 1, …, and 10 were used for training. By this procedure, the combined predictions for the validation sets constitute a tenfold cross-validation set. The 10 different models provide slightly different, albeit not independent, predictions for a new protein sequence not present in the "1325" CheZOD database. In such a case, the final ODiNPred Z-score prediction is the average of the predictions of the 10 models. The standard deviation within the 10 different predictions provides an estimate of the precision of the prediction and is further used to estimate the probability of disorder using statistical inference.

Any sequence from the “1325” was always validated in a cross-validation setting. This means that, for a given sequence, only one specific model from the 10 sub-models was used for the prediction. Namely the model, for which the particular sequence was part of the validation subset, and hence not used for neither training nor testing of the neural network. This procedure was also referred to here as blind testing.

The neural network

Deep neural nets have high capabilities in finding complex relationships between input and output data. ODiNPred uses a feed-forward network93 implemented using Tensorflow94 with an input layer, five fully connected hidden layers and an output layer with one node and a linear activation function. The input to the network is a matrix of size equal to the length of the protein and the number of features per residue. For the network to learn reliably from the features, each input feature was normalized by its mean and standard deviation. The neural network was set up differently for two cases (i) without and (ii) with evolutionary features. For the case without evolution, the hidden layers contained 40, 10, 25, 40, and 8 neurons, whereas for the case with evolution the hidden layers contained 128, 80, 20, 15, and 10 neurons. In both cases, the response of the first hidden layer was penalized using L2 regularization and the remaining hidden layers used a rectified unit activation function (ReLU). To prevent overfitting, Gaussian noise was applied after the first hidden layer, with a standard deviation of 0.1. The Adam optimization algorithm95 was used during training with a learning rate of 0.0001. The mean squared error between the predicted and observed Z-scores was determined for both training and testing datasets after each iteration. The training involved 100 iterations with a batch size of 50 and a standard back-propagation algorithm96. The model giving the lowest mean squared error for the testing set was chosen.

Sequence features

ODiNPred encodes the sequence as a comprehensive set of 157 features derived from sequence attributes such as frequency of amino acids (AA), sequence complexity, secondary structure propensities, and identification of patterns in the sequence such as binding motifs, repeats, and accumulation of identical charges. Known methods such as the Chou–Fasman algorithm97 and Tango98,99,100 were also applied for the derivation of some of the sequence features. 27 complementary features accounting for evolutionary relatedness to other sequences were based on sequence alignment profiles generated by BLAST and Clustal101,102. ODiNPred predictions can be run optionally with or without calculating and applying these additional features. Most features apply averaging within a sliding window along the sequence. A detailed definition of all features is provided in the “Appendix: Online methods” and Supplementary Table S1.

Disorder predictions

The distribution of experimental Z-scores was fitted to a weighted sum of two skew normal distributions26

$$d\left(x\right)={f}_{D}\psi \left(Z,{\mu }_{D},{\sigma }_{D},{\alpha }_{D}\right)+\left(1-{f}_{D}\right)\psi \left(Z,{\mu }_{O},{\sigma }_{O},{\alpha }_{O}\right)$$

where the skew normal distribution, ψ, is defined in terms of location, scale, and skewness parameters μ, σ, and σ, respectively, as:

$$\psi \left(Z,\mu ,\sigma ,\alpha \right)=2\phi \left({z}_{norm}\right)2\Phi \left(\alpha {z}_{norm}\right), {z}_{norm}=(Z-\mu )/\sigma$$

and ϕ is the standard normal probability density function and Φ is the corresponding cumulative distribution function for the normal distribution. The distribution of the experimental Z-scores agrees well with this model as visualized in Fig. 1 as evidenced by a Hellinger distance of 0.003673103. The shape parameters were found by the fitting procedure and the fraction of disordered residues in the training set, fD, were found to be 0.3626.

ODiNPred provides a predicted Z-score, Zpred. A probability of disorder, pD, is estimated using the fitted shape parameters and a reference fraction of disordered residues, fDref = 0.333104.

$${p}_{D}=\left(\frac{{\pi }_{D}}{{\pi }_{D}+{\pi }_{O}}\right)$$

where

$${\pi }_{S}={\int }_{-\infty }^{\infty }{f}_{D}\psi \left(Z,{\mu }_{S},{\sigma }_{S},{\alpha }_{S}\right)\phi ((Z-{Z}_{pred})/{Z}_{err})dZ$$

with S = D or O denotes the state of order or disorder and again ϕ is the standard normal distribution. The Z-score is predicted using ODiNPred's 10 different cross-validation models and the standard deviation, sZ is extracted from all 10 model predictions and is an indicator of the precision of the prediction. The actual error in the prediction, Zerr, was compared to the standard deviation for all pooled ODiNPred Z-score predictions for the 1325 sequences. The observations of standard deviations were collected in bins and a clear relationship between precision (average sZ in the bin) and accuracy (average Zerr) was observed (Supplementary Fig. S3). We used this relationship to estimate the error, Zerr, on the predicted Z-scores as applied in the equation above.

Measuring performance

To assess the performance of ODiNPred, the Pearson correlation coefficient is calculated when comparing observed against predicted Z-scores, where a value of 1 indicates a perfect correlation and 0 expresses a complete lack of correlation. The Spearman rank correlation coefficient, describing the agreement with a monotonic relationship was evaluated when comparing disorder probabilities and Z-scores. When assessing the performance against benchmarks with binary classifiers, such as the CASP datasets (for which experimental classification is available), classical binary confusion-matrix parameters, i.e. true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN), were analyzed. The area under a parametric curve is derived using the estimated probabilities of disorder as the parameter to be varied. A receiver operator curve (ROC) is the true positive rate (or recall), TPR = TP/(TP + FN), vs. the false positive rate (false alarms), FPR = FP/(FP + TN), parameterized by the probability threshold, and the corresponding area under this curve (AUC) is an aggregate measure of the quality of the correlation36. A perfect classifier would yield AUC = 1, whereas random guessing gives AUC = 0.5.

Implementation of ODiNPred web application

The ODiNPred web application (run in python) is located at https://st-protein.chem.au.dk/odinpred (Fig. 6). Input data can be uploaded following instructions on the server. ODiNpred can take up to 100 protein sequences as input at a time and predicts their Z-scores and disorder probabilities. It is possible to run predictions with or without evolutionary features included. Running without evolutionary features decreases the run time from approximately one minute to a few seconds per entry on the current server. Including evolutionary features is recommended for predictions of individual sequences, as it produces higher prediction accuracy (see Results). The prediction results are sent by email to the user as a text file and a plot.

Figure 6
figure 6

Screenshot of ODiNPred at https://st-protein.chem.au.dk/odinpred.