Abstract
Intrinsically disordered proteins and regions (collectively, IDRs) are pervasive across proteomes in all kingdoms of life, help to shape biological functions and are involved in numerous diseases. IDRs populate a diverse set of transiently formed structures and defy conventional sequence–structure–function relationships1. Developments in protein science have made it possible to predict the three-dimensional structures of folded proteins at the proteome scale2. By contrast, there is a lack of knowledge about the conformational properties of IDRs, partly because the sequences of disordered proteins are poorly conserved and also because only a few of these proteins have been characterized experimentally. The inability to predict structural properties of IDRs across the proteome has limited our understanding of the functional roles of IDRs and how evolution shapes them. As a supplement to previous structural studies of individual IDRs3, we developed an efficient molecular model to generate conformational ensembles of IDRs and thereby to predict their conformational properties from sequences4,5. Here we use this model to simulate nearly all of the IDRs in the human proteome. Examining conformational ensembles of 28,058 IDRs, we show how chain compaction is correlated with cellular function and localization. We provide insights into how sequence features relate to chain compaction and, using a machine-learning model trained on our simulation data, show the conservation of conformational properties across orthologues. Our results recapitulate observations from previous studies of individual protein systems and exemplify how to link—at the proteome scale—conformational ensembles with cellular function and localization, amino acid sequence, evolutionary conservation and disease variants. Our freely available database of conformational properties will encourage further experimental investigation and enable the generation of hypotheses about the biological roles and evolution of IDRs.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Simulation trajectories and precalculated conformational properties for the 28,058 IDRs of the pLDDT-based set are available at https://sid.erda.dk/cgi-sid/ls.py?share_id=AVZAJvJnCO, which also includes a full list of amino acid sequences, sequence features and conformational properties of the 28,058 IDRs in the pLDDT-based set (Supplementary Table 3). In the online database, directories are organized by splitting UniProt IDs into two-character units (XX/XX/XX), followed by an additional four-character unit for UniProt IDs of ten characters (XX/XX/XX/XXXX). For example, data for the IDRs of UniProt ID A0A494C086 can be accessed by entering IDRome/A0/A4/94/C086 in the Filter paths. Raw data to reproduce the results presented in this work are available at https://github.com/KULL-Centre/_2023_Tesei_IDRome and have been deposited at Zenodo (https://doi.org/10.5281/zenodo.10251736).
Window-averaged pLDDT scores were obtained from Supplementary Table 1 of a previous study2. SPOT-Disorder v.1 predictions were obtained from https://github.com/IPritisanac/AF2.IDR (ref. 7). Orthologous sequences were retrieved from the OMA database (Nov 2022 release). UniProt releases 2021_04, 2022_05 and 2023_03 were accessed as described previously62 and as detailed in the Methods. AlphaFold predicted aligned errors were retrieved from https://alphafold.ebi.ac.uk/ (refs. 97,98). The Gene Ontology Data Archive version 2023-11-15 (https://doi.org/10.5281/zenodo.10162580) was obtained from https://current.geneontology.org/ontology/go-basic.obo. UniProt IDs of proteins in membraneless organelles were retrieved from CD-CODE v.1_04 (ref. 27). Missense variants were obtained from ClinVar (release 2020-11). Frameshift variants were retrieved from Zenodo (https://doi.org/10.5281/zenodo.7311150; ref. 56). The number of publications per gene was obtained from Find My Understudied Genes (v.1.0.0; https://fmug.amaral.northwestern.edu/).
Code availability
Custom code to reproduce the results and figures of this study is available at https://github.com/KULL-Centre/_2023_Tesei_IDRome and at Zenodo (https://doi.org/10.5281/zenodo.10251736). CALVADOS is available at Zenodo (https://doi.org/10.5281/zenodo.6914053) and can be run using Google Colab through https://github.com/KULL-Centre/_2023_Tesei_IDRome/blob/main/IDRLab.ipynb.
References
Holehouse, A. S. & Kragelund, B. B. The molecular basis for cellular function of intrinsically disordered protein regions. Nat. Rev. Mol. Cell Biol. https://doi.org/10.1038/s41580-023-00673-0 (2023).
Akdel, M. et al. A structural biology community assessment of AlphaFold2 applications. Nat. Struct. Mol. Biol. 29, 1056–1067 (2022).
Ghafouri, H. et al. PED in 2024: improving the community deposition of structural ensembles for intrinsically disordered proteins. Nucleic Acids Res. 52, D536–D544 (2024).
Tesei, G., Schulze, T. K., Crehuet, R. & Lindorff-Larsen, K. Accurate model of liquid–liquid phase behavior of intrinsically disordered proteins from optimization of single-chain properties. Proc. Natl Acad. Sci. USA 118, e2111696118 (2021).
Tesei, G. & Lindorff-Larsen, K. Improved predictions of phase behaviour of intrinsically disordered proteins by tuning the interaction range. Open Res. Europe 2, 94 (2023).
Ruff, K. M. & Pappu, R. V. AlphaFold and implications for intrinsically disordered proteins. J. Mol. Biol. 433, 167208 (2021).
Alderson, T. R., Pritišanac, I., Kolarić, D., Moses, A. M. & Forman-Kay, J. D. Systematic identification of conditionally folded intrinsically disordered regions by AlphaFold2. Proc. Natl Acad. Sci. USA 120, e2304302120 (2023).
Piovesan, D., Monzon, A. M. & Tosatto, S. C. E. Intrinsic protein disorder and conditional folding in AlphaFoldDB. Protein Sci. 31, e4466 (2022).
Brotzakis, Z. F., Zhang, S. & Vendruscolo, M. AlphaFold prediction of structural ensembles of disordered proteins. Preprint at bioRxiv https://doi.org/10.1101/2023.01.19.524720 (2023).
Thomasen, F. E. & Lindorff-Larsen, K. Conformational ensembles of intrinsically disordered proteins and flexible multidomain proteins. Biochem. Soc. Trans. 50, 541–554 (2022).
Das, R. K., Huang, Y., Phillips, A. H., Kriwacki, R. W. & Pappu, R. V. Cryptic sequence features within the disordered protein p27Kip1 regulate cell cycle signaling. Proc. Natl Acad. Sci. USA 113, 5616–5621 (2016).
Martin, E. W. et al. Valence and patterning of aromatic residues determine the phase behavior of prion-like domains. Science 367, 694–699 (2020).
González-Foutel, N. S. et al. Conformational buffering underlies functional selection in intrinsically disordered protein regions. Nature Struct. Mol. Biol. 29, 781–790 (2022).
Lindorff-Larsen, K. & Kragelund, B. B. On the potential of machine learning to examine the relationship between sequence, structure, dynamics and function of intrinsically disordered proteins. J. Mol. Biol. 433, 167196 (2021).
Zheng, W. et al. Inferring properties of disordered chains from FRET transfer efficiencies. J. Chem. Phys. 148, 123329 (2018).
Sherry, K. P., Das, R. K., Pappu, R. V. & Barrick, D. Control of transcriptional activity by design of charge patterning in the intrinsically disordered ram region of the notch receptor. Proc. Natl Acad. Sci. USA 114, E9243–E9252 (2017).
Riback, J. A. et al. Stress-triggered phase separation is an adaptive, evolutionarily tuned response. Cell 168, 1028–1040 (2017).
Bremer, A. et al. Deciphering how naturally occurring sequence features impact the phase behaviours of disordered prion-like domains. Nat. Chem. 14, 196–207 (2022).
Ibrahim, A. Y. et al. Intrinsically disordered regions that drive phase separation form a robustly distinct protein class. J. Biol. Chem. 299, 102801 (2023).
Teilum, K., Olsen, J. G. & Kragelund, B. B. Globular and disordered—the non-identical twins in protein-protein interactions. Front. Mol. Biosci. 2, 40 (2015).
Staller, M. V. et al. Directed mutational scanning reveals a balance between acidic and hydrophobic residues in strong human activation domains. Cell Syst. 13, 334–345 (2022).
Sabari, B. R., Dall’Agnese, A. & Young, R. A. Biomolecular condensates in the nucleus. Trends Biochem. Sci. 45, 961–977 (2020).
Diner, I. et al. Aggregation properties of the small nuclear ribonucleoprotein U1-70K in Alzheimer disease. J. Biol. Chem. 289, 35296–35313 (2014).
Greig, J. A. et al. Arginine-enriched mixed-charge domains provide cohesion for nuclear speckle condensation. Mol. Cell 77, 1237–1250 (2020).
Chang, F. T. M. et al. PML bodies provide an important platform for the maintenance of telomeric chromatin integrity in embryonic stem cells. Nucleic Acids Res. 41, 4447–4458 (2013).
Lyons, H. et al. Functional partitioning of transcriptional regulators by patterned charge blocks. Cell 186, 327–345 (2023).
Rostam, N. et al. CD-CODE: crowdsourcing condensate database and encyclopedia. Nat. Methods 20, 673–676 (2023).
Nott, T. J. et al. Phase transition of a disordered nuage protein generates environmentally responsive membraneless organelles. Mol. Cell 57, 936–947 (2015).
Pak, C. W. et al. Sequence determinants of intracellular phase separation by complex coacervation of a disordered protein. Mol. Cell 63, 72–85 (2016).
Das, R. K. & Pappu, R. V. Conformations of intrinsically disordered proteins are influenced by linear sequence distributions of oppositely charged residues. Proc. Natl Acad. Sci. USA 110, 13392–13397 (2013).
Sawle, L. & Ghosh, K. A theoretical method to compute sequence dependent configurational properties in charged polymers and proteins. J. Chem. Phys. 143, 085101 (2015).
Zarin, T. et al. Proteome-wide signatures of function in highly diverged intrinsically disordered regions. eLife 8, e46883 (2019).
Zheng, W., Dignon, G., Brown, M., Kim, Y. C. & Mittal, J. Hydropathy patterning complements charge patterning to describe conformational preferences of disordered proteins. J. Phys. Chem. Lett. 11, 3408–3415 (2020).
Huihui, J. & Ghosh, K. Intrachain interaction topology can identify functionally similar intrinsically disordered proteins. Biophys. J. 120, 1860–1868 (2021).
Yamazaki, H., Takagi, M., Kosako, H., Hirano, T. & Yoshimura, S. H. Cell cycle-specific phase separation regulated by protein charge blockiness. Nat. Cell Biol. 24, 625–632 (2022).
Mao, A. H., Crick, S. L., Vitalis, A., Chicoine, C. L. & Pappu, R. V. Net charge per residue modulates conformational ensembles of intrinsically disordered proteins. Proc. Natl Acad. Sci. USA 107, 8183–8188 (2010).
Lin, Y.-H. & Chan, H. S. Phase separation and single-chain compactness of charged disordered proteins are strongly correlated. Biophys. J. 112, 2043–2046 (2017).
Cohan, M. C., Shinn, M. K., Lalmansingh, J. M. & Pappu, R. V. Uncovering non-random binary patterns within sequences of intrinsically disordered proteins. J. Mol. Biol. 434, 167373 (2022).
DelRosso, N. et al. Large-scale mapping and mutagenesis of human transcriptional effector domains. Nature 616, 365–372 (2023).
Johansson, K. E., Mashahreh, B., Hartmann-Petersen, R., Ravid, T. & Lindorff-Larsen, K. Prediction of quality-control degradation signals in yeast proteins. J. Mol. Biol. 435, 167915 (2023).
Ruff, K. M. in Intrinsically Disordered Proteins (eds Kragelund, B. B. & Skriver, K.) Ch. 18, 347–389 (Springer, 2020).
Lotthammer, J. M. et al. Direct prediction of intrinsically disordered protein conformational properties from sequences. Nat. Methods https://doi.org/10.1038/s41592-023-02159-5 (2024).
Wang, J. et al. A molecular grammar governing the driving forces for phase separation of prion-like RNA binding proteins. Cell 174, 688–699 (2018).
Fisher, R. S. & Elbaum-Garfinkle, S. Tunable multiphase dynamics of arginine and lysine liquid condensates. Nat. Commun. 11, 4628 (2020).
Schmidt, H. B., Barreau, A. & Rohatgi, R. Phase separation-deficient TDP43 remains functional in splicing. Nat. Commun. 10, 4890 (2019).
Sørensen, C. S. & Kjaergaard, M. Effective concentrations enforced by intrinsically disordered linkers are governed by polymer physics. Proc. Natl Acad. Sci. USA 116, 23124–23131 (2019).
Hantschel, O. et al. A myristoyl/phosphotyrosine switch regulates c-Abl. Cell 112, 845–857 (2003).
Harmon, T. S., Holehouse, A. S., Rosen, M. K. & Pappu, R. V. Intrinsically disordered linkers determine the interplay between phase separation and gelation in multivalent proteins. eLife 6, e30294 (2017).
Asselin, L. et al. Mutations in the KIF21B kinesin gene cause neurodevelopmental disorders through imbalanced canonical motor activity. Nat. Commun. 11, 2441 (2020).
Ahmed, S. B. M. & Prigent, S. A. Insights into the Shc family of adaptor proteins. J. Mol. Signal. 12, 2 (2017).
Manetti, F. LIM kinases are attractive targets with many macromolecular partners and only a few small molecule regulators. Med. Res. Rev. 32, 968–998 (2011).
Mégarbané, H. et al. An autosomal-recessive form of cutis laxa is due to homozygous elastin mutations, and the phenotype may be modified by a heterozygous fibulin 5 polymorphism. J. Invest. Dermatol. 129, 1650–1655 (2009).
Langenhove, T. V. et al. Genetic contribution of FUS to frontotemporal lobar degeneration. Neurology 74, 366–371 (2010).
Liu, Q. et al. Whole-exome sequencing identifies a missense mutation in hnRNPA1 in a family with flail arm ALS. Neurology 87, 1763–1769 (2016).
Schmidt, A. et al. Predicting the pathogenicity of missense variants using features derived from AlphaFold2. Bioinformatics 39, btad280 (2023).
Mensah, M. A. et al. Aberrant phase separation and nucleolar dysfunction in rare genetic diseases. Nature 614, 564–571 (2023).
Banani, S. F. et al. Genetic variation associated with condensate dysregulation in disease. Dev. Cell 57, 1776–1788 (2022).
Rocha, J. J. et al. Functional unknomics: systematic screening of conserved genes of unknown function. PLOS Biol. 21, e3002222 (2023).
Richardson, R. A. K., Navarro, H. T., Amaral, L. A. N. & Stoeger, T. Meta-research: understudied genes are lost in a leaky pipeline between genome-wide assays and reporting of results. eLife 12, RP93429 (2023).
Janson, G., Valdes-Garcia, G., Heo, L. & Feig, M. Direct generation of protein conformational ensembles via machine learning. Nat. Commun. 14, 774 (2023).
Le Mercier, P. et al. SwissBioPics—an interactive library of cell images for the visualization of subcellular location data. Database 2022, baac026 (2022).
The UniProt Consortium. Uniprot: the universal protein knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).
Hanson, J., Yang, Y., Paliwal, K. & Zhou, Y. Improving protein disorder prediction by deep bidirectional long short-term memory recurrent neural networks. Bioinformatics 33, 685–692 (2016).
Hekkelman, M. L., de Vries, I., Joosten, R. P. & Perrakis, A. AlphaFill: enriching AlphaFold models with ligands and cofactors. Nat. Methods 20, 205–213 (2022).
Anderson, J. A., Glaser, J. & Glotzer, S. C. HOOMD-blue: a Python package for high-performance molecular dynamics and hard particle Monte Carlo simulations. Comput. Mater. Sci. 173, 109363 (2020).
Flyvbjerg, H. & Petersen, H. G. Error estimates on averages of correlated data. J. Chem. Phys. 91, 461–466 (1989).
Borgia, A. et al. Consistent view of polypeptide chain expansion in chemical denaturants from multiple experimental methods. J. Am. Chem. Soc. 138, 11714–11726 (2016).
Aronovitz, J. & Nelson, D. Universal features of polymer shapes. Journal de Physique 47, 1445–1456 (1986).
Hensen, U., Gräter, F. & Henchman, R. H. Macromolecular entropy can be accurately computed from force. J. Chem. Theory Comput. 10, 4777–4781 (2014).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
McGibbon, R. T. et al. MDTraj: a modern open library for the analysis of molecular dynamics trajectories. Biophys. J. 109, 1528 – 1532 (2015).
Holehouse, A. S., Das, R. K., Ahad, J. N., Richardson, M. O. & Pappu, R. V. CIDER: resources to analyze sequence-ensemble relationships of intrinsically disordered proteins. Biophys. J. 112, 16–21 (2017).
Montepietra, D. et al. FRETpredict: a Python package for FRET efficiency predictions using rotamer libraries. Preprint at bioRxiv https://doi.org/10.1101/2023.01.27.525885 (2023).
Rotkiewicz, P. & Skolnick, J. Fast procedure for reconstruction of full-atom protein models from reduced representations. J. Comput. Chem. 29, 1460–1465 (2008).
Fuertes, G. et al. Decoupling of size and shape fluctuations in heteropolymeric sequences reconciles discrepancies in SAXS vs. FRET measurements. Proc. Natl Acad. Sci. USA 114, E6342–E6351 (2017).
Gomes, G.-N. W. et al. Conformational ensembles of an intrinsically disordered protein consistent with NMR, SAXS, and single-molecule FRET. J. Am. Chem. Soc. 142, 15697–15710 (2020).
Humphrey, W., Dalke, A. & Schulten, K. VMD—Visual Molecular Dynamics. J. Mol. Graph. 14, 33–38 (1996).
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).
The Gene Ontology Consortium. The Gene Ontology knowledgebase in 2023. Genetics 224, iyad031 (2023).
Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring network structure, dynamics, and function using networkx. In Proc. 7th Python in Science Conference (eds Varoquaux, G., Vaught, T. & Millman, J.) 11–15 (2008).
Brunner, E. & Munzel, U. The nonparametric Behrens–Fisher problem: asymptotic theory and a small-sample approximation. Biom. J. 42, 17–25 (2000).
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Thulin, M. Modern Statistics with R (Eos Chasma Press, 2021).
Noguchi, K., Konietschke, F., Marmolejo-Ramos, F. & Pauly, M. Permutation tests are robust and powerful at 0.5% and 5% significance levels. Behav. Res. Meth. 53, 2712–2724 (2021).
Mashahreh, B. et al. Conserved degronome features governing quality control associated proteolysis. Nat. Commun. 13, 7588 (2022).
Koren, I. et al. The eukaryotic proteome is shaped by E3 ubiquitin ligases targeting C-terminal degrons. Cell 173, 1622–1635 (2018).
Chang, Y.-W., Hsieh, C.-J., Chang, K.-W., Ringgaard, M. & Lin, C.-J. Training and testing low-degree polynomial data mappings via linearSVM. J. Mach. Learn. Res. 11, 1471–1490 (2010).
Schölkopf, B., Smola, A., Williamson, R. & Bartlett, P. New support vector algorithms. Neural Comput. 12, 1207–1245 (2000).
Breiman, L. Random forests. Machine Learning 45, 5–32 (2001).
Chao, T.-H., Rekhi, S., Mittal, J. & Tabor, D. P. Data-driven models for predicting intrinsically disordered protein polymer physics directly from composition or sequence. Mol. Syst. Des. Eng. 8, 1146–1155 (2023).
Dignon, G. L., Zheng, W., Kim, Y. C., Best, R. B. & Mittal, J. Sequence determinants of protein phase behavior from a coarse-grained model. PLOS Comput. Biol. 14, e1005941 (2018).
Lu, A. X. et al. Discovering molecular features of intrinsically disordered regions by using evolution for contrastive learning. PLOS Comput. Biol. 18, e1010238 (2022).
Altenhoff, A. M. et al. OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic Acids Res. 49, D373–D379 (2020).
Katoh, K. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066 (2002).
Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2017).
Tiemann, J. K. S., Zschach, H., Lindorff-Larsen, K. & Stein, A. Interpreting the molecular mechanisms of disease variants in human transmembrane proteins. Biophys. J. 122, 2176–2191 (2023).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Varadi, M. et al. AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2021).
Acknowledgements
We thank T. Mittag, X. Salvatella, J. Riback, and R. Pappu for comments and suggestions, and A. Holehouse for discussions and sharing work before publication. We acknowledge access to computational resources from the Biocomputing Core Facility at the Department of Biology, University of Copenhagen, from the Resource for Biomolecular Simulations (ROBUST; supported by the Novo Nordisk Foundation; NNF18OC0032608), and the Danish National Supercomputer for Life Sciences (Computerome). This work is a contribution from the PRISM (Protein Interactions and Stability in Medicine and Genomics) centre funded by the Novo Nordisk Foundation (to K.L.-L.; NNF18OC0033950). This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement no. 101025063.
Author information
Authors and Affiliations
Contributions
G.T., K.E.J. and K.L.-L. conceived and planned the study. G.T. and A.I.T. performed and analysed molecular simulations and performed GO analyses. N.J. generated the set of missense variants. G.T. and J.B. developed the machine-learning models. G.T. and F.E.K. analysed orthologous sequences. F.P. developed the Google Colab notebooks. K.E.J. generated the sets of human and orthologous sequences. G.T. performed the other analyses with input from N.J., F.P., K.E.J. and K.L.-L. G.T. and K.L.-L. wrote the manuscript with contributions from all authors. K.L.-L. supervised the study.
Corresponding authors
Ethics declarations
Competing interests
K.L.-L. holds stock options in and is a consultant for Peptone. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature thanks David De Sancho and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Extended Data Fig. 1 Analyses of pLDDT scores and UniProt annotations.
a, Distributions of the window-averaged pLDDT score of residues in the pLDDT-based set (grey), in the SPOT-based set (red), and in IDRs from 531 proteins unique to the SPOT-based set (magenta). b,c, Window-averaged pLDDT scores in (b) all IDRs and in (c) IDRs between folded domains with ν ≤ 0.45 (orange), 0.45 < ν ≤ 0.55 (cyan), and ν > 0.55 (teal). Hatched and open violin plots show results for the pLDDT-based and SPOT-based sets, respectively. Horizontal lines show medians and quartiles calculated over (b) n = 161, 748; 88,325; 3,278,851; 2,442,776; 511,556; 429,591 and (c) n = 13, 988; 8,534; 374,190; 275,164; 92,770; 66,276 pLDDT scores. d,e, Fraction of IDRs (d) and average sequence length of IDRs (e) as a function of the fraction of residues within UniProt “Zinc finger”, “Coiled coil”, “Transmembrane”, and “Domain” annotations longer than ten amino acids, fdomain. Results for the pLDDT-based (6,204 IDRs with fdomain > 0, 22%) and SPOT-based sets (4,095 IDRs with fdomain > 0, 19%) are shown in grey and red, respectively. f–i, fdomain averaged over (f) all IDRs (n = 616, 361, 18546, 13292, 8896, and 7730), (g) IDRs between folded domains (n = 91; 56; 2,470; 1,613; 1,777; and 1,214), (h) N- and C-terminal IDRs (n = 415; 203; 12,144; 6,531; 4,776; and 3,601), and (i) IDPs (n = 43; 12; 1,346; 517; 74; and 50) with ν ≤ 0.45 (orange), 0.45 < ν ≤ 0.55 (cyan), and ν > 0.55 (teal). Hatched and empty bars show results for the pLDDT-based and SPOT-based sets, respectively. Data in e–i are displayed as mean ± s.e.m. j, Comparison between the number of residues within UniProt “Zinc finger”, “Coiled coil”, “Transmembrane”, and “Domain” annotations longer than ten amino acids in the pLDDT-based (grey) and SPOT-based (red) sets of IDRs, and in the full-length proteins of both sets (blue). IDP, fully intrinsically disordered protein.
Extended Data Fig. 2 Representative intra-chain energy maps, and distributions of the conformational parameters asphericity, Δ, and prolateness, S, for a selection of IDRs from our database.
The UniProt IDs of the IDRs are (a) P08621, (b) P06748, (c) Q5VUA4, (d) Q6PD62, (e) Q15648, (f) Q13148, (g) P98177, (h) Q7KZ85, (i) P06454, (j) Q5BKZ1, (k) O60763, and (l) Q99733. The data are averaged over five ensembles of 1,000 weakly correlated conformations (Extended Data Fig. 5a), each obtained from an independent molecular dynamics simulation. (Top) Energy maps are calculated using the non-ionic Ashbaugh–Hatch potential of the CALVADOS 2 model. Bottom, lines and vertical shaded areas show distributions and confidence intervals of the ensemble averages estimated from the standard deviation of Δ (light shade) and S (dark shade) over five independent simulations.
Extended Data Fig. 3 Validation of the CALVADOS model.
a, Relative difference between predicted and experimental radii of gyration for a set of 64 proteins that were not used in training the CALVADOS 2 model, and for Sic1. Bars show \(({R}_{{\rm{g}}}^{{\rm{calc}}}-{R}_{{\rm{g}}}^{\exp })/{R}_{{\rm{g}}}^{\exp }\) where \({R}_{{\rm{g}}}^{{\rm{calc}}}\) is averaged over n = 5 independent simulation replicas. Data are displayed as the mean ± s.d. b, Correlation between \({R}_{{\rm{g}}}^{\exp }\) and \({R}_{{\rm{g}}}^{{\rm{calc}}}\) for the same proteins shown in a. c, Correlation of predicted and experimental average FRET efficiencies, ⟨EFRET⟩, for a set of 64 proteins that were not used in training the CALVADOS 2 model, and for Sic1. d–k, Comparison between Rg distributions from CALVADOS simulations (colour) and from ensembles from the Protein Ensemble Database (black) for (d) Nup49, (e) NLS, (f) NUS, (g) IBB, (h) NUL, (i) Sic1, (j) KISS1 and (k) NWASPV. Vertical lines show the average radius of gyration from simulations (colour), from the Protein Ensemble Database (solid black), and from experiments (dashed black); we note that some of the ensembles in the Protein Ensemble Database may have been generated by so-called reweighting methods, but that the ensembles in the database correspond to the unweighted structures. In Supplementary Table 4, we provide the full list of references for the experimental Rg and ⟨EFRET⟩ values shown in this figure.
Extended Data Fig. 4 Correlation between the variance of residue–residue distances in CALVADOS simulations and the confidence of the relative positions of residue pairs in AlphaFold predictions.
a–d, Standard deviation (s.d.) of intra-chain distances calculated from simulations (a,c) and AlphaFold predicted aligned errors (PAE) (b,d) for IDRs of nucleophosmin (UniProt ID P06748) (a,b) and transcription elongation factor SPT6 (UniProt ID Q7KZ85) (c,d). e–l, PAE as a function of s.d. for IDRs in proteins with UniProt IDs (e) P08621, (f) P06748, (g) Q5VUA4, (h) Q6PD62, (i) Q15648, (j) Q13148, (k) P98177, and (l) Q7KZ85. Pearson correlation coefficients, r, and linear fits (solid lines) are calculated for separations along the linear sequence, ∣i − j∣, in the range ∣i − j∣min ≤ ∣i − j∣ ≤ N, where ∣i − j∣min = 4 and N is the sequence length. m–t, Pearson r for the correlation between PAE and s.d. as a function of the shortest separation along the linear chain, ∣i − j∣min, included in the analysis; data are shown for IDRs in proteins with UniProt IDs (m) P08621, (n) P06748, (o) Q5VUA4, (p) Q6PD62, (q) Q15648, (r) Q13148, (s) P98177, and (t) Q7KZ85.
Extended Data Fig. 5 Sampling, distributions and dependence on the location within the full-length protein of conformational properties in the pLDDT-based set of IDRs.
a, Values of the autocorrelation function (ACF) of the Rg for lag times of one (grey), two (red), and three (blue) frames as a function of sequence length. Data are displayed as mean ± s.d. over n = 5 independent simulation replicas. b, S.d. of the apparent Flory scaling exponent, ν, as a function of ν for proteins of different sequence length. c, S.d. of the ratio of the mean-squared end-to-end distance and the mean-squared radius of gyration, \(\langle {R}_{{\rm{ee}}}^{2}\rangle /\langle {R}_{{\rm{g}}}^{2}\rangle \), as a function of \(\langle {R}_{{\rm{ee}}}^{2}\rangle /\langle {R}_{{\rm{g}}}^{2}\rangle \) for proteins of different sequence length. d, S.d. of the conformational entropy per residue, Sconf/N, as a function of Sconf/N for proteins of different sequence length. S.d. values are calculated over n = 5 independent simulation replicas. Dotted lines show the average s.d. e, Correlation between \(\langle {R}_{{\rm{ee}}}^{2}\rangle \) / \(\langle {R}_{{\rm{g}}}^{2}\rangle \) calculated from simulation trajectories and the approximate relation \(\langle {R}_{{\rm{ee}}}^{2}\rangle \) / \(\langle {R}_{{\rm{g}}}^{2}\rangle =(2\nu +1)\times (2\nu +2)\). (ref. 15). f,g, Distributions of (f) \(\langle {R}_{{\rm{ee}}}^{2}\rangle \) / \(\langle {R}_{{\rm{g}}}^{2}\rangle \) and (g) the conformational entropy per residue, Sconf/N, for IDRs in the human proteome; note the logarithmic scale. h–j, Correlation between (h) asphericity, Δ, and ν; (i) prolateness, S, and ν; and (j) Sconf/N and ν. r is the Pearson correlation coefficient. k, Normalized distributions of the absolute difference between ν values, ∣Δν∣, for IDRs from different proteins (grey), within the same protein (blue), and separated by folded domains within the same protein (teal). Vertical bars show mean ± s.e.m. of ∣Δν∣. l,m, Distributions of (l) ν and (m) Sconf/N for IDRs in the different categories of localization within the full-length protein: N-terminal (32.2%), C-terminal (24.8%), between folded domains (15.5%), and IDPs (5.2%). For this analysis we annotated IDRs to be located between folded domains if they are preceded and followed by at least 100 residues which do not overlap with any of the IDRs identified in this work. IDP, fully intrinsically disordered protein.
Extended Data Fig. 6 Analysis of the CD-CODE database, and relationships between sequence and conformational properties for IDRs in the pLDDT-based and SPOT-based sets.
a,b, Analysis of the association between IDR conformational properties and protein localization in membraneless organelles (as reported in the CD-CODE database27). Distributions of (a) ν and (b) Sconf/N for IDRs in ‘driver’ proteins (shaded bars) and in proteins that are not a part of the examined subset (black lines). c,d, Distributions of (c) ν and (d) Sconf/N for IDRs in ‘member’ proteins (shaded bars) and in proteins that are not a part of the examined subset (black lines). Histograms for proteins enriched in compact and expanded IDRs are shown in orange and teal, respectively. P values are estimated from one-sided Brunner–Munzel tests using t-distributions and the reported degrees of freedom (DoF). Standard errors of Cohen’s d values are estimated through 105 bootstraps. e–q, Comparison between sequence features that affect compaction in pLDDT-based and SPOT-based sets of IDRs. We show NARDINI z-scores for (e) basic–acidic patterning, z(δ+−), (f) acidic patterning, z(Ω−), and (g) aromatic patterning, z(Ωπ); (h) sequence charge decoration, SCD; (l) charge segregation, κ; (m) Sconf/N; (n) sequence hydropathy decoration, SHD; (o) average stickiness, ⟨λ⟩; (p) fraction of charged residues, FCR; and (q) sequence length, N, as a function of ν. i–k: (i) z(δ+−), (j) z(Ω−), and (k) z(Ωπ) as a function of Sconf/N. Results are shown for all IDRs in the pLDDT-based (grey) and SPOT-based (red) sets, and for IDRs with fdomain = 0 in the pLDDT-based set (blue). r,s, Short IDRs are on average more highly charged and expanded than longer IDRs. (r) ν and (s) Sconf/N as a function of sequence length, N, for the human IDRs in the pLDDT-based (grey) and SPOT-based (red) sets. Data are displayed as mean ± s.e.m. t, Normalized distributions of the NCPR for IDRs with N≤200 (full lines) and with N > 200 (dotted lines) in the pLDDT-based (grey) and SPOT-based (red) sets.
Extended Data Fig. 7 Analyses of QCDPred scores and charged residues for IDRs in the pLDDT-based set.
a, QCDPred score (red) and ⟨λ⟩ (grey) as a function of ν. b, Average QCDPred score as a function of ν and ⟨λ⟩. c, QCDPred score (red) and fraction of negatively charged residues, f− (black) as a function of ν. d,g, Average fraction of positively charged residues Arg (red) and Lys (black) as a function of (d) ν and (g) Sconf/N. e,h, Average fraction of negatively charged residues Glu (red) and Asp (black) as a function of (e) ν and (h) Sconf/N. f,i, Average net charge per residue, NCPR, as a function of (f) ν and (i) Sconf/N. Data are displayed as mean ± s.e.m. calculated within bins of width Δν = 0.015 and ΔSconf/N = 0.05 kB. The samples sizes in each bin are (a,c–f) n = 4, 3; 5; 3; 5; 6; 9; 11; 8; 16; 14; 26; 23; 28; 41; 49; 98; 135; 212; 473; 854; 1,921; 3,319; 4,724; 5,393; 4,987; 3,405; 1,461; 484; 182; 76; 43; 16; 9; 4; and (g–i) n = 5; 9; 4; 20; 29; 93; 406; 1,290; 2,555; 3,204; 3,520; 3,483; 3,415; 3,005; 2,517; 1,895; 1,208; 686; 320; 161; 90; 49; 29; 20; 12; 6; 8; 4; 3; 3; 3.
Extended Data Fig. 8 SVR models for ν and Sconf/N.
a,g, Permutation importance of the sequence features used in the SVR models for the prediction of (a) ν and (g) Sconf/N. b–f, Average sequence features as a function of ν calculated from simulations (grey) and using the SVR model (red); data are shown for (b) sequence hydropathy decoration, SHD; (c) sequence charge decoration, SCD; (d) fraction of charged residues, FCR; (e) average stickiness, ⟨λ⟩; and (f) charge segregation, κ. h–j, Average sequence features as a function of Sconf/N calculated from simulations (grey) and using the SVR model (red); data are shown for (h) ⟨λ⟩; (i) SHD; and (j) SCD. Data are displayed as mean ± s.e.m. calculated within bins of width Δν = 0.015 and ΔSconf/N = 0.05 kB. The samples sizes in each bin are (b–f) n = 4; 3; 5; 3; 5; 6; 9; 11; 8; 16; 14; 26; 23; 28; 41; 49; 98; 135; 212; 473; 854; 1,921; 3,319; 4,724; 5,393; 4,987; 3,405; 1,461; 484; 182; 76; 43; 16; 9; 4; and (h–j) n = 5; 9; 4; 20; 29; 93; 406; 1,290; 2,555; 3,204; 3,520; 3,483; 3,415; 3,005; 2,517; 1,895; 1,208; 686; 320; 161; 90; 49; 29; 20; 12; 6; 8; 4; 3; 3; 3. k–n, Testing the SVR models. k,m, Correlation between (k) ν and (m) Sconf/N from simulations and corresponding predictions of the SVR models for a held-out test set of 2,795 distinct sequences from the set of 28,058 IDRs identified in this work. l,n, Correlation between (l) ν and (n) Sconf/N values from simulations and corresponding predictions of the SVR models for 611 IDRs in the 531 proteins that are unique to the SPOT-based set of IDRs.
Extended Data Fig. 9 Relationships between the conformational properties of IDRs and the incidence of pathogenic missense and frameshift variants.
a–c, Distributions of (a) the conformational entropy per residue, Sconf/N; (b) the pLDDT scores of the substituted residues; and (c) the scaling exponent, ν, for 11,173 benign (black line) and 1,656 pathogenic (grey bars) variants of IDRs without UniProt domain annotations (fdomain = 0). d,e, Distributions of the difference in (d) ΔSconf,SVR/N and (e) ΔνSVR between variant and wild type for benign (black line) and pathogenic (grey bars) variants of IDRs with fdomain = 0. g,h, Distributions of the difference in (g) ΔSconf,SVR/N and (h) ΔνSVR between variant and wild type for 5,039 benign (or of unknown significance, black line) and 637 pathogenic (grey bars) frameshift variants identified by Mensah et al.56. P values are estimated from one-sided Brunner–Munzel tests using t-distributions and the reported degrees of freedom (DoF). Standard errors of Cohen’s d values are estimated through 105 bootstraps. f, Average number of ClinVar missense variants per IDR. i, Average number of publications per IDR obtained from the “Find my Understudied Gene” tool59. Data in f and i are displayed as mean ± s.e.m. calculated over n = 25; 87; 498; 18,546; 8,864; and 32 IDRs.
Supplementary information
Supplementary Table 1
P values from one-sided Brunner–Munzel tests and Cohen’s d are calculated using (i) all IDRs in the pLDDT-based set, (ii) only IDRs with fdomain = 0 in the pLDDT-based set, and (iii) all IDRs in the SPOT-based set. The file reports only Gene Ontology “molecular function” terms with P < 10−5 for each of the three sets of IDRs for the association with either ν, Sconf/N, or both. The file has four tabs: (i) asymptotic P values calculated using t-distributions, (ii) Cohen’s d with standard errors estimated via bootstrapping, (iii) P values obtained via bootstrapping, and (iv) P values obtained via bootstrapping where we sampled with replacement both sets using the sample size of the smaller set.
Supplementary Table 2
P values from one-sided Brunner–Munzel tests and Cohen’s d are calculated using (i) all IDRs in the pLDDT-based set, (ii) only IDRs with fdomain = 0 in the pLDDT-based set, and (iii) all IDRs in the SPOT-based set. The file reports only Gene Ontology “cellular component” terms with P < 10−5 for each of the three sets of IDRs for the association with either ν, Sconf/N, or both. The file has four tabs: (i) asymptotic P values calculated using t-distributions, (ii) Cohen’s d with standard errors estimated via bootstrapping, (iii) P values obtained via bootstrapping, and (iv) P values obtained via bootstrapping where we sampled with replacement both sets using the sample size of the smaller set.
Supplementary Table 3
Sequence length; ν, Sconf/N; <Ree2> / <Rg2>; S; Δ; <λ>; z(δ+−) ; z(Ωπ); z(Ω−); SHD; SCD; κ; FCR; NCPR; Booleans indicating whether the IDR is N-terminal, C-terminal, between folded domains, or an IDP; QCDpred score; fdomain; and amino acid sequence for the 28,058 IDRs in the pLDDT-based set. IDP, fully intrinsically disordered protein.
Supplementary Table 4
List of digital object identifiers (DOIs) of research articles reporting small-angle X-ray scattering and Förster resonance energy transfer (FRET) data used to validate the CALVADOS model as shown in Extended Data Fig. 3.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tesei, G., Trolle, A.I., Jonsson, N. et al. Conformational ensembles of the human intrinsically disordered proteome. Nature 626, 897–904 (2024). https://doi.org/10.1038/s41586-023-07004-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41586-023-07004-5
This article is cited by
-
Direct prediction of intrinsically disordered protein conformational properties from sequence
Nature Methods (2024)
-
An easy-to-use computational tool for predicting 3D properties of disordered proteins
Nature Methods (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.