Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Conformational ensembles of the human intrinsically disordered proteome

Abstract

Intrinsically disordered proteins and regions (collectively, IDRs) are pervasive across proteomes in all kingdoms of life, help to shape biological functions and are involved in numerous diseases. IDRs populate a diverse set of transiently formed structures and defy conventional sequence–structure–function relationships1. Developments in protein science have made it possible to predict the three-dimensional structures of folded proteins at the proteome scale2. By contrast, there is a lack of knowledge about the conformational properties of IDRs, partly because the sequences of disordered proteins are poorly conserved and also because only a few of these proteins have been characterized experimentally. The inability to predict structural properties of IDRs across the proteome has limited our understanding of the functional roles of IDRs and how evolution shapes them. As a supplement to previous structural studies of individual IDRs3, we developed an efficient molecular model to generate conformational ensembles of IDRs and thereby to predict their conformational properties from sequences4,5. Here we use this model to simulate nearly all of the IDRs in the human proteome. Examining conformational ensembles of 28,058 IDRs, we show how chain compaction is correlated with cellular function and localization. We provide insights into how sequence features relate to chain compaction and, using a machine-learning model trained on our simulation data, show the conservation of conformational properties across orthologues. Our results recapitulate observations from previous studies of individual protein systems and exemplify how to link—at the proteome scale—conformational ensembles with cellular function and localization, amino acid sequence, evolutionary conservation and disease variants. Our freely available database of conformational properties will encourage further experimental investigation and enable the generation of hypotheses about the biological roles and evolution of IDRs.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Schematic illustration of the approach used to obtain conformational properties for all of the IDRs in the human proteome.
Fig. 2: Analysis of the association between IDR compaction and GO annotations.
Fig. 3: Sequence features that determine compaction.
Fig. 4: Conservation of conformational and sequence properties.
Fig. 5: Proteome-wide search for conformational buffering, and relationship between conformational properties of IDRs and incidence of pathogenic missense variants.

Similar content being viewed by others

Data availability

Simulation trajectories and precalculated conformational properties for the 28,058 IDRs of the pLDDT-based set are available at https://sid.erda.dk/cgi-sid/ls.py?share_id=AVZAJvJnCO, which also includes a full list of amino acid sequences, sequence features and conformational properties of the 28,058 IDRs in the pLDDT-based set (Supplementary Table 3). In the online database, directories are organized by splitting UniProt IDs into two-character units (XX/XX/XX), followed by an additional four-character unit for UniProt IDs of ten characters (XX/XX/XX/XXXX). For example, data for the IDRs of UniProt ID A0A494C086 can be accessed by entering IDRome/A0/A4/94/C086 in the Filter paths. Raw data to reproduce the results presented in this work are available at https://github.com/KULL-Centre/_2023_Tesei_IDRome and have been deposited at Zenodo (https://doi.org/10.5281/zenodo.10251736).

Window-averaged pLDDT scores were obtained from Supplementary Table 1 of a previous study2. SPOT-Disorder v.1 predictions were obtained from https://github.com/IPritisanac/AF2.IDR (ref. 7). Orthologous sequences were retrieved from the OMA database (Nov 2022 release). UniProt releases 2021_04, 2022_05 and 2023_03 were accessed as described previously62 and as detailed in the Methods. AlphaFold predicted aligned errors were retrieved from https://alphafold.ebi.ac.uk/ (refs. 97,98). The Gene Ontology Data Archive version 2023-11-15 (https://doi.org/10.5281/zenodo.10162580) was obtained from https://current.geneontology.org/ontology/go-basic.obo. UniProt IDs of proteins in membraneless organelles were retrieved from CD-CODE v.1_04 (ref. 27). Missense variants were obtained from ClinVar (release 2020-11). Frameshift variants were retrieved from Zenodo (https://doi.org/10.5281/zenodo.7311150; ref. 56). The number of publications per gene was obtained from Find My Understudied Genes (v.1.0.0; https://fmug.amaral.northwestern.edu/).

Code availability

Custom code to reproduce the results and figures of this study is available at https://github.com/KULL-Centre/_2023_Tesei_IDRome and at Zenodo (https://doi.org/10.5281/zenodo.10251736). CALVADOS is available at Zenodo (https://doi.org/10.5281/zenodo.6914053) and can be run using Google Colab through https://github.com/KULL-Centre/_2023_Tesei_IDRome/blob/main/IDRLab.ipynb.

References

  1. Holehouse, A. S. & Kragelund, B. B. The molecular basis for cellular function of intrinsically disordered protein regions. Nat. Rev. Mol. Cell Biol. https://doi.org/10.1038/s41580-023-00673-0 (2023).

    Article  PubMed  Google Scholar 

  2. Akdel, M. et al. A structural biology community assessment of AlphaFold2 applications. Nat. Struct. Mol. Biol. 29, 1056–1067 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Ghafouri, H. et al. PED in 2024: improving the community deposition of structural ensembles for intrinsically disordered proteins. Nucleic Acids Res. 52, D536–D544 (2024).

    Article  PubMed  Google Scholar 

  4. Tesei, G., Schulze, T. K., Crehuet, R. & Lindorff-Larsen, K. Accurate model of liquid–liquid phase behavior of intrinsically disordered proteins from optimization of single-chain properties. Proc. Natl Acad. Sci. USA 118, e2111696118 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Tesei, G. & Lindorff-Larsen, K. Improved predictions of phase behaviour of intrinsically disordered proteins by tuning the interaction range. Open Res. Europe 2, 94 (2023).

    Article  Google Scholar 

  6. Ruff, K. M. & Pappu, R. V. AlphaFold and implications for intrinsically disordered proteins. J. Mol. Biol. 433, 167208 (2021).

    Article  CAS  PubMed  Google Scholar 

  7. Alderson, T. R., Pritišanac, I., Kolarić, D., Moses, A. M. & Forman-Kay, J. D. Systematic identification of conditionally folded intrinsically disordered regions by AlphaFold2. Proc. Natl Acad. Sci. USA 120, e2304302120 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Piovesan, D., Monzon, A. M. & Tosatto, S. C. E. Intrinsic protein disorder and conditional folding in AlphaFoldDB. Protein Sci. 31, e4466 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Brotzakis, Z. F., Zhang, S. & Vendruscolo, M. AlphaFold prediction of structural ensembles of disordered proteins. Preprint at bioRxiv https://doi.org/10.1101/2023.01.19.524720 (2023).

  10. Thomasen, F. E. & Lindorff-Larsen, K. Conformational ensembles of intrinsically disordered proteins and flexible multidomain proteins. Biochem. Soc. Trans. 50, 541–554 (2022).

    Article  CAS  PubMed  Google Scholar 

  11. Das, R. K., Huang, Y., Phillips, A. H., Kriwacki, R. W. & Pappu, R. V. Cryptic sequence features within the disordered protein p27Kip1 regulate cell cycle signaling. Proc. Natl Acad. Sci. USA 113, 5616–5621 (2016).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  12. Martin, E. W. et al. Valence and patterning of aromatic residues determine the phase behavior of prion-like domains. Science 367, 694–699 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  13. González-Foutel, N. S. et al. Conformational buffering underlies functional selection in intrinsically disordered protein regions. Nature Struct. Mol. Biol. 29, 781–790 (2022).

    Article  Google Scholar 

  14. Lindorff-Larsen, K. & Kragelund, B. B. On the potential of machine learning to examine the relationship between sequence, structure, dynamics and function of intrinsically disordered proteins. J. Mol. Biol. 433, 167196 (2021).

    Article  CAS  PubMed  Google Scholar 

  15. Zheng, W. et al. Inferring properties of disordered chains from FRET transfer efficiencies. J. Chem. Phys. 148, 123329 (2018).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  16. Sherry, K. P., Das, R. K., Pappu, R. V. & Barrick, D. Control of transcriptional activity by design of charge patterning in the intrinsically disordered ram region of the notch receptor. Proc. Natl Acad. Sci. USA 114, E9243–E9252 (2017).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  17. Riback, J. A. et al. Stress-triggered phase separation is an adaptive, evolutionarily tuned response. Cell 168, 1028–1040 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Bremer, A. et al. Deciphering how naturally occurring sequence features impact the phase behaviours of disordered prion-like domains. Nat. Chem. 14, 196–207 (2022).

    Article  CAS  PubMed  Google Scholar 

  19. Ibrahim, A. Y. et al. Intrinsically disordered regions that drive phase separation form a robustly distinct protein class. J. Biol. Chem. 299, 102801 (2023).

    Article  CAS  PubMed  Google Scholar 

  20. Teilum, K., Olsen, J. G. & Kragelund, B. B. Globular and disordered—the non-identical twins in protein-protein interactions. Front. Mol. Biosci. 2, 40 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Staller, M. V. et al. Directed mutational scanning reveals a balance between acidic and hydrophobic residues in strong human activation domains. Cell Syst. 13, 334–345 (2022).

    Google Scholar 

  22. Sabari, B. R., Dall’Agnese, A. & Young, R. A. Biomolecular condensates in the nucleus. Trends Biochem. Sci. 45, 961–977 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Diner, I. et al. Aggregation properties of the small nuclear ribonucleoprotein U1-70K in Alzheimer disease. J. Biol. Chem. 289, 35296–35313 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Greig, J. A. et al. Arginine-enriched mixed-charge domains provide cohesion for nuclear speckle condensation. Mol. Cell 77, 1237–1250 (2020).

    Article  ADS  Google Scholar 

  25. Chang, F. T. M. et al. PML bodies provide an important platform for the maintenance of telomeric chromatin integrity in embryonic stem cells. Nucleic Acids Res. 41, 4447–4458 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Lyons, H. et al. Functional partitioning of transcriptional regulators by patterned charge blocks. Cell 186, 327–345 (2023).

    Article  Google Scholar 

  27. Rostam, N. et al. CD-CODE: crowdsourcing condensate database and encyclopedia. Nat. Methods 20, 673–676 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Nott, T. J. et al. Phase transition of a disordered nuage protein generates environmentally responsive membraneless organelles. Mol. Cell 57, 936–947 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Pak, C. W. et al. Sequence determinants of intracellular phase separation by complex coacervation of a disordered protein. Mol. Cell 63, 72–85 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Das, R. K. & Pappu, R. V. Conformations of intrinsically disordered proteins are influenced by linear sequence distributions of oppositely charged residues. Proc. Natl Acad. Sci. USA 110, 13392–13397 (2013).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  31. Sawle, L. & Ghosh, K. A theoretical method to compute sequence dependent configurational properties in charged polymers and proteins. J. Chem. Phys. 143, 085101 (2015).

    Article  ADS  PubMed  Google Scholar 

  32. Zarin, T. et al. Proteome-wide signatures of function in highly diverged intrinsically disordered regions. eLife 8, e46883 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  33. Zheng, W., Dignon, G., Brown, M., Kim, Y. C. & Mittal, J. Hydropathy patterning complements charge patterning to describe conformational preferences of disordered proteins. J. Phys. Chem. Lett. 11, 3408–3415 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Huihui, J. & Ghosh, K. Intrachain interaction topology can identify functionally similar intrinsically disordered proteins. Biophys. J. 120, 1860–1868 (2021).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  35. Yamazaki, H., Takagi, M., Kosako, H., Hirano, T. & Yoshimura, S. H. Cell cycle-specific phase separation regulated by protein charge blockiness. Nat. Cell Biol. 24, 625–632 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Mao, A. H., Crick, S. L., Vitalis, A., Chicoine, C. L. & Pappu, R. V. Net charge per residue modulates conformational ensembles of intrinsically disordered proteins. Proc. Natl Acad. Sci. USA 107, 8183–8188 (2010).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  37. Lin, Y.-H. & Chan, H. S. Phase separation and single-chain compactness of charged disordered proteins are strongly correlated. Biophys. J. 112, 2043–2046 (2017).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  38. Cohan, M. C., Shinn, M. K., Lalmansingh, J. M. & Pappu, R. V. Uncovering non-random binary patterns within sequences of intrinsically disordered proteins. J. Mol. Biol. 434, 167373 (2022).

    Article  CAS  PubMed  Google Scholar 

  39. DelRosso, N. et al. Large-scale mapping and mutagenesis of human transcriptional effector domains. Nature 616, 365–372 (2023).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  40. Johansson, K. E., Mashahreh, B., Hartmann-Petersen, R., Ravid, T. & Lindorff-Larsen, K. Prediction of quality-control degradation signals in yeast proteins. J. Mol. Biol. 435, 167915 (2023).

    Article  CAS  PubMed  Google Scholar 

  41. Ruff, K. M. in Intrinsically Disordered Proteins (eds Kragelund, B. B. & Skriver, K.) Ch. 18, 347–389 (Springer, 2020).

  42. Lotthammer, J. M. et al. Direct prediction of intrinsically disordered protein conformational properties from sequences. Nat. Methods https://doi.org/10.1038/s41592-023-02159-5 (2024).

  43. Wang, J. et al. A molecular grammar governing the driving forces for phase separation of prion-like RNA binding proteins. Cell 174, 688–699 (2018).

    Article  Google Scholar 

  44. Fisher, R. S. & Elbaum-Garfinkle, S. Tunable multiphase dynamics of arginine and lysine liquid condensates. Nat. Commun. 11, 4628 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  45. Schmidt, H. B., Barreau, A. & Rohatgi, R. Phase separation-deficient TDP43 remains functional in splicing. Nat. Commun. 10, 4890 (2019).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  46. Sørensen, C. S. & Kjaergaard, M. Effective concentrations enforced by intrinsically disordered linkers are governed by polymer physics. Proc. Natl Acad. Sci. USA 116, 23124–23131 (2019).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  47. Hantschel, O. et al. A myristoyl/phosphotyrosine switch regulates c-Abl. Cell 112, 845–857 (2003).

    Article  CAS  PubMed  Google Scholar 

  48. Harmon, T. S., Holehouse, A. S., Rosen, M. K. & Pappu, R. V. Intrinsically disordered linkers determine the interplay between phase separation and gelation in multivalent proteins. eLife 6, e30294 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  49. Asselin, L. et al. Mutations in the KIF21B kinesin gene cause neurodevelopmental disorders through imbalanced canonical motor activity. Nat. Commun. 11, 2441 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  50. Ahmed, S. B. M. & Prigent, S. A. Insights into the Shc family of adaptor proteins. J. Mol. Signal. 12, 2 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  51. Manetti, F. LIM kinases are attractive targets with many macromolecular partners and only a few small molecule regulators. Med. Res. Rev. 32, 968–998 (2011).

    Article  PubMed  Google Scholar 

  52. Mégarbané, H. et al. An autosomal-recessive form of cutis laxa is due to homozygous elastin mutations, and the phenotype may be modified by a heterozygous fibulin 5 polymorphism. J. Invest. Dermatol. 129, 1650–1655 (2009).

    Article  PubMed  Google Scholar 

  53. Langenhove, T. V. et al. Genetic contribution of FUS to frontotemporal lobar degeneration. Neurology 74, 366–371 (2010).

    Article  PubMed  Google Scholar 

  54. Liu, Q. et al. Whole-exome sequencing identifies a missense mutation in hnRNPA1 in a family with flail arm ALS. Neurology 87, 1763–1769 (2016).

    Article  CAS  PubMed  Google Scholar 

  55. Schmidt, A. et al. Predicting the pathogenicity of missense variants using features derived from AlphaFold2. Bioinformatics 39, btad280 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Mensah, M. A. et al. Aberrant phase separation and nucleolar dysfunction in rare genetic diseases. Nature 614, 564–571 (2023).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  57. Banani, S. F. et al. Genetic variation associated with condensate dysregulation in disease. Dev. Cell 57, 1776–1788 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Rocha, J. J. et al. Functional unknomics: systematic screening of conserved genes of unknown function. PLOS Biol. 21, e3002222 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Richardson, R. A. K., Navarro, H. T., Amaral, L. A. N. & Stoeger, T. Meta-research: understudied genes are lost in a leaky pipeline between genome-wide assays and reporting of results. eLife 12, RP93429 (2023).

    Google Scholar 

  60. Janson, G., Valdes-Garcia, G., Heo, L. & Feig, M. Direct generation of protein conformational ensembles via machine learning. Nat. Commun. 14, 774 (2023).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  61. Le Mercier, P. et al. SwissBioPics—an interactive library of cell images for the visualization of subcellular location data. Database 2022, baac026 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  62. The UniProt Consortium. Uniprot: the universal protein knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).

    Article  Google Scholar 

  63. Hanson, J., Yang, Y., Paliwal, K. & Zhou, Y. Improving protein disorder prediction by deep bidirectional long short-term memory recurrent neural networks. Bioinformatics 33, 685–692 (2016).

    Article  Google Scholar 

  64. Hekkelman, M. L., de Vries, I., Joosten, R. P. & Perrakis, A. AlphaFill: enriching AlphaFold models with ligands and cofactors. Nat. Methods 20, 205–213 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  65. Anderson, J. A., Glaser, J. & Glotzer, S. C. HOOMD-blue: a Python package for high-performance molecular dynamics and hard particle Monte Carlo simulations. Comput. Mater. Sci. 173, 109363 (2020).

    Article  CAS  Google Scholar 

  66. Flyvbjerg, H. & Petersen, H. G. Error estimates on averages of correlated data. J. Chem. Phys. 91, 461–466 (1989).

    Article  ADS  MathSciNet  CAS  Google Scholar 

  67. Borgia, A. et al. Consistent view of polypeptide chain expansion in chemical denaturants from multiple experimental methods. J. Am. Chem. Soc. 138, 11714–11726 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. Aronovitz, J. & Nelson, D. Universal features of polymer shapes. Journal de Physique 47, 1445–1456 (1986).

    Article  MathSciNet  CAS  Google Scholar 

  69. Hensen, U., Gräter, F. & Henchman, R. H. Macromolecular entropy can be accurately computed from force. J. Chem. Theory Comput. 10, 4777–4781 (2014).

    Article  CAS  PubMed  Google Scholar 

  70. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  Google Scholar 

  71. McGibbon, R. T. et al. MDTraj: a modern open library for the analysis of molecular dynamics trajectories. Biophys. J. 109, 1528 – 1532 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  72. Holehouse, A. S., Das, R. K., Ahad, J. N., Richardson, M. O. & Pappu, R. V. CIDER: resources to analyze sequence-ensemble relationships of intrinsically disordered proteins. Biophys. J. 112, 16–21 (2017).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  73. Montepietra, D. et al. FRETpredict: a Python package for FRET efficiency predictions using rotamer libraries. Preprint at bioRxiv https://doi.org/10.1101/2023.01.27.525885 (2023).

  74. Rotkiewicz, P. & Skolnick, J. Fast procedure for reconstruction of full-atom protein models from reduced representations. J. Comput. Chem. 29, 1460–1465 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Fuertes, G. et al. Decoupling of size and shape fluctuations in heteropolymeric sequences reconciles discrepancies in SAXS vs. FRET measurements. Proc. Natl Acad. Sci. USA 114, E6342–E6351 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  76. Gomes, G.-N. W. et al. Conformational ensembles of an intrinsically disordered protein consistent with NMR, SAXS, and single-molecule FRET. J. Am. Chem. Soc. 142, 15697–15710 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  77. Humphrey, W., Dalke, A. & Schulten, K. VMD—Visual Molecular Dynamics. J. Mol. Graph. 14, 33–38 (1996).

    Article  CAS  PubMed  Google Scholar 

  78. Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. The Gene Ontology Consortium. The Gene Ontology knowledgebase in 2023. Genetics 224, iyad031 (2023).

    Article  Google Scholar 

  80. Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring network structure, dynamics, and function using networkx. In Proc. 7th Python in Science Conference (eds Varoquaux, G., Vaught, T. & Millman, J.) 11–15 (2008).

  81. Brunner, E. & Munzel, U. The nonparametric Behrens–Fisher problem: asymptotic theory and a small-sample approximation. Biom. J. 42, 17–25 (2000).

    Article  MathSciNet  Google Scholar 

  82. Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  83. Thulin, M. Modern Statistics with R (Eos Chasma Press, 2021).

  84. Noguchi, K., Konietschke, F., Marmolejo-Ramos, F. & Pauly, M. Permutation tests are robust and powerful at 0.5% and 5% significance levels. Behav. Res. Meth. 53, 2712–2724 (2021).

    Article  Google Scholar 

  85. Mashahreh, B. et al. Conserved degronome features governing quality control associated proteolysis. Nat. Commun. 13, 7588 (2022).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  86. Koren, I. et al. The eukaryotic proteome is shaped by E3 ubiquitin ligases targeting C-terminal degrons. Cell 173, 1622–1635 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  87. Chang, Y.-W., Hsieh, C.-J., Chang, K.-W., Ringgaard, M. & Lin, C.-J. Training and testing low-degree polynomial data mappings via linearSVM. J. Mach. Learn. Res. 11, 1471–1490 (2010).

    MathSciNet  Google Scholar 

  88. Schölkopf, B., Smola, A., Williamson, R. & Bartlett, P. New support vector algorithms. Neural Comput. 12, 1207–1245 (2000).

    Article  PubMed  Google Scholar 

  89. Breiman, L. Random forests. Machine Learning 45, 5–32 (2001).

    Article  Google Scholar 

  90. Chao, T.-H., Rekhi, S., Mittal, J. & Tabor, D. P. Data-driven models for predicting intrinsically disordered protein polymer physics directly from composition or sequence. Mol. Syst. Des. Eng. 8, 1146–1155 (2023).

    Article  CAS  PubMed  Google Scholar 

  91. Dignon, G. L., Zheng, W., Kim, Y. C., Best, R. B. & Mittal, J. Sequence determinants of protein phase behavior from a coarse-grained model. PLOS Comput. Biol. 14, e1005941 (2018).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  92. Lu, A. X. et al. Discovering molecular features of intrinsically disordered regions by using evolution for contrastive learning. PLOS Comput. Biol. 18, e1010238 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  93. Altenhoff, A. M. et al. OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic Acids Res. 49, D373–D379 (2020).

    Article  PubMed Central  Google Scholar 

  94. Katoh, K. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066 (2002).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  95. Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2017).

    Article  PubMed Central  Google Scholar 

  96. Tiemann, J. K. S., Zschach, H., Lindorff-Larsen, K. & Stein, A. Interpreting the molecular mechanisms of disease variants in human transmembrane proteins. Biophys. J. 122, 2176–2191 (2023).

    Article  ADS  CAS  PubMed  Google Scholar 

  97. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  98. Varadi, M. et al. AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2021).

    Article  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank T. Mittag, X. Salvatella, J. Riback, and R. Pappu for comments and suggestions, and A. Holehouse for discussions and sharing work before publication. We acknowledge access to computational resources from the Biocomputing Core Facility at the Department of Biology, University of Copenhagen, from the Resource for Biomolecular Simulations (ROBUST; supported by the Novo Nordisk Foundation; NNF18OC0032608), and the Danish National Supercomputer for Life Sciences (Computerome). This work is a contribution from the PRISM (Protein Interactions and Stability in Medicine and Genomics) centre funded by the Novo Nordisk Foundation (to K.L.-L.; NNF18OC0033950). This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement no. 101025063.

Author information

Authors and Affiliations

Authors

Contributions

G.T., K.E.J. and K.L.-L. conceived and planned the study. G.T. and A.I.T. performed and analysed molecular simulations and performed GO analyses. N.J. generated the set of missense variants. G.T. and J.B. developed the machine-learning models. G.T. and F.E.K. analysed orthologous sequences. F.P. developed the Google Colab notebooks. K.E.J. generated the sets of human and orthologous sequences. G.T. performed the other analyses with input from N.J., F.P., K.E.J. and K.L.-L. G.T. and K.L.-L. wrote the manuscript with contributions from all authors. K.L.-L. supervised the study.

Corresponding authors

Correspondence to Giulio Tesei or Kresten Lindorff-Larsen.

Ethics declarations

Competing interests

K.L.-L. holds stock options in and is a consultant for Peptone. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature thanks David De Sancho and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Analyses of pLDDT scores and UniProt annotations.

a, Distributions of the window-averaged pLDDT score of residues in the pLDDT-based set (grey), in the SPOT-based set (red), and in IDRs from 531 proteins unique to the SPOT-based set (magenta). b,c, Window-averaged pLDDT scores in (b) all IDRs and in (c) IDRs between folded domains with ν ≤ 0.45 (orange), 0.45 < ν ≤ 0.55 (cyan), and ν > 0.55 (teal). Hatched and open violin plots show results for the pLDDT-based and SPOT-based sets, respectively. Horizontal lines show medians and quartiles calculated over (b) n = 161, 748; 88,325; 3,278,851; 2,442,776; 511,556; 429,591 and (c) n = 13, 988; 8,534; 374,190; 275,164; 92,770; 66,276 pLDDT scores. d,e, Fraction of IDRs (d) and average sequence length of IDRs (e) as a function of the fraction of residues within UniProt “Zinc finger”, “Coiled coil”, “Transmembrane”, and “Domain” annotations longer than ten amino acids, fdomain. Results for the pLDDT-based (6,204 IDRs with fdomain > 0, 22%) and SPOT-based sets (4,095 IDRs with fdomain > 0, 19%) are shown in grey and red, respectively. fi, fdomain averaged over (f) all IDRs (n = 616, 361, 18546, 13292, 8896, and 7730), (g) IDRs between folded domains (n = 91; 56; 2,470; 1,613; 1,777; and 1,214), (h) N- and C-terminal IDRs (n = 415; 203; 12,144; 6,531; 4,776; and 3,601), and (i) IDPs (n = 43; 12; 1,346; 517; 74; and 50) with ν ≤ 0.45 (orange), 0.45 < ν ≤ 0.55 (cyan), and ν > 0.55 (teal). Hatched and empty bars show results for the pLDDT-based and SPOT-based sets, respectively. Data in ei are displayed as mean ± s.e.m. j, Comparison between the number of residues within UniProt “Zinc finger”, “Coiled coil”, “Transmembrane”, and “Domain” annotations longer than ten amino acids in the pLDDT-based (grey) and SPOT-based (red) sets of IDRs, and in the full-length proteins of both sets (blue). IDP, fully intrinsically disordered protein.

Extended Data Fig. 2 Representative intra-chain energy maps, and distributions of the conformational parameters asphericity, Δ, and prolateness, S, for a selection of IDRs from our database.

The UniProt IDs of the IDRs are (a) P08621, (b) P06748, (c) Q5VUA4, (d) Q6PD62, (e) Q15648, (f) Q13148, (g) P98177, (h) Q7KZ85, (i) P06454, (j) Q5BKZ1, (k) O60763, and (l) Q99733. The data are averaged over five ensembles of 1,000 weakly correlated conformations (Extended Data Fig. 5a), each obtained from an independent molecular dynamics simulation. (Top) Energy maps are calculated using the non-ionic Ashbaugh–Hatch potential of the CALVADOS 2 model. Bottom, lines and vertical shaded areas show distributions and confidence intervals of the ensemble averages estimated from the standard deviation of Δ (light shade) and S (dark shade) over five independent simulations.

Extended Data Fig. 3 Validation of the CALVADOS model.

a, Relative difference between predicted and experimental radii of gyration for a set of 64 proteins that were not used in training the CALVADOS 2 model, and for Sic1. Bars show \(({R}_{{\rm{g}}}^{{\rm{calc}}}-{R}_{{\rm{g}}}^{\exp })/{R}_{{\rm{g}}}^{\exp }\) where \({R}_{{\rm{g}}}^{{\rm{calc}}}\) is averaged over n = 5 independent simulation replicas. Data are displayed as the mean ± s.d. b, Correlation between \({R}_{{\rm{g}}}^{\exp }\) and \({R}_{{\rm{g}}}^{{\rm{calc}}}\) for the same proteins shown in a. c, Correlation of predicted and experimental average FRET efficiencies, EFRET, for a set of 64 proteins that were not used in training the CALVADOS 2 model, and for Sic1. dk, Comparison between Rg distributions from CALVADOS simulations (colour) and from ensembles from the Protein Ensemble Database (black) for (d) Nup49, (e) NLS, (f) NUS, (g) IBB, (h) NUL, (i) Sic1, (j) KISS1 and (k) NWASPV. Vertical lines show the average radius of gyration from simulations (colour), from the Protein Ensemble Database (solid black), and from experiments (dashed black); we note that some of the ensembles in the Protein Ensemble Database may have been generated by so-called reweighting methods, but that the ensembles in the database correspond to the unweighted structures. In Supplementary Table 4, we provide the full list of references for the experimental Rg and EFRET values shown in this figure.

Extended Data Fig. 4 Correlation between the variance of residue–residue distances in CALVADOS simulations and the confidence of the relative positions of residue pairs in AlphaFold predictions.

ad, Standard deviation (s.d.) of intra-chain distances calculated from simulations (a,c) and AlphaFold predicted aligned errors (PAE) (b,d) for IDRs of nucleophosmin (UniProt ID P06748) (a,b) and transcription elongation factor SPT6 (UniProt ID Q7KZ85) (c,d). el, PAE as a function of s.d. for IDRs in proteins with UniProt IDs (e) P08621, (f) P06748, (g) Q5VUA4, (h) Q6PD62, (i) Q15648, (j) Q13148, (k) P98177, and (l) Q7KZ85. Pearson correlation coefficients, r, and linear fits (solid lines) are calculated for separations along the linear sequence, i − j, in the range i − jmin ≤ i − j ≤ N, where i − jmin = 4 and N is the sequence length. mt, Pearson r for the correlation between PAE and s.d. as a function of the shortest separation along the linear chain, i − jmin, included in the analysis; data are shown for IDRs in proteins with UniProt IDs (m) P08621, (n) P06748, (o) Q5VUA4, (p) Q6PD62, (q) Q15648, (r) Q13148, (s) P98177, and (t) Q7KZ85.

Extended Data Fig. 5 Sampling, distributions and dependence on the location within the full-length protein of conformational properties in the pLDDT-based set of IDRs.

a, Values of the autocorrelation function (ACF) of the Rg for lag times of one (grey), two (red), and three (blue) frames as a function of sequence length. Data are displayed as mean ± s.d. over n = 5 independent simulation replicas. b, S.d. of the apparent Flory scaling exponent, ν, as a function of ν for proteins of different sequence length. c, S.d. of the ratio of the mean-squared end-to-end distance and the mean-squared radius of gyration, \(\langle {R}_{{\rm{ee}}}^{2}\rangle /\langle {R}_{{\rm{g}}}^{2}\rangle \), as a function of \(\langle {R}_{{\rm{ee}}}^{2}\rangle /\langle {R}_{{\rm{g}}}^{2}\rangle \) for proteins of different sequence length. d, S.d. of the conformational entropy per residue, Sconf/N, as a function of Sconf/N for proteins of different sequence length. S.d. values are calculated over n = 5 independent simulation replicas. Dotted lines show the average s.d. e, Correlation between \(\langle {R}_{{\rm{ee}}}^{2}\rangle \) / \(\langle {R}_{{\rm{g}}}^{2}\rangle \) calculated from simulation trajectories and the approximate relation \(\langle {R}_{{\rm{ee}}}^{2}\rangle \) / \(\langle {R}_{{\rm{g}}}^{2}\rangle =(2\nu +1)\times (2\nu +2)\). (ref. 15). f,g, Distributions of (f) \(\langle {R}_{{\rm{ee}}}^{2}\rangle \) / \(\langle {R}_{{\rm{g}}}^{2}\rangle \) and (g) the conformational entropy per residue, Sconf/N, for IDRs in the human proteome; note the logarithmic scale. hj, Correlation between (h) asphericity, Δ, and ν; (i) prolateness, S, and ν; and (j) Sconf/N and ν. r is the Pearson correlation coefficient. k, Normalized distributions of the absolute difference between ν values, Δν, for IDRs from different proteins (grey), within the same protein (blue), and separated by folded domains within the same protein (teal). Vertical bars show mean ± s.e.m. of Δν. l,m, Distributions of (l) ν and (m) Sconf/N for IDRs in the different categories of localization within the full-length protein: N-terminal (32.2%), C-terminal (24.8%), between folded domains (15.5%), and IDPs (5.2%). For this analysis we annotated IDRs to be located between folded domains if they are preceded and followed by at least 100 residues which do not overlap with any of the IDRs identified in this work. IDP, fully intrinsically disordered protein.

Extended Data Fig. 6 Analysis of the CD-CODE database, and relationships between sequence and conformational properties for IDRs in the pLDDT-based and SPOT-based sets.

a,b, Analysis of the association between IDR conformational properties and protein localization in membraneless organelles (as reported in the CD-CODE database27). Distributions of (a) ν and (b) Sconf/N for IDRs in ‘driver’ proteins (shaded bars) and in proteins that are not a part of the examined subset (black lines). c,d, Distributions of (c) ν and (d) Sconf/N for IDRs in ‘member’ proteins (shaded bars) and in proteins that are not a part of the examined subset (black lines). Histograms for proteins enriched in compact and expanded IDRs are shown in orange and teal, respectively. P values are estimated from one-sided Brunner–Munzel tests using t-distributions and the reported degrees of freedom (DoF). Standard errors of Cohen’s d values are estimated through 105 bootstraps. eq, Comparison between sequence features that affect compaction in pLDDT-based and SPOT-based sets of IDRs. We show NARDINI z-scores for (e) basic–acidic patterning, z(δ+−), (f) acidic patterning, z), and (g) aromatic patterning, zπ); (h) sequence charge decoration, SCD; (l) charge segregation, κ; (m) Sconf/N; (n) sequence hydropathy decoration, SHD; (o) average stickiness, λ; (p) fraction of charged residues, FCR; and (q) sequence length, N, as a function of ν. ik: (i) z(δ+−), (j) z), and (k) zπ) as a function of Sconf/N. Results are shown for all IDRs in the pLDDT-based (grey) and SPOT-based (red) sets, and for IDRs with fdomain = 0 in the pLDDT-based set (blue). r,s, Short IDRs are on average more highly charged and expanded than longer IDRs. (r) ν and (s) Sconf/N as a function of sequence length, N, for the human IDRs in the pLDDT-based (grey) and SPOT-based (red) sets. Data are displayed as mean ± s.e.m. t, Normalized distributions of the NCPR for IDRs with N≤200 (full lines) and with N > 200 (dotted lines) in the pLDDT-based (grey) and SPOT-based (red) sets.

Extended Data Fig. 7 Analyses of QCDPred scores and charged residues for IDRs in the pLDDT-based set.

a, QCDPred score (red) and λ (grey) as a function of ν. b, Average QCDPred score as a function of ν and λ. c, QCDPred score (red) and fraction of negatively charged residues, f (black) as a function of ν. d,g, Average fraction of positively charged residues Arg (red) and Lys (black) as a function of (d) ν and (g) Sconf/N. e,h, Average fraction of negatively charged residues Glu (red) and Asp (black) as a function of (e) ν and (h) Sconf/N. f,i, Average net charge per residue, NCPR, as a function of (f) ν and (i) Sconf/N. Data are displayed as mean ± s.e.m. calculated within bins of width Δν = 0.015 and ΔSconf/N = 0.05 kB. The samples sizes in each bin are (a,cf) n = 4, 3; 5; 3; 5; 6; 9; 11; 8; 16; 14; 26; 23; 28; 41; 49; 98; 135; 212; 473; 854; 1,921; 3,319; 4,724; 5,393; 4,987; 3,405; 1,461; 484; 182; 76; 43; 16; 9; 4; and (gi) n = 5; 9; 4; 20; 29; 93; 406; 1,290; 2,555; 3,204; 3,520; 3,483; 3,415; 3,005; 2,517; 1,895; 1,208; 686; 320; 161; 90; 49; 29; 20; 12; 6; 8; 4; 3; 3; 3.

Extended Data Fig. 8 SVR models for ν and Sconf/N.

a,g, Permutation importance of the sequence features used in the SVR models for the prediction of (a) ν and (g) Sconf/N. bf, Average sequence features as a function of ν calculated from simulations (grey) and using the SVR model (red); data are shown for (b) sequence hydropathy decoration, SHD; (c) sequence charge decoration, SCD; (d) fraction of charged residues, FCR; (e) average stickiness, λ; and (f) charge segregation, κ. hj, Average sequence features as a function of Sconf/N calculated from simulations (grey) and using the SVR model (red); data are shown for (h) λ; (i) SHD; and (j) SCD. Data are displayed as mean ± s.e.m. calculated within bins of width Δν = 0.015 and ΔSconf/N = 0.05 kB. The samples sizes in each bin are (bf) n = 4; 3; 5; 3; 5; 6; 9; 11; 8; 16; 14; 26; 23; 28; 41; 49; 98; 135; 212; 473; 854; 1,921; 3,319; 4,724; 5,393; 4,987; 3,405; 1,461; 484; 182; 76; 43; 16; 9; 4; and (hj) n = 5; 9; 4; 20; 29; 93; 406; 1,290; 2,555; 3,204; 3,520; 3,483; 3,415; 3,005; 2,517; 1,895; 1,208; 686; 320; 161; 90; 49; 29; 20; 12; 6; 8; 4; 3; 3; 3. kn, Testing the SVR models. k,m, Correlation between (k) ν and (m) Sconf/N from simulations and corresponding predictions of the SVR models for a held-out test set of 2,795 distinct sequences from the set of 28,058 IDRs identified in this work. l,n, Correlation between (l) ν and (n) Sconf/N values from simulations and corresponding predictions of the SVR models for 611 IDRs in the 531 proteins that are unique to the SPOT-based set of IDRs.

Extended Data Fig. 9 Relationships between the conformational properties of IDRs and the incidence of pathogenic missense and frameshift variants.

ac, Distributions of (a) the conformational entropy per residue, Sconf/N; (b) the pLDDT scores of the substituted residues; and (c) the scaling exponent, ν, for 11,173 benign (black line) and 1,656 pathogenic (grey bars) variants of IDRs without UniProt domain annotations (fdomain = 0). d,e, Distributions of the difference in (d) ΔSconf,SVR/N and (e) ΔνSVR between variant and wild type for benign (black line) and pathogenic (grey bars) variants of IDRs with fdomain = 0. g,h, Distributions of the difference in (g) ΔSconf,SVR/N and (h) ΔνSVR between variant and wild type for 5,039 benign (or of unknown significance, black line) and 637 pathogenic (grey bars) frameshift variants identified by Mensah et al.56. P values are estimated from one-sided Brunner–Munzel tests using t-distributions and the reported degrees of freedom (DoF). Standard errors of Cohen’s d values are estimated through 105 bootstraps. f, Average number of ClinVar missense variants per IDR. i, Average number of publications per IDR obtained from the “Find my Understudied Gene” tool59. Data in f and i are displayed as mean ± s.e.m. calculated over n = 25; 87; 498; 18,546; 8,864; and 32 IDRs.

Supplementary information

Reporting Summary

Supplementary Table 1

P values from one-sided Brunner–Munzel tests and Cohen’s d are calculated using (i) all IDRs in the pLDDT-based set, (ii) only IDRs with fdomain = 0 in the pLDDT-based set, and (iii) all IDRs in the SPOT-based set. The file reports only Gene Ontology “molecular function” terms with P < 10−5 for each of the three sets of IDRs for the association with either ν, Sconf/N, or both. The file has four tabs: (i) asymptotic P values calculated using t-distributions, (ii) Cohen’s d with standard errors estimated via bootstrapping, (iii) P values obtained via bootstrapping, and (iv) P values obtained via bootstrapping where we sampled with replacement both sets using the sample size of the smaller set.

Supplementary Table 2

P values from one-sided Brunner–Munzel tests and Cohen’s d are calculated using (i) all IDRs in the pLDDT-based set, (ii) only IDRs with fdomain = 0 in the pLDDT-based set, and (iii) all IDRs in the SPOT-based set. The file reports only Gene Ontology “cellular component” terms with P < 10−5 for each of the three sets of IDRs for the association with either ν, Sconf/N, or both. The file has four tabs: (i) asymptotic P values calculated using t-distributions, (ii) Cohen’s d with standard errors estimated via bootstrapping, (iii) P values obtained via bootstrapping, and (iv) P values obtained via bootstrapping where we sampled with replacement both sets using the sample size of the smaller set.

Supplementary Table 3

Sequence length; ν, Sconf/N; <Ree2> / <Rg2>; S; Δ; <λ>; z(δ+−) ; z(Ωπ); z); SHD; SCD; κ; FCR; NCPR; Booleans indicating whether the IDR is N-terminal, C-terminal, between folded domains, or an IDP; QCDpred score; fdomain; and amino acid sequence for the 28,058 IDRs in the pLDDT-based set. IDP, fully intrinsically disordered protein.

Supplementary Table 4

List of digital object identifiers (DOIs) of research articles reporting small-angle X-ray scattering and Förster resonance energy transfer (FRET) data used to validate the CALVADOS model as shown in Extended Data Fig. 3.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tesei, G., Trolle, A.I., Jonsson, N. et al. Conformational ensembles of the human intrinsically disordered proteome. Nature 626, 897–904 (2024). https://doi.org/10.1038/s41586-023-07004-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41586-023-07004-5

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing