Conformational ensembles of the human intrinsically disordered proteome

Tesei, Giulio; Trolle, Anna Ida; Jonsson, Nicolas; Betz, Johannes; Knudsen, Frederik E.; Pesce, Francesco; Johansson, Kristoffer E.; Lindorff-Larsen, Kresten

doi:10.1038/s41586-023-07004-5

Article
Published: 31 January 2024

Conformational ensembles of the human intrinsically disordered proteome

Nature volume 626, pages 897–904 (2024)Cite this article

22k Accesses
2 Citations
226 Altmetric
Metrics details

Subjects

Abstract

Intrinsically disordered proteins and regions (collectively, IDRs) are pervasive across proteomes in all kingdoms of life, help to shape biological functions and are involved in numerous diseases. IDRs populate a diverse set of transiently formed structures and defy conventional sequence–structure–function relationships¹. Developments in protein science have made it possible to predict the three-dimensional structures of folded proteins at the proteome scale². By contrast, there is a lack of knowledge about the conformational properties of IDRs, partly because the sequences of disordered proteins are poorly conserved and also because only a few of these proteins have been characterized experimentally. The inability to predict structural properties of IDRs across the proteome has limited our understanding of the functional roles of IDRs and how evolution shapes them. As a supplement to previous structural studies of individual IDRs³, we developed an efficient molecular model to generate conformational ensembles of IDRs and thereby to predict their conformational properties from sequences^4,5. Here we use this model to simulate nearly all of the IDRs in the human proteome. Examining conformational ensembles of 28,058 IDRs, we show how chain compaction is correlated with cellular function and localization. We provide insights into how sequence features relate to chain compaction and, using a machine-learning model trained on our simulation data, show the conservation of conformational properties across orthologues. Our results recapitulate observations from previous studies of individual protein systems and exemplify how to link—at the proteome scale—conformational ensembles with cellular function and localization, amino acid sequence, evolutionary conservation and disease variants. Our freely available database of conformational properties will encourage further experimental investigation and enable the generation of hypotheses about the biological roles and evolution of IDRs.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Schematic illustration of the approach used to obtain conformational properties for all of the IDRs in the human proteome.**

**Fig. 2: Analysis of the association between IDR compaction and GO annotations.**

**Fig. 3: Sequence features that determine compaction.**

**Fig. 4: Conservation of conformational and sequence properties.**

**Fig. 5: Proteome-wide search for conformational buffering, and relationship between conformational properties of IDRs and incidence of pathogenic missense variants.**

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

Pooled multicolour tagging for visualizing subcellular protein dynamics

Article Open access 19 April 2024

De novo design of protein structure and function with RFdiffusion

Article Open access 11 July 2023

Data availability

Simulation trajectories and precalculated conformational properties for the 28,058 IDRs of the pLDDT-based set are available at https://sid.erda.dk/cgi-sid/ls.py?share_id=AVZAJvJnCO, which also includes a full list of amino acid sequences, sequence features and conformational properties of the 28,058 IDRs in the pLDDT-based set (Supplementary Table 3). In the online database, directories are organized by splitting UniProt IDs into two-character units (XX/XX/XX), followed by an additional four-character unit for UniProt IDs of ten characters (XX/XX/XX/XXXX). For example, data for the IDRs of UniProt ID A0A494C086 can be accessed by entering IDRome/A0/A4/94/C086 in the Filter paths. Raw data to reproduce the results presented in this work are available at https://github.com/KULL-Centre/_2023_Tesei_IDRome and have been deposited at Zenodo (https://doi.org/10.5281/zenodo.10251736).

Window-averaged pLDDT scores were obtained from Supplementary Table 1 of a previous study². SPOT-Disorder v.1 predictions were obtained from https://github.com/IPritisanac/AF2.IDR (ref. ⁷). Orthologous sequences were retrieved from the OMA database (Nov 2022 release). UniProt releases 2021_04, 2022_05 and 2023_03 were accessed as described previously⁶² and as detailed in the Methods. AlphaFold predicted aligned errors were retrieved from https://alphafold.ebi.ac.uk/ (refs. ^97,98). The Gene Ontology Data Archive version 2023-11-15 (https://doi.org/10.5281/zenodo.10162580) was obtained from https://current.geneontology.org/ontology/go-basic.obo. UniProt IDs of proteins in membraneless organelles were retrieved from CD-CODE v.1_04 (ref. ²⁷). Missense variants were obtained from ClinVar (release 2020-11). Frameshift variants were retrieved from Zenodo (https://doi.org/10.5281/zenodo.7311150; ref. ⁵⁶). The number of publications per gene was obtained from Find My Understudied Genes (v.1.0.0; https://fmug.amaral.northwestern.edu/).

Code availability

Custom code to reproduce the results and figures of this study is available at https://github.com/KULL-Centre/_2023_Tesei_IDRome and at Zenodo (https://doi.org/10.5281/zenodo.10251736). CALVADOS is available at Zenodo (https://doi.org/10.5281/zenodo.6914053) and can be run using Google Colab through https://github.com/KULL-Centre/_2023_Tesei_IDRome/blob/main/IDRLab.ipynb.

References

Holehouse, A. S. & Kragelund, B. B. The molecular basis for cellular function of intrinsically disordered protein regions. Nat. Rev. Mol. Cell Biol. https://doi.org/10.1038/s41580-023-00673-0 (2023).
Article PubMed Google Scholar
Akdel, M. et al. A structural biology community assessment of AlphaFold2 applications. Nat. Struct. Mol. Biol. 29, 1056–1067 (2022).
Article CAS PubMed PubMed Central Google Scholar
Ghafouri, H. et al. PED in 2024: improving the community deposition of structural ensembles for intrinsically disordered proteins. Nucleic Acids Res. 52, D536–D544 (2024).
Article PubMed Google Scholar
Tesei, G., Schulze, T. K., Crehuet, R. & Lindorff-Larsen, K. Accurate model of liquid–liquid phase behavior of intrinsically disordered proteins from optimization of single-chain properties. Proc. Natl Acad. Sci. USA 118, e2111696118 (2021).
Article CAS PubMed PubMed Central Google Scholar
Tesei, G. & Lindorff-Larsen, K. Improved predictions of phase behaviour of intrinsically disordered proteins by tuning the interaction range. Open Res. Europe 2, 94 (2023).
Article Google Scholar
Ruff, K. M. & Pappu, R. V. AlphaFold and implications for intrinsically disordered proteins. J. Mol. Biol. 433, 167208 (2021).
Article CAS PubMed Google Scholar
Alderson, T. R., Pritišanac, I., Kolarić, D., Moses, A. M. & Forman-Kay, J. D. Systematic identification of conditionally folded intrinsically disordered regions by AlphaFold2. Proc. Natl Acad. Sci. USA 120, e2304302120 (2023).
Article CAS PubMed PubMed Central Google Scholar
Piovesan, D., Monzon, A. M. & Tosatto, S. C. E. Intrinsic protein disorder and conditional folding in AlphaFoldDB. Protein Sci. 31, e4466 (2022).
Article CAS PubMed PubMed Central Google Scholar
Brotzakis, Z. F., Zhang, S. & Vendruscolo, M. AlphaFold prediction of structural ensembles of disordered proteins. Preprint at bioRxiv https://doi.org/10.1101/2023.01.19.524720 (2023).
Thomasen, F. E. & Lindorff-Larsen, K. Conformational ensembles of intrinsically disordered proteins and flexible multidomain proteins. Biochem. Soc. Trans. 50, 541–554 (2022).
Article CAS PubMed Google Scholar
Das, R. K., Huang, Y., Phillips, A. H., Kriwacki, R. W. & Pappu, R. V. Cryptic sequence features within the disordered protein p27^Kip1 regulate cell cycle signaling. Proc. Natl Acad. Sci. USA 113, 5616–5621 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Martin, E. W. et al. Valence and patterning of aromatic residues determine the phase behavior of prion-like domains. Science 367, 694–699 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
González-Foutel, N. S. et al. Conformational buffering underlies functional selection in intrinsically disordered protein regions. Nature Struct. Mol. Biol. 29, 781–790 (2022).
Article Google Scholar
Lindorff-Larsen, K. & Kragelund, B. B. On the potential of machine learning to examine the relationship between sequence, structure, dynamics and function of intrinsically disordered proteins. J. Mol. Biol. 433, 167196 (2021).
Article CAS PubMed Google Scholar
Zheng, W. et al. Inferring properties of disordered chains from FRET transfer efficiencies. J. Chem. Phys. 148, 123329 (2018).
Article ADS PubMed PubMed Central Google Scholar
Sherry, K. P., Das, R. K., Pappu, R. V. & Barrick, D. Control of transcriptional activity by design of charge patterning in the intrinsically disordered ram region of the notch receptor. Proc. Natl Acad. Sci. USA 114, E9243–E9252 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Riback, J. A. et al. Stress-triggered phase separation is an adaptive, evolutionarily tuned response. Cell 168, 1028–1040 (2017).
Article CAS PubMed PubMed Central Google Scholar
Bremer, A. et al. Deciphering how naturally occurring sequence features impact the phase behaviours of disordered prion-like domains. Nat. Chem. 14, 196–207 (2022).
Article CAS PubMed Google Scholar
Ibrahim, A. Y. et al. Intrinsically disordered regions that drive phase separation form a robustly distinct protein class. J. Biol. Chem. 299, 102801 (2023).
Article CAS PubMed Google Scholar
Teilum, K., Olsen, J. G. & Kragelund, B. B. Globular and disordered—the non-identical twins in protein-protein interactions. Front. Mol. Biosci. 2, 40 (2015).
Article PubMed PubMed Central Google Scholar
Staller, M. V. et al. Directed mutational scanning reveals a balance between acidic and hydrophobic residues in strong human activation domains. Cell Syst. 13, 334–345 (2022).
Google Scholar
Sabari, B. R., Dall’Agnese, A. & Young, R. A. Biomolecular condensates in the nucleus. Trends Biochem. Sci. 45, 961–977 (2020).
Article CAS PubMed PubMed Central Google Scholar
Diner, I. et al. Aggregation properties of the small nuclear ribonucleoprotein U1-70K in Alzheimer disease. J. Biol. Chem. 289, 35296–35313 (2014).
Article CAS PubMed PubMed Central Google Scholar
Greig, J. A. et al. Arginine-enriched mixed-charge domains provide cohesion for nuclear speckle condensation. Mol. Cell 77, 1237–1250 (2020).
Article ADS Google Scholar
Chang, F. T. M. et al. PML bodies provide an important platform for the maintenance of telomeric chromatin integrity in embryonic stem cells. Nucleic Acids Res. 41, 4447–4458 (2013).
Article CAS PubMed PubMed Central Google Scholar
Lyons, H. et al. Functional partitioning of transcriptional regulators by patterned charge blocks. Cell 186, 327–345 (2023).
Article Google Scholar
Rostam, N. et al. CD-CODE: crowdsourcing condensate database and encyclopedia. Nat. Methods 20, 673–676 (2023).
Article CAS PubMed PubMed Central Google Scholar
Nott, T. J. et al. Phase transition of a disordered nuage protein generates environmentally responsive membraneless organelles. Mol. Cell 57, 936–947 (2015).
Article CAS PubMed PubMed Central Google Scholar
Pak, C. W. et al. Sequence determinants of intracellular phase separation by complex coacervation of a disordered protein. Mol. Cell 63, 72–85 (2016).
Article CAS PubMed PubMed Central Google Scholar
Das, R. K. & Pappu, R. V. Conformations of intrinsically disordered proteins are influenced by linear sequence distributions of oppositely charged residues. Proc. Natl Acad. Sci. USA 110, 13392–13397 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Sawle, L. & Ghosh, K. A theoretical method to compute sequence dependent configurational properties in charged polymers and proteins. J. Chem. Phys. 143, 085101 (2015).
Article ADS PubMed Google Scholar
Zarin, T. et al. Proteome-wide signatures of function in highly diverged intrinsically disordered regions. eLife 8, e46883 (2019).
Article PubMed PubMed Central Google Scholar
Zheng, W., Dignon, G., Brown, M., Kim, Y. C. & Mittal, J. Hydropathy patterning complements charge patterning to describe conformational preferences of disordered proteins. J. Phys. Chem. Lett. 11, 3408–3415 (2020).
Article CAS PubMed PubMed Central Google Scholar
Huihui, J. & Ghosh, K. Intrachain interaction topology can identify functionally similar intrinsically disordered proteins. Biophys. J. 120, 1860–1868 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Yamazaki, H., Takagi, M., Kosako, H., Hirano, T. & Yoshimura, S. H. Cell cycle-specific phase separation regulated by protein charge blockiness. Nat. Cell Biol. 24, 625–632 (2022).
Article CAS PubMed PubMed Central Google Scholar
Mao, A. H., Crick, S. L., Vitalis, A., Chicoine, C. L. & Pappu, R. V. Net charge per residue modulates conformational ensembles of intrinsically disordered proteins. Proc. Natl Acad. Sci. USA 107, 8183–8188 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
Lin, Y.-H. & Chan, H. S. Phase separation and single-chain compactness of charged disordered proteins are strongly correlated. Biophys. J. 112, 2043–2046 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Cohan, M. C., Shinn, M. K., Lalmansingh, J. M. & Pappu, R. V. Uncovering non-random binary patterns within sequences of intrinsically disordered proteins. J. Mol. Biol. 434, 167373 (2022).
Article CAS PubMed Google Scholar
DelRosso, N. et al. Large-scale mapping and mutagenesis of human transcriptional effector domains. Nature 616, 365–372 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Johansson, K. E., Mashahreh, B., Hartmann-Petersen, R., Ravid, T. & Lindorff-Larsen, K. Prediction of quality-control degradation signals in yeast proteins. J. Mol. Biol. 435, 167915 (2023).
Article CAS PubMed Google Scholar
Ruff, K. M. in Intrinsically Disordered Proteins (eds Kragelund, B. B. & Skriver, K.) Ch. 18, 347–389 (Springer, 2020).
Lotthammer, J. M. et al. Direct prediction of intrinsically disordered protein conformational properties from sequences. Nat. Methods https://doi.org/10.1038/s41592-023-02159-5 (2024).
Wang, J. et al. A molecular grammar governing the driving forces for phase separation of prion-like RNA binding proteins. Cell 174, 688–699 (2018).
Article Google Scholar
Fisher, R. S. & Elbaum-Garfinkle, S. Tunable multiphase dynamics of arginine and lysine liquid condensates. Nat. Commun. 11, 4628 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Schmidt, H. B., Barreau, A. & Rohatgi, R. Phase separation-deficient TDP43 remains functional in splicing. Nat. Commun. 10, 4890 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Sørensen, C. S. & Kjaergaard, M. Effective concentrations enforced by intrinsically disordered linkers are governed by polymer physics. Proc. Natl Acad. Sci. USA 116, 23124–23131 (2019).
Article ADS PubMed PubMed Central Google Scholar
Hantschel, O. et al. A myristoyl/phosphotyrosine switch regulates c-Abl. Cell 112, 845–857 (2003).
Article CAS PubMed Google Scholar
Harmon, T. S., Holehouse, A. S., Rosen, M. K. & Pappu, R. V. Intrinsically disordered linkers determine the interplay between phase separation and gelation in multivalent proteins. eLife 6, e30294 (2017).
Article PubMed PubMed Central Google Scholar
Asselin, L. et al. Mutations in the KIF21B kinesin gene cause neurodevelopmental disorders through imbalanced canonical motor activity. Nat. Commun. 11, 2441 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Ahmed, S. B. M. & Prigent, S. A. Insights into the Shc family of adaptor proteins. J. Mol. Signal. 12, 2 (2017).
Article PubMed PubMed Central Google Scholar
Manetti, F. LIM kinases are attractive targets with many macromolecular partners and only a few small molecule regulators. Med. Res. Rev. 32, 968–998 (2011).
Article PubMed Google Scholar
Mégarbané, H. et al. An autosomal-recessive form of cutis laxa is due to homozygous elastin mutations, and the phenotype may be modified by a heterozygous fibulin 5 polymorphism. J. Invest. Dermatol. 129, 1650–1655 (2009).
Article PubMed Google Scholar
Langenhove, T. V. et al. Genetic contribution of FUS to frontotemporal lobar degeneration. Neurology 74, 366–371 (2010).
Article PubMed Google Scholar
Liu, Q. et al. Whole-exome sequencing identifies a missense mutation in hnRNPA1 in a family with flail arm ALS. Neurology 87, 1763–1769 (2016).
Article CAS PubMed Google Scholar
Schmidt, A. et al. Predicting the pathogenicity of missense variants using features derived from AlphaFold2. Bioinformatics 39, btad280 (2023).
Article CAS PubMed PubMed Central Google Scholar
Mensah, M. A. et al. Aberrant phase separation and nucleolar dysfunction in rare genetic diseases. Nature 614, 564–571 (2023).
ADS CAS PubMed PubMed Central Google Scholar
Banani, S. F. et al. Genetic variation associated with condensate dysregulation in disease. Dev. Cell 57, 1776–1788 (2022).
Article CAS PubMed PubMed Central Google Scholar
Rocha, J. J. et al. Functional unknomics: systematic screening of conserved genes of unknown function. PLOS Biol. 21, e3002222 (2023).
Article CAS PubMed PubMed Central Google Scholar
Richardson, R. A. K., Navarro, H. T., Amaral, L. A. N. & Stoeger, T. Meta-research: understudied genes are lost in a leaky pipeline between genome-wide assays and reporting of results. eLife 12, RP93429 (2023).
Google Scholar
Janson, G., Valdes-Garcia, G., Heo, L. & Feig, M. Direct generation of protein conformational ensembles via machine learning. Nat. Commun. 14, 774 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Le Mercier, P. et al. SwissBioPics—an interactive library of cell images for the visualization of subcellular location data. Database 2022, baac026 (2022).
Article PubMed PubMed Central Google Scholar
The UniProt Consortium. Uniprot: the universal protein knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).
Article Google Scholar
Hanson, J., Yang, Y., Paliwal, K. & Zhou, Y. Improving protein disorder prediction by deep bidirectional long short-term memory recurrent neural networks. Bioinformatics 33, 685–692 (2016).
Article Google Scholar
Hekkelman, M. L., de Vries, I., Joosten, R. P. & Perrakis, A. AlphaFill: enriching AlphaFold models with ligands and cofactors. Nat. Methods 20, 205–213 (2022).
Article PubMed PubMed Central Google Scholar
Anderson, J. A., Glaser, J. & Glotzer, S. C. HOOMD-blue: a Python package for high-performance molecular dynamics and hard particle Monte Carlo simulations. Comput. Mater. Sci. 173, 109363 (2020).
Article CAS Google Scholar
Flyvbjerg, H. & Petersen, H. G. Error estimates on averages of correlated data. J. Chem. Phys. 91, 461–466 (1989).
Article ADS MathSciNet CAS Google Scholar
Borgia, A. et al. Consistent view of polypeptide chain expansion in chemical denaturants from multiple experimental methods. J. Am. Chem. Soc. 138, 11714–11726 (2016).
Article CAS PubMed PubMed Central Google Scholar
Aronovitz, J. & Nelson, D. Universal features of polymer shapes. Journal de Physique 47, 1445–1456 (1986).
Article MathSciNet CAS Google Scholar
Hensen, U., Gräter, F. & Henchman, R. H. Macromolecular entropy can be accurately computed from force. J. Chem. Theory Comput. 10, 4777–4781 (2014).
Article CAS PubMed Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet Google Scholar
McGibbon, R. T. et al. MDTraj: a modern open library for the analysis of molecular dynamics trajectories. Biophys. J. 109, 1528 – 1532 (2015).
Article PubMed PubMed Central Google Scholar
Holehouse, A. S., Das, R. K., Ahad, J. N., Richardson, M. O. & Pappu, R. V. CIDER: resources to analyze sequence-ensemble relationships of intrinsically disordered proteins. Biophys. J. 112, 16–21 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Montepietra, D. et al. FRETpredict: a Python package for FRET efficiency predictions using rotamer libraries. Preprint at bioRxiv https://doi.org/10.1101/2023.01.27.525885 (2023).
Rotkiewicz, P. & Skolnick, J. Fast procedure for reconstruction of full-atom protein models from reduced representations. J. Comput. Chem. 29, 1460–1465 (2008).
Article CAS PubMed PubMed Central Google Scholar
Fuertes, G. et al. Decoupling of size and shape fluctuations in heteropolymeric sequences reconciles discrepancies in SAXS vs. FRET measurements. Proc. Natl Acad. Sci. USA 114, E6342–E6351 (2017).
Article CAS PubMed PubMed Central Google Scholar
Gomes, G.-N. W. et al. Conformational ensembles of an intrinsically disordered protein consistent with NMR, SAXS, and single-molecule FRET. J. Am. Chem. Soc. 142, 15697–15710 (2020).
Article CAS PubMed PubMed Central Google Scholar
Humphrey, W., Dalke, A. & Schulten, K. VMD—Visual Molecular Dynamics. J. Mol. Graph. 14, 33–38 (1996).
Article CAS PubMed Google Scholar
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).
Article CAS PubMed PubMed Central Google Scholar
The Gene Ontology Consortium. The Gene Ontology knowledgebase in 2023. Genetics 224, iyad031 (2023).
Article Google Scholar
Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring network structure, dynamics, and function using networkx. In Proc. 7th Python in Science Conference (eds Varoquaux, G., Vaught, T. & Millman, J.) 11–15 (2008).
Brunner, E. & Munzel, U. The nonparametric Behrens–Fisher problem: asymptotic theory and a small-sample approximation. Biom. J. 42, 17–25 (2000).
Article MathSciNet Google Scholar
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS PubMed PubMed Central Google Scholar
Thulin, M. Modern Statistics with R (Eos Chasma Press, 2021).
Noguchi, K., Konietschke, F., Marmolejo-Ramos, F. & Pauly, M. Permutation tests are robust and powerful at 0.5% and 5% significance levels. Behav. Res. Meth. 53, 2712–2724 (2021).
Article Google Scholar
Mashahreh, B. et al. Conserved degronome features governing quality control associated proteolysis. Nat. Commun. 13, 7588 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Koren, I. et al. The eukaryotic proteome is shaped by E3 ubiquitin ligases targeting C-terminal degrons. Cell 173, 1622–1635 (2018).
Article CAS PubMed PubMed Central Google Scholar
Chang, Y.-W., Hsieh, C.-J., Chang, K.-W., Ringgaard, M. & Lin, C.-J. Training and testing low-degree polynomial data mappings via linearSVM. J. Mach. Learn. Res. 11, 1471–1490 (2010).
MathSciNet Google Scholar
Schölkopf, B., Smola, A., Williamson, R. & Bartlett, P. New support vector algorithms. Neural Comput. 12, 1207–1245 (2000).
Article PubMed Google Scholar
Breiman, L. Random forests. Machine Learning 45, 5–32 (2001).
Article Google Scholar
Chao, T.-H., Rekhi, S., Mittal, J. & Tabor, D. P. Data-driven models for predicting intrinsically disordered protein polymer physics directly from composition or sequence. Mol. Syst. Des. Eng. 8, 1146–1155 (2023).
Article CAS PubMed Google Scholar
Dignon, G. L., Zheng, W., Kim, Y. C., Best, R. B. & Mittal, J. Sequence determinants of protein phase behavior from a coarse-grained model. PLOS Comput. Biol. 14, e1005941 (2018).
Article ADS PubMed PubMed Central Google Scholar
Lu, A. X. et al. Discovering molecular features of intrinsically disordered regions by using evolution for contrastive learning. PLOS Comput. Biol. 18, e1010238 (2022).
Article CAS PubMed PubMed Central Google Scholar
Altenhoff, A. M. et al. OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic Acids Res. 49, D373–D379 (2020).
Article PubMed Central Google Scholar
Katoh, K. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066 (2002).
Article ADS CAS PubMed PubMed Central Google Scholar
Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2017).
Article PubMed Central Google Scholar
Tiemann, J. K. S., Zschach, H., Lindorff-Larsen, K. & Stein, A. Interpreting the molecular mechanisms of disease variants in human transmembrane proteins. Biophys. J. 122, 2176–2191 (2023).
Article ADS CAS PubMed Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Varadi, M. et al. AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2021).
Article PubMed Central Google Scholar

Download references

Acknowledgements

We thank T. Mittag, X. Salvatella, J. Riback, and R. Pappu for comments and suggestions, and A. Holehouse for discussions and sharing work before publication. We acknowledge access to computational resources from the Biocomputing Core Facility at the Department of Biology, University of Copenhagen, from the Resource for Biomolecular Simulations (ROBUST; supported by the Novo Nordisk Foundation; NNF18OC0032608), and the Danish National Supercomputer for Life Sciences (Computerome). This work is a contribution from the PRISM (Protein Interactions and Stability in Medicine and Genomics) centre funded by the Novo Nordisk Foundation (to K.L.-L.; NNF18OC0033950). This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement no. 101025063.

Author information

These authors contributed equally: Giulio Tesei, Anna Ida Trolle

Authors and Affiliations

Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
Giulio Tesei, Anna Ida Trolle, Nicolas Jonsson, Johannes Betz, Frederik E. Knudsen, Francesco Pesce, Kristoffer E. Johansson & Kresten Lindorff-Larsen

Authors

Giulio Tesei
View author publications
You can also search for this author in PubMed Google Scholar
Anna Ida Trolle
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Jonsson
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Betz
View author publications
You can also search for this author in PubMed Google Scholar
Frederik E. Knudsen
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Pesce
View author publications
You can also search for this author in PubMed Google Scholar
Kristoffer E. Johansson
View author publications
You can also search for this author in PubMed Google Scholar
Kresten Lindorff-Larsen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.T., K.E.J. and K.L.-L. conceived and planned the study. G.T. and A.I.T. performed and analysed molecular simulations and performed GO analyses. N.J. generated the set of missense variants. G.T. and J.B. developed the machine-learning models. G.T. and F.E.K. analysed orthologous sequences. F.P. developed the Google Colab notebooks. K.E.J. generated the sets of human and orthologous sequences. G.T. performed the other analyses with input from N.J., F.P., K.E.J. and K.L.-L. G.T. and K.L.-L. wrote the manuscript with contributions from all authors. K.L.-L. supervised the study.

Corresponding authors

Correspondence to Giulio Tesei or Kresten Lindorff-Larsen.

Ethics declarations

Competing interests

K.L.-L. holds stock options in and is a consultant for Peptone. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature thanks David De Sancho and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Analyses of pLDDT scores and UniProt annotations.

a, Distributions of the window-averaged pLDDT score of residues in the pLDDT-based set (grey), in the SPOT-based set (red), and in IDRs from 531 proteins unique to the SPOT-based set (magenta). b,c, Window-averaged pLDDT scores in (b) all IDRs and in (c) IDRs between folded domains with ν ≤ 0.45 (orange), 0.45 < ν ≤ 0.55 (cyan), and ν > 0.55 (teal). Hatched and open violin plots show results for the pLDDT-based and SPOT-based sets, respectively. Horizontal lines show medians and quartiles calculated over (b) n = 161, 748; 88,325; 3,278,851; 2,442,776; 511,556; 429,591 and (c) n = 13, 988; 8,534; 374,190; 275,164; 92,770; 66,276 pLDDT scores. d,e, Fraction of IDRs (d) and average sequence length of IDRs (e) as a function of the fraction of residues within UniProt “Zinc finger”, “Coiled coil”, “Transmembrane”, and “Domain” annotations longer than ten amino acids, f_domain. Results for the pLDDT-based (6,204 IDRs with f_domain > 0, 22%) and SPOT-based sets (4,095 IDRs with f_domain > 0, 19%) are shown in grey and red, respectively. f–i, f_domain averaged over (f) all IDRs (n = 616, 361, 18546, 13292, 8896, and 7730), (g) IDRs between folded domains (n = 91; 56; 2,470; 1,613; 1,777; and 1,214), (h) N- and C-terminal IDRs (n = 415; 203; 12,144; 6,531; 4,776; and 3,601), and (i) IDPs (n = 43; 12; 1,346; 517; 74; and 50) with ν ≤ 0.45 (orange), 0.45 < ν ≤ 0.55 (cyan), and ν > 0.55 (teal). Hatched and empty bars show results for the pLDDT-based and SPOT-based sets, respectively. Data in e–i are displayed as mean ± s.e.m. j, Comparison between the number of residues within UniProt “Zinc finger”, “Coiled coil”, “Transmembrane”, and “Domain” annotations longer than ten amino acids in the pLDDT-based (grey) and SPOT-based (red) sets of IDRs, and in the full-length proteins of both sets (blue). IDP, fully intrinsically disordered protein.

Extended Data Fig. 2 Representative intra-chain energy maps, and distributions of the conformational parameters asphericity, Δ, and prolateness, S, for a selection of IDRs from our database.

The UniProt IDs of the IDRs are (a) P08621, (b) P06748, (c) Q5VUA4, (d) Q6PD62, (e) Q15648, (f) Q13148, (g) P98177, (h) Q7KZ85, (i) P06454, (j) Q5BKZ1, (k) O60763, and (l) Q99733. The data are averaged over five ensembles of 1,000 weakly correlated conformations (Extended Data Fig. 5a), each obtained from an independent molecular dynamics simulation. (Top) Energy maps are calculated using the non-ionic Ashbaugh–Hatch potential of the CALVADOS 2 model. Bottom, lines and vertical shaded areas show distributions and confidence intervals of the ensemble averages estimated from the standard deviation of Δ (light shade) and S (dark shade) over five independent simulations.

Extended Data Fig. 3 Validation of the CALVADOS model.

a, Relative difference between predicted and experimental radii of gyration for a set of 64 proteins that were not used in training the CALVADOS 2 model, and for Sic1. Bars show \(({R}_{{\rm{g}}}^{{\rm{calc}}}-{R}_{{\rm{g}}}^{\exp })/{R}_{{\rm{g}}}^{\exp }\) where \({R}_{{\rm{g}}}^{{\rm{calc}}}\) is averaged over n = 5 independent simulation replicas. Data are displayed as the mean ± s.d. b, Correlation between \({R}_{{\rm{g}}}^{\exp }\) and \({R}_{{\rm{g}}}^{{\rm{calc}}}\) for the same proteins shown in a. c, Correlation of predicted and experimental average FRET efficiencies, ⟨E_FRET⟩, for a set of 64 proteins that were not used in training the CALVADOS 2 model, and for Sic1. d–k, Comparison between R_g distributions from CALVADOS simulations (colour) and from ensembles from the Protein Ensemble Database (black) for (d) Nup49, (e) NLS, (f) NUS, (g) IBB, (h) NUL, (i) Sic1, (j) KISS1 and (k) NWASPV. Vertical lines show the average radius of gyration from simulations (colour), from the Protein Ensemble Database (solid black), and from experiments (dashed black); we note that some of the ensembles in the Protein Ensemble Database may have been generated by so-called reweighting methods, but that the ensembles in the database correspond to the unweighted structures. In Supplementary Table 4, we provide the full list of references for the experimental R_g and ⟨E_FRET⟩ values shown in this figure.

Extended Data Fig. 4 Correlation between the variance of residue–residue distances in CALVADOS simulations and the confidence of the relative positions of residue pairs in AlphaFold predictions.

a–d, Standard deviation (s.d.) of intra-chain distances calculated from simulations (a,c) and AlphaFold predicted aligned errors (PAE) (b,d) for IDRs of nucleophosmin (UniProt ID P06748) (a,b) and transcription elongation factor SPT6 (UniProt ID Q7KZ85) (c,d). e–l, PAE as a function of s.d. for IDRs in proteins with UniProt IDs (e) P08621, (f) P06748, (g) Q5VUA4, (h) Q6PD62, (i) Q15648, (j) Q13148, (k) P98177, and (l) Q7KZ85. Pearson correlation coefficients, r, and linear fits (solid lines) are calculated for separations along the linear sequence, ∣i − j∣, in the range ∣i − j∣_min ≤ ∣i − j∣ ≤ N, where ∣i − j∣_min = 4 and N is the sequence length. m–t, Pearson r for the correlation between PAE and s.d. as a function of the shortest separation along the linear chain, ∣i − j∣_min, included in the analysis; data are shown for IDRs in proteins with UniProt IDs (m) P08621, (n) P06748, (o) Q5VUA4, (p) Q6PD62, (q) Q15648, (r) Q13148, (s) P98177, and (t) Q7KZ85.

Extended Data Fig. 5 Sampling, distributions and dependence on the location within the full-length protein of conformational properties in the pLDDT-based set of IDRs.

a, Values of the autocorrelation function (ACF) of the R_g for lag times of one (grey), two (red), and three (blue) frames as a function of sequence length. Data are displayed as mean ± s.d. over n = 5 independent simulation replicas. b, S.d. of the apparent Flory scaling exponent, ν, as a function of ν for proteins of different sequence length. c, S.d. of the ratio of the mean-squared end-to-end distance and the mean-squared radius of gyration, \(\langle {R}_{{\rm{ee}}}^{2}\rangle /\langle {R}_{{\rm{g}}}^{2}\rangle \), as a function of \(\langle {R}_{{\rm{ee}}}^{2}\rangle /\langle {R}_{{\rm{g}}}^{2}\rangle \) for proteins of different sequence length. d, S.d. of the conformational entropy per residue, S_conf/N, as a function of S_conf/N for proteins of different sequence length. S.d. values are calculated over n = 5 independent simulation replicas. Dotted lines show the average s.d. e, Correlation between \(\langle {R}_{{\rm{ee}}}^{2}\rangle \) / \(\langle {R}_{{\rm{g}}}^{2}\rangle \) calculated from simulation trajectories and the approximate relation \(\langle {R}_{{\rm{ee}}}^{2}\rangle \) / \(\langle {R}_{{\rm{g}}}^{2}\rangle =(2\nu +1)\times (2\nu +2)\). (ref. ¹⁵). f,g, Distributions of (f) \(\langle {R}_{{\rm{ee}}}^{2}\rangle \) / \(\langle {R}_{{\rm{g}}}^{2}\rangle \) and (g) the conformational entropy per residue, S_conf/N, for IDRs in the human proteome; note the logarithmic scale. h–j, Correlation between (h) asphericity, Δ, and ν; (i) prolateness, S, and ν; and (j) S_conf/N and ν. r is the Pearson correlation coefficient. k, Normalized distributions of the absolute difference between ν values, ∣Δν∣, for IDRs from different proteins (grey), within the same protein (blue), and separated by folded domains within the same protein (teal). Vertical bars show mean ± s.e.m. of ∣Δν∣. l,m, Distributions of (l) ν and (m) S_conf/N for IDRs in the different categories of localization within the full-length protein: N-terminal (32.2%), C-terminal (24.8%), between folded domains (15.5%), and IDPs (5.2%). For this analysis we annotated IDRs to be located between folded domains if they are preceded and followed by at least 100 residues which do not overlap with any of the IDRs identified in this work. IDP, fully intrinsically disordered protein.

Extended Data Fig. 6 Analysis of the CD-CODE database, and relationships between sequence and conformational properties for IDRs in the pLDDT-based and SPOT-based sets.

a,b, Analysis of the association between IDR conformational properties and protein localization in membraneless organelles (as reported in the CD-CODE database²⁷). Distributions of (a) ν and (b) S_conf/N for IDRs in ‘driver’ proteins (shaded bars) and in proteins that are not a part of the examined subset (black lines). c,d, Distributions of (c) ν and (d) S_conf/N for IDRs in ‘member’ proteins (shaded bars) and in proteins that are not a part of the examined subset (black lines). Histograms for proteins enriched in compact and expanded IDRs are shown in orange and teal, respectively. P values are estimated from one-sided Brunner–Munzel tests using t-distributions and the reported degrees of freedom (DoF). Standard errors of Cohen’s d values are estimated through 10⁵ bootstraps. e–q, Comparison between sequence features that affect compaction in pLDDT-based and SPOT-based sets of IDRs. We show NARDINI z-scores for (e) basic–acidic patterning, z(δ₊₋), (f) acidic patterning, z(Ω₋), and (g) aromatic patterning, z(Ω_π); (h) sequence charge decoration, SCD; (l) charge segregation, κ; (m) S_conf/N; (n) sequence hydropathy decoration, SHD; (o) average stickiness, ⟨λ⟩; (p) fraction of charged residues, FCR; and (q) sequence length, N, as a function of ν. i–k: (i) z(δ₊₋), (j) z(Ω₋), and (k) z(Ω_π) as a function of S_conf/N. Results are shown for all IDRs in the pLDDT-based (grey) and SPOT-based (red) sets, and for IDRs with f_domain = 0 in the pLDDT-based set (blue). r,s, Short IDRs are on average more highly charged and expanded than longer IDRs. (r) ν and (s) S_conf/N as a function of sequence length, N, for the human IDRs in the pLDDT-based (grey) and SPOT-based (red) sets. Data are displayed as mean ± s.e.m. t, Normalized distributions of the NCPR for IDRs with N≤200 (full lines) and with N > 200 (dotted lines) in the pLDDT-based (grey) and SPOT-based (red) sets.

Extended Data Fig. 7 Analyses of QCDPred scores and charged residues for IDRs in the pLDDT-based set.

a, QCDPred score (red) and ⟨λ⟩ (grey) as a function of ν. b, Average QCDPred score as a function of ν and ⟨λ⟩. c, QCDPred score (red) and fraction of negatively charged residues, f₋ (black) as a function of ν. d,g, Average fraction of positively charged residues Arg (red) and Lys (black) as a function of (d) ν and (g) S_conf/N. e,h, Average fraction of negatively charged residues Glu (red) and Asp (black) as a function of (e) ν and (h) S_conf/N. f,i, Average net charge per residue, NCPR, as a function of (f) ν and (i) S_conf/N. Data are displayed as mean ± s.e.m. calculated within bins of width Δν = 0.015 and ΔS_conf/N = 0.05 k_B. The samples sizes in each bin are (a,c–f) n = 4, 3; 5; 3; 5; 6; 9; 11; 8; 16; 14; 26; 23; 28; 41; 49; 98; 135; 212; 473; 854; 1,921; 3,319; 4,724; 5,393; 4,987; 3,405; 1,461; 484; 182; 76; 43; 16; 9; 4; and (g–i) n = 5; 9; 4; 20; 29; 93; 406; 1,290; 2,555; 3,204; 3,520; 3,483; 3,415; 3,005; 2,517; 1,895; 1,208; 686; 320; 161; 90; 49; 29; 20; 12; 6; 8; 4; 3; 3; 3.

Extended Data Fig. 8 SVR models for ν and S_conf/N.

a,g, Permutation importance of the sequence features used in the SVR models for the prediction of (a) ν and (g) S_conf/N. b–f, Average sequence features as a function of ν calculated from simulations (grey) and using the SVR model (red); data are shown for (b) sequence hydropathy decoration, SHD; (c) sequence charge decoration, SCD; (d) fraction of charged residues, FCR; (e) average stickiness, ⟨λ⟩; and (f) charge segregation, κ. h–j, Average sequence features as a function of S_conf/N calculated from simulations (grey) and using the SVR model (red); data are shown for (h) ⟨λ⟩; (i) SHD; and (j) SCD. Data are displayed as mean ± s.e.m. calculated within bins of width Δν = 0.015 and ΔS_conf/N = 0.05 k_B. The samples sizes in each bin are (b–f) n = 4; 3; 5; 3; 5; 6; 9; 11; 8; 16; 14; 26; 23; 28; 41; 49; 98; 135; 212; 473; 854; 1,921; 3,319; 4,724; 5,393; 4,987; 3,405; 1,461; 484; 182; 76; 43; 16; 9; 4; and (h–j) n = 5; 9; 4; 20; 29; 93; 406; 1,290; 2,555; 3,204; 3,520; 3,483; 3,415; 3,005; 2,517; 1,895; 1,208; 686; 320; 161; 90; 49; 29; 20; 12; 6; 8; 4; 3; 3; 3. k–n, Testing the SVR models. k,m, Correlation between (k) ν and (m) S_conf/N from simulations and corresponding predictions of the SVR models for a held-out test set of 2,795 distinct sequences from the set of 28,058 IDRs identified in this work. l,n, Correlation between (l) ν and (n) S_conf/N values from simulations and corresponding predictions of the SVR models for 611 IDRs in the 531 proteins that are unique to the SPOT-based set of IDRs.

Extended Data Fig. 9 Relationships between the conformational properties of IDRs and the incidence of pathogenic missense and frameshift variants.

a–c, Distributions of (a) the conformational entropy per residue, S_conf/N; (b) the pLDDT scores of the substituted residues; and (c) the scaling exponent, ν, for 11,173 benign (black line) and 1,656 pathogenic (grey bars) variants of IDRs without UniProt domain annotations (f_domain = 0). d,e, Distributions of the difference in (d) ΔS_conf,SVR/N and (e) Δν_SVR between variant and wild type for benign (black line) and pathogenic (grey bars) variants of IDRs with f_domain = 0. g,h, Distributions of the difference in (g) ΔS_conf,SVR/N and (h) Δν_SVR between variant and wild type for 5,039 benign (or of unknown significance, black line) and 637 pathogenic (grey bars) frameshift variants identified by Mensah et al.⁵⁶. P values are estimated from one-sided Brunner–Munzel tests using t-distributions and the reported degrees of freedom (DoF). Standard errors of Cohen’s d values are estimated through 10⁵ bootstraps. f, Average number of ClinVar missense variants per IDR. i, Average number of publications per IDR obtained from the “Find my Understudied Gene” tool⁵⁹. Data in f and i are displayed as mean ± s.e.m. calculated over n = 25; 87; 498; 18,546; 8,864; and 32 IDRs.

Supplementary information

Reporting Summary

Supplementary Table 1

P values from one-sided Brunner–Munzel tests and Cohen’s d are calculated using (i) all IDRs in the pLDDT-based set, (ii) only IDRs with f_domain = 0 in the pLDDT-based set, and (iii) all IDRs in the SPOT-based set. The file reports only Gene Ontology “molecular function” terms with P < 10⁻⁵ for each of the three sets of IDRs for the association with either ν, S_conf/N, or both. The file has four tabs: (i) asymptotic P values calculated using t-distributions, (ii) Cohen’s d with standard errors estimated via bootstrapping, (iii) P values obtained via bootstrapping, and (iv) P values obtained via bootstrapping where we sampled with replacement both sets using the sample size of the smaller set.

Supplementary Table 2

P values from one-sided Brunner–Munzel tests and Cohen’s d are calculated using (i) all IDRs in the pLDDT-based set, (ii) only IDRs with f_domain = 0 in the pLDDT-based set, and (iii) all IDRs in the SPOT-based set. The file reports only Gene Ontology “cellular component” terms with P < 10⁻⁵ for each of the three sets of IDRs for the association with either ν, S_conf/N, or both. The file has four tabs: (i) asymptotic P values calculated using t-distributions, (ii) Cohen’s d with standard errors estimated via bootstrapping, (iii) P values obtained via bootstrapping, and (iv) P values obtained via bootstrapping where we sampled with replacement both sets using the sample size of the smaller set.

Supplementary Table 3

Sequence length; ν, S_conf/N; <R_ee²> / <R_g²>; S; Δ; <λ>; z(δ₊₋) ; z(Ω_π); z(Ω₋); SHD; SCD; κ; FCR; NCPR; Booleans indicating whether the IDR is N-terminal, C-terminal, between folded domains, or an IDP; QCDpred score; f_domain; and amino acid sequence for the 28,058 IDRs in the pLDDT-based set. IDP, fully intrinsically disordered protein.

Supplementary Table 4

List of digital object identifiers (DOIs) of research articles reporting small-angle X-ray scattering and Förster resonance energy transfer (FRET) data used to validate the CALVADOS model as shown in Extended Data Fig. 3.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tesei, G., Trolle, A.I., Jonsson, N. et al. Conformational ensembles of the human intrinsically disordered proteome. Nature 626, 897–904 (2024). https://doi.org/10.1038/s41586-023-07004-5

Download citation

Received: 12 May 2023
Accepted: 19 December 2023
Published: 31 January 2024
Issue Date: 22 February 2024
DOI: https://doi.org/10.1038/s41586-023-07004-5

This article is cited by

Direct prediction of intrinsically disordered protein conformational properties from sequence
- Jeffrey M. Lotthammer
- Garrett M. Ginell
- Alex S. Holehouse
Nature Methods (2024)
An easy-to-use computational tool for predicting 3D properties of disordered proteins

Nature Methods (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.