Introduction

The Anfinsen dogma, that protein sequences dictates their three-dimensional (3D) structure, was postulated nearly 50 years ago1. It set in motion a quest to find methods to reliably and accurately predict 3D protein structures from their sequence, which became even more important with the full sequencing of the human and other genomes (see https://www.ncbi.nlm.nih.gov/genome). Recent spectacular advances in the 3D structure prediction from protein sequences by Artificial Intelligence (AI) programs such as AlphaFold (AF) and RoseTTAfold appear to have revolutionized the field in terms of accuracy and speed2,3. Boosted by their success in predicting structures to near (and sometimes even better than) crystallographic accuracy, the AlphaFold consortium (https://alphafold.ebi.ac.uk) has already made publicly available a series of databases of predicted protein structures first for the entire human and several other organisms proteomes4, and more recently for the entire UniProt database of curated sequences5 (https://www.uniprot.org).

However, these AI programs have not tackled the folding issue from a thermodynamic/mechanistic approach, but rather by combining many different observations in a deep learning process6,7. Apart from simple cases of highly homologous sequences, or clearly recognized folding classes, to reasonably rapidly ascertain the degree of confidence of a predicted structure based on a few measured properties in solution we believe should become a necessary step. For instance, besides known occurrences of multi-chain proteins, determining a molecular mass M in solution can immediately verify the protein oligomerization state and prompt for the need of further modeling. On a different level, circular dichroism (CD) spectroscopy, possible on very small sample amounts8, would permit a rapid check of the actual secondary structure content of a predicted 3D structure.

Particularly useful for known single-chain proteins in the AF databases, shape-sensitive hydrodynamic parameters such as the translational diffusion and sedimentation coefficients (\({D_{t(20,w)}^{0}}\), \({s_{{\left( {{20},w} \right)}}^{{0}} }\)) and the intrinsic viscosity ([η]), could provide a robust assessment of the overall fold likeliness. These measurements, requiring little material and with a reasonably quick turnaround, are usually accessible in most research endeavors, especially in core facilities where analytical ultracentrifugation9,10, multi-angle static and dynamic light scattering (MALS and DLS) coupled to size-exclusion chromatography (SEC)11,12 or directly on plate readers13, and SEC-coupled differential viscosimetry14,15, can often be found. On another level, small-angle X-ray scattering (SAXS) measurements can provide the rms radius of gyration Rg and the pair-wise distance distribution function p(r) vs. r16,17,18. Notably, several synchrotron beamlines offer on-line SEC-SAXS (e.g., Table 11.1 in Ref.19), some accepting mailed-in samples for this set-up (e.g., https://www.diamond.ac.uk/Users/Support-for-European-Access-to-Life-Sciences/Applications/Bio-SAXS.html; https://www.embl-hamburg.de/biosaxs/mailin.html; https://www.synchrotron-soleil.fr/en/beamlines/swing; https://bl1231.als.lbl.gov/htsaxs).

Importantly, all these parameters and functions can be calculated, with varying degree of accuracy, from 3D structures. Among the CD spectra computational methods available, we have chosen SESCA, which appears to offer very accurate results for a wide variety of structures20. The computation of the hydrodynamic parameters from atomic level structures is a mature field, with several approaches and corresponding software available, and with an average accuracy comparable to that of the experimentally determined parameters, 2–4%21,22. For the hydrodynamic and the p(r) vs. r distribution calculations, we have employed the extensively validated UltraScan SOlution MOdeler (US-SOMO) public domain suite23,24,25.

This effort has allowed us to produce and make publicly available, from the AlphaFold released predicted protein structures databases, the comprehensive US-SOMO-AF database presented here, containing the corresponding calculated M, \({D_{t(20,w)}^{0}}\), \({s_{{\left( {{20},w} \right)}}^{{0}} }\), [η], p(r) vs. r, CD spectra, and other ancillary information. Note that the AlphaFold databases were generated from the UniProt sequences without being curated any further. For instance, many proteins are synthesized with either an initiator methionine26, a signal peptide27, or a transit peptide28, which will be post-translationally removed. In addition, some proteins are also further processed by removal of one or more propeptide sequences (see https://www.uniprot.org/help/ptm_processing_section). These modifications will affect the calculated parameters in an inverse proportion to protein size. As the mature form will be nearly always purified and studied, we have by default removed whenever possible the UniProt-identified initiator, signal, and transit peptide residues from the AF structures before performing the hydrodynamic, structural and spectroscopic calculations. For the propeptides, we have instead generated alternate AF structure(s) when they were removed (see Supplementary Methods for details).

Based on the calculated values, some analyses regarding the effectiveness of performing a screening of predicted structures against experimental parameters are presented. Advantages, drawbacks, and potential improvements are then discussed.

Results

Database generation and website implementation

The steps leading to the implementation of the US-SOMO-AF database are outlined in “Methods” section and fully described in the Supplementary Methods sections. Briefly, each entry in the entire AF-v1 (and subsequently -v2) databases was first checked against the corresponding entry in the UniProt database to find the (putative) initiator, signal, and transit peptide regions, which were then removed from the AF PDB files. If propeptide sequence(s) were present, additional PDB files were generated with this/these region(s) removed. If more than a single propeptide was present, permuted structures were generated. These extra AF-derived PDB files have “-pp#” appended to the filename (where “#” is a number). Potential disulfides were identified (allowing a better evaluation of the partial specific volume \({\overline{v}}\) and of M) and written as SSBOND records in the curated PDBs, together with HELIX and SHEET information identified using the DSSP29 implementation in UCSF Chimera30. Batch-mode US-SOMO was used to calculate M, \({\overline{v}}\), \({D_{t(20,w)}^{0}}\), \({s_{{\left( {{20},w} \right)}}^{{0}} }\), the derived Stokes’ radius Rs, [η], Rg, the maximum extensions along the principal X, Y and Z axes of the molecule, and the generation of the p(r) vs. r distributions (normalized by the M of the structure). SESCA20 was used to generate 170–270 nm CD spectra.

In Fig. 1a,b, two screenshots of the US-SOMO-AF webpage (https://somo.genapp.rocks) are shown, with panel a featuring the text/data part and panel b containing the graphic output. The header contains hyperlinks to the US-SOMO, SESCA, and AlphaFold websites, and to Ref.24. It is followed by a warning message concerning the meaningfulness of the calculations when applied to “real” proteins (see “Discussion” section). By hovering the mouse over entries, an explanation will appear below the corresponding field on the right column. A UniProt accession number, or some initial part of it, can be entered in the first field (typing just the initial part and clicking “Search” will provide a list of corresponding entries, including alternate structures). In any case, if the code is present in the database, the corresponding entry will be shown in the “AlphaFold model name” field, followed by the “Title” and “Source” fields as retrieved from the PDB file. If an initiator methionine, a signal or a transit peptide, and/or any propeptide(s) were identified and their atoms removed from the current AF PDB file, their identity and the stretch of residues involved will be listed in the “Post Translational Processing” field, otherwise “none” will appear. The actual residue(s) stretch(es) present in the structure are displayed in the “UniProt residues present” field. Note that if any propeptide stretch was removed from the middle of a structure, the subsequent chain part(s) was/were renamed “B”, “C”…, without renumbering (see Supplementary Methods for details). The dates on which the AF predictions and US-SOMO/SESCA computations were done appear in their corresponding fields, and in between the “Mean confidence” field reports the calculated mean % per-residue confidence, based on the values present in the AF original PDB file.

Figure 1
figure 1

Screenshots of the US-SOMO-AF webpage. Shown are the results for AF-P01029-F1 that includes the removal of the signal sequence and two propeptides. (a) The upper part containing text/data information. (b) The bottom part showing the computed p(r) vs. r distribution and CD spectrum graphs, and the JSmol representation of the structure.

The ten fields that follow report the US-SOMO computed parameters. Since the hydrodynamic parameters were computed with the statistically-based ZENO method31,32,33, standard deviations (SD) can be generated. However, a SD is reported only for [η], as they are tiny for all other parameters. Note that a calculated \({\overline{v}}\) is provided because besides being used to compute \({s_{{\left( {{20},w} \right)}}^{{0}} }\) from \({D_{t(20,w)}^{0}}\) and M, it could also be used to compute an experimental M from SAXS data18. The bottom two entries report the per-residue % of α-helix and β-sheet as calculated from HELIX and SHEET fields in the curated PDB. They could be compared with CD-derived values, besides comparing experimental and calculated spectra (see below).

External links for the current entry to both UniProt and AlphaFold websites are placed after the parameters listings. Curated PDB- and mmCIF-formatted files for the entry can be retrieved from the provided hyperlinks, as well as text files with the p(r) vs. r distribution and CD spectrum, and a csv-formatted file containing all the identifying information and the single-value parameters. All these files can be also retrieved as single compressed files (zip or tar.xz). Below these hyperlinks, the computed p(r) vs. r distribution and CD spectrum graphs are presented, followed by a JSmol (https://sourceforge.net/projects/jsmol) representation of the structure (see Fig. 1b).

Controls for the visualization and copying as an image of both graphs are provided. JSmol commands are also available to change the representation and export it. The default representation colors the structure according to the per-residue confidence level (red, lowest; blue, highest), but for a more in-depth analysis we refer the user to the original AF website.

In the end, parameters for a total of 365,198 and 1,002,038 structures were generated from the AF-v1 and -v2 databases, respectively (sequences with multiple predicted segments were not included, as the computations of their parameters are meaningless). The AF-v2 structures, including replacements for all AF-v1 structures, are stored in the freely accessible US-SOMO-AF database.

General data analyses

Although it is beyond the scope of this work to provide extensive data analyses and interpretations, some observations can be made. To begin with, we have randomly selected from the 365,198 AF-v1 curated structures originally present in the US-SOMO-AF-v1 database, a subset containing 41,200 predicted structures with no counterparts in the RCSB PDB34 database (https://www.rcsb.org), and we have analyzed their calculated properties (data provided as a spreadsheet, Supplementary Data 1).

The graphs in Fig. 2 qualitatively illustrate the potential of selected calculated parameters to distinguish between structures, by observing the spread of the Rs (Fig. 2a) and [η] (Fig. 2b) values for a given M value (Rs was chosen as a proxy for either \({s_{{\left( {{20},w} \right)}}^{{0}} }\) or \({D_{t(20,w)}^{0}}\), the experimentally determined parameters). It is evident that Rs alone (Fig. 2a) can already distinguish between structures, and its ability to discriminate, albeit somewhat limited, does not substantially change on increasing M in the interval 104–105 Da. A significantly larger spread is instead displayed by [η], almost independently of M (Fig. 2b). To provide a measure of the discriminating ability of Rs and [η], we have grouped their values in bins spanning M intervals of 5 kDa, and we have computed the pair-wise% difference between each entry. Then, we calculated the percentage of pairs whose % difference was above two pre-established cut-offs, 6% and 9%, reflecting the potential experimental errors conservatively estimated around 3% (see Supplementary Methods). The results are presented in Fig. S1, and clearly show that the discriminating ability is practically constant, independent of both the M values and of the number of pairs. For Rs even a 9% cut-off would allow about 70% of the pairs to be discriminated, while for [η] this figure is around 90%. [η] is, however, more affected by potentially flexible regions not properly taken into account by the computations, sometimes leading to suspiciously very high values. Indeed, a correlation between increasing [η] values vs. a decreasing % confidence level in the structure prediction can be seen in Fig. 2c, becoming, however, much less defined when the confidence level goes below 50%. In Fig. S2a, we report a Z-scores analysis of this behavior. The shape somewhat follows the number of structures in each bin (Fig. S2b), as expected, since the chances of a sample including an individual from the tail of the distribution increases with sample size. However, the Z-scores seem to flatten out when the confidence level drops below 50%, likely reflecting the lack of [η] value clustering. Finally, Fig. 2d shows in 3D how combining two parameters, Rs and [η], can effectively increase the ability to discriminate. Another important parameter is Rg, but it can rarely be determined by MALS techniques, that have a lower detection limit of ~ 10–11 nm. While SAXS can determine Rg, it can also be used to derive the p(r) vs. r distribution18, which contains more information and can be directly compared with the one computed from structure. Note that the effect of not taking into account the hydration water in the computation of the p(r) vs. r distribution is relatively minor, and its importance decreases as M increases. Therefore, plots involving Rg are not presented here, but could be easily generated from the Supplementary Data 1 spreadsheet.

Figure 2
figure 2

Plots of selected calculated parameters for 41,200 AF-v1 predicted structures with no corresponding entries in the solved structures PDB database. (a) Rs vs. M, log–log scale. (b) [η] vs. M, log–log scale. (c) [η] vs. % decreasing mean confidence level, log-lin scale. (d) A 3D plot where M (log scale) is on the vertical Z-axis, and Rs and [η] are on the horizontal X- and Y-axes, respectively (both linear scales).

Since the AF prediction algorithm was trained on the RCSB PDB structures, the exclusion in the above analysis of AF-predicted structures having a counterpart in the RCSB was done to avoid biasing this subset with potentially “correct” calculated parameters. However, it could be also interesting to compare some experimental hydrodynamic parameters with those calculated from both AF-predicted and experimental structures. Unfortunately, in the AF-v1 database there were very few instances that matched the necessary criteria, that is, i. RCSB PDB complete structures of single chain proteins from the same organism also present in the AF-v1 database; ii. without prosthetic groups; iii. having verified sound experimental hydrodynamic parameters, in particular \({D_{t(20,w)}^{0}}\) and \({s_{{\left( {{20},w} \right)}}^{{0}} }\). After perusing Table 2 of Ref.24, only three proteins met these criteria, and the comparisons are presented in Table 1. For two proteins, human carbonic anhydrase and human serum albumin, both AF-predicted and PDB structures produced very similar \({D_{t(20,w)}^{0}}\) and \({s_{{\left( {{20},w} \right)}}^{{0}} }\) values (inter-difference of ~ 0.6–0.9%), with excellent matches with \({D_{t(20,w)}^{0}}\) (− 2.4 to + 1.8%) and somewhat worse with \({s_{{\left( {{20},w} \right)}}^{{0}} }\) (− 3.7 to + 5.4%). Notably, a large inter-difference instead was present for soybean trypsin inhibitor (STI; − 4.6 to 4.8%), with the experimental \({D_{t(20,w)}^{0}}\) matched better by the AF prediction and \({s_{{\left( {{20},w} \right)}}^{{0}} }\) by the PDB structure (this apparently odd fact can be explained by either experimental value being potentially incorrect). The inter-difference could be rationalized by superimposing the structures and calculating the RMSD between them, as reported in Table 1. The smallest protein, STI, has the largest RMSD, and this is apparently sufficient to be reflected in the different calculated \({D_{t(20,w)}^{0}}\) and \({s_{{\left( {{20},w} \right)}}^{{0}} }\) values.

Table 1 Comparison between experimental and calculated \({D_{{t({20},w)}}^{{0}} }\) and \({s_{{({20},w)}}^{{0}} }\) for three proteins having a crystallographic structure and a predicted AF-v1 structure.

Selected examples

In Table 2, we have listed 14 entries chosen from the 41,200 mentioned above. They were initially selected to represent intervals from 2.2 to 0.66 in the computed Rg/Rs ratio indicating deviation from globular shape (Rg/Rs ~ 0.7 for a sphere). A suitable range of [η] values was also sought, as well as a good representation of the organisms present in the AF-v1 databases, the presence or absence of a signal peptide, and some spread in the mean % confidence. M, Rg, Rs, and [η] were chosen as the calculated parameters, and the entries are ordered by decreasing M. Connected to Table 2 is Fig. 3, that displays snapshots of the 3D structures for each entry colored according to the per-residue confidence level, followed by the p(r) vs. r and CD plots.

Table 2 Some calculated parameters for a selection of AF-v1 predicted structures with no RCSB PDB counterparts, ordered by decreasing molecular mass.
Figure 3
figure 3

JSmol snapshots of the structures for the entries reported in Table, together with the calculated p(r) vs. r and CD plots.

Table 2 and Fig. 3 provide an insightful glimpse on the great variety of predicted structures and their associated calculated parameters, suggesting that performing some of these checks can indeed boost, or question, their reliability. As expected, CD spectra display differences between most structures, and they are a robust check on the predicted secondary structure content. The variability in [η] values in Table 2 appears to confirm its discriminating ability above that of Rs, but clearly it is the p(r) vs. r distribution that would provide the best test, although it is the least rapidly experimentally accessible parameter among those considered.

Comparisons with experimental SAXS-derived data

To strengthen our case, we have conducted a direct comparison between experimentally-derived p(r) vs. r, retrieved from the SASBDB database35 (https://www.sasbdb.org/), and those calculated for the corresponding AF-v1 structures. After a SASBDB search for UniProt codes also present in the AF-v1 database, 473 matching datasets were retrieved. Again, the selection among them was based on the experimental sample being complete, monomeric, and without prosthetic groups, leading to 45 candidates. The final chosen data are presented in Fig. 4, and cover a molecular mass range from 16 to 107 kDa, collected at several SAXS beamlines in either batch or SEC-SAXS mode. In two cases, the corresponding PDB structures were also available. In Fig. 4a, we see a large difference between the SEC-SAXS experimentally-derived p(r) vs. r for the 16 kDa P. falciparum myosin essential light chain36 (black) and that calculated for the AF-Q8IJM4 structure (red), clearly indicating a more extended conformation in solution. In Fig. 4b, the batch-SAXS experimentally-derived p(r) vs. r for the 44 kDa H. sapiens Hsp90 co-chaperone Cdc37 protein37 (black) is significantly different from that calculated from the AF-Q16543 predicted structure (red), hinting at a different domains arrangement in solution. The effect of removing the propeptide segment from an AF predicted structure can be appreciated in Fig. 4c, where the 54 kDa H. sapiens pro-matrix metalloproteinase-1 (MMP-1) studied before (black) and after (blue) propeptide 20–99 segment cleavage (SEC-SAXS unpublished data collected by R. Holland at Diamond, UK) is compared with the corresponding curated AF-P03956 structures (red and magenta, respectively), and with the p(r) vs. r calculated from chain A in the RCSB PDB structure 4AUO38 (green). Here the effect of removing the 9.2 kDa propeptide is noticeable, and subtle differences also appear between the SAXS-derived and calculated p(r) vs. r, with the AF-P50897 and PDB-derived structures being almost identical. Another complete accordance between AF- and PDB-derived (AF-P50897 and 3GRO, unpublished) p(r) vs. r can be seen in Fig. 4d for the 31 kDa H. sapiens palmitoyl-protein thioesterase 1 (PPT1; red and green, respectively), but both are quite different from the SEC-SAXS derived data39 (black) that point to a more elongated structure. A noticeably more elongated structure is also apparent in Fig. 4e by comparing the SEC-SAXS derived40 p(r) vs. r for the 107 kDa H. sapiens probable ATP-dependent RNA helicase DDX58 (black) with the one calculated for AF-O95786 (red). Finally, in Fig. 4f are two cases where SEC-SAXS-derived and AF-calculated p(r) vs. r yield nearly identical curves, the 72 kDa A. thaliana enhanced disease susceptibility 1 (experimental41, blue; AF-Q9SU72, magenta) and the 25 kDa H. sapiens arpin isoform 1 (experimental42, black; AF-Q7Z6K5, red).

Figure 4
figure 4

P(r) vs. r curves SAXS-derived and calculated from AF and RCSB PDB structures. (af) Protein source and names, SASBDB, AF (UniProt) and RCSB PDB accession numbers for each entry are indicated in the boxes within each panel. In all panels the experimentally-derived and the AF-calculated p(r) vs. r are black and red lines, respectively. Additional SAXS-derived and AF-calculated p(r) vs. r present in (c,f) are blue and magenta lines, respectively. Additional PDB-calculated p(r) vs. r (green lines) are present in (c,d).

Conformational variability

To provide an additional test of the discriminatory ability of the hydrodynamic parameters and p(r) vs. r distribution, we have selected the O88338 Cadherin-16 from M. musculus structure (see Table 2, Fig. 3) that contains a number of independently folded domains connected by linkers, and we have run a Discrete Molecular Dynamics (DMD)43,44 simulation to expand its conformational space (see Supplementary Methods for details), followed by hydrodynamic and p(r) vs. r calculations on 100 produced structures. As can be seen in Supplementary Video 1, sufficiently different alternative conformations were generated within an overall frame, allowing an evaluation of the spread in the predicted parameters and their potential discriminating capability. For instance, the Rs spread, 5.88–6.16 nm (a ~ 4.5% change) would be barely above experimental error in distinguishing between the most different conformations in this set, while the spread in [η], 20.8–23.8 cm3/g (a ~ 12.6% change), would clearly allow distinguishing between many conformations (all this set’s individual data are in Supplementary Data 2 spreadsheet, and the Rs and [η] are reported in each video frame, along with the p(r) vs. r distributions). Even more striking is the variation in the p(r) vs. r distributions that are also collectively reported in Fig. 5. Thus, even for such a restricted structural variation, comparing experimental and calculated parameters can provide reliable tests of the predicted structures.

Figure 5
figure 5

Calculated p(r) vs. r distributions for the 100 conformations generated in the DMD run on the AF-predicted O88338 structure.

Effects of long unstructured, potentially flexible regions

Finally, we have also explored the effect of generating a large number of conformations for AF-predicted unstructured parts in three of the AF-v1 proteins shown in Table 2 and Fig. 3, AF-Q4DE01 (residues 1–72 and 746–957), AF-A0A060D4L2 (residues 1–118), and AF-Q8IJG3 (residues 1–40), by producing over 16,000 full structures for each entry. Since molecular dynamics or even DMD runs would have been prohibitively time-consuming, we used the Monomer Monte Carlo (MMC) simulation tool in the SASSIE-web suite45, followed again by batch-mode US-SOMO to compute the hydrodynamic parameters (see Supplementary Methods for details; an animation of 100 randomly chosen among the generated structures for AF-A0A060D4L2 is presented as Supplementary Video 2). Besides calculating the averages ± SD of each parameter, we also statistically analyzed the data, producing distribution histograms. The results can be seen in Fig. 6, where histograms of the distributions of the calculated Rg/Rs ratio (panels a,c,e) and [η] (panels b,d,f) are shown. The starting conformations and the average ± SD values are reported in each panel’s internal label, and are marked on the plots as solid green, and solid and dashed red vertical lines, respectively. From these graphs, the noticeable increase and spread of calculated values is evident for both parameters, the latter being reflected in the large SD associated with the average values. Some differences can be seen, with [η] more correlated to the size of the unstructured regions (decreasing from panels a,b to panels e,f in Fig. 6) and Rg/Rs apparently better able to pick up a bimodal distribution (Fig. 6 panel c). Note that since the MMC simulations we ran did not involve an energy penalty term in accepting/rejecting conformations at each step, but only an overlap check, these calculated average values cannot be directly compared with potential experimentally-derived values, as also indicated by the large associated SDs. Nevertheless, they surely confirm that the conformation of unstructured parts will severely affect the hydrodynamic properties in solution, reinforcing the importance of measuring them.

Figure 6
figure 6

Histograms of the calculated parameters for the MMC-generated conformations of three AF-predicted structures from Table 2. Shown are the distributions of Rg/Rs (a,c,e) and of [η] (b,d,f) calculated for AF-Q4DE01 (16,520 conformations, (a,b)), AF-A0A060D4L2 (16,666 conformations, (c,d)), and AF-Q8IJG3 (16,367 conformations, (e,f)). In each panel, the vertical green lines mark the location of the starting structure parameters, while the vertical solid and dashed red lines indicate the average ± SD over all conformations (the actual values are reported in each panel’s inside legend).

Discussion

We have presented here a new database stemming from the AlphaFold predicted protein structures databases. We initially worked with the AF-v1 release, whose entries were utilized for the tests reported here, and we have already extended it to the AF-v2 recent release. The resulting US-SOMO-AF database contains calculated hydrodynamic and structural parameters whose experimental determination should be within the reach of scientists working with a particular protein for which a “hard” structure is either currently unavailable or in the making. Indeed, it is interesting to note that crystallographers and cryo-electron microscopists are already suggesting using AF-predicted structures to solve experimental structures by molecular replacement methods46. Performing some rapid tests and comparing the results with those we provide in the US-SOMO-AF database could save them valuable time and perhaps hint at twists that should be applied to a predicted structure to better fit the X-ray, cryo-EM, and NMR data. In this respect, we would like to point out a tool present in the US-SOMO program that allows one to color-code a visualized structure based on the contribution of residues to a particular set of distances in a p(r) vs. r distribution47. For instance, this could provide an easier identification of domains that under- or over-contribute to that set of distances. This is another reason why we chose to produce real-space p(r) vs. r distributions instead of reciprocal-space simulated SAXS intensity vs. scattering vector curves, for which a wide variety of methods, often quite computationally intensive, exist48. More in-depth analyses could be subsequently performed on case-by-case basis.

For a more general application, assessing the reliability of a predicted structure could lead to better designed function/structure relationship experiments. The availability of the US-SOMO-AF database has the distinctive advantage of allowing a quick comparison without the need to master the expertise necessary to soundly calculate the relevant solution parameters.

There are, of course, a series of drawbacks associated with these computations. First and foremost, all the AF predicted structures consider all proteins as single chain entities. Efforts are underway (see6) to cope with this issue by allowing multi-chain predictions, and when an evolution in that sense appears in the AF database (only a general tool is presently available, see https://alphafold.ebi.ac.uk/faq), all parameters could be re-calculated for a new set.

A second evident drawback resides in the post-translational modifications that many proteins undergo. None were considered by the AF team, and we have made an important first step by removing the initiator methionine, signal and transit peptides, and producing alternate structures with/without propeptides. This resulted in about ~ 11% (~ 110,000 over ~ 1,002,000) of the AF-v2 structures being modified by our procedures, a sizeable amount. The remaining most important modification, affecting the calculated parameters, is glycosylation (e.g., see Table 1 in Ref.49). While UniProt provides a list of potential glycosylation sites for entries, and publications describing them when available, presently there is no direct way to have the composition of each carbohydrate associated with a particular site. This is a pity, as methods for building complex carbohydrates are already available and/or under development (see50), and it should be relatively straightforward to automatically add them at the appropriate sites. Indeed, this has just been independently advocated in a recent letter51. Even in absence of time-consuming molecular dynamics minimization steps, this simple addition could increase the reliability of calculated hydrodynamic and structural parameters. While we hope that such an important step will be taken at the UniProt and/or AlphaFold databases level, users that need to refine the calculations on a predicted structure after having manually added any prosthetic group can easily do so by using one of the downloadable (http://somo.aucsolutions.com) US-SOMO versions.

The third drawback is the handling of flexibility, especially if large unstructured parts are predicted. Here the US-SOMO-AF database can only raise red flags, such as very high predicted [η] values associated with visualized extended, unstructured parts. Dealing with these issues requires much longer calculations involving either Monte Carlo methods or Brownian dynamics simulations (see52), that would require a major effort to be applied systematically on > 1,000,000 structures. While our simple test with three proteins (Fig. 6) just shows the complexity of the problem, comparing some experimental parameter with those calculated on a starting AF structure would still be quite informative.

All current data has been also deposited to Dryad (https://datadryad.org), which promises preservation. We expect to maintain the website as long as computational resources are available and community interest continues. The website leverages a framework (see Supplementary Methods) which is actively maintained, greatly simplifies website maintenance and updates, and is being used by multiple projects, some since 2013. Our plan is to update the database as new AlphaFoldDB datasets are released. However, modifications or additions to AlphaFoldDB released datasets (e.g., inclusion of carbohydrates), may require us to seek additional funding and/or solicit community contributions to enhance our processing pipeline and/or its component programs (hydrodynamic, structural and spectra calculations). We welcome any group or individual that wishes to host or contribute to the website, database or processing pipeline. They can contact us through the feedback tab of the website or email us directly.

All considered, we believe that the publicly available (https://somo.genapp.rocks) US-SOMO-AF database described here will become a useful tool allowing the research community, by comparing one or more experimentally-determined parameters with the corresponding computed ones, to quickly evaluate the compatibility in solution of an AlphaFold predicted protein structure.

Methods

Production of the results presented in this paper required five major steps: collect the AlphaFold entries and additional metadata; prepare the structures for hydrodynamic, structural and CD calculations; compute the hydrodynamic, structural and CD properties; build a database containing the hydrodynamic properties and additional metadata; and finally build a website allowing users convenient access to the database.

After downloading the AlphaFold-v1 and -v2 databases, we prepared the structures by removing the post-translational processing regions, where present, identified from the UniProt website. We utilized US-SOMO23,24,25 to compute hydrodynamic and structural properties. The US-SOMO suite uses a bead modeling strategy which takes into account the theoretical amount of “bound” hydration water, and the ZENO computational algorithm31,32,33 was employed to calculate the hydrodynamic parameters in a rigid-body frame. US-SOMO was also used to compute the p(r) vs. r distribution on not-hydrated structures, using SAXS-related parameters. To compute the CD spectra, we used SESCA20.

All the computed results were collected and inserted into a database. Full descriptions for all these steps can be found in the Supplementary Methods section.