Influence of spatial structure on protein damage susceptibility: a bioinformatics approach

Aging research is a very popular field of research in which the deterioration or decline of various physiological features is studied. Here we consider the molecular level, which can also have effects on the macroscopic level. The proteinogenic amino acids differ in their susceptibilities to non-enzymatic modification. Some of these modifications can lead to protein damage and thus can affect the form and function of proteins. For this, it is important to know the distribution of amino acids between the protein shell/surface and the core. This was investigated in this study for all known structures of peptides and proteins available in the PDB. As a result, it is shown that the shell contains less susceptible amino acids than the core with the exception of thermophilic organisms. Furthermore, proteins could be classified according to their susceptibility. This can then be used in applications such as phylogeny, aging research, molecular medicine, and synthetic biology.


Hydrophobicity classification
We try to gather information about the amount of hydrophobic AAs for every protein to give a relative representation of the bonding potential of hydrophobic interactions that lie within the proteins. When it comes to conformational changes this can lead very easily to aggregations. Beware this information is not oriented on protein surface hydrophobicity! It is generally known that there is a difference in the distribution of hydrophobic versus hydrophilic AAs 1 . This is because proteins have an aqueous environment and therefore have to be hydrophilic. Concerning hydrophobicity there is no definite classification. Here we are holding on to the description of Voet and Voet 2 , Lesk 3 and Berg, Stryer and Tymoczko 4 with some minor adjustments because we are only interested in hydrophobic potential. That means all the nonpolar AAs which are alanine, phenylalanine, glycine, isoleucine, leucine, methionine, proline and valine are definitely classified as hydrophobic. Two other borderline cases namely tyrosine and tryptophan are also seen as hydrophobic due to their aromatic rings (Tab. 1). Glycine for example has a side chain which is so small that the attributes of the backbone dominate. That means in unbound form it would be hydrophilic and when bound it would be hydrophobic. Since here we are working with bound AAs the classification is decided to be hydrophobic. A contradiction for the first and last AA in the chain will be neglected. Suppl.

Spatial accessibility of secondary structures
Suppl.
2) The scripts #1 -#9 (see Appendix) were used in the program Cloud2 Version 14.3.20 (Heiko Stark, Jena, Germany, URL: https://starkrats.de) to extract the respective amino acids (AA). This resulted in the different sequence files ('all', 'core', 'hull', ...) modified with the respective spatial information. This leads to 9 different data sets, but for simplicity it is summarized here as classified 3D-data. That means the following steps have to be done separately for every data set.
3)The script #10 was used to retrieve additional information like descriptions and organisms out of the structure files to include them later.

4)
Multi-conformational entries could not be handled at this point, that is why script #11 was used to make a list of all multi-conformational IDs to exclude these sequences later.
Step #2 to #4 can also be done in one, but due to the development process we came to this modularised approach which gives the possibility to alter one aspect without having to recalculate all the spatial data.
5) The perl (v5.22.1) program 'preprocess_description_table.pl' combines the gathered information from step #2 to #4. It adds the additional information to the headers of the calculated sequences and removes the sequences with the IDs from the multi-conformation black list. The resulting files now have the ending '..._descr...' added to them but are here listed as preprocessed classified 3D-data.
6) The program 'Scorer_3D.pl' is a modification of the scoring program from Fichtner et al.
(2019) 5 . It leaves out the GO-annotation. It also uses the scores from the table in 'Scores.txt'. For every data set a peptide and protein score file is created as well as a histogram and a statistics file. Note that with this approach, another step follows that will very likely change the final statistics (step #7). See also supplement of Fichtner et al. (2019) 5 for further details.

7)
Because the sequences are reduced in step #2 (except for the data set 'all') the allocation to peptides and proteins is not working as intended in step #6. That means proteins can be degraded to peptides if they fall under the threshold of 100 AAs. The program 'pep_prot_fixer.pl' is there to reallocate the false positive peptides. It takes the peptide and the protein score file, compares them with the preprocessed classified 3D-data and does the necessary arrangements. It also removes all entries with an 'X'-amount higher than the chosen threshold. In our case we decided to use 0.05 as a default value. The created score files have the ending '..._fixed...' and are the final result in the chain. Additionally, the program 'no_length_0.pl' can be applied to get a copy without sequences of length 0, that can result especially in peptides due to the differentiation between core and hull (e.g. if there is no core. 8) An additional consideration is made in step #8 out of the 'preprocessed classified 3D-data'. The program 'amino_acid_counter.pl' is counting the occurrence of all characters. The results are discussed in our paper and in Suppl. 3 there are additional diagrams for this approach.

Extended Graham Scan
For our calculations we used the Graham Scan (https://en.wikipedia.org/wiki/Graham_scan). With the algorithm the convex hull of a finite set of points in 2D can be calculated.
In a second step we check which line connecting two atoms in the hull is longer than the minimal distance (6 Å or 7 Å). If so, the line is replaced by two or more lines connecting the two atoms with the closest indented node(s). Finally this results in the concave hull. This simplification is generalized to sectional planes along the three spatial axes to identify all surface-associated atoms.
In this way, we obtain the concave hull, but only for topologies with genus zero (i.e. without holes). The code can be found under https://github.com/heikostark/Projekte/blob/cloud2/mesh.inc Line 251 to 704 contains the code for the extended Graham Scan.

Statistics of the influence of the spatial protein structure on the susceptibility
Note that these are not the statistics from the statistics file (see Suppl.3, #step 6). Since the statistics are changed by a step afterwards, these values have been obtained with an R script that can be found in the supplementary data. It is called 'boxplot_proteins' and contains also the code for the plots from Fig. 2.

Example Comparison with and without water and ligands
We have taken the file for the protein human oxyhaemoglobin (PDB entry: 1hho) which is given in the PDB with water and ligands. For the comparison we have analyzed this protein (which also shown in the Figure 1 in the main text) with and without these additional molecules (see Suppl. Our analysis showed, that the protein shell now contains 148 amino acids in the presence of water and ligands and 167 in their absence. That means 19 amino acids are no longer shielded. These AAs show a spatially heterogeneous distribution. However, the difference between core and shell remains (see http://damage.stark-jena.de).

a) b)
Suppl. Fig. 3: Representation of the amino acids with (a) and without (b) water/ligand components of the protein oxyhaemoglobin (PDB entry: pdb1hho). A dot represents the averaged centre of an amino acid (red when in the protein shell (PS), yellow in the protein core (PC)) or water molecule/ ligands (blue). Red lines show the cross-linking by the surface calculation, by which the surface (concave hull) is defined. All unlinked dots are assigned to the PC. This graphic was created with Cloud2 version 15.7.22 (https://starkrats.de).

Further amino acid distribution visualizations
In the PDB data the information is resolved on the level of atoms which are allocated to different AAs. There are also other entries than the 20 standard AAs listed. This includes next to the very rare selenocysteine ('U'; only 191 occurrences) and pyrrolysine ('O', only 2 occurrences) also unknown positions ('X'). Even other molecules like ligands or water on the protein surface are listed. These entries are put together in one 'X' at the end of the sequence by cloud2. For the creation of the diagrams only the information for the 20 standard AAs was considered. That means especially in the calculation 'X's for example were not considered. For a closer look at AA distributions Suppl. Fig. 3 shows the distribution of all standard AAs over all proteins we analyzed from PDB.
Suppl. Figure 4: Normalized amino acid distribution for all proteins with structure information from PDB. This diagram was created with LibreOffice Calc version 6.4.6.2.
In the following some normalized versions for the AA distributions with different focuses are shown. Suppl. Fig. 4 shows how the AAs distribute over protein shell (PS) and core (PC). Here a normalization is shown where the real values of an AA for PS and PC are divided by the sum of all values for that AA. In addition, it was changed into percent where PS+PC for every AAs equals 100 %. This shows the distribution of the respective AA over PS and PC. As expected the PC dominates most of the entries which is because the complete data set of PDB contains many bigger proteins. With increasing size the share of AAs considered to be in the core increases in general  For more examples we refer to our website http://damage.stark-jena.de.

Further comparison of enzyme properties
Suppl. Figure 9:

Comparison between organisms
Our analysis revealed visible differences between organisms from different kingdoms (Suppl. Fig. 9). Bacteria show more unsusceptible proteins than other organisms, while animals show more susceptible proteins. The plants, which seem neither susceptible nor unsusceptible, occupy a special position.

Calculation and normalization
The calculation for the Supplement Figures 9 and 10 were calculated in two cases. The two cases take into account the difference between the relation of the mean score for the protein shell (PS) and the protein core (PC). In case 2 a normalization has been made, to show all values in the range of 1 to 2 and with that overall in the range of 0 to 2. case 1 mean score PS < mean score PC : mean score PS / mean score PC case 2 mean score PS > mean score PC : 1+(1-(mean score PC / mean score PS)) Suppl. Figure 10: Cladogram of the genera Sulfolobus, Escherichia, Arabidopsis, Saccharomyces, Drosophila and Homo. The leafs/clades show the distribution of the peptides and proteins with regard to the difference between protein core (PC) and shell (PS) (see difference calculation). These graphics were created with Enzyme2 version 9.7.22 (https://starkrats.de) and LibreOffice Impress version 6.4.6.2.
Suppl. Figure 11: Sorted representation of the organisms with regard to the susceptibility between protein shell (PS) and protein core (PC) (see difference calculation). Values above 1 mean a more susceptible PS compared to the PC and values below 1 mean a more susceptible PC compared to the PS. This diagram was created with LibreOffice Calc version 6.4.6.2.
A more detailed analysis of the distribution shows differences between organisms in the susceptibilities in PS and PC. These clearly occur in the Archea, since on average they have a more susceptible PS than the PC (Suppl. Fig. 10). It is noticeable that Rattus norvegicus and Mus musculus differ greatly from their closest relatives (this may be due to the fact that these are experimental animals and therefore certain protein classes are overrepresented). Otherwise, the organisms are divided into unicellular and multicellular groups.
For more examples we refer to our website http://damage.stark-jena.de.

Top and Bottom 100
For the purpose of analysis and as an overview, we made a list for every protein part, which we distinguished in our work, with the top 100 and bottom 100 entries of the susceptibility calculation. The file is called "TopBot100.ods". These entries also contain the information for the organism and between the lists we counted the occurrence of some specific organisms. So you can see and compare which organisms rank relatively high or low in the respective protein part.