Introduction

It is well accepted that geometrical representations of chemical structures contain not only descriptive information but insights of the native configuration of the represented molecules. In the case of proteins, it has been observed that their tridimensional (3D) structure provides information about their function in living organisms1. Using graphic approaches to study biological and medical systems can provide an intuitive vision and useful insights for helping analyze complicated relations therein, as indicated by many previous studies on a series of important biological topics (particularly for the topics of enzyme kinetics2,3,4,5, protein folding rates6,7,8,9, and low-frequency internal motion10,11).

Thus, the use of 3D molecular descriptors (MDs) can be considered as an approach for inferring information about structural properties and their related quantities. A good number of prediction models that link 3D chemical structures with activity or properties (QSAR/QSPR) have been generated from 3D-MDs, which have been extensively used for the characterization of organic molecules and small chemical systems12. However, in the case of proteins a few biomacro-molecular indices have been proposed for sequence codification and spatial information extraction13,14,15. This indicates that the approaches based on MDs have not been completely exploited, and it could be considered a field subjected to further theoretical development in protein science.

The modelling of physicochemical properties and biological interactions for proteins require the extraction of information regarding sequence, spatial configuration and the chemical characteristics of every amino acid present on the structure12,16,17,18. Thus, it is important to generate new 3D-MDs for proteins that consider all these features present in 3D structures that provide new, non-redundant information and a more complete characterization of them.

Marrero-Ponce et al. introduced a new set of MDs that consider topology (2D) related characteristics for organic molecules19,20,21,22,23, which has been included in QuBiLs MAS (Quadratic, Bilinear and N-Linear Maps based on graph-theoretic electronic-density Matrices and Atomic Weightings) software24. These 0-2D and chiral MDs were obtained codifying the structural information, using algebraic bilinear forms, and considering electronic density graph-based matrices. Based on their performance and seeking a generalization of this mathematical proposal (N-linear algebraic forms, related to tensor algebra), the definition of geometrical 3D-MDs for organic molecules was also proposed. This approach allowed the use of N-linear algebraic forms as well as other mathematical considerations such as metrics and aggregation operators to increase the information extraction for the resultant indices25,26,27. The aforementioned approach was named QuBiLs MIDAS (Quadratic, Bilinear and N-Linear Maps based on N-tuple Spatial Metric [(Dis)-Similarity] Matrices and Atomic Weightings)28 and several preliminary studies with the QuBiLS-MIDAS 3D-MDs demonstrated a satisfactory behavior, suggesting that this algebraic strategy yields information-rich indices of relevance in chemoinformatic studies26.

There are several applications in protein science such as the prediction of protein structural classes29 and the folding rate of proteins30, which have defined benchmark data sets that have been used in numerous articles31,32,33,34. It has been observed that the amino acid sequence and the various interactions between every amino acid present on a protein, could give information concerning the global stability of the native structure and folding process, indicating that the folding rate of proteins do not consider solely thermodynamic factors35. Therefore, the folding rate of proteins could provide information about the function of a protein based on its geometrical and topological configurations36,37.

Regarding structural class prediction, it has been used as a tool to predict protein function and evolution since the 1970s38. Based on the importance and amount of information related to these two properties, several computational methodologies have been proposed for their calculation. Considering the case of structural classification, there are several methods proposed for this purpose: the amino acid composition (AAC)33, pair-coupled amino acid composition39, pseudo amino acid composition (PseAAC)14, and a mathematical based strategy considering bilinear descriptors40. Concerning protein folding rate, there are several indices that consider the topology/geometry of proteins and the number of contacts between amino acids for the prediction of this property15,30,41,42,43.

The major disadvantage of the AAC-based methods is the reduced consideration of the interaction effects generated by the sequence of the protein, generating lower quality on the prediction. There have been several approaches based on PseAAC that were proposed to improve the prediction of these type of descriptors44,45,46,47,48,49. Regarding the descriptors generated for protein folding, they consider geometrical/topological concepts, distance between the residues in contact as well as long- and short-range interactions based on the conformation of the protein. However, the disadvantage of these approaches is that they do not consider the whole 3D nature of proteins and the information contained on it, since it has been proven that folding rate does not only depend on sequence36.

As demonstrated by a series of recent publications50,51,52,53,54,55 and summarized in a comprehensive review56, to develop a really useful predictor for a biological system, it can be recommended to follow Chou’s 5-step rule which contains the following steps: (a) select or construct a valid benchmark dataset to train and test the predictor; (b) represent the samples with an effective formulation that can truly reflect their intrinsic correlation with the target to be predicted; (c) introduce or develop a powerful algorithm to conduct the prediction; (d) properly perform cross-validation tests to objectively evaluate the anticipated prediction accuracy; (e) establish a user-friendly web-server for the predictor that is accessible to the public. Papers presented for developing a new sequence-analyzing method or statistical predictor by observing the guidelines of Chou’s 5-step rules have the following notable merits: (1) crystal clear in logic development, (2) completely transparent in operation, (3) easily to repeat the reported results by other investigators, (4) with high potential in stimulating other sequence-analyzing methods, and (5) very convenient to be used by the majority of experimental scientists.

The main aim of this study is the introduction of a new class of 3D protein MDs based on N-linear algebraic forms that consider several mathematical tools as concept generalization for enhanced information extraction from proteins. The utility of these novel 3D-biomacro-molecular indices will be evaluated by the prediction of SCOP-structural classes of proteins and its folding rate by using Linear Discriminant Analysis (LDA) and Multiple Linear Regression (MLR) techniques, respectively.

Theoretical Framework

The concept of algebraic based (bilinear) 3D-MDs was proposed in 2015 by Marrero- Ponce et al. as a tool for protein structural codification40, and an initial extension of a geometric distance matrix12,57 for a protein was obtained.

However, the use of tensor algebra to codify relations between more than 2 atoms (3 and 4 atoms) has been used for organic molecules as a strategy for obtaining more information from the geometrical 3D molecular structure26. In this work, the N-tuple algebraic form concept (N = 3) will be evaluated for the calculation of 3D-protein descriptors.

Definitions for the total and amino acid level 3D protein descriptors based on three-linear forms

The definition for any kth three-linear biomacro-molecular descriptors for a protein must consider a canonical basis set and the application of N-linear forms (maps) in a \({{\mathbb{R}}}^{n}\) space; Eq. (1) indicates the mathematical expression for this definition:

$${}_{tr}{}^{k}L=t{r}^{k}(\bar{x},\bar{y},\bar{p})=\sum _{i=1}^{n}\,\sum _{j=1}^{n}\,\sum _{l=1}^{n}\,{z}_{ijl}^{k}{x}^{i}{y}^{j}{p}^{l}$$
(1)

This trilinear form could be defined by using matrices as follows,

$${}_{{\boldsymbol{tr}}}{}^{{\boldsymbol{k}}}{\boldsymbol{L}}=[{\boldsymbol{X}}]\,{{\mathbb{Z}}}^{{\boldsymbol{k}}}{[{\boldsymbol{Y}}]}^{{\boldsymbol{T}}}{[{\boldsymbol{P}}]}^{{\boldsymbol{T}}}={X}_{(1\times {\bf{n}}\times 1)}\,{{\mathbb{Z}}}_{({\bf{n}}\times {\bf{n}}\times {\bf{n}})}^{{\boldsymbol{k}}}{Y}_{({\bf{n}}\times 1\times 1)}{P}_{(1\times 1\times {\bf{n}})}$$
(2)

where, \({}_{tr}{}^{k}L\,\) is the resulting trilinear form MD, n is the number of amino acids (aa) present on the protein, \([X],\,[Y],\,[P]\) are the macro-molecular vectors containing x1,, xn, y1,…,yn and p1,, pn elements, which are the physicochemical properties of every aa present in the protein structure58,59. A Table indicating all physicochemical properties considered on this study is available on the Supplementary Material SMI-A. The kth total three-tuple-(dis)similarity matrices (T-TDSM) (\({{\mathbb{Z}}}^{k}\)) is a three-order tensor whose elements \({z}_{ijl}^{k}\) are calculated by using relationships (multi-metrics) between three aa. These relationships will be discussed in Section 2.4.

Based on the physicochemical nature of the properties used for the macromolecular vectors conformation, the following algebraic forms could be defined: (1) Trilinear Canonical (when all macro-molecular vectors are configured differently, that is, using 3 different aa properties) (see Fig. 1), (2) Trilinear linear (when 2 of the macro-molecular vectors are the identity vector and the other one is an aa property), (3) Trilinear bilinear (when 2 macro-molecular vectors have the same configuration (that is to say, by using the same aa property) and the other one is the identity vector), (4) Trilinear quadratic bilinear (when 2 macro-molecular vectors have the same configuration and the other one has a different aa property from the previous), and (5) Trilinear cubic (when all the macro-molecular vectors have the same configuration, i.e., use the same aa property).

Figure 1
figure 1

Schematic indication of the transformation of the information contained on macro-molecular vectors using spatial information of the protein (Three-Tuple-(Dis)Similarity-Matrices) (TDSM) and algebraic forms. Where n is the number of amino acids present on the protein, [X], [Y], [P] are macro-molecular vectors; zkijl are elements of the TDSM and trL is the resulting MD. These algebraic forms are defined by the physicochemical nature of the macro-molecular vectors.

Moreover, the definition of aa-based kth three-linear MDs for every aa in the protein is shown in Eq. (3):

$${}_{tr}{}^{k}L_{aa}=t{r}^{aa,k}(\bar{x}\,,\,\bar{y},\bar{p})=\sum _{i=1}^{n}\,\sum _{j=1}^{n}\,\sum _{l=1}^{n}\,{z}_{ijl}^{aa,k}{x}^{i}{y}^{j}{p}^{l}=[X]\,{{\mathbb{Z}}}^{aa,k}{[Y]}^{T}{[P]}^{T}\,\forall \,aa=1,2,\ldots ,n$$
(3)

where, x1,, xn, y1,, yn and p1,, pn are the components of the macro-molecular vectors.

The kth amino acid-level three-tuple-(dis)similarity matrices (A-TDSM) (\({{\mathbb{Z}}}^{aa,k}\)) with elements \({z}_{ijl}^{aa,k}\) are computed by considering the following rules:

$$\begin{array}{ll}{z}_{ijl}^{aa,k}={z}_{ijl}^{k} & {\rm{if}}\,{\boldsymbol{i}}\wedge {\boldsymbol{j}}\wedge {\boldsymbol{l}}={\bf{a}}{\bf{a}}\\ {z}_{ijl}^{aa,k}=\frac{2}{3}{z}_{ijl}^{k} & {\rm{if}}\,{\boldsymbol{i}},{\boldsymbol{j}}\vee {\boldsymbol{j}},{\boldsymbol{l}}\vee {\boldsymbol{i}},{\boldsymbol{j}}={\bf{a}}{\bf{a}}\\ {z}_{ijl}^{aa,k}=\frac{1}{3}{z}_{ijl}^{k} & {\rm{if}}\,{\boldsymbol{i}}\,\vee {\boldsymbol{j}}\vee {\boldsymbol{l}}={\bf{a}}{\bf{a}}\\ {z}_{ijl}^{aa,k}=0 & {\rm{otherwise}}\end{array}$$
(4)

Consequently, if a protein contains “B” aa in its structure, the T-TDSM (\({{\mathbb{Z}}}^{k}\)) can be expressed as the sum of “B” aa-level matrices (\({{\mathbb{Z}}}^{aa,k}\)) (see Fig. 2). From this concept, after the application of algebraic maps on every A-TDSM, we will obtain “B” aa-level indices, denoted as \({}_{tr}L_{aa}\) (see Eq. (3)), which will be stored on an array (see Fig. 3).

Figure 2
figure 2

Graphical representation of the differences on the computation between (a) total and (b) amino acid-based tensors for the novel 3D algebraic MDs for a simple example, i.e., truncated peptide PDB file (5WRX).

Figure 3
figure 3

Novel molecular descriptors calculation illustration. (A) Protein structure is filtered considering a protein representation (Section 2.4.) for calculating the relationship between two (metrics, SMI-C) and three amino acids (multi-metrics, Table 2). (B,C) The computation of the macromolecular vectors considers a group of physicochemical properties and the sequence of the structure (Section 2.1.). (D) The T-TDSM can be filtered considering several groups of amino acids to evaluate their role for a certain application (Section 2.2.). (E) The non-stochastic tensor is raised to the kth power (−12 to 12) applying a Haddamard matrix product, to evaluate the interactions between amino acids (Section 2.4.). (F,G) The non-stochastic tensor can be normalized using the simple stochastic and the mutual probability methods, respectively. (Section 2.5.). (H) The total tensor can be split into amino acid-based tensors (Section 2.1.). (I) The application of N-algebraic forms allows the transformation of the extracted information present on the macromolecular vectors and the tensors (Section 2.1.). (J) The obtained amino acid-based indices are stored in a Local Amino Acidic Invariant (LAI) (Section 2.1.). (K) The use of aggregation operators is proposed as a fusion operation for the LAI (Section 2.3.)

This array will be designated as LAI (Local Amino Acidic Invariant) as a correspondence of the LOVI vector for organic molecules (Local Vertex Invariant)60,61. From the LAI vector, the total (whole-protein) three-linear indices can be calculated by using aggregation operators (which is a generalization concept for merging components)62. These aggregation operators will be discussed in Section 2.3. The general calculation scheme for these novel biomacro-molecular indices is shown in Fig. 3.

Definition for the group-based 3D protein MDs considering three-linear forms

If we consider clusters of aa classified in terms of their activity/properties on solution or their probability to generate a certain secondary structure (see Table 1), group-based indices can be computed by choosing the selected aa-based indices stored in the LAI. Consequently, a new vector denominated Local Group-based Amino Acidic Invariant (LAIG) is generated. Considering the concept of aggregator operators, a new type of general indices based on aa groups could be generated. This operation allows to evaluate the influence of certain aa in a variety of applications on protein science.

Table 1 Amino acids groups considered for the computation of the novel 3D algebraic biomacro-molecular descriptors for proteins.

Generation of novel protein mds from amino acid-based indices using aggregation operators

An invariant could be defined as a generalization procedure for merging different components to obtain one fused expression. The hypothesis that the most appropriate global definition of a natural system may not necessarily be additive is our initiative to propose this tool as an alternative for the generation of MDs. As proof of the concept, in the work done by Barigye et al.62, it was demonstrated that other operators besides the sum could yield better correlations with determined chemical properties. These invariants (aggregator operators) are classified in four major groups that are presented as follows: (i) Norms (or Metrics) Invariants: Minkowski norms (N1, N2, N3). Note that the N1 corresponds to the linear combination (summation) of the elements in LAI; (ii) Mean Invariants (first statistical moments): Geometric mean (G), arithmetic mean (M), quadratic mean (P2), power mean of third degree (P3) and harmonic mean (A); (iii) Statistical Invariants (highest statistical moments): Variance (V), skewness (S), kurtosis (K), standard deviation (SD), variation coefficient (CV), range (R), percentile 25 (Q1), percentile 50 (Q2), percentile 75 (Q3), inter-quartile range (I50), maximum trL (MX) and minimum trL (MN); and iv) Classical Invariants: Autocorrelation (AC), Gravitational (GV), Total Information Content (TIC), Mean Information Content (MIC), Standardized Information Content (SIC), Total Sum (TS) and Kier-Hall Connectivity (KH).

These invariants are applied to the LAI vector that contains the aa based indices as a strategy to obtain a series of global (or local: aa-based or group-based) indices that could contain orthogonal information from the use of the metric invariant N1. A Table indicating all formulae for the aggregation operators proposed is indicated on SMI-B.

Definition of the three-tuple-(Dis) similarity matrix (TDSM) for physicochemical information extraction

Macro-molecular graphs allow the study of chemical interactions in biological systems to obtain more information on the behavior shown on experimental observations63,64; protein geometric (3D) representations indicate the distribution of its constituent amino acids in space. It is important to mention that the stability and maintenance of this complex structure relies on the inter-residue interactions65. Regarding this graphical approach, the aa on the protein can be considered as pseudo-vertices, which possess spatial coordinates defined by a chosen carbon representation. Alpha carbon (Cα) has been the most used representation for protein geometrical/topological studies12,15,64,66, however, there were studies where Beta Carbon (Cβ) was considered as a simple atom(pseudo-node)-based representation67.

In this report, we propose two additional representations (Amide Carbon (AB) and the average of the coordinates of all atoms in the amino acid (AVG)) to observe the behavior and information content that these representations could bring respect the other existing representations. Furthermore, all interactions and bonding between these pseudo vertices are considered as connections between them. Here, all these interactions between amino acids will be computed by considering relationships (multi-metrics) among three aa \(({z}_{ijl}^{k})\). Therefore, three-tuple spatial-(dis)similarity matrices \(({{\mathbb{Z}}}^{k})\) will be generated as a representation of the bio-macro-molecular structure.

The formal definitions of elements \({z}_{ijl}^{k}\) of the matrix \({{\mathbb{Z}}}^{k}\) are indicated as follows (see Eq. (5)) (See Fig. 4):

$$\begin{array}{rcl}{z}_{ijl}^{k} & = & \begin{array}{cc}T{T}_{ijl} & {\rm{if}}\,{\bf{i}}\wedge {\bf{j}}\wedge {\bf{l}}\,are\,not\,equal\end{array}\\ & = & \begin{array}{cc}{D}_{ijl} & {\rm{if}}\,{\bf{i}},{\bf{j}}\vee {\bf{j}},{\bf{l}}\vee {\bf{i}},{\bf{l}}\,are\,equal\end{array}\\ & = & \begin{array}{cc}0 & {\rm{otherwise}}\end{array}\end{array}$$
(5)

where, \(T{T}_{ijl}\) is a measure for ternary relations of amino acids (multi-metric), \({D}_{ijl}\) is a measure for duplex relation of amino acids (metric between 2 amino acids).

Figure 4
figure 4

Computation of the Three-Tuple-(Dis) Similarity Matrix (TDSM) for an example truncated peptide (5WRX). Zijl is the value resulting of the use of a multi-metric (Bond Angle, Triangle Perimeter) (see Table 2). The obtained tensor has n × n × n dimensions, where n is the number of amino acids on the protein.

From Eq. (5) we can observe that, when the aa i, j or l on the protein are different, the measure used for calculation is ternary. The ternary measures used for the computation of the indices are indicated in Table 2. However, when a multi-metric cannot be computed (two aa are the same), then it could be reduced to an inferior measure (duplex relation). The duplex measures used for the computation are indicated in SMI-C. It is important to remark that when a ternary measure is selected to codify the information of the protein, is mandatory to select at least one duplex measure or metric. Nevertheless, the selection of a metric is not mandatory when the ternary measures are related to the Volume, Bond Angle and Dihedral Angle measures (see Fig. 5).

Table 2 Multi-metrics available for the calculation of the novel 3D algebraic MDs for proteins. In bold, the software ID number of the multi-metric is indicated.
Figure 5
figure 5

Selection of multi-metrics or metrics for the definition of the Three-Tuple-(Dis) Similarity Matrix (TDSM) on the truncate peptide 5WRX by using AB representation. A multi-metric is considered (a) Complete when it considers not only the relationships between 3 amino acids (multi-metrics, here Triangle Perimeter), but also relationships between 2 amino acids (metrics, here Euclidean Distance). A multi-metric is considered (b) Non-Complete when it considers only the relationships between 3 amino acids (relationships between 2 amino acids are defined as zero in the TDSM). Moreover, the diagonal of the tensor (conformed by all the tensor elements where i = j = l), could have zero values if the measure was applied considering every aa as a reference or they could be different from zero values if the measure was applied considering the center of mass of the protein.

There are two possibilities regarding the application of multi-metrics or metrics on the protein structure, these could be amino acid-based, or protein mass center-based. In the first option, the multi-metric is calculated considering the distance functions against every aa, consequently, the elements zijl of the T-TDSM when i = j = l, are zero. For the second case, the multi-metric is calculated considering the selected metric of each amino acid to the mass center of the protein, and all elements zijl on the T-TDSM are different from zero; this approach may offer a better discrimination among protein spatial structures given that it provides information about the centrality of aa residues.

The kth three-tuple-(dis)similarity matrix is obtained by performing a Hadamard matrix product12. This procedure performs the power operation in every element of the three-tuple-(dis)similarity matrices. The exponent k is a real number whose values can be positive or negative; when the parameter k is negative, the reciprocal operation is computed. This operation aims for the information extraction accounted by the intra-molecular forces that occur in the protein structure due the residues present in every aa. The range of values to evaluate this product could be from −12 to 12, e.g. k = −1 is related to the gravitational potential, k = −2 is related to the Coulomb potential (See Fig. 6 for more details).

Figure 6
figure 6

Application of the Hadamard Matrix Product on the Three-Tuple-(Dis) Similarity Matrix (TDSM) for the example truncated peptide 5WRX.

When normalizing procedures are not employed (see below section 2.6) for the elements of \({{\mathbb{Z}}}^{k}\), these matrices are designed as the kth non-stochastic three-tuple-(dis)similarity matrices (NS-T-TDSM) \(({}_{ns}{\mathbb{Z}}_{k})\).

Probabilistic transformations of the TDSM

Although normalization methods for geometrical matrices are not usually employed, there are several descriptors which use this concept for organic molecules and RNA secondary structures, protein sequences and viral surfaces68,69,70,71,72. There are advantages of using normalized matrices such as information standardization and as a tool for the computation of different kth three-linear MDs25.

Since probabilistic transformations have only been applied for two-tuple matrices, a generalization for these concepts will be used to normalize the kth non-stochastic three-tuple-(dis)similarity matrices obtained from the computation described computation above. In this study, two probability schemes could be applied: a) simple stochastic and b) mutual probability transformations.

The kth simple-stochastic three-tuple-(dis)similarity matrices \({}_{ss}{\mathbb{Z}}_{k}\) (SS-T-TDSM) and kth mutual probability three-tuple-(dis)similarity matrices \({}_{mp}{\mathbb{Z}}_{k}\) (MP-T-TDSM), which are obtained from \({}_{ns}{\mathbb{Z}}_{k}\), have been defined as follows:

$${}_{ss}z_{i\,jl}^{k}=\frac{{}_{ns}z_{i\,jl}^{k}}{{S}_{jl}}=\frac{{}_{ns}z_{i\,jl}^{k}}{{\sum }_{j=1}^{n}\,{\sum }_{k=1}^{n}\,{}_{ns}z_{i\,jl}^{k}}$$
(6)
$${}_{mp}z_{ijl}^{k}=\,\frac{{}_{ns}z_{ijl}^{k}}{{S}_{ijl}}=\frac{{}_{ns}z_{ijl}^{k}}{{\sum }_{i=1}^{n}\,{\sum }_{j=1}^{n}\,{\sum }_{k=1}^{n}\,{}_{ns}z_{ijl}^{k}}$$
(7)

where, \({}_{ns}z_{ijl}^{k}\) are the elements of the kth non-stochastic three-tuple-(dis)similarity matrices. Sjl is the summation of all entries of the two-tuple matrix corresponding to each aa i in a three-tuple matrix for the simple stochastic case whereas for the mutual probability scheme, Sijl is the summation of all elements of the tensor \({}_{ns}{\mathbb{Z}}_{k}\) (see Fig. 7).

Figure 7
figure 7

Application of probabilistic transformations on the Three-Tuple-(Dis) Similarity Matrix (TDSM). The simple stochastic transformation (SS) consists on dividing every element of a 2D matrix for the sum of all elements in that 2D matrix. The mutual probability procedure consists on dividing every element of a 2D matrix for the sum of all elements in the tensor (3D matrix).

Computational calculation of the new proposed protein MDs

These novel 3D algebraic MDs can be generated by using the in-house software MuLiMs MCoMPAs (at ToMoCoMD-CAMPS system), an open access java-based software. The software allows the user to evaluate all the theoretical configurations presented above and it is available at http://www.tomocomd.com/; it runs on all operative systems available and it presents two versions, a graphical user interface (GUI) version and console version for calculations on a high-performance computing system (HPC).

Application Of The N-Linear 3d Algebraic Biomacro-Molecular Descriptors To The Prediction Of Folding Rate And Scop Structural Classification Of Proteins

Benchmark datasets

The training set used for the modelling of the folding rate of proteins (80 proteins) was proposed by Ouyang31. It is important to mention that the case “2BLM” was removed from the set since this case considers only the alpha carbon representation. The test set used here (17 proteins) was proposed by Ruiz-Blanco36.

The set used for protein structural classification (204 proteins) was proposed by K.C. Chou based on the SCOP classification (52 all alpha, 61 all beta, 45 alpha/beta and 46 alpha + beta)39. This set was divided into two groups, 149 proteins were used for the training set and 55 were used for the test set. The details about how this separation was done could be found in Marrero-Ponce et al.40 (see also section 3.1). The structures (pdb files) of the protein and the respective protein representations (pdbx files) could be found as SMII-1 and SMII-2.

Novel 3D algebraic MDs calculation and dimensionality reduction

The software MuLiMs-MCoMPAs (acronym for Multi-Linear Maps based on N-Metric & Contact Matrices of 3D-Protein and Amino-Acids Weightings) belonging to the ToMoCoMD-CAMPS suite (acronym for TOpological MOlecular COMputational Design-Computed-Aided Modelling in Protein Science) allows the computation of these novel protein descriptors. However, in order to reduce the number of MDs to evaluate, analysis of collinearity between indices and information redundancy were performed to obtain 10 suggested theoretical configurations (here designed as projects). The projects designed and used in the present study are shown in SMII-3. From these projects, a total of 20.263 MDs were generated on an HPC with the following computational characteristics: 16 cores Intel (R) Xeon (R) E5-2630 v3, 2.4 GHz of speed and 64 GB RAM using MuLiMs console version.

After the computation of the indices, additional dimensionality reduction procedures were performed. First, non-supervised and supervised procedures considering an information theoretic approach were employed for the reduction of the number of descriptors73,74. The software used for this purpose is known as IMMAN75. In addition to these reductions, a final supervised reduction was performed using subset filters which considered 2 search methods, Best First and Greedy Stepwise. The software used for this purpose was WEKA (version 3.8)76.

Development of the regression and classification models

The folding rate modelling was performed using the software MOBYDIGS77, that combines Multiple Linear Regression (MLR) with a wrapper method based on Genetic Algorithm (GA). The GA was set up with the following considerations: population size: 100; reproduction/mutation rate show starts on 0.5 but it is changed from 0 to 1 while doing the exploration; selection method started on 0.5, but it was changed to 1 and 0 to evaluate more selection options. Several experiments were performed for the construction of models that considered only trilinear indices and the combination between trilinear and bilinear indices.

From the chosen test set, based on the prediction error obtained for all models, four proteins were excluded from the test set (outliers). These outliers were: pdb1jo8, pdb1spr_A, pdb1t8j, pdb2vik.

The protein structural classification was performed by using the software WEKA76, that combines the Linear Discriminant Analysis (LDA) with a subset method that uses two searching strategies: Best First and Greedy Stepwise, as well as a wrapper method. Several experiments were carried out for the generation of mathematical models that considered only trilinear indices and the combination between trilinear and bilinear indices.

Assessment of the models

Depending on the modelling technique, several statistical parameters were selected for the resulting mathematical expressions validation. Regarding the case of MLR, the leave one out cross validation (Q2loo) was used as a fitness function. The models were assessed as well considering the Y-scrambling (a(Q2))78 validation method and the bootstrapping technique (Q2boot)79, to reduce the possibility of casual correlation between the selected MDs and for the assessment of the predictive power of the models.

Results and comparison with other approaches

The use of these novel biomacro-molecular descriptors for proteins as a main component for the generation of predictive mathematical models was proposed to evaluate the performance of these models against mathematical expressions generated using other MDs proposed in the literature. As a result, several models for the prediction of folding rate of proteins considering MLR as a modelling strategy and several models for the structural classification of proteins considering the SCOP dataset, using LDA as a modelling strategy, were obtained. The best ranked models and the comparison table are shown below.

Folding rate evaluation

This section presents the equations and statistical parameters for the best two models obtained for folding rate prediction considering only trilinear indices (Eqs 8 and 9) and the best two models obtained for folding rate prediction considering the combination of the trilinear and bilinear indices (Eqs 10 and 11). These equations are presented below:

$$\mathrm{ln}\,({\rm{k}})=0.0123\ast {\rm{A}}-0.0315\ast {\rm{B}}+{\rm{19.100}}\ast {\rm{C}}+7029.6\ast {\rm{D}}$$
(8)

where,

A = AVG_TS[7]_N1_Tr_M33(M3)_MP-8_o_RPU_KA_PAH-ISA-HWS_MCoMPAs

B = AVG_N3_TrQB_M55(M15)_SS-2_T_KA_PAH-ISA_MCoMPAs

C = AVG_Q1_TrC_M58(M15)_SS0_T_KA_PAH_MCoMPAs

D = AVG_GV[5]_MX_TrF_M41(M5)_MP7_o_T_KA_PBS_MCoMPAs

$$\mathrm{ln}\,({\rm{k}})=-0.0323\ast {\rm{A}}+20.3011\ast {\rm{B}}+7205.01\ast {\rm{C}}-1.7572$$
(9)

where,

A = AVG_N3_TrQB_M55(M15) _SS-2_T_KA_PAH-ISA_MCoMPAs

B = AVG_Q1_TrC_M58(M15) _SS0_T_KA_PAH_MCoMPAs

C = AVG_GV[5]_MX_TrF_M41(M5) _MP7_o_T_KA_PBS_MCoMPAs

$$\mathrm{ln}\,({\rm{k}})=-44766.6\ast {\rm{A}}-0.96157\ast {\rm{B}}+0.20729\ast {\rm{C}}-3.25903\ast {\rm{D}}+25.4265$$
(10)

where,

A = CB_Q2_B_M19_NS-3_T_LGP[ + 12.0]_LGL[4–11]_PAH-PBS_MCoMPAs

B = CB_K_Q_M5_NS-1_T_LGP[1-3]_KDS_MCoMPAs

C = CB_K_B_M2_SS-1_FBS_KA_MM-ECI_MCoMPAs

D = CB_MIC_N1_TrQB_M45(M8)_SS2_o_T_KA_PAH-Z3_MCoMPAs

$$\mathrm{ln}\,({\rm{k}})=-42920.3\ast {\rm{A}}+0.17709\ast {\rm{B}}-3.22386\ast {\rm{C}}+26.0880$$
(11)

where,

A = CB_Q2_B_M19_NS-3_T_LGP[ + 12.0]_LGL[4-11]_PAH-PBS_MCoMPAs

B = CB_K_B_M2_SS-1_FBS_KA_MM-ECI_MCoMPAs

C = CB_MIC_N1_TrQB_M41(M5)_SS2_o_T_KA_PAH-Z3_MCoMPAs

As can be observed from Table 3, the bootstrapping correlation coefficient Q2boot calculated for each model presents a value greater than 0.73, which indicates the robustness of the calibrated models against perturbations over the training set. Moreover, the best ranked model was obtained with the combination of trilinear and bilinear indices and its Q2 value is 0.797 (Eq. 11). In addition, the parameters derived from Y-scrambling tests [a(Q2)] have in all cases values around −0.137, indicating low propensity to random correlations in predictions. Folding rate depends on the tridimensional structure and specific contact sites along the structure. The correlation obtained between the studied property and the set of proteins indicates that there is an increased amount of information related to the proposed descriptors. Consequently, it could be observed that these proposed descriptors extract orthogonal and novel information complementary to the bilinear algebraic indices. Regarding the composition of the indices that conform the equations, it can be observed that the protein representations Cβ and AVG are present in all these models, indicating that these novel representations proposed extract more information that the Cα representation.

Table 3 Best models obtained for the folding rate prediction of 96 proteins using these novel molecular descriptors.

Furthermore, the similarity between the standard deviation (SDEP) values in training and test sets suggest that the obtained modes have a general applicability.

Regarding the statistical parameters obtained considering the external set of proteins (test set), the overall Q2ext is higher than 0.78 (explains more than the 78% of the total variance), which indicates the high predictive capability of the models respect to this property. Moreover, the model with the highest Q2ext is Eq. 10 with 0.86; this model was generated considering only trilinear indices. Based on the configuration of the descriptors used for the modelling, it could be observed that the mathematical tools such as operation aggregators (all the selected operators are different from the linear combination, which validates this theoretical statement), the normalization procedures (Simple stochastic and Mutual probability), steric physicochemical properties (PAH and PBS), and considering a protein mass center-based multi-metric and metric distance function calculation (which is a generalization that considers the whole protein structure), allowed a strong correlation between the indices and the response variable.

Concerning other MDs obtained to correlate the folding rate of proteins, it can be observed that the cross-validation correlation coefficient is the highest reported value for this application. Table 4 indicates all the values obtained for the training and test sets using the aforementioned descriptors. The values obtained in this study are superior to the value reported in the other reports.

Table 4 Comparison of the training and test set’s folding rate statistical parameters of several existing molecular descriptors for proteins against this approach.

Finally, all the best ranked models and its statistical parameters are indicated on SMIII-D.

Protein structural classification evaluation

The statistical values for the best four models obtained for SCOP protein structural classification are presented in Table 5; of which two of them are obtained with trilinear indices (Equations 12 and 13), whereas the other two are obtained with combinations of trilinear and bilinear indices (Equations 14 and 15).

Table 5 Best models obtained for the protein secondary structural classification of 204 proteins using these novel MDs.

As it can be observed from Table 5, the overall number of variables in all the best models presented is between 9 and 19, suggesting that these training models have an high accuracy and a relatively low amount of variables on the prediction of structural classes regarding the training set. The best models obtained on the training set were equations (14 and 15) with an Acc. value of 99.33. It is important to mention that these models were obtained using the combination of trilinear and bilinear indices. Since the structural classification of proteins considers the amount of secondary structures (alpha helixes and beta sheets) present on the structure, the trilinear indices extract structural information in a higher degree than bilinear indices alone based on the results obtained. This statement can be supported by the generalizations applied on the mathematical definition of the indices, that allow more and non-redundant information from the protein structure.

Regarding the composition of the indices that conform the equations, it can be observed that the protein representations Cβ, AVG and AB are present in all these models, indicating that these novel representations proposed extract more information that the Cα representation.

Evaluating the MCC values for the training set, it can be observed that the values for all models are above 0.88, which indicates that the models have low classification errors due false positives and false negatives.

Regarding the results obtained for the external prediction, it can be observed that all models have a correct classification percentage above 89.09%, which indicates a high prediction value using the model resulting from the training set. The model with the highest prediction value is equation (15) with an Acc. value of 98.18%. The MCC value for this model is 0.943 which indicates a very low number of false positives and false negatives on the prediction.

Based on the configuration of the used descriptors on the classification models generated, it is possible to observe that several mathematical tools such as different metrics used for the definition of the distance between two amino acids, the local descriptors, and the use of several aggregation operators, allow better information extraction for this property classification models.

Concerning other descriptors generated to predict the secondary structural classification, the comparison between the reported statistical parameters used to evaluate the classification models using those descriptors and our models, it can be observed that the models proposed in this study have a higher classification percentage for the training and test sets (Table 6). All the best ranked models and its statistical parameters are indicated on SMIII-E.

Table 6 Comparison of the training set’s protein structural classification correct classification percentage of several existing molecular descriptors against this approach.

Conclusion and Future Research

The definition of a new type of 3D MDs based on N-linear algebraic forms allowed the codification of geometrical and topological information regarding relationships between three amino acids on a protein by the evaluation and comparison of the selected statistical parameters obtained for two representative applications in protein science (folding rate and secondary structural classification). Consequently, these MDs constitute an alternative for the generation of proteins physicochemical properties’ and function predictive models.

Two new (AB and AVG) and two commonly used (Cα and Cβ) computing protein representations were evaluated for protein geometrical information extraction. Based on the results obtained from this study, it was observed that the higher information extraction was obtained when the proposed protein descriptors considered the beta carbon (Cβ) and the pseudo amino acid (AVG) representations.

As future research, we suggest using spherical truncating methods and generalized aggregation operators as another generalization strategy for the generation of these novel MDs. These mathematical tools could improve the information extraction from the proteins’ graphical representations.

Moreover, we suggest the evaluation of these novel biomacro-molecular descriptors for proteins in multi-reference studies (several representative protein science applications), that consider several benchmark data sets, to identify for what types of applications, these novel indices could perform better than the previous proposed approaches and how much orthogonal information can these molecular descriptors can obtain.

As pointed out in K.C. Chou’s review80 and demonstrated in a series of recent publications (see, e.g.50,51,81) user-friendly and publicly accessible web-servers represent the future direction for developing useful prediction methods and computational tools. Many webservers have significantly increased the impacts of bioinformatics on medical science82, driving medicinal chemistry into an unprecedented revolution83, we shall make efforts in our future work to provide a webserver for the topic presented in this paper.