Tensor Algebra-based Geometrical (3D) Biomacro-Molecular Descriptors for Protein Research: Theory, Applications and Comparison with other Methods

In this report, a new type of tridimensional (3D) biomacro-molecular descriptors for proteins are proposed. These descriptors make use of multi-linear algebra concepts based on the application of 3-linear forms (i.e., Canonical Trilinear (Tr), Trilinear Cubic (TrC), Trilinear-Quadratic-Bilinear (TrQB) and so on) as a specific case of the N-linear algebraic forms. The definition of the kth 3-tuple similarity-dissimilarity spatial matrices (Tensor’s Form) are used for the transformation and for the representation of the existing chemical information available in the relationships between three amino acids of a protein. Several metrics (Minkowski-type, wave-edge, etc) and multi-metrics (Triangle area, Bond-angle, etc) are proposed for the interaction information extraction, as well as probabilistic transformations (e.g., simple stochastic and mutual probability) to achieve matrix normalization. A generalized procedure considering amino acid level-based indices that can be fused together by using aggregator operators for descriptors calculations is proposed. The obtained results demonstrated that the new proposed 3D biomacro-molecular indices perform better than other approaches in the SCOP-based discrimination and the prediction of folding rate of proteins by using simple linear parametrical models. It can be concluded that the proposed method allows the definition of 3D biomacro-molecular descriptors that contain orthogonal information capable of providing better models for applications in protein science.

It is well accepted that geometrical representations of chemical structures contain not only descriptive information but insights of the native configuration of the represented molecules. In the case of proteins, it has been observed that their tridimensional (3D) structure provides information about their function in living organisms 1 . Using graphic approaches to study biological and medical systems can provide an intuitive vision and useful insights for helping analyze complicated relations therein, as indicated by many previous studies on a series of important biological topics (particularly for the topics of enzyme kinetics 2-5 , protein folding rates [6][7][8][9] , and low-frequency internal motion 10,11 ).
Definitions for the total and amino acid level 3D protein descriptors based on three-linear forms. The definition for any k th three-linear biomacro-molecular descriptors for a protein must consider a canonical basis set and the application of N-linear forms (maps) in a  n space; Eq. This trilinear form could be defined by using matrices as follows, k tr k T T k n n n n n n (1 1) ( ) ( 1 1) (1 1 ) where, L k tr is the resulting trilinear form MD, n is the number of amino acids (aa) present on the protein, [ ] are the macro-molecular vectors containing x 1 ,…, x n , y 1 ,…,y n and p 1 ,…, p n elements, which are the physicochemical properties of every aa present in the protein structure 58,59 . A Table indicating all physicochemical properties considered on this study is available on the Supplementary Material SMI-A. The k th total three-tuple-(dis)similarity matrices (T-TDSM) ( k ) is a three-order tensor whose elements z ijl k are calculated by using relationships (multi-metrics) between three aa. These relationships will be discussed in Section 2.4.
Based on the physicochemical nature of the properties used for the macromolecular vectors conformation, the following algebraic forms could be defined: (1) Trilinear Canonical (when all macro-molecular vectors are configured differently, that is, using 3 different aa properties) (see Fig. 1), (2) Trilinear linear (when 2 of the macro-molecular vectors are the identity vector and the other one is an aa property), (3) Trilinear bilinear (when 2 macro-molecular vectors have the same configuration (that is to say, by using the same aa property) and the other one is the identity vector), (4) Trilinear quadratic bilinear (when 2 macro-molecular vectors have the same configuration and the other one has a different aa property from the previous), and (5) Trilinear cubic (when all the macro-molecular vectors have the same configuration, i.e., use the same aa property).
Moreover, the definition of aa-based k th three-linear MDs for every aa in the protein is shown in Eq.  where, x 1 ,…, x n , y 1 ,…, y n and p 1 ,…, p n are the components of the macro-molecular vectors.
The k th amino acid-level three-tuple-(dis)similarity matrices (A-TDSM) ( aa k , ) with elements z ijl aa k , are computed by considering the following rules: Consequently, if a protein contains "B" aa in its structure, the T-TDSM ( k ) can be expressed as the sum of "B" aa-level matrices ( aa k , ) (see Fig. 2). From this concept, after the application of algebraic maps on every A-TDSM, we will obtain "B" aa-level indices, denoted as L aa tr (see Eq. (3)), which will be stored on an array (see Fig. 3). This array will be designated as LAI (Local Amino Acidic Invariant) as a correspondence of the LOVI vector for organic molecules (Local Vertex Invariant) 60,61 . From the LAI vector, the total (whole-protein) three-linear indices can be calculated by using aggregation operators (which is a generalization concept for merging components) 62 . These aggregation operators will be discussed in Section 2.3. The general calculation scheme for these novel biomacro-molecular indices is shown in Fig. 3.
Definition for the group-based 3D protein MDs considering three-linear forms. If we consider clusters of aa classified in terms of their activity/properties on solution or their probability to generate a certain secondary structure (see Table 1), group-based indices can be computed by choosing the selected aa-based indices stored in the LAI. Consequently, a new vector denominated Local Group-based Amino Acidic Invariant (LAI G ) is generated. Considering the concept of aggregator operators, a new type of general indices based on aa groups could be generated. This operation allows to evaluate the influence of certain aa in a variety of applications on protein science.

Generation of novel protein mds from amino acid-based indices using aggregation operators.
An invariant could be defined as a generalization procedure for merging different components to obtain one fused expression. The hypothesis that the most appropriate global definition of a natural system may not necessarily be additive is our initiative to propose this tool as an alternative for the generation of MDs. As proof of the concept, in the work done by Barigye et al. 62 , it was demonstrated that other operators besides the sum could yield better correlations with determined chemical properties. These invariants (aggregator operators) are classified in four major groups that are presented as follows: These invariants are applied to the LAI vector that contains the aa based indices as a strategy to obtain a series of global (or local: aa-based or group-based) indices that could contain orthogonal information from the use of the metric invariant N1. A Table indicating all formulae for the aggregation operators proposed is indicated on SMI-B. www.nature.com/scientificreports www.nature.com/scientificreports/ Definition of the three-tuple-(Dis) similarity matrix (TDSM) for physicochemical information extraction. Macro-molecular graphs allow the study of chemical interactions in biological systems to obtain more information on the behavior shown on experimental observations 63,64 ; protein geometric (3D) representations indicate the distribution of its constituent amino acids in space. It is important to mention that the stability and maintenance of this complex structure relies on the inter-residue interactions 65 . Regarding this graphical approach, the aa on the protein can be considered as pseudo-vertices, which possess spatial coordinates defined by a chosen carbon representation. Alpha carbon (C α ) has been the most used representation for protein geometrical/topological studies 12,15,64,66 , however, there were studies where Beta Carbon (C β ) was considered as a simple atom(pseudo-node)-based representation 67 .
In this report, we propose two additional representations (Amide Carbon (AB) and the average of the coordinates of all atoms in the amino acid (AVG)) to observe the behavior and information content that these representations could bring respect the other existing representations. Furthermore, all interactions and bonding between these pseudo vertices are considered as connections between them. Here, all these interactions between amino acids will be computed by considering relationships (multi-metrics) among three aa z ( ) ijl k . Therefore, three-tuple spatial-(dis)similarity matrices  ( ) k will be generated as a representation of the bio-macro-molecular structure.
Bond angle (Angle between sides) (M39-M40) A A A coordinates of three aminoacids of a protein , , ,   www.nature.com/scientificreports www.nature.com/scientificreports/ The formal definitions of elements z ijl k of the matrix  k are indicated as follows (see Eq. (5)) (See Fig. 4):

Ternary Measures (T XYZ ) Agreement Coefficients-based
where, TT ijl is a measure for ternary relations of amino acids (multi-metric), D ijl is a measure for duplex relation of amino acids (metric between 2 amino acids). From Eq. (5) we can observe that, when the aa i, j or l on the protein are different, the measure used for calculation is ternary. The ternary measures used for the computation of the indices are indicated in Table 2. However, when a multi-metric cannot be computed (two aa are the same), then it could be reduced to an inferior measure (duplex relation). The duplex measures used for the computation are indicated in SMI-C. It is important to remark that when a ternary measure is selected to codify the information of the protein, is mandatory to select at least one duplex measure or metric. Nevertheless, the selection of a metric is not mandatory when the ternary measures are related to the Volume, Bond Angle and Dihedral Angle measures (see Fig. 5).
There are two possibilities regarding the application of multi-metrics or metrics on the protein structure, these could be amino acid-based, or protein mass center-based. In the first option, the multi-metric is calculated considering the distance functions against every aa, consequently, the elements z ijl of the T-TDSM when i = j = l, are zero. For the second case, the multi-metric is calculated considering the selected metric of each amino acid to the mass center of the protein, and all elements z ijl on the T-TDSM are different from zero; this approach may offer a better discrimination among protein spatial structures given that it provides information about the centrality of aa residues.
The k th three-tuple-(dis)similarity matrix is obtained by performing a Hadamard matrix product 12 . This procedure performs the power operation in every element of the three-tuple-(dis)similarity matrices. The exponent k is a real number whose values can be positive or negative; when the parameter k is negative, the reciprocal operation is computed. This operation aims for the information extraction accounted by the intra-molecular forces that occur in the protein structure due the residues present in every aa. The range of values to evaluate this product could be from −12 to 12, e.g. k = −1 is related to the gravitational potential, k = −2 is related to the Coulomb potential (See Fig. 6 for more details).
When normalizing procedures are not employed (see below section 2.6) for the elements of  k , these matrices are designed as the k th non-stochastic three-tuple-(dis)similarity matrices (NS-T-TDSM)  ( ) k ns . Probabilistic transformations of the TDSM. Although normalization methods for geometrical matrices are not usually employed, there are several descriptors which use this concept for organic molecules and RNA secondary structures, protein sequences and viral surfaces [68][69][70][71][72] . There are advantages of using normalized matrices such as information standardization and as a tool for the computation of different k th three-linear MDs 25 .
Since probabilistic transformations have only been applied for two-tuple matrices, a generalization for these concepts will be used to normalize the k th non-stochastic three-tuple-(dis)similarity matrices obtained from the computation described computation above. In this study, two probability schemes could be applied: a) simple stochastic and b) mutual probability transformations.
The k th simple-stochastic three-tuple-(dis)similarity matrices  k ss (SS-T-TDSM) and k th mutual probability three-tuple-(dis)similarity matrices  k mp (MP-T-TDSM), which are obtained from  k ns , have been defined as follows: where, z ijl k ns are the elements of the k th non-stochastic three-tuple-(dis)similarity matrices. S jl is the summation of all entries of the two-tuple matrix corresponding to each aa i in a three-tuple matrix for the simple stochastic case whereas for the mutual probability scheme, S ijl is the summation of all elements of the tensor  k ns (see Fig. 7).

Computational calculation of the new proposed protein MDs.
These novel 3D algebraic MDs can be generated by using the in-house software MuLiMs MCoMPAs (at ToMoCoMD-CAMPS system), an open access java-based software. The software allows the user to evaluate all the theoretical configurations presented above and it is available at http://www.tomocomd.com/; it runs on all operative systems available and it presents two versions, a graphical user interface (GUI) version and console version for calculations on a high-performance computing system (HPC).
(XY) in the protein, respectively, d M is the Mahalanobis distance, n is the dimension (3), k is the number of combinations (i, j), when i < j [(1, 2) (1, 3) and (2, 3)], U is the arithmetic mean of the the variable U. The values of the subscript "i" (1, 2, 3) stands for the atoms (X, Y, Z), respectively (e.g for the combination (1, 2) U 1 and U 2 represent the atoms X and Y) and r XY is the Pearson correlation between variables X and Y, p XY is the topological distance between the amino acids containing atoms (X and Y).

MuLiMs-MCoMPAs (acronym for Multi-Linear Maps based on N-Metric & Contact Matrices of 3D-Protein and
Amino-Acids Weightings) belonging to the ToMoCoMD-CAMPS suite (acronym for TOpological MOlecular COMputational Design-Computed-Aided Modelling in Protein Science) allows the computation of these novel protein descriptors. However, in order to reduce the number of MDs to evaluate, analysis of collinearity between indices and information redundancy were performed to obtain 10 suggested theoretical configurations (here designed as projects). The projects designed and used in the present study are shown in SMII-3. From these projects, a total of 20.263 MDs were generated on an HPC with the following computational characteristics: 16 cores Intel (R) Xeon (R) E5-2630 v3, 2.4 GHz of speed and 64 GB RAM using MuLiMs console version.
After the computation of the indices, additional dimensionality reduction procedures were performed. First, non-supervised and supervised procedures considering an information theoretic approach were employed for the reduction of the number of descriptors 73,74 . The software used for this purpose is known as IMMAN 75 . In addition to these reductions, a final supervised reduction was performed using subset filters which considered 2 search methods, Best First and Greedy Stepwise. The software used for this purpose was WEKA (version 3.8) 76 .

Development of the regression and classification models.
The folding rate modelling was performed using the software MOBYDIGS 77 , that combines Multiple Linear Regression (MLR) with a wrapper method based on Genetic Algorithm (GA). The GA was set up with the following considerations: population size: 100; reproduction/mutation rate show starts on 0.5 but it is changed from 0 to 1 while doing the exploration; selection method started on 0.5, but it was changed to 1 and 0 to evaluate more selection options. Several experiments were performed for the construction of models that considered only trilinear indices and the combination between trilinear and bilinear indices.
From the chosen test set, based on the prediction error obtained for all models, four proteins were excluded from the test set (outliers). These outliers were: pdb1jo8, pdb1spr_A, pdb1t8j, pdb2vik.
The protein structural classification was performed by using the software WEKA 76 , that combines the Linear Discriminant Analysis (LDA) with a subset method that uses two searching strategies: Best First and Greedy Stepwise, as well as a wrapper method. Several experiments were carried out for the generation of mathematical models that considered only trilinear indices and the combination between trilinear and bilinear indices.
Assessment of the models. Depending on the modelling technique, several statistical parameters were selected for the resulting mathematical expressions validation. Regarding the case of MLR, the leave one out cross validation (Q 2 loo ) was used as a fitness function. The models were assessed as well considering the Y-scrambling  Table 2). The obtained tensor has n × n × n dimensions, where n is the number of amino acids on the protein.  79 , to reduce the possibility of casual correlation between the selected MDs and for the assessment of the predictive power of the models.
Results and comparison with other approaches. The use of these novel biomacro-molecular descriptors for proteins as a main component for the generation of predictive mathematical models was proposed to evaluate the performance of these models against mathematical expressions generated using other MDs proposed in the literature. As a result, several models for the prediction of folding rate of proteins considering MLR as a modelling strategy and several models for the structural classification of proteins considering the SCOP dataset, using LDA as a modelling strategy, were obtained. The best ranked models and the comparison table are shown below. Figure 5. Selection of multi-metrics or metrics for the definition of the Three-Tuple-(Dis) Similarity Matrix (TDSM) on the truncate peptide 5WRX by using AB representation. A multi-metric is considered (a) Complete when it considers not only the relationships between 3 amino acids (multi-metrics, here Triangle Perimeter), but also relationships between 2 amino acids (metrics, here Euclidean Distance). A multi-metric is considered (b) Non-Complete when it considers only the relationships between 3 amino acids (relationships between 2 amino acids are defined as zero in the TDSM). Moreover, the diagonal of the tensor (conformed by all the tensor elements where i = j = l), could have zero values if the measure was applied considering every aa as a reference or they could be different from zero values if the measure was applied considering the center of mass of the protein. www.nature.com/scientificreports www.nature.com/scientificreports/ Folding rate evaluation. This section presents the equations and statistical parameters for the best two models obtained for folding rate prediction considering only trilinear indices (Eqs 8 and 9) and the best two models obtained for folding rate prediction considering the combination of the trilinear and bilinear indices (Eqs 10 and 11). These equations are presented below: As can be observed from Table 3, the bootstrapping correlation coefficient Q 2 boot calculated for each model presents a value greater than 0.73, which indicates the robustness of the calibrated models against perturbations over the training set. Moreover, the best ranked model was obtained with the combination of trilinear and bilinear indices and its Q2 value is 0.797 (Eq. 11). In addition, the parameters derived from Y-scrambling tests [a(Q 2 )] have in all cases values around −0.137, indicating low propensity to random correlations in predictions. Folding rate depends on the tridimensional structure and specific contact sites along the structure. The correlation obtained between the studied property and the set of proteins indicates that there is an increased amount of information related to the proposed descriptors. Consequently, it could be observed that these proposed descriptors extract orthogonal and novel information complementary to the bilinear algebraic indices. Regarding the composition of the indices that conform the equations, it can be observed that the protein representations C β and AVG are present in all these models, indicating that these novel representations proposed extract more information that the Cα representation.
Furthermore, the similarity between the standard deviation (SDEP) values in training and test sets suggest that the obtained modes have a general applicability.
Regarding the statistical parameters obtained considering the external set of proteins (test set), the overall Q 2 ext is higher than 0.78 (explains more than the 78% of the total variance), which indicates the high predictive capability of the models respect to this property. Moreover, the model with the highest Q 2 ext is Eq. 10 with 0.86; this model was generated considering only trilinear indices. Based on the configuration of the descriptors used for the modelling, it could be observed that the mathematical tools such as operation aggregators (all the selected operators are different from the linear combination, which validates this theoretical statement), the normalization procedures (Simple stochastic and Mutual probability), steric physicochemical properties (PAH and PBS), and considering a protein mass center-based multi-metric and metric distance function calculation (which is a generalization that considers the whole protein structure), allowed a strong correlation between the indices and the response variable.
Concerning other MDs obtained to correlate the folding rate of proteins, it can be observed that the cross-validation correlation coefficient is the highest reported value for this application. Table 4 indicates all the values obtained for the training and test sets using the aforementioned descriptors. The values obtained in this study are superior to the value reported in the other reports.
Finally, all the best ranked models and its statistical parameters are indicated on SMIII-D.
Protein structural classification evaluation. The statistical values for the best four models obtained for SCOP protein structural classification are presented in Table 5; of which two of them are obtained with trilinear indices (Equations 12 and 13), whereas the other two are obtained with combinations of trilinear and bilinear indices (Equations 14 and 15).
As it can be observed from Table 5, the overall number of variables in all the best models presented is between 9 and 19, suggesting that these training models have an high accuracy and a relatively low amount of variables on the prediction of structural classes regarding the training set. The best models obtained on the training set were equations (14 and 15) with an Acc. value of 99.33. It is important to mention that these models were obtained using the combination of trilinear and bilinear indices. Since the structural classification of proteins considers the amount of secondary structures (alpha helixes and beta sheets) present on the structure, the trilinear indices extract structural information in a higher degree than bilinear indices alone based on the results obtained. This statement can be supported by the generalizations applied on the mathematical definition of the indices, that allow more and non-redundant information from the protein structure.   www.nature.com/scientificreports www.nature.com/scientificreports/ Regarding the composition of the indices that conform the equations, it can be observed that the protein representations Cβ, AVG and AB are present in all these models, indicating that these novel representations proposed extract more information that the Cα representation.
Evaluating the MCC values for the training set, it can be observed that the values for all models are above 0.88, which indicates that the models have low classification errors due false positives and false negatives.
Regarding the results obtained for the external prediction, it can be observed that all models have a correct classification percentage above 89.09%, which indicates a high prediction value using the model resulting from the training set. The model with the highest prediction value is equation (15) with an Acc. value of 98.18%. The MCC value for this model is 0.943 which indicates a very low number of false positives and false negatives on the prediction.
Based on the configuration of the used descriptors on the classification models generated, it is possible to observe that several mathematical tools such as different metrics used for the definition of the distance between two amino acids, the local descriptors, and the use of several aggregation operators, allow better information extraction for this property classification models.
Concerning other descriptors generated to predict the secondary structural classification, the comparison between the reported statistical parameters used to evaluate the classification models using those descriptors and our models, it can be observed that the models proposed in this study have a higher classification percentage for the training and test sets ( Table 6). All the best ranked models and its statistical parameters are indicated on SMIII-E.

Conclusion and Future Research
The definition of a new type of 3D MDs based on N-linear algebraic forms allowed the codification of geometrical and topological information regarding relationships between three amino acids on a protein by the evaluation and comparison of the selected statistical parameters obtained for two representative applications in protein science (folding rate and secondary structural classification). Consequently, these MDs constitute an alternative for the generation of proteins physicochemical properties' and function predictive models.
Two new (AB and AVG) and two commonly used (C α and C β ) computing protein representations were evaluated for protein geometrical information extraction. Based on the results obtained from this study, it was observed that the higher information extraction was obtained when the proposed protein descriptors considered the beta carbon (C β ) and the pseudo amino acid (AVG) representations.
As future research, we suggest using spherical truncating methods and generalized aggregation operators as another generalization strategy for the generation of these novel MDs. These mathematical tools could improve the information extraction from the proteins' graphical representations.  Table 6. Comparison of the training set's protein structural classification correct classification percentage of several existing molecular descriptors against this approach.