# Tensor Algebra-based Geometrical (3D) Biomacro-Molecular Descriptors for Protein Research: Theory, Applications and Comparison with other Methods

• 464 Accesses

## Abstract

In this report, a new type of tridimensional (3D) biomacro-molecular descriptors for proteins are proposed. These descriptors make use of multi-linear algebra concepts based on the application of 3-linear forms (i.e., Canonical Trilinear (Tr), Trilinear Cubic (TrC), Trilinear-Quadratic-Bilinear (TrQB) and so on) as a specific case of the N-linear algebraic forms. The definition of the kth 3-tuple similarity-dissimilarity spatial matrices (Tensor’s Form) are used for the transformation and for the representation of the existing chemical information available in the relationships between three amino acids of a protein. Several metrics (Minkowski-type, wave-edge, etc) and multi-metrics (Triangle area, Bond-angle, etc) are proposed for the interaction information extraction, as well as probabilistic transformations (e.g., simple stochastic and mutual probability) to achieve matrix normalization. A generalized procedure considering amino acid level-based indices that can be fused together by using aggregator operators for descriptors calculations is proposed. The obtained results demonstrated that the new proposed 3D biomacro-molecular indices perform better than other approaches in the SCOP-based discrimination and the prediction of folding rate of proteins by using simple linear parametrical models. It can be concluded that the proposed method allows the definition of 3D biomacro-molecular descriptors that contain orthogonal information capable of providing better models for applications in protein science.

## Introduction

It is well accepted that geometrical representations of chemical structures contain not only descriptive information but insights of the native configuration of the represented molecules. In the case of proteins, it has been observed that their tridimensional (3D) structure provides information about their function in living organisms1. Using graphic approaches to study biological and medical systems can provide an intuitive vision and useful insights for helping analyze complicated relations therein, as indicated by many previous studies on a series of important biological topics (particularly for the topics of enzyme kinetics2,3,4,5, protein folding rates6,7,8,9, and low-frequency internal motion10,11).

Thus, the use of 3D molecular descriptors (MDs) can be considered as an approach for inferring information about structural properties and their related quantities. A good number of prediction models that link 3D chemical structures with activity or properties (QSAR/QSPR) have been generated from 3D-MDs, which have been extensively used for the characterization of organic molecules and small chemical systems12. However, in the case of proteins a few biomacro-molecular indices have been proposed for sequence codification and spatial information extraction13,14,15. This indicates that the approaches based on MDs have not been completely exploited, and it could be considered a field subjected to further theoretical development in protein science.

The modelling of physicochemical properties and biological interactions for proteins require the extraction of information regarding sequence, spatial configuration and the chemical characteristics of every amino acid present on the structure12,16,17,18. Thus, it is important to generate new 3D-MDs for proteins that consider all these features present in 3D structures that provide new, non-redundant information and a more complete characterization of them.

Marrero-Ponce et al. introduced a new set of MDs that consider topology (2D) related characteristics for organic molecules19,20,21,22,23, which has been included in QuBiLs MAS (Quadratic, Bilinear and N-Linear Maps based on graph-theoretic electronic-density Matrices and Atomic Weightings) software24. These 0-2D and chiral MDs were obtained codifying the structural information, using algebraic bilinear forms, and considering electronic density graph-based matrices. Based on their performance and seeking a generalization of this mathematical proposal (N-linear algebraic forms, related to tensor algebra), the definition of geometrical 3D-MDs for organic molecules was also proposed. This approach allowed the use of N-linear algebraic forms as well as other mathematical considerations such as metrics and aggregation operators to increase the information extraction for the resultant indices25,26,27. The aforementioned approach was named QuBiLs MIDAS (Quadratic, Bilinear and N-Linear Maps based on N-tuple Spatial Metric [(Dis)-Similarity] Matrices and Atomic Weightings)28 and several preliminary studies with the QuBiLS-MIDAS 3D-MDs demonstrated a satisfactory behavior, suggesting that this algebraic strategy yields information-rich indices of relevance in chemoinformatic studies26.

There are several applications in protein science such as the prediction of protein structural classes29 and the folding rate of proteins30, which have defined benchmark data sets that have been used in numerous articles31,32,33,34. It has been observed that the amino acid sequence and the various interactions between every amino acid present on a protein, could give information concerning the global stability of the native structure and folding process, indicating that the folding rate of proteins do not consider solely thermodynamic factors35. Therefore, the folding rate of proteins could provide information about the function of a protein based on its geometrical and topological configurations36,37.

Regarding structural class prediction, it has been used as a tool to predict protein function and evolution since the 1970s38. Based on the importance and amount of information related to these two properties, several computational methodologies have been proposed for their calculation. Considering the case of structural classification, there are several methods proposed for this purpose: the amino acid composition (AAC)33, pair-coupled amino acid composition39, pseudo amino acid composition (PseAAC)14, and a mathematical based strategy considering bilinear descriptors40. Concerning protein folding rate, there are several indices that consider the topology/geometry of proteins and the number of contacts between amino acids for the prediction of this property15,30,41,42,43.

The major disadvantage of the AAC-based methods is the reduced consideration of the interaction effects generated by the sequence of the protein, generating lower quality on the prediction. There have been several approaches based on PseAAC that were proposed to improve the prediction of these type of descriptors44,45,46,47,48,49. Regarding the descriptors generated for protein folding, they consider geometrical/topological concepts, distance between the residues in contact as well as long- and short-range interactions based on the conformation of the protein. However, the disadvantage of these approaches is that they do not consider the whole 3D nature of proteins and the information contained on it, since it has been proven that folding rate does not only depend on sequence36.

As demonstrated by a series of recent publications50,51,52,53,54,55 and summarized in a comprehensive review56, to develop a really useful predictor for a biological system, it can be recommended to follow Chou’s 5-step rule which contains the following steps: (a) select or construct a valid benchmark dataset to train and test the predictor; (b) represent the samples with an effective formulation that can truly reflect their intrinsic correlation with the target to be predicted; (c) introduce or develop a powerful algorithm to conduct the prediction; (d) properly perform cross-validation tests to objectively evaluate the anticipated prediction accuracy; (e) establish a user-friendly web-server for the predictor that is accessible to the public. Papers presented for developing a new sequence-analyzing method or statistical predictor by observing the guidelines of Chou’s 5-step rules have the following notable merits: (1) crystal clear in logic development, (2) completely transparent in operation, (3) easily to repeat the reported results by other investigators, (4) with high potential in stimulating other sequence-analyzing methods, and (5) very convenient to be used by the majority of experimental scientists.

The main aim of this study is the introduction of a new class of 3D protein MDs based on N-linear algebraic forms that consider several mathematical tools as concept generalization for enhanced information extraction from proteins. The utility of these novel 3D-biomacro-molecular indices will be evaluated by the prediction of SCOP-structural classes of proteins and its folding rate by using Linear Discriminant Analysis (LDA) and Multiple Linear Regression (MLR) techniques, respectively.

## Theoretical Framework

The concept of algebraic based (bilinear) 3D-MDs was proposed in 2015 by Marrero- Ponce et al. as a tool for protein structural codification40, and an initial extension of a geometric distance matrix12,57 for a protein was obtained.

However, the use of tensor algebra to codify relations between more than 2 atoms (3 and 4 atoms) has been used for organic molecules as a strategy for obtaining more information from the geometrical 3D molecular structure26. In this work, the N-tuple algebraic form concept (N = 3) will be evaluated for the calculation of 3D-protein descriptors.

### Definitions for the total and amino acid level 3D protein descriptors based on three-linear forms

The definition for any kth three-linear biomacro-molecular descriptors for a protein must consider a canonical basis set and the application of N-linear forms (maps) in a $${{\mathbb{R}}}^{n}$$ space; Eq. (1) indicates the mathematical expression for this definition:

$${}_{tr}{}^{k}L=t{r}^{k}(\bar{x},\bar{y},\bar{p})=\sum _{i=1}^{n}\,\sum _{j=1}^{n}\,\sum _{l=1}^{n}\,{z}_{ijl}^{k}{x}^{i}{y}^{j}{p}^{l}$$
(1)

This trilinear form could be defined by using matrices as follows,

$${}_{{\boldsymbol{tr}}}{}^{{\boldsymbol{k}}}{\boldsymbol{L}}=[{\boldsymbol{X}}]\,{{\mathbb{Z}}}^{{\boldsymbol{k}}}{[{\boldsymbol{Y}}]}^{{\boldsymbol{T}}}{[{\boldsymbol{P}}]}^{{\boldsymbol{T}}}={X}_{(1\times {\bf{n}}\times 1)}\,{{\mathbb{Z}}}_{({\bf{n}}\times {\bf{n}}\times {\bf{n}})}^{{\boldsymbol{k}}}{Y}_{({\bf{n}}\times 1\times 1)}{P}_{(1\times 1\times {\bf{n}})}$$
(2)

where, $${}_{tr}{}^{k}L\,$$ is the resulting trilinear form MD, n is the number of amino acids (aa) present on the protein, $$[X],\,[Y],\,[P]$$ are the macro-molecular vectors containing x1,, xn, y1,…,yn and p1,, pn elements, which are the physicochemical properties of every aa present in the protein structure58,59. A Table indicating all physicochemical properties considered on this study is available on the Supplementary Material SMI-A. The kth total three-tuple-(dis)similarity matrices (T-TDSM) ($${{\mathbb{Z}}}^{k}$$) is a three-order tensor whose elements $${z}_{ijl}^{k}$$ are calculated by using relationships (multi-metrics) between three aa. These relationships will be discussed in Section 2.4.

Based on the physicochemical nature of the properties used for the macromolecular vectors conformation, the following algebraic forms could be defined: (1) Trilinear Canonical (when all macro-molecular vectors are configured differently, that is, using 3 different aa properties) (see Fig. 1), (2) Trilinear linear (when 2 of the macro-molecular vectors are the identity vector and the other one is an aa property), (3) Trilinear bilinear (when 2 macro-molecular vectors have the same configuration (that is to say, by using the same aa property) and the other one is the identity vector), (4) Trilinear quadratic bilinear (when 2 macro-molecular vectors have the same configuration and the other one has a different aa property from the previous), and (5) Trilinear cubic (when all the macro-molecular vectors have the same configuration, i.e., use the same aa property).

Moreover, the definition of aa-based kth three-linear MDs for every aa in the protein is shown in Eq. (3):

$${}_{tr}{}^{k}L_{aa}=t{r}^{aa,k}(\bar{x}\,,\,\bar{y},\bar{p})=\sum _{i=1}^{n}\,\sum _{j=1}^{n}\,\sum _{l=1}^{n}\,{z}_{ijl}^{aa,k}{x}^{i}{y}^{j}{p}^{l}=[X]\,{{\mathbb{Z}}}^{aa,k}{[Y]}^{T}{[P]}^{T}\,\forall \,aa=1,2,\ldots ,n$$
(3)

where, x1,, xn, y1,, yn and p1,, pn are the components of the macro-molecular vectors.

The kth amino acid-level three-tuple-(dis)similarity matrices (A-TDSM) ($${{\mathbb{Z}}}^{aa,k}$$) with elements $${z}_{ijl}^{aa,k}$$ are computed by considering the following rules:

$$\begin{array}{ll}{z}_{ijl}^{aa,k}={z}_{ijl}^{k} & {\rm{if}}\,{\boldsymbol{i}}\wedge {\boldsymbol{j}}\wedge {\boldsymbol{l}}={\bf{a}}{\bf{a}}\\ {z}_{ijl}^{aa,k}=\frac{2}{3}{z}_{ijl}^{k} & {\rm{if}}\,{\boldsymbol{i}},{\boldsymbol{j}}\vee {\boldsymbol{j}},{\boldsymbol{l}}\vee {\boldsymbol{i}},{\boldsymbol{j}}={\bf{a}}{\bf{a}}\\ {z}_{ijl}^{aa,k}=\frac{1}{3}{z}_{ijl}^{k} & {\rm{if}}\,{\boldsymbol{i}}\,\vee {\boldsymbol{j}}\vee {\boldsymbol{l}}={\bf{a}}{\bf{a}}\\ {z}_{ijl}^{aa,k}=0 & {\rm{otherwise}}\end{array}$$
(4)

Consequently, if a protein contains “B” aa in its structure, the T-TDSM ($${{\mathbb{Z}}}^{k}$$) can be expressed as the sum of “B” aa-level matrices ($${{\mathbb{Z}}}^{aa,k}$$) (see Fig. 2). From this concept, after the application of algebraic maps on every A-TDSM, we will obtain “B” aa-level indices, denoted as $${}_{tr}L_{aa}$$ (see Eq. (3)), which will be stored on an array (see Fig. 3).

This array will be designated as LAI (Local Amino Acidic Invariant) as a correspondence of the LOVI vector for organic molecules (Local Vertex Invariant)60,61. From the LAI vector, the total (whole-protein) three-linear indices can be calculated by using aggregation operators (which is a generalization concept for merging components)62. These aggregation operators will be discussed in Section 2.3. The general calculation scheme for these novel biomacro-molecular indices is shown in Fig. 3.

### Definition for the group-based 3D protein MDs considering three-linear forms

If we consider clusters of aa classified in terms of their activity/properties on solution or their probability to generate a certain secondary structure (see Table 1), group-based indices can be computed by choosing the selected aa-based indices stored in the LAI. Consequently, a new vector denominated Local Group-based Amino Acidic Invariant (LAIG) is generated. Considering the concept of aggregator operators, a new type of general indices based on aa groups could be generated. This operation allows to evaluate the influence of certain aa in a variety of applications on protein science.

### Generation of novel protein mds from amino acid-based indices using aggregation operators

An invariant could be defined as a generalization procedure for merging different components to obtain one fused expression. The hypothesis that the most appropriate global definition of a natural system may not necessarily be additive is our initiative to propose this tool as an alternative for the generation of MDs. As proof of the concept, in the work done by Barigye et al.62, it was demonstrated that other operators besides the sum could yield better correlations with determined chemical properties. These invariants (aggregator operators) are classified in four major groups that are presented as follows: (i) Norms (or Metrics) Invariants: Minkowski norms (N1, N2, N3). Note that the N1 corresponds to the linear combination (summation) of the elements in LAI; (ii) Mean Invariants (first statistical moments): Geometric mean (G), arithmetic mean (M), quadratic mean (P2), power mean of third degree (P3) and harmonic mean (A); (iii) Statistical Invariants (highest statistical moments): Variance (V), skewness (S), kurtosis (K), standard deviation (SD), variation coefficient (CV), range (R), percentile 25 (Q1), percentile 50 (Q2), percentile 75 (Q3), inter-quartile range (I50), maximum trL (MX) and minimum trL (MN); and iv) Classical Invariants: Autocorrelation (AC), Gravitational (GV), Total Information Content (TIC), Mean Information Content (MIC), Standardized Information Content (SIC), Total Sum (TS) and Kier-Hall Connectivity (KH).

These invariants are applied to the LAI vector that contains the aa based indices as a strategy to obtain a series of global (or local: aa-based or group-based) indices that could contain orthogonal information from the use of the metric invariant N1. A Table indicating all formulae for the aggregation operators proposed is indicated on SMI-B.

### Definition of the three-tuple-(Dis) similarity matrix (TDSM) for physicochemical information extraction

Macro-molecular graphs allow the study of chemical interactions in biological systems to obtain more information on the behavior shown on experimental observations63,64; protein geometric (3D) representations indicate the distribution of its constituent amino acids in space. It is important to mention that the stability and maintenance of this complex structure relies on the inter-residue interactions65. Regarding this graphical approach, the aa on the protein can be considered as pseudo-vertices, which possess spatial coordinates defined by a chosen carbon representation. Alpha carbon (Cα) has been the most used representation for protein geometrical/topological studies12,15,64,66, however, there were studies where Beta Carbon (Cβ) was considered as a simple atom(pseudo-node)-based representation67.

In this report, we propose two additional representations (Amide Carbon (AB) and the average of the coordinates of all atoms in the amino acid (AVG)) to observe the behavior and information content that these representations could bring respect the other existing representations. Furthermore, all interactions and bonding between these pseudo vertices are considered as connections between them. Here, all these interactions between amino acids will be computed by considering relationships (multi-metrics) among three aa $$({z}_{ijl}^{k})$$. Therefore, three-tuple spatial-(dis)similarity matrices $$({{\mathbb{Z}}}^{k})$$ will be generated as a representation of the bio-macro-molecular structure.

The formal definitions of elements $${z}_{ijl}^{k}$$ of the matrix $${{\mathbb{Z}}}^{k}$$ are indicated as follows (see Eq. (5)) (See Fig. 4):

$$\begin{array}{rcl}{z}_{ijl}^{k} & = & \begin{array}{cc}T{T}_{ijl} & {\rm{if}}\,{\bf{i}}\wedge {\bf{j}}\wedge {\bf{l}}\,are\,not\,equal\end{array}\\ & = & \begin{array}{cc}{D}_{ijl} & {\rm{if}}\,{\bf{i}},{\bf{j}}\vee {\bf{j}},{\bf{l}}\vee {\bf{i}},{\bf{l}}\,are\,equal\end{array}\\ & = & \begin{array}{cc}0 & {\rm{otherwise}}\end{array}\end{array}$$
(5)

where, $$T{T}_{ijl}$$ is a measure for ternary relations of amino acids (multi-metric), $${D}_{ijl}$$ is a measure for duplex relation of amino acids (metric between 2 amino acids).

From Eq. (5) we can observe that, when the aa i, j or l on the protein are different, the measure used for calculation is ternary. The ternary measures used for the computation of the indices are indicated in Table 2. However, when a multi-metric cannot be computed (two aa are the same), then it could be reduced to an inferior measure (duplex relation). The duplex measures used for the computation are indicated in SMI-C. It is important to remark that when a ternary measure is selected to codify the information of the protein, is mandatory to select at least one duplex measure or metric. Nevertheless, the selection of a metric is not mandatory when the ternary measures are related to the Volume, Bond Angle and Dihedral Angle measures (see Fig. 5).

There are two possibilities regarding the application of multi-metrics or metrics on the protein structure, these could be amino acid-based, or protein mass center-based. In the first option, the multi-metric is calculated considering the distance functions against every aa, consequently, the elements zijl of the T-TDSM when i = j = l, are zero. For the second case, the multi-metric is calculated considering the selected metric of each amino acid to the mass center of the protein, and all elements zijl on the T-TDSM are different from zero; this approach may offer a better discrimination among protein spatial structures given that it provides information about the centrality of aa residues.

The kth three-tuple-(dis)similarity matrix is obtained by performing a Hadamard matrix product12. This procedure performs the power operation in every element of the three-tuple-(dis)similarity matrices. The exponent k is a real number whose values can be positive or negative; when the parameter k is negative, the reciprocal operation is computed. This operation aims for the information extraction accounted by the intra-molecular forces that occur in the protein structure due the residues present in every aa. The range of values to evaluate this product could be from −12 to 12, e.g. k = −1 is related to the gravitational potential, k = −2 is related to the Coulomb potential (See Fig. 6 for more details).

When normalizing procedures are not employed (see below section 2.6) for the elements of $${{\mathbb{Z}}}^{k}$$, these matrices are designed as the kth non-stochastic three-tuple-(dis)similarity matrices (NS-T-TDSM) $$({}_{ns}{\mathbb{Z}}_{k})$$.

### Probabilistic transformations of the TDSM

Although normalization methods for geometrical matrices are not usually employed, there are several descriptors which use this concept for organic molecules and RNA secondary structures, protein sequences and viral surfaces68,69,70,71,72. There are advantages of using normalized matrices such as information standardization and as a tool for the computation of different kth three-linear MDs25.

Since probabilistic transformations have only been applied for two-tuple matrices, a generalization for these concepts will be used to normalize the kth non-stochastic three-tuple-(dis)similarity matrices obtained from the computation described computation above. In this study, two probability schemes could be applied: a) simple stochastic and b) mutual probability transformations.

The kth simple-stochastic three-tuple-(dis)similarity matrices $${}_{ss}{\mathbb{Z}}_{k}$$ (SS-T-TDSM) and kth mutual probability three-tuple-(dis)similarity matrices $${}_{mp}{\mathbb{Z}}_{k}$$ (MP-T-TDSM), which are obtained from $${}_{ns}{\mathbb{Z}}_{k}$$, have been defined as follows:

$${}_{ss}z_{i\,jl}^{k}=\frac{{}_{ns}z_{i\,jl}^{k}}{{S}_{jl}}=\frac{{}_{ns}z_{i\,jl}^{k}}{{\sum }_{j=1}^{n}\,{\sum }_{k=1}^{n}\,{}_{ns}z_{i\,jl}^{k}}$$
(6)
$${}_{mp}z_{ijl}^{k}=\,\frac{{}_{ns}z_{ijl}^{k}}{{S}_{ijl}}=\frac{{}_{ns}z_{ijl}^{k}}{{\sum }_{i=1}^{n}\,{\sum }_{j=1}^{n}\,{\sum }_{k=1}^{n}\,{}_{ns}z_{ijl}^{k}}$$
(7)

where, $${}_{ns}z_{ijl}^{k}$$ are the elements of the kth non-stochastic three-tuple-(dis)similarity matrices. Sjl is the summation of all entries of the two-tuple matrix corresponding to each aa i in a three-tuple matrix for the simple stochastic case whereas for the mutual probability scheme, Sijl is the summation of all elements of the tensor $${}_{ns}{\mathbb{Z}}_{k}$$ (see Fig. 7).

### Computational calculation of the new proposed protein MDs

These novel 3D algebraic MDs can be generated by using the in-house software MuLiMs MCoMPAs (at ToMoCoMD-CAMPS system), an open access java-based software. The software allows the user to evaluate all the theoretical configurations presented above and it is available at http://www.tomocomd.com/; it runs on all operative systems available and it presents two versions, a graphical user interface (GUI) version and console version for calculations on a high-performance computing system (HPC).

## Application Of The N-Linear 3d Algebraic Biomacro-Molecular Descriptors To The Prediction Of Folding Rate And Scop Structural Classification Of Proteins

### Benchmark datasets

The training set used for the modelling of the folding rate of proteins (80 proteins) was proposed by Ouyang31. It is important to mention that the case “2BLM” was removed from the set since this case considers only the alpha carbon representation. The test set used here (17 proteins) was proposed by Ruiz-Blanco36.

The set used for protein structural classification (204 proteins) was proposed by K.C. Chou based on the SCOP classification (52 all alpha, 61 all beta, 45 alpha/beta and 46 alpha + beta)39. This set was divided into two groups, 149 proteins were used for the training set and 55 were used for the test set. The details about how this separation was done could be found in Marrero-Ponce et al.40 (see also section 3.1). The structures (pdb files) of the protein and the respective protein representations (pdbx files) could be found as SMII-1 and SMII-2.

### Novel 3D algebraic MDs calculation and dimensionality reduction

The software MuLiMs-MCoMPAs (acronym for Multi-Linear Maps based on N-Metric & Contact Matrices of 3D-Protein and Amino-Acids Weightings) belonging to the ToMoCoMD-CAMPS suite (acronym for TOpological MOlecular COMputational Design-Computed-Aided Modelling in Protein Science) allows the computation of these novel protein descriptors. However, in order to reduce the number of MDs to evaluate, analysis of collinearity between indices and information redundancy were performed to obtain 10 suggested theoretical configurations (here designed as projects). The projects designed and used in the present study are shown in SMII-3. From these projects, a total of 20.263 MDs were generated on an HPC with the following computational characteristics: 16 cores Intel (R) Xeon (R) E5-2630 v3, 2.4 GHz of speed and 64 GB RAM using MuLiMs console version.

After the computation of the indices, additional dimensionality reduction procedures were performed. First, non-supervised and supervised procedures considering an information theoretic approach were employed for the reduction of the number of descriptors73,74. The software used for this purpose is known as IMMAN75. In addition to these reductions, a final supervised reduction was performed using subset filters which considered 2 search methods, Best First and Greedy Stepwise. The software used for this purpose was WEKA (version 3.8)76.

### Development of the regression and classification models

The folding rate modelling was performed using the software MOBYDIGS77, that combines Multiple Linear Regression (MLR) with a wrapper method based on Genetic Algorithm (GA). The GA was set up with the following considerations: population size: 100; reproduction/mutation rate show starts on 0.5 but it is changed from 0 to 1 while doing the exploration; selection method started on 0.5, but it was changed to 1 and 0 to evaluate more selection options. Several experiments were performed for the construction of models that considered only trilinear indices and the combination between trilinear and bilinear indices.

From the chosen test set, based on the prediction error obtained for all models, four proteins were excluded from the test set (outliers). These outliers were: pdb1jo8, pdb1spr_A, pdb1t8j, pdb2vik.

The protein structural classification was performed by using the software WEKA76, that combines the Linear Discriminant Analysis (LDA) with a subset method that uses two searching strategies: Best First and Greedy Stepwise, as well as a wrapper method. Several experiments were carried out for the generation of mathematical models that considered only trilinear indices and the combination between trilinear and bilinear indices.

#### Assessment of the models

Depending on the modelling technique, several statistical parameters were selected for the resulting mathematical expressions validation. Regarding the case of MLR, the leave one out cross validation (Q2loo) was used as a fitness function. The models were assessed as well considering the Y-scrambling (a(Q2))78 validation method and the bootstrapping technique (Q2boot)79, to reduce the possibility of casual correlation between the selected MDs and for the assessment of the predictive power of the models.

### Results and comparison with other approaches

The use of these novel biomacro-molecular descriptors for proteins as a main component for the generation of predictive mathematical models was proposed to evaluate the performance of these models against mathematical expressions generated using other MDs proposed in the literature. As a result, several models for the prediction of folding rate of proteins considering MLR as a modelling strategy and several models for the structural classification of proteins considering the SCOP dataset, using LDA as a modelling strategy, were obtained. The best ranked models and the comparison table are shown below.

#### Folding rate evaluation

This section presents the equations and statistical parameters for the best two models obtained for folding rate prediction considering only trilinear indices (Eqs 8 and 9) and the best two models obtained for folding rate prediction considering the combination of the trilinear and bilinear indices (Eqs 10 and 11). These equations are presented below:

$$\mathrm{ln}\,({\rm{k}})=0.0123\ast {\rm{A}}-0.0315\ast {\rm{B}}+{\rm{19.100}}\ast {\rm{C}}+7029.6\ast {\rm{D}}$$
(8)

where,

A = AVG_TS[7]_N1_Tr_M33(M3)_MP-8_o_RPU_KA_PAH-ISA-HWS_MCoMPAs

B = AVG_N3_TrQB_M55(M15)_SS-2_T_KA_PAH-ISA_MCoMPAs

C = AVG_Q1_TrC_M58(M15)_SS0_T_KA_PAH_MCoMPAs

D = AVG_GV[5]_MX_TrF_M41(M5)_MP7_o_T_KA_PBS_MCoMPAs

$$\mathrm{ln}\,({\rm{k}})=-0.0323\ast {\rm{A}}+20.3011\ast {\rm{B}}+7205.01\ast {\rm{C}}-1.7572$$
(9)

where,

A = AVG_N3_TrQB_M55(M15) _SS-2_T_KA_PAH-ISA_MCoMPAs

B = AVG_Q1_TrC_M58(M15) _SS0_T_KA_PAH_MCoMPAs

C = AVG_GV[5]_MX_TrF_M41(M5) _MP7_o_T_KA_PBS_MCoMPAs

$$\mathrm{ln}\,({\rm{k}})=-44766.6\ast {\rm{A}}-0.96157\ast {\rm{B}}+0.20729\ast {\rm{C}}-3.25903\ast {\rm{D}}+25.4265$$
(10)

where,

A = CB_Q2_B_M19_NS-3_T_LGP[ + 12.0]_LGL[4–11]_PAH-PBS_MCoMPAs

B = CB_K_Q_M5_NS-1_T_LGP[1-3]_KDS_MCoMPAs

C = CB_K_B_M2_SS-1_FBS_KA_MM-ECI_MCoMPAs

D = CB_MIC_N1_TrQB_M45(M8)_SS2_o_T_KA_PAH-Z3_MCoMPAs

$$\mathrm{ln}\,({\rm{k}})=-42920.3\ast {\rm{A}}+0.17709\ast {\rm{B}}-3.22386\ast {\rm{C}}+26.0880$$
(11)

where,

A = CB_Q2_B_M19_NS-3_T_LGP[ + 12.0]_LGL[4-11]_PAH-PBS_MCoMPAs

B = CB_K_B_M2_SS-1_FBS_KA_MM-ECI_MCoMPAs

C = CB_MIC_N1_TrQB_M41(M5)_SS2_o_T_KA_PAH-Z3_MCoMPAs

As can be observed from Table 3, the bootstrapping correlation coefficient Q2boot calculated for each model presents a value greater than 0.73, which indicates the robustness of the calibrated models against perturbations over the training set. Moreover, the best ranked model was obtained with the combination of trilinear and bilinear indices and its Q2 value is 0.797 (Eq. 11). In addition, the parameters derived from Y-scrambling tests [a(Q2)] have in all cases values around −0.137, indicating low propensity to random correlations in predictions. Folding rate depends on the tridimensional structure and specific contact sites along the structure. The correlation obtained between the studied property and the set of proteins indicates that there is an increased amount of information related to the proposed descriptors. Consequently, it could be observed that these proposed descriptors extract orthogonal and novel information complementary to the bilinear algebraic indices. Regarding the composition of the indices that conform the equations, it can be observed that the protein representations Cβ and AVG are present in all these models, indicating that these novel representations proposed extract more information that the Cα representation.

Furthermore, the similarity between the standard deviation (SDEP) values in training and test sets suggest that the obtained modes have a general applicability.

Regarding the statistical parameters obtained considering the external set of proteins (test set), the overall Q2ext is higher than 0.78 (explains more than the 78% of the total variance), which indicates the high predictive capability of the models respect to this property. Moreover, the model with the highest Q2ext is Eq. 10 with 0.86; this model was generated considering only trilinear indices. Based on the configuration of the descriptors used for the modelling, it could be observed that the mathematical tools such as operation aggregators (all the selected operators are different from the linear combination, which validates this theoretical statement), the normalization procedures (Simple stochastic and Mutual probability), steric physicochemical properties (PAH and PBS), and considering a protein mass center-based multi-metric and metric distance function calculation (which is a generalization that considers the whole protein structure), allowed a strong correlation between the indices and the response variable.

Concerning other MDs obtained to correlate the folding rate of proteins, it can be observed that the cross-validation correlation coefficient is the highest reported value for this application. Table 4 indicates all the values obtained for the training and test sets using the aforementioned descriptors. The values obtained in this study are superior to the value reported in the other reports.

Finally, all the best ranked models and its statistical parameters are indicated on SMIII-D.

#### Protein structural classification evaluation

The statistical values for the best four models obtained for SCOP protein structural classification are presented in Table 5; of which two of them are obtained with trilinear indices (Equations 12 and 13), whereas the other two are obtained with combinations of trilinear and bilinear indices (Equations 14 and 15).

As it can be observed from Table 5, the overall number of variables in all the best models presented is between 9 and 19, suggesting that these training models have an high accuracy and a relatively low amount of variables on the prediction of structural classes regarding the training set. The best models obtained on the training set were equations (14 and 15) with an Acc. value of 99.33. It is important to mention that these models were obtained using the combination of trilinear and bilinear indices. Since the structural classification of proteins considers the amount of secondary structures (alpha helixes and beta sheets) present on the structure, the trilinear indices extract structural information in a higher degree than bilinear indices alone based on the results obtained. This statement can be supported by the generalizations applied on the mathematical definition of the indices, that allow more and non-redundant information from the protein structure.

Regarding the composition of the indices that conform the equations, it can be observed that the protein representations Cβ, AVG and AB are present in all these models, indicating that these novel representations proposed extract more information that the Cα representation.

Evaluating the MCC values for the training set, it can be observed that the values for all models are above 0.88, which indicates that the models have low classification errors due false positives and false negatives.

Regarding the results obtained for the external prediction, it can be observed that all models have a correct classification percentage above 89.09%, which indicates a high prediction value using the model resulting from the training set. The model with the highest prediction value is equation (15) with an Acc. value of 98.18%. The MCC value for this model is 0.943 which indicates a very low number of false positives and false negatives on the prediction.

Based on the configuration of the used descriptors on the classification models generated, it is possible to observe that several mathematical tools such as different metrics used for the definition of the distance between two amino acids, the local descriptors, and the use of several aggregation operators, allow better information extraction for this property classification models.

Concerning other descriptors generated to predict the secondary structural classification, the comparison between the reported statistical parameters used to evaluate the classification models using those descriptors and our models, it can be observed that the models proposed in this study have a higher classification percentage for the training and test sets (Table 6). All the best ranked models and its statistical parameters are indicated on SMIII-E.

## Conclusion and Future Research

The definition of a new type of 3D MDs based on N-linear algebraic forms allowed the codification of geometrical and topological information regarding relationships between three amino acids on a protein by the evaluation and comparison of the selected statistical parameters obtained for two representative applications in protein science (folding rate and secondary structural classification). Consequently, these MDs constitute an alternative for the generation of proteins physicochemical properties’ and function predictive models.

Two new (AB and AVG) and two commonly used (Cα and Cβ) computing protein representations were evaluated for protein geometrical information extraction. Based on the results obtained from this study, it was observed that the higher information extraction was obtained when the proposed protein descriptors considered the beta carbon (Cβ) and the pseudo amino acid (AVG) representations.

As future research, we suggest using spherical truncating methods and generalized aggregation operators as another generalization strategy for the generation of these novel MDs. These mathematical tools could improve the information extraction from the proteins’ graphical representations.

Moreover, we suggest the evaluation of these novel biomacro-molecular descriptors for proteins in multi-reference studies (several representative protein science applications), that consider several benchmark data sets, to identify for what types of applications, these novel indices could perform better than the previous proposed approaches and how much orthogonal information can these molecular descriptors can obtain.

As pointed out in K.C. Chou’s review80 and demonstrated in a series of recent publications (see, e.g.50,51,81) user-friendly and publicly accessible web-servers represent the future direction for developing useful prediction methods and computational tools. Many webservers have significantly increased the impacts of bioinformatics on medical science82, driving medicinal chemistry into an unprecedented revolution83, we shall make efforts in our future work to provide a webserver for the topic presented in this paper.

## Data Availability

The MuLiMs-MCoMPAs software and the respective user manual are freely available online at www.tomocomd.com.

## References

1. 1.

Bui, T. N. & Sundarraj, G. An efficient genetic algorithm for predicting protein tertiary structures in the 2D HP model. in Proceedings of the 2005 conference on Genetic and evolutionary computation - GECCO ’05 385, https://doi.org/10.1145/1068009.1068072 (ACM Press, 2005).

2. 2.

Chou, K. C. & Forsén, S. Graphical rules for enzyme-catalysed rate laws. Biochem. J. 187, 829–835 (1980).

3. 3.

Chou, K. C., Forsen, S. & Zhou, G. Q. Three schematic rules for deriving apparent rate constants. Chem. Scr. 109–113 (1980).

4. 4.

Chou, K. C., Carter, R. E. & Forsen, S. A new graphical method for deriving rate equations for complicated mechanisms. Chem. Scr. 82–86 (1981).

5. 5.

Li, T. T. & Chou, K. C. The flow of substrate molecules in fast enzyme-catalyzed reaction systems. Chem. Scr. 192–196 (1980).

6. 6.

Chou, K.-C. Applications of graph theory to enzyme kinetics and protein folding kinetics: Steady and non-steady-state systems. Biophys. Chem. 35, 1–24 (1990).

7. 7.

Chou, K. & Forsén, S. Diffusion-controlled effects in reversible enzymatic fast reaction systems - critical spherical shell and proximity rate constant. Biophys. Chem. 12, 255–263 (1980).

8. 8.

Chou, K., Li, T. & Forsén, S. The critical spherical shell in enzymatic fast reaction systems. Biophys. Chem. 12, 265–269 (1980).

9. 9.

Shen, H.-B., Song, J. & Chou, K.-C. Prediction of protein folding rates from primary sequence by fusing multiple sequential features. Journal of Biomedical Science and Engineering 2 (2009).

10. 10.

Chou, K.-C. Low-frequency collective motion in biomacromolecules and its biological functions. Biophys. Chem. 30, 3–48 (1988).

11. 11.

Chou, K. C., Chen, N. Y. & Forse, S. The biological functions of low-frequency phonons: 2. Cooperative effects. Chem. Scr. 18, 126–132 (1981).

12. 12.

Todeschini, R. & Consonni, V. Molecular Descriptors for Chemoinformatics. Molecular Descriptors for Chemoinformatics 2, (Wiley-VCH Verlag GmbH & Co. KGaA, 2009).

13. 13.

Cai, Y.-D., Feng, K.-Y., Lu, W.-C. & Chou, K.-C. Using LogitBoost classifier to predict protein structural classes. J. Theor. Biol. 238, 172–176 (2006).

14. 14.

Chou, K.-C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct. Funct. Bioinforma. 43, 246–255 (2001).

15. 15.

Plaxco, K. W., Simons, K. T. & Baker, D. Contact order, transition state placement and the refolding rates of single domain proteins. J. Mol. Biol. 277, 985–994 (1998).

16. 16.

Randić, M., Zupan, J., Balaban, A., Vikić-Topić, D. & Plavšić, D. Graphical Representation of Proteins. Chem. Rev. 111, 790–862 (2011).

17. 17.

Ruiz-Blanco, Y. B. et al. Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone. BMC Bioinformatics 18, 1–14 (2017).

18. 18.

Agüero, G. TI2BioP: Topological Indices to BioPolymers. Mol2Net 1, 1–3 (2015).

19. 19.

Marrero Ponce, Y., Torrens, F., García-Domenech, R., Ortega-Broche, S. E. & Zaldivar, V. R. Novel 2D TOMOCOMD-CARDD molecular descriptors: atom-based stochastic and non-stochastic bilinear indices and their QSPR applications. J. Math. Chem. 44, 650–673 (2008).

20. 20.

Marrero Ponce, Y. Total and local (atom and atom type) molecular quadratic indices: significance interpretation, comparison to other molecular descriptors, and QSPR/QSAR applications. Bioorg. Med. Chem. 12, 6351–6369 (2004).

21. 21.

Castillo-Garit, J. A., Martinez-Santiago, O., Marrero Ponce, Y., Casañola-Martín, G. M. & Torrens, F. Atom-based non-stochastic and stochastic bilinear indices: Application to QSPR/QSAR studies of organic compounds. Chem. Phys. Lett. 464, 107–112 (2008).

22. 22.

Marrero Ponce, Y. Linear Indices of the “Molecular Pseudograph’s Atom Adjacency Matrix”: Definition, Significance-Interpretation, and Application to QSAR Analysis of Flavone Derivatives as HIV-1 Integrase Inhibitors. J. Chem. Inf. Comput. Sci. 44, 2010–2026 (2004).

23. 23.

Marrero Ponce, Y., Torrens, F., Alvarado, Y. J. & Rotondo, R. Bond-based global and local (bond, group and bond-type) quadratic indices and their applications to computer-aided molecular design. 1. QSPR studies of diverse sets of organic chemicals. J. Comput. Aided. Mol. Des. 20, 685–701 (2006).

24. 24.

Valdés-Martiní, J. R. et al. QuBiLS-MAS, open source multi-platform software for atom- and bond-based topological (2D) and chiral (2.5D) algebraic molecular descriptors computations. J. Cheminform. 9, 1–26 (2017).

25. 25.

Garcia-Jacas, C. et al. N-Linear Algebraic Maps for Chemical Structure Codification: A Suitable Generalization for Atom-pair Approaches? Curr. Drug Metab. 15, 441–469 (2014).

26. 26.

García-Jacas, C. et al. N-tuple topological/geometric cutoffs for 3D N-linear algebraic molecular codifications: variability, linear independence and QSAR analysis. SAR QSAR Environ. Res. 27, 949–975 (2016).

27. 27.

García-Jacas, C. et al. Examining the predictive accuracy of the novel 3D N-linear algebraic molecular codifications on benchmark datasets. J. Cheminform. 8, 1–16 (2016).

28. 28.

García-Jacas, C. et al. QuBiLS-MIDAS: A parallel free-software for molecular descriptors computation based on multilinear algebraic maps. J. Comput. Chem. 35, 1395–1409 (2014).

29. 29.

Chou, K.-C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21, 10–19 (2005).

30. 30.

Nölting, B. et al. Structural determinants of the rate of protein folding. J. Theor. Biol. 223, 299–307 (2003).

31. 31.

Ouyang, Z. & Liang, J. Predicting protein folding rates from geometric contact and amino acid sequence. Protein Sci. 17, 1256–1263 (2008).

32. 32.

Ruiz-Blanco, Y. B. et al. A Hooke’s law-based approach to protein folding rate. J. Theor. Biol. 364, 407–417 (2015).

33. 33.

Chou, K.-C. A novel approach to predicting protein structural classes in a (20–1)-D amino acid composition space. Proteins Struct. Funct. Bioinforma. 21, 319–344 (1995).

34. 34.

Chou, K.-C. & Shen, H.-B. FoldRate: A Web-Server for Predicting Protein Folding Rates from Primary Sequence. Open Bioinforma. J. 3, 31–50 (2009).

35. 35.

Shakhnovich, E. Protein Folding Thermodynamics and Dynamics: Where Physics, Chemistry and Biology Meet. Chem. Rev. 106, 1559–1588 (2009).

36. 36.

Ruiz-Blanco, Y. et al. A Hooke’s law-based approach to protein folding rate. J. Theor. Biol. 364, 407–417 (2015).

37. 37.

Breda, A., Valadares, N. F., De Souza, O. N. & Garratt, R. C. Ch A06: Protein Structure, Modelling and Applications. Bioinforma. Trop. Dis. Res. A Pract. Case-Study Approach 1–41, https://doi.org/10.1177/0009922817691536 (2007).

38. 38.

Xu, H. N., Huang, W. N. & He, C. H. Modeling for extraction of isoflavones from stem of Pueraria lobata (Willd.) Ohwi using n-butanol/water two-phase solvent system. Sep. Purif. Technol. 62, 590–595 (2008).

39. 39.

Chou, K.-C. A Key Driving Force in Determination of Protein Structural Classes. Biochem. Biophys. Res. Commun. 264, 216–224 (1999).

40. 40.

Marrero Ponce, Y. et al. Novel 3D bio-macromolecular bilinear descriptors for protein science: Predicting protein structural classes. J. Theor. Biol. 374, 125–137 (2015).

41. 41.

Gromiha, M. & Selvaraj, S. Comparison between long-range interactions and contact order in determining the folding rate of two-state proteins: Application of long-range order to folding rate prediction. J. Mol. Biol. 310, 27–32 (2001).

42. 42.

Zhou, H. & Zhou, Y. Folding Rate Prediction Using Total Contact Distance. Biophys. J. 82, 458–463 (2002).

43. 43.

Munoz, V. & Eaton, W. A. A simple model for calculating the kinetics of protein folding from three-dimensional structures. Proc. Natl. Acad. Sci. 96, 11311–11316 (1999).

44. 44.

Xiao, X., Shao, S.-H., Huang, Z.-D. & Chou, K.-C. Using pseudo amino acid composition to predict protein structural classes: Approached with complexity measure factor. J. Comput. Chem. 27, 478–482 (2006).

45. 45.

Xiao, X., Lin, W.-Z. & Chou, K.-C. Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes. J. Comput. Chem. 29, 2018–2024 (2008).

46. 46.

Xiao, X., Wang, P. & Chou, K.-C. Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image. J. Theor. Biol. 254, 691–696 (2008).

47. 47.

Zhou, X.-B., Chen, C., Li, Z.-C. & Zou, X.-Y. Using Chou’s amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. J. Theor. Biol. 248, 546–551 (2007).

48. 48.

Zhang, T.-L. & Ding, Y.-S. Using pseudo amino acid composition and binary-tree support vector machines to predict protein structural classes. Amino Acids 33, 623–629 (2007).

49. 49.

Chen, C., Zhou, X., Tian, Y., Zou, X. & Cai, P. Predicting protein structural class with pseudo-amino acid composition and support vector machine fusion network. Anal. Biochem. 357, 116–121 (2006).

50. 50.

Chen, W., Feng, P.-M., Lin, H. & Chou, K.-C. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res. 41, e68–e68 (2013).

51. 51.

Lin, H., Deng, E.-Z., Ding, H., Chen, W. & Chou, K.-C. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res. 42, 12961–12972 (2014).

52. 52.

Liu, Z., Xiao, X., Qiu, W.-R. & Chou, K.-C. iDNA-Methyl: Identifying DNA methylation sites via pseudo trinucleotide composition. Anal. Biochem. 474, 69–77 (2015).

53. 53.

Hussain, W., Khan, Y. D., Rasool, N., Khan, S. A. & Chou, K.-C. SPalmitoylC-PseAAC: A sequence-based model developed via Chou’s 5-steps rule and general PseAAC for identifying S-palmitoylation sites in proteins. Anal. Biochem. 568, 14–23 (2019).

54. 54.

Hussain, W., Khan, Y. D., Rasool, N., Khan, S. A. & Chou, K.-C. SPrenylC-PseAAC: A sequence-based model developed via Chou’s 5-steps rule and general PseAAC for identifying S-prenylation sites in proteins. J. Theor. Biol. 468, 1–11 (2019).

55. 55.

Khan, Y. D. et al. pSSbond-PseAAC: Prediction of disulfide bonding sites by integration of PseAAC and statistical moments. J. Theor. Biol. 463, 47–55 (2019).

56. 56.

Chou, K.-C. Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 273, 236–247 (2011).

57. 57.

Nikolić, S., Trinajstić, N., Mihalić, Z. & Carter, S. On the geometric-distance matrix and the corresponding structural invariants of molecular systems. Chem. Phys. Lett. 179, 21–28 (1991).

58. 58.

Marrero Ponce, Y. et al. Protein linear indices of the ‘macromolecular pseudograph α-carbon atom adjacency matrix’ in bioinformatics. Part 1: Prediction of protein stability effects of a complete set of alanine substitutions in Arc repressor. Bioorg. Med. Chem. 13, 3003–3015 (2005).

59. 59.

Ortega-Broche, S. E., Marrero Ponce, Y., Díaz, Y. E., Torrens, F. & Pérez-Giménez, F. tomocomd-camps and protein bilinear indices - novel bio-macromolecular descriptors for protein research: I. Predicting protein stability effects of a complete set of alanine substitutions in the Arc repressor. FEBS J. 277, 3118–3146 (2010).

60. 60.

Todeschini, R. & Consonni, V. New Local Vertex Invariants and Molecular Descriptors Based on Functions of the Vertex Degrees. MATCH - Commun. Math. Comput. Chem. 64, 359–372 (2010).

61. 61.

Balaban, A. Local versus Global (i.e. Atomic versus Molecular) Numerical Modeling of Molecular Graphs. J. Chem. Inf. Comput. Sci. 34, 398–402 (1994).

62. 62.

Barigye, S. J. et al. Relations frequency hypermatrices in mutual, conditional, and joint entropy-based information indices. J. Comput. Chem. 34, 259–274 (2012).

63. 63.

Lin, S. & Lapointe, J. Theoretical and experimental biology in one. Biomed. Sci. Eng. 6, 435–442 (2013).

64. 64.

Di Paola, L., De Ruvo, M., Paci, P., Santoni, D. & Giuliani, A. Protein Contact Networks: An Emerging Paradigm in Chemistry. Chem. Rev. 113, 1598–1613 (2013).

65. 65.

Nelson, D. L. & Cox, M. M. Lehninger Principles of Bichemistry. (Macmillan Learning, 2017).

66. 66.

Gonzalez-Diaz, H., Vilar, S., Santana, L. & Uriarte, E. Medicinal Chemistry and Bioinformatics - Current Trends in Drugs Discovery with Networks Topological Indices. Curr. Top. Med. Chem. 7, 1015–1029 (2007).

67. 67.

Mishra, A., Rana, P. S., Mittal, A. & Jayaram, B. D2N: Distance to the native. Biochim. Biophys. Acta - Proteins Proteomics 1844, 1798–1807 (2014).

68. 68.

Marrero Ponce, Y., González-Díaz, H., Zaldivar, V. R., Torrens, F. & Castro, E. A. 3D-Chiral quadratic indices of the ‘molecular pseudograph’s atom adjacency matrix’ and their application to central chirality codification: classification of ACE inhibitors and prediction of σ-receptor antagonist activities. Bioorg. Med. Chem. 12, 5331–5342 (2004).

69. 69.

Ramos de Armas, R., González Díaz, H., Molina, R. & Uriarte, E. Markovian Backbone Negentropies: Molecular descriptors for protein research. I. Predicting protein stability in Arc repressor mutants. Proteins Struct. Funct. Bioinforma. 56, 715–723 (2004).

70. 70.

Gonzáles-Díaz, H. et al. Markovian chemicals ‘in silico’ design (MARCH-INSIDE), a promising approach for computer-aided molecular design I: discovery of anticancer compounds. J. Mol. Model. 9, 395–407 (2003).

71. 71.

Klein, D. J., Palacios, J. L., Randić, M. & Trinajstić, N. Random Walks and Chemical Graph Theory. J. Chem. Inf. Comput. Sci. 44, 1521–1525 (2004).

72. 72.

Carbó-Dorca, R. Stochastic transformation of quantum similarity matrices and their use in quantum QSAR (QQSAR) models. Int. J. Quantum Chem. 79, 163–177 (2000).

73. 73.

Bonchev, D. Information Theoretic Characterization of Chemical Structures (1983). Series: Chemometrics series. Ed. Research Studies Press. ISBN-10: 0471900877. ISBN-13: 978-0471900870.

74. 74.

Barigye, S. J., Marrero-Ponce, Y., Pérez-Giménez, F. & Bonchev, D. Trends in information theory-based chemical structure codification. Mol. Divers. 18, 673–686 (2014).

75. 75.

Pino, R. W. et al. IMMAN: free software for information theory-based chemometric analysis. Mol. Divers. 19, 305–319 (2015).

76. 76.

Appendix B - The WEKA workbench. In Data Mining: Practical Machine Learning Tools and Techniques (eds Witten, I. H., Frank, E., Hall, M. A. & Pal, C. J. B. T.-D. M. (Fourth E.) 553–571, https://doi.org/10.1016/B978-0-12-804291-5.00024-6 (Morgan Kaufmann, 2017).

77. 77.

Todeschini, R., Consonni, V., Mauri, A. & Pavan, M. MobyDigs: software for regression and classification models by genetic algorithms. Data Handling in Science and Technology 23 (2003).

78. 78.

Tropsha, A., Gramatica, P. & Gombar, V. K. The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models. QSAR Comb. Sci. 22, 69–77 (2003).

79. 79.

Léger, C., Politis, D. N. & Romano, J. P. Bootstrap Technology and Applications. Technometrics 34, 378–398 (1992).

80. 80.

Chou, K.-C. & Shen, H.-B. REVIEW: Recent advances in developing web-servers for predicting protein attributes. Nat. Sci. 01, 63–92 (2009).

81. 81.

Liu, B. et al. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 43, W65–W71 (2015).

82. 82.

Chou, K.-C. Impacts of Bioinformatics to Medicinal Chemistry. Curr. Top. Med. Chem. 11, 218–234 (2015).

83. 83.

Chou, K.-C. An Unprecedented Revolution in Medicinal Chemistry Driven by the Progress of Biological Science. Curr. Top. Med. Chem. 17, 2337–2358 (2017).

84. 84.

Zhang, T.-L., Ding, Y.-S. & Chou, K.-C. Prediction protein structural classes with pseudo-amino acid composition: Approximate entropy and hydrophobicity pattern. J. Theor. Biol. 250, 186–193 (2008).

85. 85.

Cai, Y.-D., Liu, X.-J., Xu, X. & Chou, K.-C. Prediction of protein structural classes by support vector machines. Comput. Chem. 26, 293–296 (2002).

86. 86.

Chen, K., Kurgan, L. A. & Ruan, J. Prediction of protein structural class using novel evolutionary collocation-based sequence representation. J. Comput. Chem. 29, 1596–1604 (2008).

## Acknowledgements

Yovani Marrero-Ponce (M.-P., Y) thanks to the program Profesor convitado for a post-doctoral fellowship to work at Valencia University in 2018–2019. M-P, Y acknowledges the support from USFQ “Chancellor Grant 2017–2018 (Project ID11192)”. C.R.G.J. acknowledges the support from “Consejo Nacional de Ciencia y Tecnología (CONACYT)” for the endowed chair 501/2018 at “Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE)”. F. Javier Torres thank USFQ POLY-GRANTS program for financial support. The present study has been performed by employing the resources of the USFQ’s High Performance Computing System (HPC-USFQ).

## Author information

M.-P.Y., G.-J.C., T.J.E. and C.-T.E. proposed the theory of the MuLiMs-MCoMPAs indices, supervised the applications, the design of the GUI and prepared the manuscript. T.E., J.T.F. and V.-R.R.; worked in the definition of the MuLiMs-MCoMPAs indices, in the computational implementation of API and GUI interfaces, performed the QSAR and other statistical analysis and prepared the manuscript. All authors read and approved the final manuscript.

Correspondence to Yovani Marrero-Ponce.

## Ethics declarations

### Competing Interests

The authors declare no competing interests.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

Reprints and Permissions