PrESOgenesis: A two-layer multi-label predictor for identifying fertility-related proteins using support vector machine and pseudo amino acid composition approach

Successful spermatogenesis and oogenesis are the two genetically independent processes preceding embryo development. To date, several fertility-related proteins have been described in mammalian species. Nevertheless, further studies are required to discover more proteins associated with the development of germ cells and embryogenesis in order to shed more light on the processes. This work builds on our previous software (OOgenesis_Pred), mainly focusing on algorithms beyond what was previously done, in particular new fertility-related proteins and their classes (embryogenesis, spermatogenesis and oogenesis) based on the support vector machine according to the concept of Chou’s pseudo-amino acid composition features. The results of five-fold cross validation, as well as the independent test demonstrated that this method is capable of predicting the fertility-related proteins and their classes with accuracy of more than 80%. Moreover, by using feature selection methods, important properties of fertility-related proteins were identified that allowed for their accurate classification. Based on the proposed method, a two-layer classifier software, named as “PrESOgenesis” (https://github.com/mrb20045/PrESOgenesis) was developed. The tool identified a query sequence (protein or transcript) as fertility or non-fertility-related protein at the first layer and then classified the predicted fertility-related protein into different classes of embryogenesis, spermatogenesis or oogenesis at the second layer.

disregard the sequence order information. As mentioned above the DC offers 400 descriptors, calculated as [2]: F (r, s) = number of dipeptide represented by amino acid type and type total number of all possible dipeptide r, s = 1, 2, . . . , 20

Autocorrelation features
Autocorrelation features are based on the distribution of amino acid properties along the sequence.
In this study, we used eight amino acid properties to evaluate these features including normalized average hydrophobicity scales, average flexibility indices, polarizability parameter, free energy of solution in water, residue accessible surface area in tripeptide, steric parameter, relative mutability and residue volume. Three types of autocorrelation features are defined here and described below.
Normalized Moreau-Broto autocorrelation descriptors can be computed as: Moran autocorrelation descriptors application to protein sequence can be calculated as: Where d and and + are defined properties of amino acid at the position i and i + l, respectively. and ̅ is the considered property P along the sequence, i.e.,

Composition, transition and distribution features (CTD)
These descriptors are The distribution descriptor represents the distribution of every property in the sequence. It calculates five descriptors of distribution for every attribute in first residue, 25% residue, 50% residue, 75% residue, and 100% residue, respectively, for each specified encoded group.

Conjoint Triad descriptors
Conjoint triad descriptors are developed by Shen et al. (2007). These conjoint triad descriptors summarize the attributes of protein pairs according to the amino acid classification. In this way, every sequence of protein is defined by a vector space consisting of amino acids descriptors. The 20 amino acids were clustered into various groups due to their volumes of the side chains and dipoles to decrease vector space dimensions. The conjoint triad descriptors are computed as follow: = f − min{ 1 , 2 , … , 343 } max{ 1 , 2 , … , 343 } V is the vector space of the sequence features; each feature ( ) represents a triad composed of three consecutive amino acids; F is the frequency vector corresponding to V, and the value of the i-th dimension of F( ) is the frequency that vi-triad appeared in the protein sequence [2].

Quasi-sequence-order descriptors
Sequence-order descriptors are developed by Chou (2000). They are calculated from the distance matrix between the 20 amino acids pair.

Sequence-order-coupling number
The d-th rank sequence-order-coupling number is calculated as: Where , + is the distance between the two amino acids at position i and i + d. two physiochemical distance matrix such as Schneider, Wrede and Grantham where maxlag(d) = 30. So the total (d*2 = 60) number of features of sequence order coupling numbers are extracted.

Quasi-sequence-order descriptors
For each amino acid type, a quasi-sequence order descriptor can be calculated as: Here, f is the normalized occurrence of amino acid of type r and w is the weighting factor.