AFP-LSE: Antifreeze Proteins Prediction Using Latent Space Encoding of Composition of k-Spaced Amino Acid Pairs

Species living in extremely cold environments resist the freezing conditions through antifreeze proteins (AFPs). Apart from being essential proteins for various organisms living in sub-zero temperatures, AFPs have numerous applications in different industries. They possess very small resemblance to each other and cannot be easily identified using simple search algorithms such as BLAST and PSI-BLAST. Diverse AFPs found in fishes (Type I, II, III, IV and antifreeze glycoproteins (AFGPs)), are sub-types and show low sequence and structural similarity, making their accurate prediction challenging. Although several machine-learning methods have been proposed for the classification of AFPs, prediction methods that have greater reliability are required. In this paper, we propose a novel machine-learning-based approach for the prediction of AFP sequences using latent space learning through a deep auto-encoder method. For latent space pruning, we use the output of the auto-encoder with a deep neural network classifier to learn the non-linear mapping of the protein sequence descriptor and class label. The proposed method outperformed the existing methods, yielding excellent results in comparison. A comprehensive ablation study is performed, and the proposed method is evaluated in terms of widely used performance measures. In particular, the proposed method demonstrated a high Matthews correlation coefficient of 0.52, F-score of 0.49, and Youden’s index of 0.81 on an independent test dataset, thereby outperforming the existing methods for AFP prediction.

of AFPs found in fishes namely Type I, II, III, IV and AFGP 15 , have no significant similarities in structures and sequences; rather, they demonstrate some homology to different protein families from which they are assumed to have evolved 18,19 . This inconsistency makes their in-silico identification using conventional search tools such as BLAST 20 and PSI-BLAST 21 unfavorable and increases the complexity of the development of a reliable prediction model due to the lack of common features.
Researchers have proposed several computational strategies such as machine learning to achieve superior results for this diversified classification problem. Kandaswamy et al. proposed a framework named AFP-Pred, which is considered to be a pioneering work in this direction, to utilize machine learning 22 . In this method, a feature vector containing 119 attributes was obtained by encoding each sequence, from which dominant features were selected using the ReliefF approach to train the random forest (RF) classifier. Yu et al. proposed a web-based predictor named iAFP 23 , which utilized n-peptide composition to obtain the feature set. Superior features were selected using the genetic algorithm, and the resultant features were employed to train a support vector machine (SVM). Xiaowei et al. used position-specific scoring matrix (PSSM) profiles with an SVM classifier to develop a web-based AFP predictor called AFP_PSSM 24 . Mondal et al. used the sequence order information from Chou's pseudo amino acid composition (PseAAC) with an SVM to develop an algorithm for AFP prediction (AFP-PseAAC) 25 . Yang et al. developed an ensemble-based learning method named AFP-Ensemble 26 , in which the RF classifier was trained for predicting AFPs. As they performed the evaluation on a non-standard dataset, their results are not discussed in this study. Xiao et al. developed a predictor named iAFP-Ense 27 by incorporating evolutionary information into PseAAC using RF classifiers; however, the classifier was not evaluated on an independent test dataset. Khan et al. performed segmentation of protein sequences to divide them into two groups for amino acid composition (AAC) and di-peptide composition analyses 28 . The dominating features were selected using information gain and ranker methods, and classification was performed using the RF classifier. A web-based predictor for AFPs called CryoProtect 29 is proposed using the RF classifier. The predictor used AAC and di-peptide composition as features for the classifier. The classification of AFP from other protein families is an example of a class imbalance problem. A widely adopted technique to deal with the unbalanced dataset is resampling 30 . Simple resampling techniques involve over-sampling, in which records from the minority class are randomly duplicated, and under-sampling, which executes a random removal of some records from the majority class. However, over-sampling has been reported to pose the problem of overfitting 31 and under-sampling leads to the loss of information 32 . To overcome these limitations Nath et al. adopted K-means clustering with ensemble prediction algorithms to predict AFPs 19 .
The aforementioned methods have shown a reasonable improvement in prediction performance. However, there is a need for an improved method to obtain the desired results. In particular, to the best of our knowledge, none of the methods discussed above have achieved a balanced accuracy value of 90% or above on the standard dataset.
In this work, we utilize the composition of k-spaced amino acid pairs (CKSAAP) for the numerical representation of the amino acid sequence, which has been successfully adopted by several researchers to address various prediction problems [33][34][35] . A part of this work was presented in 36 , where we explored the discrimination power of k = 0 to 13-spaced amino acid pairs. More specifically, we observed that a gap of k = 8 provides the best classification performance.
In recent times, deep learning has been used in various bio-informatics applications 37,38 . It has also been very successfully employed for classification problems 39 . The novelty of our work is that, for the first time, a deep-learning-based technique has been proposed for the classification of AFP sequences. As the dataset is significantly small in size and, with k = 8, the number of descriptors of the CKSAAP scheme is 3600, the training of the model becomes an ill-posed problem.
In this paper, we propose a novel machine-learning-based approach using the concept of latent space learning through a task-specific deep auto-encoder. An auto-encoder, generally used for feature compression 40 , is now utilized to perform composite functions, i.e., to extract significant features from the encoding scheme and to perform the prediction task. The auto-encoder is modified to learn minimally redundant and maximally relevant latent space features, and hence, the feature length is drastically reduced. Exploiting only these important attributes, the classifier achieves superior performance.
A thorough ablation study is performed on the model to obtain the optimal values of the hyperparameters and latent space size. The best model produces superior results on the evaluation parameters including the Matthews correlation coefficient (MCC), Youden's index, balanced accuracy and F1 score. The workflow of the proposed method and the ablation studies performed are shown in Fig. 1, and its details are discussed in later sections. Methods evaluation parameters. AFP prediction is considered a classification problem. Accordingly, we use standard threshold-dependent parameters including sensitivity, specificity, accuracy, MCC, balanced accuracy, Youden's index and F1 score to evaluate the performance of the proposed classifier. These parameters can be evaluated using the following equations:  (8) Here TP, FP, TN, and FN represent true positive (correctly classified AFP), false positive (incorrect classification of non-AFP as AFP), true negative (correctly classified non-AFP), and false negative (incorrect classification of AFP as non-AFP), respectively. Thus, sensitivity indicates the fraction of AFPs correctly classified as AFPs and specificity indicates the fraction of non-AFPs correctly classified as non-AFP. Accuracy indicates the ratio of the total number of correctly classified samples to the total number of samples. As the test dataset is highly imbalanced, the parameters that assess the predictor's quality considering the imbalanced distribution of the test data must be emphasized. For example, MCC considers the TP, TN, FP, and FN values and is regarded as a balanced measure, even if the test dataset is imbalanced. The range of MCC lies between −1 → 1, with −1 indicating the worst binary classification and 1 indicating the best binary classification. Furthermore, balanced accuracy, which is defined as an average of the recall obtained on each class, is usually used when the test dataset is imbalanced. Dataset. The benchmark dataset 22 is obtained to assess the performance of our approach. The dataset was constructed by initially obtaining 221 AFPs from the Pfam database as seed. A stringent threshold, (E = 0.001), was chosen during the PSI-BLAST to remove any redundancy from the data. A manual check was performed to remove any non-AFPs, and finally, the CD-HIT program was used to reduce the sequence identity to 40%. The total number of proteins in the positive dataset is 481. The negative dataset has 9493 non-AFPs, which do not have overlap with the AFPs. These positive and negative datasets were divided into two subsets for training and testing. For a fair comparison, the subsets are maintained to be quantitatively equal to the subsets used in the previous approaches i.e., 300 AFPs and 300 non-AFPs in the training subset, and 181 AFPs and 9193 non-AFPs in the test subset. The selection of proteins from the dataset was randomized to ensure generalization. Some methods have utilized an imbalanced training dataset to investigate the influence of the number of non-AFPs on the prediction performance 41 . Therefore, to determine the effect of data distribution, we performed an ablation study with 600, 900, and 1200 negative training samples during training while maintaining a constant number of positive samples i.e., 300.
features extraction. Composition of k-spaced amino acid pairs. Several machine-learning approaches have been utilized to perform the prediction task for AFPs 28,42 . The fundamental task in developing a computation-based classification model is the translation of protein sequences to interpretative encoded numerical features. Therefore, the conversion of sequence into the numerical vector is indispensable. Various encoding schemes that employ numerous protein features have been developed to extract diverse information from the protein sequences. As it was believed that an individual feature extraction strategy may only represent a partial target's knowledge 26 , in numerous studies, multiple feature extraction methods are combined to enhance the classification performance 23,24,26,27 . However, it has been observed in recent studies that a viable feature extraction method e.g., CKSAAP can equally contribute toward satisfactory prediction performances [43][44][45] . Thus, we utilized CKSAAP encoding scheme in the AFP-CKSAAP method 36 .
This encoding method has emphasized the significance of amino acid pairs and has been utilized in various classification methods 34,35,46 . The feature vector is obtained by calculating the frequency of amino acid pairs separated by k (j = 0, 1, 2, … k) number of residues. The representation is based on the frequency of k-spaced amino acid pairs in a local sequence window. If k = 2, k-spaced pairs for j = 0, 1, and 2 are considered. For each value of j, the corresponding feature vectors F j i.e., F 0 , F 1 and F 2 as shown in Eqs. (9), (10), and (11), respectively, are evaluated, each having a length of 400. The final feature vector F is computed by concatenating the individual feature vectors as shown in Eq. (12). The value of each descriptor is calculated by dividing the number of occurrences of that amino acid pair by the total number of j-spaced residue pairs (N 0 , where L is the length of the protein sequence. In Fig. 2, only a few windows have been highlighted for the purpose of illustration. However, in practice, all the amino acid pairs are covered in overlapping windows with the respective gap values. It is evident from Eq. (12) and Fig. 2, that the CKSAAP encoding scheme utilizes the the trivial information from the preceding features including AAC, DPC, and TPC, which have been proven to play a vital role in AFP prediction in earlier studies 22,28,29 .
Incremental feature selection. Selection of key representative parameters is important for improving the prediction performance of a classifier. AFP-CKSAAP has been thoroughly evaluated to determine the optimal value of k by manually performing the sequential forward selection method to determine the best-suited feature. The best performance of the classifier was obtained by maintaining the gap value k = 8 36 . It is also evident from the references that an attribute vector obtained from a very large value of k will include redundant features and may not contribute toward prediction 33,47 . Owing to the significance of maintaining this value of k, in this study, we perform all the performance analyses by maintaining the constant gap value of k = 8.
From Eq. (12), it can be inferred that the gap value k = 8 in CKSAAP retrieves a feature vector of length 3600. In AFP-CKSAAP, we utilized all the features for classification using a deep neural network that produced satisfactory results, outperforming the previously proposed methods by a fair margin. However, by training the algorithm with fewer training samples having large feature dimensions, there exists a possibility that the AFP-CKSAAP algorithm may lose its generalization for new samples. Therefore, in this study, we intend to achieve satisfactory prediction using a reduced number of features. This could be done by dimension reduction using existing methods such as principle component analysis 48 , Gini index 49 , and mutual information 50 . However, recently, an auto-encoder has also been effectively used for dimension reduction 51,52 . An auto-encoder, which is an unsupervised algorithm, has emerged as a successful neural network framework that learns to represent the input data in much fewer dimensions and regenerates an output approximately similar to the input that has been fed to it. The principal function of this algorithm is its ability to reconstruct the input using substantially fewer features by constraining the latent space. The properties of the latent space in the auto-encoder make it a favorable candidate for feature compression in this study. The details of the architecture of the auto-encoder and its utilization in this study are discussed later sections. Latent space learning for AFP classification. In this study, we design a novel auto-encoder-based classification model for the prediction of AFP proteins. The proposed model is a combination of auto-encoder and classifier. By simultaneously training the auto-encoder and classifier, we successfully learned a noise-free latent space representation, which is composed of variables that have learned the least redundant and most relevant attributes of the input data. The architecture of the proposed model is shown in Fig. 3. Network specifications. Auto-encoder. An auto-encoder is an unsupervised learning algorithm that aims to learn to reproduce the input using fewer dimensions. We propose to use a multilayer auto-encoder architecture that has been regularized to be sparse to generate compressed latent space. By imposing a sparsity penalty during training, the model learns the most informative and discriminative features for AFP classification from the input data as a byproduct 40 . The architecture is composed of three sections: (i) an encoder with some hidden layers, (ii) a latent space, which represents the encoded input in reduced dimensions by ignoring the noise in the input 53 , and (iii) a decoder that regenerates the input from the latent space variables. The number of hidden layers and the number of neurons in each layer of the encoder and decoder are varied to obtain reasonable performance. In this study, the encoder and decoder are composed of five layers, including four hidden layers. The number of neurons in the input layer of the encoder is equal to the length of the attribute vector, the number of neurons in the first hidden layer is 50, the numbers of neurons in the second and third hidden layers of the encoder are 25 each, and the fourth hidden layer has 10 neurons. The number of neurons in the latent space is systematically altered to obtain the best performance. The best performance was achieved when four neurons in the space were selected. The decoder is a complement of the encoder, this symmetry ensures the smooth encoding and decoding procedure 54 . Therefore, the number of neurons in the first hidden layer of the decoder is equal to that in the last layer of the encoder and so on i.e., the numbers of neurons in the first, second, third and fourth hidden layers of The latent space, represents the learned representative features, and is the middle layer of the auto-encoder. It is shared between the encoder and decoder, serving as the final layer for the encoder and the input layer for the decoder. In the proposed model, the latent space has been regularized to be sensitive to the unique statistical features of the input by adding a regularization term in the loss function. Therefore, the model retrieves the information by using the most discriminative features only, essentially serving the classification task. Thus, the classifier is trained on the dominant features, and the decoder is trained to regenerate the input from the latent variables.
Classifier. The classifier is designed to process the latent space variables generated by the auto-encoder module. For the classification, a similar approach as in AFP-CKSAAP 36 i.e., multilayer perceptron (MLP), is implemented. The architecture of the classifier, as shown in Fig. 3, is composed of three hidden layers and an output layer. The final layer of the encoder, which is the latent space, serves as an input layer for the classifier. Therefore, the input layer of the classifier has 4 neurons, each hidden layer has 10 neurons, and the number of neurons in the output layer is equivalent to the number of classes.
Training method. The model consisting of two modules, the auto-encoder module and the classifier module as shown in Fig. 3, is trained using Python on Keras (Tensorflow) for 1000 epochs with a variant of the gradient Figure 3. Architecture of the proposed model for AFP classification. The encoder is composed of an input layer and four hidden layers and embeds the observation to the latent space. The output layer of the encoder is the latent space, connected to the last hidden layer of the the encoder, and serves as the input for the decoder and classifier.The decoder is the complement of the encoder and decodes the representation to the original space. The classifier is a fully connected four-layered multilayer perceptron and is tuned to perform prediction task. (2020) 10:7197 | https://doi.org/10.1038/s41598-020-63259-2 www.nature.com/scientificreports www.nature.com/scientificreports/ descent algorithm called Rmsprop 55 . Each layer of the auto-encoder module uses a rectified linear unit (ReLU) as an activation function to avoid a vanishing gradient. Furthermore, a dropout layer with 30% is used after each layer for better generalization and to avoid overfitting. For the classification module, ReLU has been used as an activation function for all the layers, except the output layer where the softmax function is used to generate class prediction probabilities.
The proposed model generates two types of outputs: (i) a decoded feature vector, and (ii) a class label of input protein. For the auto-encoder and classifier modules, we used different loss functions to minimize their respective error values. To train the auto-encoder, we use a mean squared error (MSE) loss function, whereas the classifier module is optimized by minimizing the binary cross entropy between the true class and predicted class labels. The MSE is calculated between the input and decoded feature vectors of the auto-encoder. The results of MSE values for all the auto-encoder models are presented in Table 1.

Results
Herein, we present the results of the experiments performed for the evaluation of the model. The training dataset is randomly divided into two subsets, i.e., training and validation, with the ratio of 90:10, i.e., out of 600 samples, 540 samples were used for training and 60 samples for validation. We used early stopping with the patience of 50 epochs to avoid overfitting, and we stopped the training if the model stopped improving. The metric in the early stopping was validation loss, and the training was stopped at approximately 700 epochs. The best model was obtained by performing the ablation study, the details of which are discussed later in the text.
Ablation study. In this work, we perform an ablation study to obtain a simple overall architecture. This is motivated by the fact that the latent space is sparsely populated. This sparse space eliminates redundancies to achieve the degree of compression factor that can be reached. To this end, a benchmark architecture is evaluated with various modifications in the design, and the performance of each model is observed. One must choose an optimal number of neurons in the latent space so that the feature vector is significantly reduced, and the decoder must be able to regenerate the input using these features. Furthermore, the latent space serves as the input layer of the classifier network, which makes it crucial. Considering the significance of the latent variables, in this study, we evaluated the models with varying number of latent space variables. Additionally, we intended to observe the behavior of the model with respect to the data distribution in the train dataset. The existing studies, with some www.nature.com/scientificreports www.nature.com/scientificreports/ exceptions, have been conducted on the balanced training dataset of the benchmark data. For a fair comparison, we used a similar configuration of the train and test datasets. However, to evaluate the robustness of the proposed method, we also train it using an unbalanced dataset.
Effect of latent variables. In the first ablation study, we observe the effect of varying the number of variables in the latent space by maintaining a constant balanced data distribution for training. Since the latent space is sparsely populated, it satisfies the limitation on the compression factor. Therefore, we start the evaluation by maintaining the latent space variable of length 25. The latent space variables (LV) are then systematically reduced and evaluated by reducing 5 neurons. Subsequently, after evaluating the performance of the model for LV 5 neurons, the latent space variables were further reduced one by one. For each configuration, 20 simulation runs are performed, and the values of the statistical parameters such as MCC, Youden's index, balanced accuracy, F1 score, and MSE are observed. The mean values of Youden's index and the MSE for the reconstruction error have been depicted in Figs. 4 and 5, respectively.
Effect of data distribution. Another ablation study was performed to observe the sensitivity of the model for training the data distribution. To this end, AFPs and non-AFPs were fused in three distinct subsets having AFP and non-AFP ratios of 1:1, 1:2, and 1:3. Additionally, the effect of the latent space variables on the data distribution was considered; therefore, the training was performed on incremental latent space variables. Yang et al. studied the effect of an imbalanced training dataset and it has been reported that their classifier does not comprehend the imbalanced data and classifies most of the samples to the majority class 26 , the results therefore are not appreciable. However, the proposed classifier (AFP-LSE) has the tendency to learn further motif information when the number of training samples is increased. Appreciable values of performance metrics in Table 1, suggests that the performance of the classifier can be improved by utilizing the supplementary information from the negative class. As there is a limitation in the availability of AFP datasets, previous studies have been conducted on a small balanced dataset. Therefore, for a comparison, we report the results of the performance of the classifier trained by using similar configurations.  www.nature.com/scientificreports www.nature.com/scientificreports/ Performance evaluation and comparison with contemporary methods. After an analysis of the results obtained from the ablation study performed to determine the optimal parameters and the size of the latent space, the best model is selected as the classifier for AFP and is named as AFP-LSE. The model is trained with CKSAAP encoded samples with k = 8, with the number of latent space variables LV = 4 and with 1:1 ratio of training and test datasets. The model is evaluated on an independent test dataset, and its results on the statistical parameters are better than those obtained by the previously reported methods. This study evaluates the performance of the classifier on the parameters reflecting the true efficacy of the classifier by considering the imbalanced condition of the training and testing datasets. Therefore, we emphasize the parameters MCC, balanced accuracy, and Youden's index due to their insensitivity toward imbalance in classes. The best model showed the MCC value of 0.52, balanced accuracy of more than 90%, and Youden's index value of 0.81. The performance of AFP-LSE is compared with those of the existing methods as shown in Table 2. Based on the prediction results, AFP-LSE achieved superior performance on all the statistical measures. Particularly, improvements of approximately 2% and 5% in the balanced accuracy and Youden's index, respectively, were observed when compared with the corresponding values for the best classifier in the literature i.e., CryoProtect 29 . Similarly, the best values of the MCC and F-score were demonstrated by AFP_PSSM 24 , whereas the proposed classifier shows improvements of approximately 52% and 68%, respectively, for the aforementioned parameters.
Prediction of novel AFP candidates. Considering the extreme rarity of AFPs within entire organism proteomes, herein, we perform the screening of novel AFP candidate proteins. An independent dataset containing 10 candidate AFPs was obtained from the INTERPRO 56 database. The sequences in this independent test dataset were not present in the positive or negative datasets of AFP-LSE. The prediction results of AFP-LSE were compared with those of PSI-BLAST search from UNIPROT 57 and SWISSPROT 58 databases on E = 0.1. The AFP-LSE predicted 9 proteins as AFPs and only 1 protein is predicted as non-AFP. Interestingly, the same protein is also classified as non-AFP by PSI-BLAST. Compared with AFP-LSE, PSI-BLAST retrieved only 4 out 10 candidate sequences as AFPs as shown in Table 3. The NCBI database annotated 4 out of 10 sequences as hypothetical or unnamed proteins; further three of them were characterized as Type I antifreeze, or AFP-like domain-containing proteins, whereas the annotations of the remaining three are shown in Table 3. The performance of AFP-LSE suggests that it can be effectively utilized for the annotation of hypothetical proteins.

Discussion
Due to the lack of availability of AFP samples, the nature of the available dataset is skewed, therefore, the classification of AFPs from non-AFPs poses a class imbalance problem which is challenging for machine-learning algorithms 59 . In addition to this class imbalance, there is an issue of rare cases of sub-types in AFP, as in "AFP" class, where only fewer sub-types are in abundance, which leads to intra-class imbalance and introduces outlier artifacts in designing a reliable classifier. In contrast, in typical classification problems e.g., in the case of lysine acetylation sites prediction in proteins, or the identification of protein-protein binding sites, there is an availability of a substantially large number of positive and negative samples in datasets, hence, they do not suffer from the problem of class imbalance or intra-class variation 33,60,61 . Another challenge faced in the classification of AFPs is the variation in the sequences of AFPs, which subsequently produces features with low inter-class and high intra-class variance. These inevitable phenomena are the consequences of the similarity exhibited by AFPs with different protein families from which they are assumed to be evolved 18,19 and because different AFPs present low sequence similarity among each other. Principal component analysis (PCA) projection of CKSAAP features, which is discussed later in the text, establishes explicit evidence in Fig. 6(a,b), that both AFPs and non-AFPs appear in an overlapping fashion, suggesting that the development of the AFP classifier using linear methods is an arduous task.
For an insightful understanding of CKSAAP representation-based classification of AFPs using the given dataset, we present a comparison of the PCA and AFP-LSE methods. For visual assessments, the data were projected on two dimensions utilizing the top two eigenvectors in the case of PCA and two latent spaces in the case of AFP-LSE. As shown in Fig. 6(c), the proposed non-linear auto-encoder-based latent space encoding (AE-LSE) presents superior learning capabilities and maps the AFPs and non-AFPs in separate regions in contrast to the linear unsupervised sub-space learning method of PCA depicted in Fig. 6(a), which fails to do so, revealing that both classes are inseparable in a linear sense.
The same eigenvectors and the latent space from PCA and AE-LSE respectively, obtained from training are then utilized to project the test data. Differences in the mapping capabilities of AFPs can be observed for both the PCA and AE-LSE methods in Fig. 6(b,d) respectively. It can be observed in the bottom right of the Fig. 6(d) that the AE-LSE method forms clusters of AFP samples. Nevertheless, there is some overlapping of non-AFPs, the overall separability of the data projected through the AE-LSE method is better than that of the data linearly projected by the PCA, indicating that the discovery of unknown groups using PCA is strenuous. This helps in understanding the working principle of the proposed method and the motivation for the development of non-linear auto-encoder-based learning of latent space.
The proposed method can contribute toward the design of a superior mapping function resulting in a reduction of dimensions while retaining the information that separates the AFP from the non-AFP samples. Recently, many researchers have shown interest in auto-encoder-based models 62 . However, to the best of our knowledge, no auto-encoder-based classifier has been proposed for the classification of protein sequences. The proposed model www.nature.com/scientificreports www.nature.com/scientificreports/ can be used for the prediction of other types of proteins as well, for instance, bioluminance proteins (BLPs) 63 and extra cellular matrix proteins (ECM) 64 etc. In particular, it can be utilized for the dimensionality reduction in highly non-linear classification problems where number attributes are higher than the training samples. To avoid overfitting, we used regularization techniques such as dropout and batch-normalization in this study. For future studies we would recommend utilizing transfer learning approach where the AFP-LSE model is first trained with a closely related classification task and later fine-tuned for AFP dataset. However, transfer learning and other training strategies are beyond the scope of this study. The Python implementation of the proposed algorithm has been made public, and interested user can utilize the algorithm for their problem of interest. The algorithm is available at (https://github.com/Shujaat123/AFP-LSE). In the near future, we would like to explore auto-encoder-based classifiers further for other bio-informatics problems.

conclusion
The prediction of AFPs due to the unavailability of a substantial dataset and the inherent diversity in the sequence and structures is a challenging classification problem that has been addressed by various researchers. In the proposed prediction method, each protein sequence was encoded using CKSAAP with k = 8. The results of our previous study showed that these features can significantly contribute to the classification performance. For classification, we proposed a novel machine-learning-based method for the AFP prediction. The method uses an auto-encoder for feature compression, and these reduced features are used to train the neural-network-based classifier. A comparison of the proposed non-linear mapping method with the linear projection approach of PCA demonstrated superior classification capabilities of the proposed method. A comprehensive ablation study was performed for a better understanding of the effect of latent space variables as well as the impact of training data distribution, and widely used biostatistics nomenclatures were evaluated. The method yields excellent classification results on the benchmark dataset, outperforming the existing methods, particularly yielding an MCC value of 0.52 with a Youden's index of 0.81.