Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction

The use of raw amino acid sequences as input for deep learning models for protein functional prediction has gained popularity in recent years. This scheme obliges to manage proteins with different lengths, while deep learning models require same-shape input. To accomplish this, zeros are usually added to each sequence up to a established common length in a process called zero-padding. However, the effect of different padding strategies on model performance and data structure is yet unknown. We propose and implement four novel types of padding the amino acid sequences. Then, we analysed the impact of different ways of padding the amino acid sequences in a hierarchical Enzyme Commission number prediction problem. Results show that padding has an effect on model performance even when there are convolutional layers implied. Contrastingly to most of deep learning works which focus mainly on architectures, this study highlights the relevance of the deemed-of-low-importance process of padding and raises awareness of the need to refine it for better performance. The code of this analysis is publicly available at https://github.com/b2slab/padding_benchmark.

All linear models tables have been created with stargazer v.5.2.2 [1].   Note: * p<0.1; * * p<0.05; * * * p<0.01 Table S4: Linear model on F1-score to analyze what is the effect of changing from the standard dense padding to sparse paddings. The reference levels were omitted.    Table S6: Linear model on F1-score to analyze how might enzyme type affect differently the performances of some padding types. The reference levels were omitted. In Figure S8, activations seem to be more spread along the PC space than in Figure 3, the analogous representation for task 2. This happens for all types of padding except for zoom-, whose distribution is very similar to that of zoom-padding for task 2.
In order to quantify the differences observed in the first principal component of the activations of the convolutional filters for task 2 models, we built the explanatory model explained by Equation 1. Results can be seen in Table S7. All terms are statistically significant, which means there are differences for PC1 according to enzyme type and padding type. Terms for enzyme types 2, 3 and 4 are negative while they are positive for 1, 5, 6 and 7.
The selected parameters were drop_per=0.2, drop_hid=0.5 for all the models; n_1=314, n_2=77, n_3=8 for task 1 and n_1=313, n_2=76 for task 2; pool_size=10 for stack_conv; f1=64 and k1=5 for 1_conv; f2=10, k21=1, k22=3, k23=5, k24=9, k25=15 for stack_conv. As activation function, we used Rectified Linear Unit (ReLU) for the hidden layers [2], and softmax for the output layers. The set of values for the number of neurons in the feed-forward part of the model, which is common for the four architectures, is based on the number of enzyme classes and subclasses. There are 7 classes of enzymes according to the first digit of the EC number (8 if we take count non-enzymes), 76 subclasses according to the second digit of the EC number (77 with non-enzymes) and 313 categories according to the third digit of the EC number (314 with the non-enzymes). So the values for the feed-forward layers would be 314, 77 and 8, respectively. These values aim to have a biological meaning since deep neural networks are able to extract hierarchical feature representations and enzyme classification has a tree-structured label space.

Performance metrics
The description of the metrics used for evaluating and comparing the performance of the different padding types is shown below. Let T P be the number of true positive classified samples, T N the true negatives, F P the false positives and F N the false negatives: Accuracy = (T P + T N ) (T P + F P + T N + F N ) .
If precision = T P T P +F P and recall = T P T P +F N , then F1-score can be described as F 1 = 2 · precision · recall precision + recall The macro F1-score calculates metrics for each label, and finds their unweighted mean. This does not take label imbalance into account.
The Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve measures performance in classification problems for different thresholds. The ROC curve is a probability curve obtained by plotting the True Positive Rate (which is the same as the recall) on y-axis against the False Positive Rate (which is F P T N +F P ) on the x-axis. The AUC, which is the area under this curve, quantifies how capable the model is of distinguishing both classes.