Amyloidogenic motifs revealed by n-gram analysis

Amyloids are proteins associated with several clinical disorders, including Alzheimer’s, and Creutzfeldt-Jakob’s. Despite their diversity, all amyloid proteins can undergo aggregation initiated by short segments called hot spots. To find the patterns defining the hot spots, we trained predictors of amyloidogenicity, using n-grams and random forest classifiers. Since the amyloidogenicity may not depend on the exact sequence of amino acids but on their more general properties, we tested 524,284 reduced amino acid alphabets of different lengths (three to six letters) to find the alphabet providing the best performance in cross-validation. The predictor based on this alphabet, called AmyloGram, was benchmarked against the most popular tools for the detection of amyloid peptides using an external data set and obtained the highest values of performance measures (AUC: 0.90, MCC: 0.63). Our results showed sequential patterns in the amyloids which are strongly correlated with hydrophobicity, a tendency to form β-sheets, and lower flexibility of amino acid residues. Among the most informative n-grams of AmyloGram we identified 15 that were previously confirmed experimentally. AmyloGram is available as the web-server: http://smorfland.uni.wroc.pl/shiny/AmyloGram/ and as the R package AmyloGram. R scripts and data used to produce the results of this manuscript are available at http://github.com/michbur/AmyloGramAnalysis.


Contents
Average relative probability of inner beta-sheet (Kanehisa and Tsong, 1980)

S2 Quick Permutation Test (QuiPT)
Permutation tests are commonly used for filtering important n-grams testing, the hypothesis that an occurrence of n-gram and a value of a target are independent. However, exhaustive testing of permutations is computationally expensive and, as a result, they often become one of the most limiting factors in these kinds of analyses. Therefore, we developed the Quick Permutation Test which effectively filters n-gram features, without requiring exhaustive testing, and using the information gain (mutual information) as the criterion of the importance of a specific n-gram. Using QuiPT we selected the most discriminating n-grams extracted from the hexapeptides of the training data set. Again, the counts of n-grams were binarized (1 if n-gram was present, 0 if absent). Only n-grams with p-value smaller than 0.05 were assumed to be informative. Consider a contingency table for a target y and a feature x (Tab. 2). For example, the entry n 1,0 is the number of cases when the target is 1 and the feature is 0. target / feature 1 0 total 1 n 1,1 n 1,0 n 1,· 0 n 0,1 n 0,0 n 0,· total n ·,1 n ·,0 n Under the hypothesis that x and y are independent, the probability of observing such a contingency table is given by the multinomial distribution in which all probabilities depend only on marginal distributions. The idea of the permutation test is to reshuffle labels of features and targets, while keeping the fixed total number of positives for features and targets. When we impose this constraint on the multinomial distribution, then the probability of occurrence for a given contingency table depends on only one entry, n 1,1 , which is fairly easy to compute. After computing Information Gain (IG) for each possible value of n 1,1 ∈ [0, min(n · ,1 ; n 1, · )], we get the distribution of Information Gain under the hypothesis that the target and feature are independent. We reject the null hypothesis of independence, if the IG for the tested feature is above the required quantile from the IG distribution.
The analytic formula for the distribution enables to perform the permutation test much quicker. Furthermore, we get exact quantiles even for extreme tails of the distribution, which is not guaranteed by random permutations. For example, for the test at the level α = 10 −8 , which can often occur in the corrections for multiple testing, the standard deviation of quantile estimate in the permutation test, p(1−p) m , is roughly equal to α itself even for a very large number of permutations like m = 10 8 . In the context of n-gram data, we can further speed up our algorithm. Note that test statistics depends only on n ·,1 , i.e., the number of positive cases in the feature when the target y is common for testing all n-gram features. Although we test millions of features, there are only a few distributions that we need to compute because the usual number of positives in n-gram feature is small. We take advantage of this fact and compute quantiles only for this small number of distributions. Therefore, the complexity of our algorithm is roughly equal to O(n · p), where n and p represents the number of features and number of positives, respectively.
Last, let us point out that QuiPT is very similar to Fisher's exact test. From the derivation provided in reference (Lehmann and Romano, 2008) and elsewhere, it becomes obvious that QuiPT is a heuristics for an unsolved problem of a two-tailed Fisher's exact test. In this heuristics, the extremity of a contingency table is defined by its information gain.

S3 Sensitivity and specificity
Sensitivity and specificity of classifiers with various encodings for every possible combination of sequence lengths in the training and testing data sets.
The classifier based on the best-performing encoding always have a good specificity and sensitivity. The color of the square is proportional to the number of encodings in its area. Points represent classifiers based on special encodings: the best-performing encoding, full amino acid alphabet and two two standard encodings, ADEGHKNPQRST, C, FY, ILMV, W (Kosiol et al., 2004) and AG, C, DEKNPQRST, FILMVWY, H (Melo and Marti-Renom, 2006 (Kosiol, et al., 2004) Standard encoding (Melo and Marti-Renom, 2006) 250 500 750

S4 Bootstrap confidence intervals for benchmark
We computed 0.95 confidence intervals for all classifiers by bootstrapping results of the benchmark (Efron and Tibshirani, 1994). Briefly, predictions returned by classifiers were sampled with replacement number of times equal to the total number of predictions. For each bootstrap sample we computed performance measures. Repeating the procedure 1000 times, we obtained a robust estimate of 95% confidence intervals adjusted for multiple comparison using Dunn -Šidák correction, therefore significantly different values of performance measures can be distinguished (see Tab. 3 and Fig. 1).

S5 Pairwise sequence identity between training and benchmark data sets
For each peptide from the pep424 data set we computed its pairwise identity to peptides from the benchmark data set. The pairwise sequence identity was defined as following (Raghava and Barton, 2006): where: • I: identical positions, • A: aligned positions, • G: internal gap positions.
We discovered that despite high pairwise identity, peptides may have different properties amyloidogenic properties. In case of 270 non-amyloidogenic sequences from pep424, over 295 amyloidogenic peptides from the training data set have pairwise identity 100%. 149 amyloidogenic peptides from the pep424 data set have pairwise identity 100% with only 69 sequences in the benchmark data set that are amyloidogenic and 316 non-amyloidogenic sequences (2). Concluding, in case of amyloid data high sequence similarity does not reflect likeness in their properties.

S6 Jackknife test
To further estimate the bias of AmyloGram, we performed a jackknife procedure. Using the training data created for the benchmark (269 positive sequences and 746 negative sequences) we trained 961 iterations of AmyloGram each time leaving one sequence out and then performing predictions on the left sequence. The AUC=0.86 and MCC=0.52. We also trained an iteration of AmyloGram on the full training data set and made predictions on all input sequences obtaining AUC=0.97 and MCC=0.80.

S7 Amino acid flexibility/rigidity and size
In the light of recent studies, amyloid peptides could form ring-like structures in the core of aggregating oligomers Dovidchenko et al. (2016). This may require a certain flexibility of amino acids involved in the core. However, the flexibility/rigidity seems to depend on the size or volume of amino acid residues. Therefore, we checked if the flexibility measure that stood out in our analysis of amyloid regions can result from a correlation to any size-related feature of amino acids, especially that which was not selected for the encodings. We evaluated the correlation of three size-related properties from AAIndex database and the flexibility measure chosen by our algorithm (Tab. 4 and Fig. 3). It turned out that only bulkiness is significantly correlated but moderately with the amino acid flexibility (Tab. 4). Bulkiness was not in the set of 17 physicochemical properties used to create the encodings. We also computed the average values of flexibility and size-related measures of each peptide from the AmyLoad database. The average size is very similar for amyloid and non-amyloid peptides, as well as is not related to the visible differences in the flexibility. It can be seen on volcano plots, where the distribution of flexibility differentiates the amyloid and non-amyloid peptides (Fig. 5)). The same can be also observed on violin plots representing the distribution of mean values of all properties for amyloids and non-amyloids (Fig. 4)). A slight difference can be observed in the volume of residues, however this feature was explicitly selected for the encodings and is not correlated with the flexibility measure that we used. Finally, the bulkiness behaves somehow differently. It may differentiate amyloid and non-amyloid hot-spots and it is correlated with our flexibility measure. Therefore, we cannot exclude that this size-related measure may contribute to the amyloid propensity.