A signal processing method for alignment-free metagenomic binning: multi-resolution genomic binary patterns

Algorithms in bioinformatics use textual representations of genetic information, sequences of the characters A, T, G and C represented computationally as strings or sub-strings. Signal and related image processing methods offer a rich source of alternative descriptors as they are designed to work in the presence of noisy data without the need for exact matching. Here we introduce a method, multi-resolution local binary patterns (MLBP) adapted from image processing to extract local ‘texture’ changes from nucleotide sequence data. We apply this feature space to the alignment-free binning of metagenomic data. The effectiveness of MLBP is demonstrated using both simulated and real human gut microbial communities. Sequence reads or contigs can be represented as vectors and their ‘texture’ compared efficiently using machine learning algorithms to perform dimensionality reduction to capture eigengenome information and perform clustering (here using randomized singular value decomposition and BH-tSNE). The intuition behind our method is the MLBP feature vectors permit sequence comparisons without the need for explicit pairwise matching. We demonstrate this approach outperforms existing methods based on k-mer frequencies. The signal processing method, MLBP, thus offers a viable alternative feature space to textual representations of sequence data. The source code for our Multi-resolution Genomic Binary Patterns method can be found at https://github.com/skouchaki/MrGBP.

(a) Atomic  Fig. 2 Visualisation of the simulated metagenomic community by considering various MLBP window lengths (a) p = 2, (b) p ≤ 4, (c) p ≤ 6, and (e) p ≤ 8. Each colour represents a different species (see Figure 1 for key) on the left side and a cluster defined by our approach on the right hand side figures.

4-mer as feature space
we provided a direct comparison of our results with 4-mers as the feature space in our pipeline. The results are shown for 10 and 100 genome datasets. Supplementary

Real Human Gut Metagenomic Data
Supplementary   Supplementary Figure 3 Longitudinal abundance patterns of the 19 identified clusters, see Figure 3. The associated species or groups of species are indicated for each cluster. The x-axis corresponds to the longitudinal sampling over nine days 1 . The y-axis corresponds to normalised read coverage. The red box indicate the correlated clusters across longitudinal samples.
Supplementary Figure 4 Each color shows a cluster defined by our approach on the right hand side figure.
Computing input similarities... dimension reduction running time: 35.5877 clustering time: 12.056 -out: 12 clusters has been generated. Parameter Setting: 1. Numerical representation of nucleotide sequences: various data representations can affect the results (Supplementary Figure 1 and table 2). It shows that Integer representation has better performance and has been set as the default for the online code.
2. MLBP window length that affects the feature length: Longer feature space my improve the performance similar to other applications. However, increasing the feature length increases the computational complexity/run time (Supplementary Figure 2 and Table 3).
3. Dimension reduction steps using (1) SVD considering various number of eigen factors (Figure 4 and Table 4) and (2) BH-tSNE where the default parameters have been considered except for suggesting keeping 3 dimensions for more complex data can improve the results. Moreover, a review on BH-tSNE parameter settings can be found at: http://distill.pub/2016/misread-tsne/ 4. DBSCAN parameters: It has two parameters (1) epsilon that indicates the closeness of the points of each cluster to each other and (2) minPts, the minimum neighbours a point should have to be considered into a cluster. Usually these values are not known prior to analysis and there are several ways to select their values. One way is to calculate the distance of each point to its closest nearest neighbour and use the histogram of distances to select epsilon. After selecting epsilon a histogram can be obtained of the average number of neighbours for each point using the epsilon. Some of the samples do not have enough neighbouring points and can be considered as noise. Implementation of the parameter selection is included in spark dbscal (https://github.com/alitouka/spark_dbscan). Here, we consider minPts = 8 and epsilon = 0.02 but to have more clusters our suggestion is to reduce epsilon. Visualisation can help to decide if the clusters are satisfactory otherwise the parameters can change.
Moreover, for multi-sample real datasets the coverage information have been added. Consequently, we believe our method does not need many samples to run. Our method performs better for low/medium datasets and also if there is noisy sequences.