Texture analysis in gel electrophoresis images using an integrative kernel-based approach

Texture information could be used in proteomics to improve the quality of the image analysis of proteins separated on a gel. In order to evaluate the best technique to identify relevant textures, we use several different kernel-based machine learning techniques to classify proteins in 2-DE images into spot and noise. We evaluate the classification accuracy of each of these techniques with proteins extracted from ten 2-DE images of different types of tissues and different experimental conditions. We found that the best classification model was FSMKL, a data integration method using multiple kernel learning, which achieved AUROC values above 95% while using a reduced number of features. This technique allows us to increment the interpretability of the complex combinations of textures and to weight the importance of each particular feature in the final model. In particular the Inverse Difference Moment exhibited the highest discriminating power. A higher value can be associated with an homogeneous structure as this feature describes the homogeneity; the larger the value, the more symmetric. The final model is performed by the combination of different groups of textural features. Here we demonstrated the feasibility of combining different groups of textures in 2-DE image analysis for spot detection.

co-occurrence matrix: angular second moment, contrast, correlation, sum of squares, and various averages, variances, inverse moments and entropies. 2 Please refer to Mazda user's manual for the particular equations of these features.
The highest influence in this work is achieved by the Inverse Difference Moment textural feature (IDM) which is a measure of local homogeneity with the following equation where G is the number of grey levels used, and from the Gray Level Coocurrence Matrix (GLCM) each element contains the second order statistical probability value from changes between grey levels i and j at a particular distance d and at a particular angle Θ having particular (co-occurring) values i and j. The matrix element P(i, j | ∆x, ∆y) is the relative frequency with which two pixels, separated by a pixel distance (∆x, ∆y), occur within a given neighborhood and with intensities i and j.
According with this equation, a low IDM value is achieved with inhomogeneous images and a relatively higher value with homogeneous images. Reviewing our dataset, Proteins have low IDM values (inhomogeneous) and Noise have high IDM values (homogeneous).

Run-length matrix
Across a given direction, the run-length matrix measures how many times there are runs of consecutive pixels with the same gray level value. In this software there are four run-length matrices computed, for four directions of pixel runs: horizontal, vertical, at 45 • and at 135 • . MaZda calculates five different textural features from this matrix: short run emphasis, inverse moment, long run emphasis moment, gray-level non-uniformity, run length non-uniformity and fraction of image in runs. 2 Please refer to Mazda user's manual for the particular equations of these features.

Model-based textural features
Based on a first-order autoregressive model of the image. The model assumes that pixel intensity, in reference to the mean value of image intensity, may be predicted as a weighted sum of four neighboring pixel (left, top, top-left and top-right) intensities. This group is aimed to find relations between neighborhood of pixels (shapes within the image). Please refer to Mazda user's manual for the particular equations of these features.

Absolute gradient
The gradient of an image measures the spatial variations of gray levels across the image. A high gradient value is achieved with an abrupt variation of gray level value (for example from black to white). Those features are derived from the gradient: mean, variance, skewness, kurtosis and percentage of pixels with nonzero gradient. Please refer to Mazda user's manual for the particular equations of these features.

Discrete Haar Wavelet
Wavelets analyzes the frequency of an image in different scales. The wavelet images are scaled up to five times, both in horizontal and vertical direction. It results in image transformation into 20 frequency channels. Please refer to Mazda user's manual for the particular equations of these features.

The dataset
In order to generate the dataset, ten 1024 × 1024 8-bit 2-DE images 3 were used, corresponding to an experiment where the effect of a plant extract on the protein expression of IBR3 human dermal fibroblasts was investigated. Spot separation patterns were visualized by silver staining using standard protocols. These images are from the dataset owned by G.-Z Yang 4 (Imperial College of Science, Technology and Medicine, London) and have been used in several publications. [5][6][7] For each image out of these ten 100 regions of interest (ROI), 50 spots representing proteins and 50 representing noise (noise, background, non-protein regions) manually segmented that were selected to build a training set with 1000 samples and 274 textural features. We preprocess this dataset in order to have a standard normal distribution (a mean of zero and a standard deviation of one). The dataset is available for download at http://dx.doi.org/10.6084/m9.figshare.1368643.
We also included in the Supplementary Materials the Mazda (.roi) files and the images in order to reproduce the manual segmentation process and also to point out the particular spots selected for each image. This information is available for download at http://dx.doi.org/10.6084/m9.figshare.1538606.
With Mazda it is only possible to define up to 16 regions of interest for each image, so there exist eight (.roi) files for each image. Please refer to Mazda user's manual for the particular instructions to load an image and (.roi) files.      Values in parenthesis represent coordinates, containing information about distance and direction between pixels. Perc. = percentile derived from the image histogram, Theta and Sigma= vector of autoregressive model, InvDfMom = inverse difference moment, Gr. = absolute gradient parameters (kurtosis, mean, skewness, variance and and percentage of pixels with nonzero gradient). FSMKL considers that Run-length matrix and wavelet textural features are not relevant for the given classification problem. Table 6. Inter-intra-variability in the manual spot segmentation process with ten 2D electrophoresis images. Iteration  1  2  3  4  5  6  7  8  9  10   Clinician A  1  404  545  539  545  445  539  565  307  539  565  2  433  551  527  512  412  533  579  306  533  551   Clinician B  1  397  481  541  497  431  511  539  297  505  556  2  401  475  512  505  429  523  545  300  471  We identify each image with an Image ID number and present for each one of the two clinicians the number of spots manually segmented in two consecutive iterations. Mean and standard deviation are calculated at the bottom of the table to measure the inter-and intra-variability. These features are based on image histogram, co-occurrence matrix (information about the grey level value distribution of pairs of pixels with a preset distance d = 1,2,3,4 and 5 pixels apart along a given direction with angle Θ = 0 • , 45 • , 90 • , 135 • , run-length matrix (information about sequences of pixels with the same grey level values in a given direction), image gradients (spatial variation of grey levels values), auto-regressive models (description of texture based on statistical correlation between neighbouring pixels) and wavelet analysis (information about image frequency content at different scales).