DeepInsight: A methodology to transform a non-image data to an image for convolution neural network architecture

It is critical, but difficult, to catch the small variation in genomic or other kinds of data that differentiates phenotypes or categories. A plethora of data is available, but the information from its genes or elements is spread over arbitrarily, making it challenging to extract relevant details for identification. However, an arrangement of similar genes into clusters makes these differences more accessible and allows for robust identification of hidden mechanisms (e.g. pathways) than dealing with elements individually. Here we propose, DeepInsight, which converts non-image samples into a well-organized image-form. Thereby, the power of convolution neural network (CNN), including GPU utilization, can be realized for non-image samples. Furthermore, DeepInsight enables feature extraction through the application of CNN for non-image samples to seize imperative information and shown promising results. To our knowledge, this is the first work to apply CNN simultaneously on different kinds of non-image datasets: RNA-seq, vowels, text, and artificial.

Where = 1,2, … , , and is the dimension of the samples in the dataset. If after normalization any feature value of the validation set or test set is less than 0 or greater than 1, then such feature values are clamped between 0 and 1 to maintain the consistency.
In the norm-2 normalization will try to keep the topology of the features up to some extent. In this method, the minimum value is adjusted for each feature or attribute, and then a global maximum is used in the logarithmic scale to place the feature values between 0 and 1. The norm-2 is conducted in the following manner: Min $ = min '()*+,' ./ ( , ∶) ./ , : ← log( ./ ( , : ) + Min $ + 1) Max = max( ./ ) ./ , : ← ./ ( , : ) Max The validation and test sets are adjusted using the training extrema values for normalization. In case, after adjusting by the minimum values (Min $ ), any element of validation or test set is less than 0 then it is clamped at 0. Similarly, if after normalizing by the maximum value (Max) any feature from the validation and test sets are above 1 then it is clamped to 1.

Supplementary File 2 Parameters for DeepInsight
In this supplement, we describe parameters used for DeepInsight method.

CNN parameter
Four convolution layers are implemented in a parallel configuration. The Bayes optimization technique is used to find the best parameters from a range of values used. The filter size or window size for each parallel layer is different. The parameters, like the number of filters, momentum, and l2-regularization, are same. The maximum objective evaluation and maximum epochs are set to 100. The parameters are summarized in Table S2.1 The range of values are applied during the training session and best the values were selected which gave the least validation error.

Dimensionality reduction technique
We utilized t-SNE and kernel PCA for finding locations of features. In case of t-SNE, if the number of features is less than 5000 then the exact algorithm is used otherwise Burneshut algorithm is applied (to speed up processing). The default distance in t-SNE is 'cosine'. For kernel PCA two eigenvectors corresponding to the leading eigenvalues are used to do transformation. The kernel type used was 'Gaussian'.

Feature mapping
Once the feature locations are defined using the training set, the next step is to map feature values to these locations. If two or more than two features occupy the same location then their averaged values are used; i.e., if locations of # , % and & are the same ( , ) then ( # + % + & )/3 will be mapped on this location. This will allow lossy compression of features. The validation and test sets use the feature locations obtained using the training set. For the empty pixels; i.e., pixels that do not contain any features are referred as Base, and its value is fixed as 1.

Pixel frame
The pixel size can be arranged automatically or can be fixed. The auto-mode determines the size × by utilizing the distance of two nearest feature location (referred as 2#3 in equations S6.3 and S6.4 of Supplement File 6). However, this will enlarge the pixel size and therefore it is limited by maximum predefined size of either or . If maximum size is the the pixel size is adjusted accordingly. In this work, we used maximum pixel size as 120×120 and 200×200.

Supplementary File 3 Results
A dataset is first partitioned into three segments, namely train, validation and test sets. The proportion of train, validation and test is roughly 80:10:10. All the results are on test sets.
For DeepInsight method, we optimized the parameters using train and validation sets. The parameters selected are those for which the validation error is minimum. DeepInsight method employs norm-1 and norm-2 normalization (as described in the manuscript and Supplement File 1) and the validation error is evaluated on both these norms, and the norm which provided the lowest validation error is used. The validation errors for both the norms are depicted in Table S3.1 . The validation error for RNA-seq dataset when pixel size is 200×200 was also obtained. The values for norm-1 is 0.0233 and for norm-2 it is 0.0179; i.e., norm-2 is selected in this case due to lower validation error. The test accuracy obtained was 99%.
For pixel size 120×120 the test accuracies obtained are depicted in Table S3.2   Table S3.2: Accuracy on test set when pixel size 120×120 is used.

Supplementary File 4 Codes description
This package is written in Matlab. It has two main components: transforming into pixels and processing via convolution neural networks (CNNs). A summary of the code and how to use it is discussed herein. As an example dataset, ringnorm-DELVE is provided with the package.

1) Dataset struct
The dataset (dset) should be in the following struct format Supplementary File 5

Non-linear dimensionality reduction techniques
In this supplement, we describe two non-linear dimensionality reduction techniques employed in the DeepInsight method. These techniques are t-distributed stochastic neighbor embedding (t-SNE) (Maaten and Hinton, 2008) and kernel principal component analysis (PCA) (Schölkopf et al., 1998).

t-SNE
The t-SNE technique visualizes high-dimensional data on a two or three dimensional plane for clustering samples.
The mapping from higher dimensional space to lower dimensional space happens in a non-linear fashion.
The samples with similarity, map close to each other, and with dissimilarities mapped apart. This technique is a variant of stochastic neighbor embedding (Hinton and Roweis, 2002) and easier to optimize, leading to better visualizations.
Many linear dimensionality reduction techniques map data to a 2D plane (DeepInsight does not require 3D transformation by t-SNE). However, mapped samples are highly convoluted, and it becomes very challenging for clustering algorithms to find a reasonable level of groupings. On the other hand, this technique has the potential to map very high dimensional data to a 2D plane while trying to keep the topology, or in other words, with minimum error. This enables understanding of complex data structures in a lower dimensional space. However, the processing time of t-SNE can be prolonged. For faster processing, the Barneshut algorithm is used to approximate joint distributions instead of the exact computation.
The t-SNE technique has two main steps. In the first step, it constructs a probability distribution over pairs of samples such that similar samples have higher probability and dissimilar samples have lower probability. In the second step, it finds the probability distribution in a 2D plane. Then it minimizes the Kullback-Leibler divergence between the two distributions belonging to lower-and higher-dimensional spaces using a gradient descent method.
t-SNE uses Euclidean distance (however, in DeepInsight, cosine distance was used) to compute probabilities. The conditional probability, "|$ , used in t-SNE, is a measure of the probability that a sample $ will pick " as its neighbor under Gaussian distribution. It can be defined as where ∈ ℝ 9 and $ is the variance of the Gausssian that is centered at sample $ . Since t-SNE is only interested in pairwise probability, $|$ , has been set to 0.
If the conditional probabilities, "|$ and "|$ , are equal, then it means sample points, $ and " , correctly model the similarity between higher dimensional samples, $ and " . Therefore, the aim is to model "|$ as close as of "|$ . This is done by minimizing Kullback-Leibler divergence with respect to $ , using a gradient descent method as where C is the cost function, KL is the Kullback-Leibler divergence function, $ is the conditional probability distribution over all samples given $ , and $ is the conditional probability distribution over all mapped samples given $ .
As a consequence of this optimization, mapped samples in the lower-dimensional space can be found which similitudes samples between the higher-dimensional space.

Kernel PCA
Kernel PCA is beneficial for visualization, novelty detection and image de-noising. It is an extension of the PCA technique for dimensionality reduction by incorporating kernel functions. These kernel functions help to compute the principal components in much higher dimensional spaces. However, the transformation to these higher dimensional spaces does not explicitly occur.
In kernel PCA, projection function is used to transform samples ∈ ℝ 9 to a feature space. This feature space could be in infinite dimensional space. However, instead of explicitly computing this feature space, kernel trick is used to obtain samples ∈ ℝ L (where ℎ < ) in a parsimonious data space.
Assuming a projected dataset with samples ( Q ), ( 2 ), … , ( T ) are centered, and, therefore, its mean is zero. The covariance matrix can be obtained using as where [ ] is an expectation function, and ( ) is a sample from this dataset. If the data matrix is denoted by then = [ Q , 2 , … , T ], which will allow equation (S5.4) to write in a matrix form as Eigenvalue decomposition (EVD) of covariance matrix will give $ = $ $ , for = 1,2, … , (S5.6) Since $ can be represented as a linear combination of Q , 2 , … , ( T ), we can write $ = $ , where $ is a N-dimensional column vector. Using this equality and from equation (S5.5), we can rewrite equation (S5.6) as It is easy to eliminate the term V from both the sides in equation (S5.7). Also if we define kernel = V , then that is, $ , is the eigenvector of (an × matrix) corresponding to eigenvalue $ . In order for normalization, Thereafter, dimensionality reduction can be applied as So far, we have assumed that the projected data ( ) has a zero mean. But in practice, this is not true. Therefore, projected data after centralizing, denoted ( ), will give kernel as where 1 T is an × matrix for which every element takes the value of 1/ .
There could be a variety of kernel functions. For e.g. linear kernel between two samples, and ′, could be , g = V ′, or Gaussian kernel could be , g = exp (− − g 2 / ).

Description of DeepInsight Pipeline
An overview of the DeepInsight pipeline is depicted in Figure 1b (main manuscript). Here we describe the details of the pipeline. A dataset was subdivided into 3 parts, namely the training set, validation set and test set. The training set is employed to find the location of attributes or features in a 2D plane. Let a training set consisting of samples and attributes be defined as = { ' , ) , … , + }. The attributes of can be represented as = { ' , ) , … / } where is a feature vector with entries. This set is processed through t-SNE or kernel PCA to get 2D coordinates { ' , ' , ) , ) , … , / , / }. The coordinates ( 3 , 3 ) define the location of 3 , where = 1,2, … , .
Next, the convex hull algorithm is applied to find the minimum box covering all the points. Since this box is not necessarily in the horizontal or vertical direction (as required by the CNN architect), we perform rotation. For rotation, gradient of two corner coordinates of a rectangle (obtained by the convex hull algorithm) is considered. If the coordinates are defined as ( 8' , 8' ) and ( 8) , 8) ) then gradient is defined as (See Figure  This will enable to compute the rotation angle defined as = tan >' ( ). This will provide the rotation matrix as This rotation matrix is multiplied with the training set to provide horizontal/vertical image frame (shown as red points in Figure 1b of the manuscript). The horizontal and vertical lengths of this frame are given as 8 = | /) − /' | 8 = | /N − /) | Where /) and /' are the x-axis coordinates of the image frame in the horizontal direction, and 8N and 8) are the y-axis coordinates in the vertical direction (see Figure S6.2). It is required to convert the Cartesian coordinates to pixel forms for processing. This was done by determining the minimum distance between the two closest points OP+ . The pixel coordinates can therefore be given as Where ( 8 , 8 ) are x-axis and y-axis coordinates in the Cartesian plane and ( Q , Q ) are coordinates in the pixel frame.