Band-based similarity indices for gene expression classification and clustering

The concept of depth induces an ordering from centre outwards in multivariate data. Most depth definitions are unfeasible for dimensions larger than three or four, but the Modified Band Depth (MBD) is a notable exception that has proven to be a valuable tool in the analysis of high-dimensional gene expression data. This depth definition relates the centrality of each individual to its (partial) inclusion in all possible bands formed by elements of the data set. We assess (dis)similarity between pairs of observations by accounting for such bands and constructing binary matrices associated to each pair. From these, contingency tables are calculated and used to derive standard similarity indices. Our approach is computationally efficient and can be applied to bands formed by any number of observations from the data set. We have evaluated the performance of several band-based similarity indices with respect to that of other classical distances in standard classification and clustering tasks in a variety of simulated and real data sets. However, the use of the method is not restricted to these, the extension to other similarity coefficients being straightforward. Our experiments show the benefits of our technique, with some of the selected indices outperforming, among others, the Euclidean distance.


Methods
Notation. In the context of binary data, given two objects i 1 and i 2 , with respective associated binary vectors x i 1 and x i 2 , the information required to compute a similarity index between x i 1 and x i 2 -to describe the proximity of i 1 and i 2 -can be summarised in a 2 ×2 contingency table (see Fig. 1a). With the usual notation, a is the number of attributes present at both objects i 1 and i 2 (or equivalently, the number of binary features or bits equal to 1 in both x i 1 and x i 2 ); b is the number of attributes present at object i 1 and absent at i 2 ; c is the number of attributes absent at object i 1 and present at i 2 ; and d is the number of attributes absent at both i 1 and i 2 . Clearly, the sum n = a + b + c + d is the total number of attributes, and the fractions p i 1 = a a + b and p i 2 = a a + c represent the proportion of attributes present at both objects i 1 and i 2 , among the number of attributes present at i 1 and i 2 , respectively. Note that in the context of information retrieval or machine learning these ratios correspond to the terms precision and recall.
On the other hand, given a sample of n d-dimensional observations {y 1 , . . . , y n } , where y i = (y i,1 , . . . , y i,d ) , we define the j-band determined by the j distinct elements y i 1 , . . . , y i j , with 2 ≤ j ≤ n , as the d-dimensional inter val delimited by the minimum and maximum of their coordinates, in the form [y i 1 ; . . . ; y i j ] = (y 1 , . . . , y d ) ∈ R d : min{y i 1 ,k , . . . , y i j ,k } ≤ y k ≤ max{y i 1 ,k , . . . , y i j ,k }, k ∈ {1, . . . , d} . However, for referring to j-bands in the cases where it is irrelevant to identify which elements define them, we simplify the notation by letting B (j) be the set of all j-bands β  4 ] . The gray region between the black curves is the band β (2) 5 . The first coordinate of curve y 1 (green dot) is outside the band; this corresponds to a 0 in the entry (5,1) of its associated Boolean matrix M (2) 1 . The first coordinate of curve y 3 , in red, is inside the band; the entry (5,1) in matrix M (2) 3 is 1, accordingly. The corresponding Boolean product is 0, meaning that the band does not include both coordinates. (c) Binary contingency table for coordinate k and the number j of curves forming a band, constructed from the matrices sketched in (b). www.nature.com/scientificreports/ For each j, we can associate to each y i an n j × d binary matrix M (j) i , whose (p, k) element equals 1 if y i,k lies within the k-th (one-dimensional) interval determined by the p-th j-band in B (j) , and 0 otherwise. The number of 1s within the k-th column represents the number of bands in B (j) whose k-th interval contains y i,k . Let us denote this number by N (j) k (y i ). In a similar way, given the pair of distinct elements (y i 1 , y i 2 ) , we can construct the matrix M i 2 to identify the bands in B (j) whose k-th interval contains the k-th coordinate of both points y i 1 and y i 2 . The number of 1s in the k-th column of this matrix is denoted by N (j) k (y i 1 , y i 2 ).
Similarity indices for binary data. Some of the most widely used similarity indices for binary data are described below.
• Simple matching (SM) coefficient: Introduced by 23 , it provides the simplest similarity measure as the proportion of matching binary attributes, both present and absent, of items i 1 and i 2 : • Jaccard (J) coefficient: It was introduced in 20 and is one of the most frequently used indices. In fact, the dissimilarity index derived from it is a metric. It counts the number of matching present attributes and compares it to the number of attributes that are present at least at one of the objects: • Simpson (S) coefficient: This popular index was introduced in 22 to measure set similarity as the proportion of the smallest set included in the largest one. It is often referred to as Lennon's coefficient 45 , and corresponds to the largest of proportions p i 1 and p i 2 : • Forbes (F) coefficient: It was derived in the ecology field as the observed-to-expected ratio of the number of species in two random samples 21 . Despite its good general properties, this index has been routinely excluded from studies and discussions on similarity indices. Nevertheless, it has recently raised interest 28,46,47 and it has been shown to outperform other indices in several situations. The corresponding formula is: • Dice (D) coefficient: Presented in 48 , and also known as S φrensen or Czekanowski coefficient, it is computed like the Jaccard index, but giving double weight to matching attributes with value 1: • Anderberg (A) coefficient: It was used in 24 . It is another variation of the Jaccard coefficient that gives double weight to mismatching attributes: • Ochiai (O) coefficient: This index is described in 49 . It represents the geometric mean of p i 1 and p i 2 . In several studies (see e.g. 27 ) it has outperformed other indices. Its expression is: • Russell and Rao (RR) coefficient: Introduced in 50 , it represents the proportion of positive matching attributes: All these coefficients are symmetric and non-negative similarity indices. When an index S with such properties lies in the interval [0, 1] it is possible to base on it the definition of a symmetric and non-negative index which measures dissimilarity as D = 1 − S . Note that this is not the case for F, which is not always bounded by 1; however, this issue can be circumvented by rescaling the values of F as in 28 . Also, note that D RR = 1 − S RR is not reflexive (i.e., not a proper dissimilarity) as D RR (x i , x i ) is not 0 for vectors x i not identically equal to 1. Thus it is easy to anticipate that, normally, it will not produce good results. www.nature.com/scientificreports/ Proposed band-based similarity measure for quantitative data. Given the multivariate, quantitative data set under study {y 1 , . . . , y n } and the number j used to define the bands, we can associate to each data point y i a binary matrix M (j) i , as previously explained. Figure 1(b) exemplifies the computation of these matrices through a toy data set with four 5-dimensional observations. These vectors define six different 2-bands, labelled β (2) 1 , . . . , β (2) 6 , where β (2) 1 is the region determined by y 1 and y 2 ; β (2) 2 is delimited by y 1 and y 3 , and so on. We focus on elements y 1 , in green, and y 3 , in red, and examine their first coordinate: these are outside and inside, respectively, the band β (2) 5 = [y 2 ; y 4 ] , defined by the other two elements (gray region). Accordingly, the corresponding values in matrices M (2) 1 and M (2) 3 are 0 and 1, respectively, as highlighted in gray in Fig. 1(b). If we consider all other bands, we find that three of them contain the coordinate y 1,1 and five bands contain y 3,1 ; however, only two bands contain both coordinates simultaneously.
The key idea behind our method is the following. We expect that two observations whose coordinates are simultaneously embedded in many bands are more similar than those with coordinates located in few bands at the same time. Note that such similarity is dependent on the data set, i.e., two given elements will have different similarities when the other elements in the set change, in the same way the depth of a point is modified when the data set is altered.
Next, for each coordinate k and for each j, with 2 ≤ j ≤ J , 2 ≤ J ≤ n (following the construction of the MBD), we can build a contingency table as the one shown in Fig. 1(c), where and In our toy example, a At this point, for a given j, it is possible to summarise the information contained in these d tables in two different manners: a) by adding corresponding values in all the tables and plugging the results in the chosen index, or b) by computing the chosen index component-wise and averaging across coordinates. To keep track of similarities between y i 1 and y i 2 , it is more informative to use the second approach, as confirmed experimentally. Therefore, we used these counts to compute, coordinate by coordinate, the indices described in the previous subsection.
Finally, for a given J, if we denote by S k,j the selected index computed for the pair (k, j), the band-based similarity measure that we proposed is simply the average of the coordinate-wise indices across the different values of j: Note that this similarity index is related to shape rather than to spatial proximity, which is appropriate for gene expression data. Note as well that each coefficient built in this way is obviously affine invariant and provides a sample-dependent similarity measure, making them suitable for data-driven tasks such as classification or clustering. (1) and (2) to compute the MBD leads to nested for loops that become time-consuming as J increases. To overcome this problem, current applications of the MBD 51,52 use the value J = 2 because it is computationally faster and also because the order induced in the sample is very stable in J 34 . However, the effective construction of the n j j-bands formed by j distinct observations from the sample becomes intensive as n increases, even for j = 2.

Efficient computation of MBD and contingency tables. A direct implementation of expressions
To speed up computations, Torrente et al. 53 provided an implementation of MBD n,2 that avoids nested loops. Instead of exhaustively searching for all pairs of points in the sample, the alternative calculations are based on storing the data in an n × d matrix Y and increasingly ordering each column. Then where η k = η k,y i is the multiplicity of y i,k within the k-th column of Y , and l k = l k,y i is the first occurrence of y i,k in the same column of the reordered matrix. Also, we assume that η k 2 = 0 whenever η k = 1.
For a fixed k with 1 ≤ k ≤ d , a simple combinatorial strategy allows counting the number of pairs of elements from the sample such that the k-th coordinate of y i is inside the interval defined by the k-th coordinates of those pairs. The complement principle states that a (j) www.nature.com/scientificreports/ where again we assume This principle can also be used to count the number of j-tuples from the sample such that the k-th coordinate of y i lies within the minimum and the maximum of the k-th coordinates of the j elements: with α j = 0 whenever 0 ≤ α < j . Thus, the MBD for J > 2 can be computed, with the computational cost linearly-instead of exponentially-scaling with J, as Finally, given two points from the sample, y i 1 and y i 2 , we make use of the principle of inclusion-exclusion to compute the number of j-tuples containing the k-th coordinate of both points. Thus we write and where m = min{y i 1 ,k , y i 2 ,k } and M = max{y i 1 ,k , y i 2 ,k }. These calculations reveal the shortcut in the method workflow displayed in Fig. 2 (as black blocks): instead of computing each j-band and each binary matrix (red blocks), we can obtain the numbers N Application to classification and clustering tasks. Finally, we describe how to examine the efficiency of the proposed dissimilarities indices in different standard techniques to be run on either simulated models or real data (see the Results section).
A straight forward application of the similarity indices to classification is given through the well known k Nearest Neighbours (kNN) algorithm 42 , a simple, widely-used rule that has proven to be very powerful. Though the Euclidean distance is the dissimilarity measure most frequently used with this method we also assess it when using other classical distances, namely: the Manhattan distance, the Minkowski distance, for p ∈ {0.25, 0.5, www.nature.com/scientificreports/ 0.75, 3, 4, 5} , and Pearson's correlation-based distance. The assessment of how each of the selected dissimilarity measures affects the algorithm's performance is carried out in terms of the classification error rates (i.e., the percentage of observations the method misclassifies). To have a common value for all measures, the number k of nearest neighbours used in the simulations was selected by the rule of thumb of setting k equal to the largest odd integer smaller than or equal to the square root of the number of training samples. The only exception to this is considered in one of the real datasets, where two of the classes are very small and the selection of k according to such a rule resulted in high classification error rates. Ties were broken by choosing the group containing the neighbour closest to the sample to be classified (that is, as if k = 1).
For artificial data, we generated B a = 200 pairs of independent training and test sets, and used the test sets to evaluate the classification error. For real data, we estimated this error with 10-fold internal cross-validation B r = 200 times (see 54 ).
Additionally, we have selected PAM, a deterministic method, to illustrate the performance of the dissimilarities indices in clustering artificial data with respect to the distances mentioned above. For each dissimilarity measure, the partitions produced by the algorithm were compared to the true ones using the clustering error rate described in 1 ; that is, for each partition, we permuted the cluster labels and selected the minimum proportion of samples with labels disagreeing with the true ones. In addition to this, we computed the adjusted Rand index (ARI) 55 to assess the quality of the clustering output by comparing the labels provided by the PAM algorithm to the real ones.
For real data, we considered complete-linkage hierarchical clustering instead, as the biological results yielded by this method are easier to visualise and interpret.
All these methods were used as implemented in the statistical R environment (http://www.R-project.org), in packages class, cluster and stats, respectively.

Results
Simulation study. For the simulated data, we considered normally distributed models with different dimension and number of groups, as detailed in Table 1.
Application to classification. We evaluated the performance of kNN with the band-based similarity indices with respect to that resulting from the use of the standard distances. However, prior to such a comparison, we carried out an analysis of the influence of the value of J on the classification performance. To that end, we considered J ∈ {2, 3, . . . , 10} for the models proposed.
For each of these models, we generated 200 training sets with 100 observations in each group, and 200 independent test sets with 25 elements in each class. We computed the classification error rate in each test set.
Supplementary Figure S1 shows the average classification error rate for each model along with the corresponding standard errors. This evinces the stability of all the methods (except for RR) as J increases, in agreement with the ordering induced by the MBD in the data, for different values of J 34 . This is also true even for values of J that have not been previously reported in the literature. Also, we can establish a rather persistent ranking of these indices for the selected models. Clearly, S is the best option, followed by O, whereas RR and A stand on the opposite side. However, there is a slight trend to deteriorate the performance, which is more remarkable for some indices than for others. According to these results, hereinafter in the comparison with classical distances we will report only results corresponding to J = 2 and 3, which are the ones suggested to be used in practical situations.
Model 1: It is taken from 56 to mimic simple gene expression data and consists of two overlapping groups of points with 100 independent variables, drawn from normal distributions N(0 100 , I 100 ) and N(0.5 100 , I 100 ) , where α s is a vector of s identical components α and I s is the identity matrix of size s × s.
Here, most of the methods have an appropriate behaviour; see Fig. 3, upper-left panel. The classification error rates of the Euclidean distance, S, the Manhattan distance and the Minkovski distance, with p = 0.75, 3, 4 , are similar and very low, with averages below 0.03. Only A, SM and RR have errors one order of magnitude higher, and the Pearson distance has drastically bad results. Table 1. Description of the model parameters. The matrix I n is the identity of dimension n. The matrix A i;n is an n × n matrix with all entries equal to zero, except the i-th position in the main diagonal, which equals 1. Models 1 and 2 consist of two Gaussian clusters; models 3-5 are Gaussian mixture models with fixed levels of overlap. www.nature.com/scientificreports/ Model 2: It also has two groups, now with 10 independent variables; the last five ones are noise. The standard deviation of a single noise variable is increased an amount δ ∈ {0, 1, 2} . The results corresponding to both extremes are displayed in Fig. 3, top-middle and top-right panels.

Model Dimension Means Covariance Matrices
When δ = 0 the classical distances classify the new observations better than the band-based dissimilarity indices, but they are closely followed by S. If sorted from lowest to highest average error rates, S ranks 7 and 9 for J = 2 and 3, respectively. Nevertheless, all the band-based measures except for RR yield average error rates lower than 0.05. Again, the Pearson correlation has the worst results. However, this model poses a critical scenario for the classical distances as the parameter δ rises. For δ = 2 , the corresponding error rates deteriorate and S becomes the best option, whereas, notably, the band-based alternatives (except RR) remain robust against the value of δ. These models are difficult scenarios for classification. The error rates are higher than in the previous cases and their distributions have larger interquartile ranges and contain many outliers (see the lower panels of Fig. 3) for all the methods. Despite that, the best techniques are again S and the Euclidean distance, followed by the Manhattan and Minkovski ( p = 3, 4 ) distances.
Application to clustering. Next we evaluated the performance of the band-based indices for clustering methods. We generated 50 samples independently for each cluster, and simulated 200 data sets for Models 1-5. We applied PAM with all the band-derived measures and distances selected.
An inspection of the consistency of the results of each method across different values of J also shows they have a stable behaviour; Supplementary Figures S2 and S3 illustrate respectively clustering error rates and ARIs, again with a trend to deteriorate as J increases. As in the classification case, note that Supplementary Figs. S2 and S3 allow ranking the methods from best to worst consistently, regardless of the simulated model we consider.
With respect to the overall comparison, Tables 2 and 3 display clustering error rates and ARIs, averaged over the 200 simulated data sets, along with their standard errors (in brackets).
The best similarity measures to cluster Model 1 are S and the Euclidean distance, followed by the Minkovski ( p = 3 ) and Manhattan ones. Pearson, RR, A, F and (surprisingly) J have the worst performance, with average clustering errors above 35% and average ARIs below 0.1.
For Model 2, the performance of the different methods resembles that they had in classification. The Euclidean distance, followed by the Manhattan and the Minkowski distances-for most of the reported values of p-, and then by S, are the best alternatives when δ = 0 . However, when δ = 2 , the Euclidean and Minkowski (for p ≥ 3 ) distances yield drastically high clustering error rates and low ARI values, as reflected in Tables 2 and 3; the Manhattan and Pearson distances also deteriorate remarkably their behaviour. However, S stays rather stable and appears as the best option. The other band-based indices improve their performance, but yet they are not capable of beating S.
With respect to Models 3-5, only the Pearson distance, amongst the classical distances, is a competitor for S, with ARI values slightly smaller than those of S ( J = 3 ) and a better clustering error rate in Model 5. The other classical distances follow these techniques, whereas the use of the other band-based indices, specially A and RR, is clearly discouraged.  www.nature.com/scientificreports/ Thus, in this context, the band-based dissimilarity index S reveals to be consistently competitive with the classical distances, specially the Euclidean one, outperforming it in some of the simulated scenarios. It also beats D, F, O and, notably, the popular Jaccard index, though these have a reasonable performance. On the other hand, RR, A and SM are not efficient in order to reflect similarities between samples.

Real gene expression data.
To explore the effectiveness of the band-based indices with gene expression data in both classification and clustering techniques, we analysed different publicly available data sets, from both microarrays and RNA-seq technologies.
Microarray data. All the raw data sets were preprocessed following the normalisation described in 2 .
In kNN, the number of genes considered in each iteration was reduced to 200 in a supervised manner, using the B/W criterion 58 . This feature selection was performed for each training set.
For the (unsupervised) clustering approach, we kept the 200 most variable genes and clustered the samples, for which we knew the group labels (i.e., the annotations). We do not report here the results corresponding to the use of PAM, because it produced poor results in all the cases when the number of clusters was set equal to the number of classes. This is a common situation in gene expression data when samples are clustered. Therefore, instead of PAM, we used hierarchical clustering in conjunction with heatmaps, as these can be visually examined.
Lymphoma data set. It is a subset of the set presented in 6 ; it contains 62 samples from three of the most prevalent lymphoid malignancies: 42 diffuse large B-cell lymphomas (DLBCL), 9 follicular lymphomas (FL) and 11 chronic lymphocytic leukemias (CLL). Figure 4, top-left panel, displays the distribution of the 10-fold cross-validation classification errors made by kNN, that is, the proportion of wrongly classified elements, for the band-based indices and for the classical distances. A, D, J, all Minkovski distances and also F (when J = 2 ) and O (when J = 3 ) produce zero error rates in most of the iterations, being notably better than the classical Euclidean, Manhattan or Pearson distances. On the contrary, SM and-strikingly-S are slightly worse; RR misclassifies much more samples than the other methods.
On the other hand, after filtering the genes for clustering, we computed, for each pair of samples, the bandbased dissimilarity measures, as well as the other classical distances. We used these to build a hierarchical tree using complete linkage.
We illustrate some of the results produced by the different indices in Fig. 5. In particular, we display the dendrograms and heatmaps corresponding to the Euclidean distance and to S and O, with J = 2 , as they are, in general, the best choices. The plots corresponding to the other measures are included in Fig. S4 from the Supplementary material. Table 3. Average ARI over 200 simulations for all the simulated models; standard errors are written in brackets. Best and second best performance are highlighted in bold and italics, respectively.  www.nature.com/scientificreports/ With respect to classification, we report in Fig. 4, top-middle panel, the distribution of the classification error rates for the compared methods. Clearly, all Minkovski options provide the best results, with average error rates below 0.02, whereas RR with J = 3 has the worst behaviour. The classical distances Manhattan, Euclidean and Pearson have a performance similar to that of most of the band-based indices, with average error rates ranging between 0.15 and 0.18, except for S, whose error rate is slightly below 0.15. Figure 6 displays the hierarchical clusterings using the Euclidean distance and the band-based indices S and O, with J = 2 . The Euclidean distance creates a dendrogram where the two main branches contain both normal and cancer samples; the leftmost branch contains two well separated sub-branches (normal and cancer) whereas the rightmost one mixes both types. The heatmap evinces that both groups are indeed difficult to tell apart. This is not surprising as gene expression in normal gastrointestinal tissues is known to be similar to that of neoplastic samples (see, for example 60 ).
The use of S results in a similar situation, regardless of the value of J: both main branches in the tree contain normal and cancerous samples, but the second one contains two neater sub-branches, roughly corresponding to both classes. An analogous behaviour can be observed for most of the other methods, as illustrated in Fig. S5 of the Supplementary material. Nevertheless, SM and the Minkovski distance for p = 3 reflect the similarities between observations in a more accurate way and find a clearer structure in the data according to the disease status. Most of the samples from each type are placed together in the dendrogram, forming two distinct branches, with only a few misclassified leaves.
Leukemia data set. This data set was made publicly available in 61 , and contains the gene expression levels in 72 leukemia samples, 25 of which were AML and 47 were ALL; additionally, 38 ALL samples originated from B-cells and 9 from T-cells. After the preprocessing steps, the expression matrix included 3571 genes and 72 samples. We first analysed the data considering two classes, AML and ALL, and next, also distinguishing the three groups.
The results corresponding to classification are illustrated in Fig. 4, top-right and bottom-left panels. The Minkovski distances have the best behaviour in kNN for the 2-class case. The Manhattan and the Pearson distances and also most of our band-based indices outperform the Euclidean distance; only S and SM (when J = 2 ) are not better, in addition to RR, which, as before, is much worse. With respect to the 3-class case, the performance of S is drastically improved, reaching error rates comparable to or even better than those of the Minkovski options. Setting RR aside, the Euclidean, Manhattan and Pearson distances are the worst choices.  www.nature.com/scientificreports/ With respect to clustering, the top dendrogram in Fig. 7, left panel, shows that the Euclidean distance does not correctly restore the biological truth of the data, as it allocates some of the ALL samples (and in fact, all the T-ALL samples-see the hierarchical tree in the vertical axis) together with the AML samples. Also, the Euclidean distance misplaces one more AML sample in the dendrogram. However, the Simpson index finds the AML/ ALL structure, wrongly assigning only 2 samples for J = 2 . Also, S identifies the subclusters containing T-cell ALL and B-cell ALL samples. O ( J = 2 ) also separates the three classes, though wrongly assigning 6 samples.
Results corresponding to alternative band-based indices and classical distances are shown in Fig. S6 of the Supplementary material, where we can see that most of the similarity measures used are capable of retrieving the 2-group structure. The exceptions are RR, SM and Minkovski with p = 5 , which mix both types of cells. However, only F and J and the Manhattan distance identify the three groups-with some wrongly allocated samples.
RNA-seq data. Next Generation Sequencing (NGS) provides more accurate data than microarrays, though their analysis is more challenging and complex. In this section, we additionally test the performance of the proposed indices with two NGS data sets.
Pan-cancer data set. This is a subset of the PANCAN data set 62  This leads to excellent classification results with the kNN algorithm. The Minkovski distances are capable of correctly classifying all the samples in 25% of the iterations, though their median (corresponding to 1 misclassified element) matches that of S. The other band-based indices, except for RR, have a slightly worse performance, but better than that of the Pearson, Manhattan and Euclidean distances.
On the other hand, due to the massive size of the data set, in Fig. 8 we only visualise, for clarity, the corresponding dendrograms, where wrongly allocated samples are highlighted. The good separability of the samples allows all the methods to identify the five types of cancer with high accuracy. The Euclidean distance places two kidney cancer and one lung cancer samples inside the the breast cancer branch, and identifies the kidney cancer type as the most distinct one (last merging of the dendrogram). A similar performance can be observed for O ( J = 2 ), with the same wrongly assigned KIRK and LUAD samples, but this time they are found to have some differences, as they are placed at different sub-branches of the BRCA branch. On the contrary, S ( J = 2 ) only assigns one of the KIRK samples to the BRCA branch, and finds KIRK to be more similar to COAD and LUAD.
None of the classical distances is capable of getting better results (see Fig. S7 of the Supplementary material). For most of them, the number of wrongly allocated samples ranges from 2 to 4, but the Minkovski distance with p = 0.25 splits the LUAD samples into two distinct branches in the hierarchical tree. The other band-based indices have a similar performance, although the use of J = 3 leads, in general, to slightly worse assignments, specially with the Jaccard index. Notably, RR and SM identify the correct structure, without any error for both values of J.
FFvsFFPE data set. This data set was downloaded from the ArrayExpress database 65 , under accession number E-MTAB-2523. The experiment included 86 RNA-seq samples representing six different human tissues: bladder carcinoma, prostate carcinoma, colon carcinoma, normal liver, normal colon and reactive tonsil 66 . These tissues were stored under two different conditions: formalin fixed, paraffin embedded (FFPE) and fresh frozen (FF), which affect nucleic acid extraction. The experimental design defines 9 known distinct classes indicated in Table 4, along with their size. The trimmed RNA-Seq reads were mapped against gene regions and transcripts were annotated using the RNA-Seq Analysis tool CLC Bio, version 6.0.1; the final processed files contained RPKM values for 19137 genes. www.nature.com/scientificreports/ First, as mentioned before, if we stick to the given rule for selecting k, the classification error rates in kNN are remarkably higher than in previous data sets. This is because some of the groups are very small and do not www.nature.com/scientificreports/ have enough representatives to get the majority vote. In such a case, average error rates are in the order of 0.25. Therefore, we show in Fig. 4, bottom-right panel the results corresponding to k = 3 for this dataset. As in previous data sets, the Minkovski distances have the best performance, followed by S and SM. The worse behaviour corresponds to the Euclidean and the Pearson distances and the RR index, whereas the other ones lie in between, with similar average error rates around 0.15. Next, we built the heatmaps and hierarchical trees associated to the clustering procedure; see Fig. 9. The Euclidean distance leads to a heatmap with two differentiated parts, corresponding to FFPE and FF samples, respectively. These are indicated with a vertical black line; meaningful branches, identified by visual inspection, are highlighted with black dots. Inside the FFPE block, which includes some FF bladder carcinoma samples, we can identify a branch with all liver and tonsil samples and most FFPE bladder cancer samples; a neat difference between these cell types is not found, though. All FFPE prostate and most FFPE colon tumours are clustered together in another branch. On the FF side, the samples are correctly arranged according to the tissue of origin, except for a few FFPE samples. Cancer and normal colon tissues are placed in the same branch, as expected 60 .
Interestingly, the S and O indices, shown in the middle and right panels of Fig. 9, produce a different arrangement. To the left of the tree we find a branch with tonsil samples, grouped in a more consistent manner than with the Euclidean distance; also most liver samples are closely located in the tree. There is a branch containing all prostate cancer samples, subclassified according to the storage condition, in contrast to the previous result. Another sub-branch contains colon samples, regardless of the disease condition. S separates both types of bladder samples in distinct sub-branches, whereas O finds a group with mostly FF samples, while spreading the FFPE ones along the hierarchical tree. Figure S8 in the Supplementary material shows the behaviour of the other techniques. None of them is capable of producing a neat separation between the classes. All the classical distances yield to inaccurate branches, except for Minkovski with p = 0.25 and 0.5, which identify most of the tissues of origin but fail to retrieve the FFPE bladder carcinoma branch. On the other hand, most of the band-based similarities produce good results as they are also able to identify differences involving tissues of origin. This evinces that in this data set the band-based indices find, in general, more biologically meaningful similarities than the classical distances.
It is worth noting that since the class labels are known, it is possible to filter the genes using the B/W criterion also for clustering, but using the most variable ones suits the unsupervised philosophy better. Nevertheless, we repeated the analyses selecting the genes with that criterion. The results, in terms of structure recovery, were very similar to the ones previously discussed. www.nature.com/scientificreports/ In summary, we conclude from the previous results that, even though not all the dissimilarity coefficients based on bands have an appropriate behaviour for every (simulated) dataset, when it comes to efficiently assessing biological differences between samples they reveal as suitable alternatives, especially in clustering tasks. In particular, the Simpson and the Ochiai indices are, globally, the best options: they have consistently good or the best performance across our variety of experiments.

Discussion
Our findings from the Results section suggest that the proposed technique for constructing a similarity or dissimilarity measure for quantitative data is effective for analysing gene expression data. However, it is clear that there are always scenarios where a particular dissimilarity/distance measure is not the best option, as it has been illustrated in this study.
We have first evaluated the performance of these indices in relation to J, the largest number of different observations considered to form the j-bands. This is computationally feasible due to the efficiency of the proposed method. Our experiments show, as expected, that the results produced for different values of J are very similar, but sometimes with a tendency to deteriorate as J increases, a fact that has not been previously reported. Therefore, it is enough to set J equal to 2 or 3 to get satisfactory results fast. We can rank the band-based dissimilarity indices that we have considered from best to worst and identify which of them have systematically a bad behaviour (or worse than the competitors) in the analysed simulated data sets, i.e., with distributions that are spherical or with ellipsoidal weighted components. These indices are the RR and A measures, which seem less suitable for classifying or clustering this kind of data; this is consistent to what has been reported in several studies with respect to RR. Surprisingly, despite being one of the most popular indices, J does not perform as well as expected. Instead, S is clearly the best option in most of the tested data sets, as it achieves lower classification/clustering error rates and higher ARIs than every other dissimilarity measure, including, for instance, the benchmark Euclidean distance.
In the case of analysing gene expression data, the differences among the band-based methods are much smaller as most of them retrieve the underlying biological structure. The variety of analised data sets illustrates several situations where some of the band-based indices outperform the classical distances, especially for clustering tasks, earning their right to be considered a suitable measure of similarity in the gene expression context. As an example, the ability to cluster together samples from the same tissue of origin but stored under different conditions, or to better arrange samples according to the disease status, appears to be of particular interest. As opposed to the simulated Gaussian data case, the Jaccard index has a more relevant behaviour in this context.

Conclusions
Different applications such as pattern recognition or classification of multivariate data require a similarity measure S or dissimilarity measure D . The choice of such an appropriate measure relies on the data to be analysed, and in particular on the nature of their variables. Thus, for example, the Euclidean distance is a common measure for continuous data whilst binary data is often examined with indices such as Jaccard's or Simpson's ones.
In this work we propose a similarity/dissimilarity measure for continuous data based on binary features associated to the original observations. To that end, we make use of the idea underlying in the construction of the MBD, a data depth notion that has proven useful and efficient in the analysis of complex data, like gene expression data. More precisely, for each coordinate of a given observation we build a binary vector that helps count how many bands in the sample contain such coordinate. These vectors allow constructing contingency tables as in the presence-absence context, which become the input of standard similarity indices. The computation of the contingency tables is very efficient as it is possible to determine the corresponding values without evaluating each band. Instead, we use a combinatorial approach after reordering the columns of the data in an increasing way.
We have used this strategy for analysing a collection of simulated and real gene expression data sets using supervised and unsupervised classification. In summary, it is apparent that the binary matrices constructed from the bands keep track of similarities appropriately, though not all indices make a sound use of this information in every scenario. For simulated data sets, with (Gaussian) spherical or with ellipsoidal weighted-component distributions, the Simpson index is the best choice, followed by the Ochiai index. On the contrary, for more complex real gene expression data, additional band-based indices (e.g., the Dice index, the Forbes index-in agreement with recent work-or the popular Jaccard index) come to the fore as better alternatives than the widely-used Euclidean, Manhattan or Pearson distances. In some cases, S and O are not the best options, yet their performance is good enough to be considered as relevant choices. The different Minkovski distances are the most appropriate ones for classification tasks in these data sets, but are outperformed by S and O in clustering. In brief, S and O are sound options that produce competitive, sometimes complementary, outputs. Therefore, a joint analysis using them is suggested.
In either case, the technique presented in this study to construct a (dis)similarity measure based on bands from the MBD notion reveals as a suitable alternative, which can be extended to any of the multiple similarity indices defined for binary data and thus applied to quantitative data.