A data driven approach reveals disease similarity on a molecular level

Could there be unexpected similarities between different studies, diseases, or treatments, on a molecular level due to common biological mechanisms involved? To answer this question, we develop a method for computing similarities between empirical, statistical distributions of high-dimensional, low-sample datasets, and apply it on hundreds of -omics studies. The similarities lead to dataset-to-dataset networks visualizing the landscape of a large portion of biological data. Potentially interesting similarities connecting studies of different diseases are assembled in a disease-to-disease network. Exploring it, we discover numerous non-trivial connections between Alzheimer’s disease and schizophrenia, asthma and psoriasis, or liver cancer and obesity, to name a few. We then present a method that identifies the molecular quantities and pathways that contribute the most to the identified similarities and could point to novel drug targets or provide biological insights. The proposed method acts as a “statistical telescope” providing a global view of the constellation of biological data; readers can peek through it at: http://datascope.csd.uoc.gr:25000/.

datasets for data-driven biology 1 , including ~5600 datasets of ~260000 samples spanning ~500 58 diseases was used in this work. This subset includes 978 datasets from six different -omics meas-59 urement technologies (five transcriptomics and one epigenomics). The datasets were selected with 60 the following criteria: 61 1) They include at least 40 samples. The exact threshold of 40 was chosen arbitrarily as a 62 minimum threshold to obtain statistically reliable results in distributions of about 50000 63 or even 500000 dimensions (molecular quantities measured). 64 2) They share no common samples, as this may lead to similarities identified due to the 65 common samples and not their molecular underpinnings that may reveal interesting biol-66 ogy. 67 3) They require at least three principal components that explain at least 50% of the variance, 68 as a result of the principal component analysis that we perform as a first step for the pro-69 posed method (see section 2). 70 In BioDataome, all datasets are automatically annotated with a disease term from the Disease-71 Ontology 2 . Here, we further manually checked these automatically assigned disease terms for 72 possible mislabels and also assigned either a disease, or a phenotype term (i.e. GSE11761 studies 73 "response to exercise") to any unlabeled datasets. The distribution of the disease/ phenotype of the 74 datasets is shown in Supplementary Figure 1. Among all diseases "breast cancer" and 'lung cancer' 75 enjoy the maximum share. In Supplementary Figure 2

177
The matrix W , (k_) is a diagonal matrix consisting of the inverses of the eigenvalues V T , , and 179 the matrix (S K U) consists of the inner products î ï,ñ = FI ï K ⋅ R ñ G of the (column) eigenvectors 180 of 'P' and 'Q', over all pairs. We use the notation [X(ã, ó)] Ü,áà_ to denote a square matrix * × * 181 with element X(ã, ó) on the i th row and j th column and get:  Using the basic inequality (™ − 1) ≥ u* ™ for ™ > 0, which is exact only in the case ™ = 1, 202 we see that all terms in (12) are non-negative, hence the value 0 is the minimum possible value of 203 (12), and hence also of (11). The minimum value of any linear function over a polytope is attained 204 on a vertex. The vertices of Π are the permutation matrices On such a vertex, we get: , , , contrary to our hypothesis of discrete 217 spectrums. 218 It is easy to prove the reverse, i.e., that when E # = E , , then ÉFE # ||E , G = 0. In this case,  In omics datasets the number of variables of datasets DSP, DSQ is relatively large (* > 230 50,000), while the number of samples s is much smaller (e.g. ' < 100). This makes the sample 231 covariance matrices to be rank deficient, with most eigenvalues equal to zero; this fact makes 232 formula (4) inapplicable (due to divisions by zero). 233 In order to make (4) applicable in the aforementioned cases, we will process ('curate') the 234 spectrum of the eigenvalues, by adding the term +M to all of them. 235 To prepare a detailed presentation of this procedure, we mention two rather standard facts, 236 for the manipulation of the covariance matrices and their respective eigenvalues: Proof: Next, we consider the eigenvalues indexed in descending order, i.e., l1 > l2 > … > ls. It is 261 also reasonable to assume that in most typical cases with high-dimensional data and small sample 262 sizes the eigenvalues will not be equal (discrete spectrum). We only maintain the first c eigenval-263 ues and corresponding eigenvectors. The value of c depends on a:

265
Step 1 (filtration): select an index c, 1 ≤ X ≤ @X such that, The parameter a can be interpreted as the smallest percentage of variance to retain with the selected 275 eigenvectors.

291
In summary, the ⟨α,σ⟩ curated eigenvalues are: The procedure leaves the eigenvectors intact. All eigenvalues are now positive, hence formula 299 (4) is applicable. In the curated spectrum all the eigenvalues, except the few largest ones, will have a small 311 constant value (M # or M , ), and due to this fact, formula (4) is not only applicable but greatly sim-312 plified.

314
Lemma 5: Let E # and E , be two covariance matrices of dimension n, with eigenvectors I T , 315 R T and eigenvalues V T # , V T , , respectively. Furthermore, let the last * − X # eigenvalues of E # be 316 equal to M # , and the last * − X , eigenvalues of E , be equal to M , . The divergence ÉFE # ||E , G is 317 given by: Since the eigenvectors are normalized (that is: they have unit Euclidean norm), and they form 332 an orthonormal base of the whole space, we get ∑ FI Ü K ⋅ R á G P Üà_ = 1 for every j, hence, Using expressions (18) in the 2 nd and 3 rd terms of expression (17) for the pqrXhsE , (k_) E # t we 336 have:

340
For the 4 th term of (17) we have: Using the identities (18) in the previous formula (20) we get: or finally, for the 4 th term in (17):    The problem above selects exactly k variables that make the two datasets seem most similar, with 451 respect to the SKL. Equivalently, the problem can be written as:  To validate the proposed dataset similarity method, we randomly divided each dataset into two 507 mutually exclusive parts, A and B (subsequently referred as "siblings") and compute SKL among 508 all pairwise combinations of datasets. We then count how many times each sibling ranks first, 509 second, etc. or greater than 10 for all combinations. We repeat random splitting 10 times and ensure 510 that each split includes at least 20 samples. 511 The pseudocode of the "siblings" experiment is given in Supplementary Figure 4