Introduction

Of the total of 3,634,743 disease pairs involving 13,034 distinct diseases in clinical data from 13,039,018 individuals1, 78.8% involving all 13,034 diseases have a larger than random probability that they co-occur in one individual. Disease comorbidity, the co-occurrence of distinct diseases in one individual, is an interesting medical phenomenon, and it is important to understand their molecular origins. For example, rheumatoid arthritis, autoimmune thyroiditis, and insulin-dependent diabetes mellitus co-occur, but rheumatoid arthritis and multiple sclerosis do not2. Previously, there have been several efforts to investigate the molecular features responsible for human disease comorbidities3,4,5,6,7,8,9. Some studies focused on particular subsets of diseases4 or ethnic groups, while others investigated the entire human disease network5,6,7,8. For example, ref. 6 applied text mining to search the literature for disease-symptom associations. They then predicted the entire human disease–disease network based on a calculated symptom similarity score. While this approach covers many human diseases, it relies on prior knowledge of symptomatic information; this limits its disease coverage and only explains one phenotype (disease) by another phenotype (symptom). ref. 7 utilized known disease–gene associations from GWAS10 and OMIM combined with a protein–protein interaction network to identify connected disease–gene clusters or modules. Another study also utilized known disease–gene associations and protein–protein interaction networks to characterize disease–disease relationships without requiring gene clusters8; thus, its disease coverage is better than in ref. 7. These studies that used known disease–gene associations are limited by data availability. Indeed, only a small fraction of diseases have known associated genes. For example, ref. 8 only covers 1022 of the 8043 diseases in the Disease Ontology database11, with just 6594 pairs of diseases having a non-zero number of shared genes. Similarly, ref. 7 found that about 59% of 44,551 disease pairs do not share genes and their relationship cannot be resolved based on the shared gene hypothesis. The effect of possibly missed proteins arising from both direct and indirect protein–protein interactions with known interacting proteins are accounted for by the network propagation method in the XD-score8 or by the disease module and network distance of the SAB score7. However, those scores only marginally improve the recall rate of disease pairs that are clinically comorbid compared to that of shared genes in their methods (<10% recall rate by both the XD-score and SAB score).

To address these limitations of existing studies, we developed LeMeDISCO, which extends our recently developed MEDICASCY machine learning approach12 for predicting disease indications and mode of action (MOA) proteins (as well as small molecule drug side effects and efficacy) to predict disease comorbidities and the proteins and pathways responsible for their comorbidity. LeMeDISCO covers 6.5 million pairs of diseases compared to 97,666 pairs by the XD-score8, 44,551 pairs by the SAB score7, and 133,107 pairs by the Symptom Similarity Score6. Assuming that the most enriched comorbid proteins are responsible for disease comorbidity, we determine the most frequent comorbidity enriched MOA proteins. These proteins are then employed in pathway analysis13. As examples, we predict the comorbid diseases, comorbidity enriched MOA proteins, and pathways associated with coronary artery disease (CAD) and ovarian cancer (OC). We note that recently machine learning (ML) methods have been successfully employed in numerous areas of biology12,14,15,16. However, due to MLs “black box” nature, it is not easy to trace back the biological meaning of the predictions and the molecular origin(s) of disease comorbidity. Thus, as in previous works6,7,8, we adopt an explicit score that provides a set of common proteins responsible for comorbidity predictions.

Results

Benchmarking results of LeMeDISCO

To assess its relative performance, we compared the results of LeMeDISCO to three other methods, the XD-score8, the SAB score7, and the Symptom Similarity Score6. The XD-score was calculated as described in ref. 8: Using known disease–gene associations to create a vector representation of the disease by setting 1 for all associated genes and 0 for all others; then the vector was iteratively updated based on the Random Walk with Restart (RWR) algorithm, with a restart probability of p = 0.9 by using the STRING network database. Finally, the XD-score quantifying the relation of two diseases is defined using the updated vectors of two diseases. NG is the number of shared genes between disease pairs8. The SAB score, a protein–protein network-based separation of a disease pair calculated from known disease–gene associations is defined as SAB = <dAB>− (<dAA> +<dBB>)/2, where SAB compares the shortest distances between proteins within each disease A & B7, <dAA> and <dBB> , to the shortest distances <dAB > between A-B protein pairs7. The Symptom Similarity Score was obtained by large scale text mining of the literature for disease-symptom relations represented as a vector, with the similarity score defined as the cosine similarity of the respective vectors6. In this work, a J-score for disease similarity is defined as the Jaccard index17 of two diseases (see Method for details). The disease–disease relations of benchmarking data from Medicare insurance databases are quantified by their relative risk (RR) and φ-score (see Methods section for details)1. The relative risk RR is defined in Eq. 4a and is the probability that two diseases occur in a single individual relative to random. The φ-score is the Pearson’s correlation for binary variables and is defined in Eq. 4b. Diseases in this work are represented by DOID numbers from the Human Disease Ontology database11, and they are in clinical data usually denoted by ICD-9 or ICD-10 classifications18 or their Medical subject headings (MeSH) names19.

Table 1 summarizes the results. We define a true positive comorbidity pair when their clinic log(RR) >0, a predicted positive when XD-score >0, SAB score <0, or the Symptom Similarity Score >0.1 and q value <0.05 for our J-score. Recall is defined as (the number of true positives having score > cutoff or < cutoff for SAB score)/(total number of true positives). We emphasize that in calculating recall, the cutoffs are suggested by the respective work as being either biologically meaningful7,8 or statistically significant6. In addition to Pearson’s correlation coefficient (c.c.), recall and precision, the cutoff independent measures area under the receiver operating characteristic (AUROC) and the area under the precision-recall curve (AUPRC) are also compared.

Table 1 Comparison of LeMeDISCO’s J-score with the XD-score, NG, SAB score and Symptom Similarity Score for correlations with comorbidity quantified by the log(RR) score, φ-score, and recalla.

Mapping the DOIDs to the ICD-9 ID classifications of ref. 1, excluding easy pairs when in the MEDICASCY library two diseases have \(\frac{{shared\; \#\; of\; efficious\; drugs}}{\sqrt{{\#\; disease}1{efficious\; drugs}\times {\#\; disease}2{efficious\; drugs}}} \; > \; 0.9\), we obtain 191,966 disease pairs for use in LeMeDISCO benchmarking. All Pearson’s correlations of the J-score with the log(RR) score (c.c. = 0.116, p value = 0) and φ-score (c.c. = 0.090, p value = 0) are statistically significant (p value <0.05). The recall rate of J-score for this large set is 37.1%, and the AUROC of 0.528 is well better than random of 0.5.

The Permute drug–protein test has an average ± standard deviation from 100 runs 1958.6 ± 144.9 (54.3%) for diseases with non-zero MOA proteins. We note there are still significant correlations, though the absolute c.c. drops from 0.116 to 0.050 (p value = 0.0) for log(RR) and from 0.090 to 0.060 (p value = 0.0) for the φ-score, and the recall drops from 37.1% of true relationships to 8.8% due to that the number of diseases having correctly assigned MOA proteins drops to around half. All other measures are also worse. The p values of the difference between LeMeDISCO and this permutation test are significant for all measures (<0.05).

On average, the Permute drug–disease test only has 55.09 (1.5%) diseases with non-zero MOAs. Its average recall of 0.0137% is much worse than that of the Permute drug–protein test because it loses the correct connections between diseases and proteins. Correlations with both log(RR) (c.c. = 0.0026, p value = 0.24) and the φ-score (c.c. = 0.0029, p value = 0.19) are insignificant. Except for precision, the p values of the difference between LeMeDISCO and this permutation test are all very significant (well below 0.05). The 54.3% average precision is due to its very few predictions (average only ~26). With these few predictions, a random selection of 26 pairs from the 191,966 (75.6% are true positives defined as log(RR) >0) will have a probability of \(\mathop{\sum }\nolimits_{k=14}^{26}{C}_{26}^{k}\) × 0.756k × 0.24426-k = 0.996 of having greater than 54.3% precision. This means the precision is not better than random selection.

To understand the significant difference between the Permute drug–protein and the Permute drug–disease tests, we note that MEDICASCY predicts drug–disease pairs based on two components: One uses the drug’s chemical structure to learn the indications of a drug from those drugs with similar structure. This component is insensitive to whether the drugs’ protein targets change. The other depends on the drug’s protein targets. In the Permute drug–protein test, a permuted drug–protein relation will randomly change the drug’s protein targets to another drug’s. MEDICASCY was applied after the permutation to ensure correct drug–disease relations. Thus, MEDICASCY’s prediction of drug–disease relations still has information from the permuted drug–protein relation and the disease-(through permuted drug)–protein relations are not completely lost. This actually reflects the fact that there are a subset of proteins that occur in many diseases and permuting the drug–protein relationship for this subset does not change the identification of proteins in a given disease. On the other hand, the permuted drug–disease test completely destroys the mapping of the protein (through the drug) to disease.

To compare LeMeDISCO’s J-score to the XD-score, we mapped their ICD-9 disease code to the DOIDs and obtained a subset of 29,658 pairs from their dataset of 97,665 pairs8. As shown in Supplementary Fig. 1 and Table 1, the XD-score has a c.c. of 0.042 (p value = 2.8 × 10−13) with log(RR) and c.c. = 0.071 (p value = 9.7 × 10−35) with the φ-score. Their NG score (the number of shared genes) essentially has no significant correlation with log(RR) with a c.c. of 0.0047 (p value = 0.42) and only shows a correlation of 0.053 (p value = 6.6 × 10−20) with the φ-score. The J-score has much better correlations: c.c. = 0.146 (p value = 0.0) with log(RR), 0.106 (p value = 0.0) with the φ-score. The recall rate of the J-score is 44.5% compared to 6.4% for the XD-score. J-score’s precision (80.6 vs. 77.8%), AUROC (0.531 vs. 0.510), AUPRC (0.812 vs. 0.801) are all better. Supplementary Fig. 1 shows distinct patterns of J-score and XD-score. The data points of the XD-score are mostly concentrated at an XD-score = 0, whereas those of the J-score spread across the full range of 0–1.

For comparison with the SAB score7, the MeSH19 disease names were mapped to their DOIDs. A consensus set of 943 disease pairs from their dataset and ours was obtained. As shown in Supplementary Fig. 2 and Table 1, compared to SAB7, for the 947 disease pairs, LeMeDISCO’s J-score has a c.c. = 0.0986 (p value = 0.0024) with log(RR) and a c.c. = 0.0886 (p value = 0.0065) with the φ-score that are both better than those of the SAB score with a c.c. = −0.0620 (p value = 0.057) with log(RR) and c.c. = −0.0413(p value = 0.205) with the φ-score; both are insignificant. The recall rate of the J-score is 68.6% and is an order of magnitude better than the 8.0% by the SAB score when defining comorbid pairs when the SAB score <0; that is for a biologically meaningful disease–disease relationship7. J-score has AUROC = 0.531 compared to SAB score’s 0.434 that is even worse than random value 0.5 because its dominant SAB score >0 region is worse than random. The J-score also has a better AUPRC (0.798 vs. 0.761). However, J-score’s precision (77.7% vs. 85.3%) is slightly worse. Supplementary Fig. 2 shows that the data points of the SAB score are concentrated in the region SAB >0, whereas those of the J-score spread over the 0–1 region.

Next, a common dataset of 2621 disease pairs was obtained for comparison with the Symptom Similarity Score6. As shown in Supplementary Fig. 3 and Table 1, the Symptom Similarity Score has better correlations of 0.322 (p value = 0.0) than 0.140 (p value = 5.2 × 10−13) by the J-score for the log(RR) and 0.194 (p value = 1.4 × 10−23) than 0.135 (p value = 3.8 × 10−12) by the J-score for the φ-score. It also has better recall (100 vs. 63.7%), AUROC (0.587 vs. 0.512) and AUPRC (0.856 vs. 0.814). However, the Symptom Similarity Score only explains the relationship of one phenotype (symptom) to another phenotype (disease). Nevertheless, all correlations of the J-score are statistically significant. The J-score’s precision is almost identical to that of the Symptom Similarity Score (79.3 vs. 79.6%). We note that Supplementary Figs. 3, 4 show very similar patterns of the J-score and Symptom Similarity Score.

MEDICASCY based MOA protein prediction

The ICD-10 main classification coverage of the 3608 diseases is shown in Fig. 1a. We first examine the number of predicted MOA proteins per indication from MEDICASCY12. Using a q value cutoff of 0.05 and including protein isoforms, the average (median) number of MOA proteins per indication is 1,142.2 (339); the maximal and minimal values are 15,281 (almost half of the total 32,584 screened proteins) for mast cell sarcoma and 0 for 82 diseases. The histogram of the number of MOAs is shown in Fig. 1b. 71.0% (40.6%) of indications have >100 (500) MOA proteins. These associations allowed us to expand the protein repertoire that might be associated with each disease and resemble the statistics from GWAS studies. Below, we describe the use of LeMeDISCO to predict disease comorbidities as well as prioritize these proteins.

Fig. 1: Summary results for 3608 distinct diseases.
figure 1

a ICD-10 main classification coverage across the 3608 diseases. Some diseases are found in multiple groups; they were counted in each group with which they are associated. b Histogram of the number of MOAs. c Frequency (bin size 0.02) and density of the J-score for the ~2 million significant (q value <0.05), non-redundant disease pairs. d Frequency (bin size = 100) and density of the degree (number of edges) of each disease (node). e Fraction of diseases in the giant component of disease–disease network versus the J-score cutoff.

Shared MOA proteins explain disease comorbidity by way of disease–disease relationships

We next examine the overall characteristics of the predicted comorbidity network of 3608 diseases. Eighty-two diseases do not have MOA protein predictions and thus do not have predicted comorbidities. There are a total of 6,507,028 possible pairwise disease associations. Of these, there are 2,137,022 significant pairwise disease associations (q value <0.05) excluding the diagonals given by LeMeDISCO. Out of 3608, 3523 diseases have significant comorbidities. The density and frequency of the J-score for the significant non-redundant pairs is in Fig. 1c, and the density and frequency of the degree (number of edges) for each node (disease) is represented in Fig. 1d. Using a q value cutoff of 0.05, the average (median) number of comorbidities per disease is 608.3 (491). The largest (smallest) number of comorbidities is 2229 for Pneumonia aspiration and the smallest is 1 for these four diseases: glossopharyngeal neuralgia, median arcuate ligament syndrome, toxoplasmosis, hallucinogen dependence. The average closeness \(\pm\) one standard deviation (defined as the reciprocal of the shortest distance to all other nodes: \(\frac{{{{{{{\mathrm{Number}}}}}}\; {{{{{\mathrm{of}}}}}}\; {{{{{\mathrm{other}}}}}}\; {{{{{\mathrm{nodes}}}}}}}}{\sum {{{{{{\mathrm{shortest}}}}}}\; {{{{{\mathrm{distance}}}}}}\; {{{{{\mathrm{to}}}}}}\; {{{{{\mathrm{other}}}}}}\; {{{{{\mathrm{nodes}}}}}}}}\)) of all nodes is 0.535 ± 0.084, indicating that the majority of disease pairs have the shortest distance around 1/0.535. The average betweenness is 1135 ± 1727, i.e., on average, 1135 pairs of diseases have their shortest distance passing through the given disease node. Thus, the disease network is very dense.

The cumulative distribution for the J-score and q values for all of the comorbidities and the top 100 are shown in Supplementary Figs. 5, 6, respectively. The summary statistics of the scores for these thresholds are shown in Supplementary Table 1. What is clear from these figures and Supplementary Table 1, particularly for the top 100 ranked comorbidities, is that the 99.6% top-ranked 100 comorbidities have a q value <0.005. In other words, while a q value threshold of 0.05 is used, in reality, the actual q values employed for subsequent analysis are far more significant.

Around 32.8% of the disease pairs have a q value <0.05. This result is consistent with the 37.1% recall of large-scale benchmarking (see Table 1). As shown in Fig. 1e, the giant component (GP) of the disease–disease network covers the entire network when the J-score is <0.1 and the q value <0.05, i.e., starting from any disease, one can walk to any other disease on the network. As the J-score cutoff increases, the number of diseases in the giant component decreases; however, the decrease is very slow. The rapid decrease only happens around a 0.45 J-score corresponding to an average q value of 1.63 × 10−6 ± 1.81 × 10−4. Thus, the disease network is not only dense, but it is also strongly (i.e., one has to apply a high J-score cutoff to break the network into small GP) and highly significantly (compared to default q value 0.05) connected. These issues will be explored in future work.

LeMeDISCO identified MOA proteins

In addition to the comorbidity predictions, LeMeDISCO also identifies comorbidity enriched MOA proteins. The comorbidity enriched MOA proteins are hierarchically ranked by their CoMOAenrich score (defined in the Methods section). Comparing the top 100 comorbidity enriched MOA proteins (hierarchically ranked by the CoMOAenrich score) with the MEDICASCY top 100 MOA proteins (ranked by q value), 92.5% of the diseases have proteins with a significant overlap p value (<0.05). The cumulative distribution for the CoMOAenrich scores and q values for all the comorbidity enriched MOA proteins and the top 100 are shown in Supplementary Figs. 7, 8, respectively. The summary statistics of the scores for these thresholds are shown in Supplementary Table 1. For the comorbidity enriched MOA proteins ranked by their CoMOAenrich score, 65.9% have a q value <0.005. If one only assesses the top 100 comorbidity enriched MOA proteins, 67.1% have a q value <0.005, which are the proteins used for the global pathway analysis.

Mapping of the LeMeDISCO MOA proteins to significant pathways

The cumulative distribution of the q values for the pathways and the top 100 pathways are shown in Supplementary Fig. 9 and the summary statistics are provided in Supplementary Table 1. As shown in Supplementary Fig. 9, 73.1% of the significant pathways (q value <0.05) have a q value <0.015. About 3453 or 95.7% of the 3608 diseases have significant pathways. We further note that there are some MOA proteins (e.g., AR, NR4A3, and PGR) and pathways (e.g., HSP90 chaperone cycle for steroid hormone receptors, SUMO E3 ligases SUMOylate target proteins, SUMOylation) that are present in approximately a third of the diseases in our library.

Applications of LeMeDISCO

By way of illustration, we applied LeMeDISCO to two disparate diseases, coronary artery disease (CAD) and ovarian cancer (OC).

Coronary artery disease (CAD)

CAD, a leading cause of death worldwide, is caused by narrowed or blocked arteries due to plaques composed of cholesterol or other fatty deposits lining the inner wall of the artery. These plaques result in decreased blood supply to the heart20. We find 2576 significant comorbid diseases (q value <0.05) and 785 (558) comorbidity enriched MOA proteins (genes) (score >0.01), meaning that at least one of the top 100 comorbid disease shares the protein as an MOA protein. This is the p value weighted comorbidity frequency normalized by the number of comorbid diseases used for calculating the frequency. See Methods for more details. Forty-nine significant pathways (q value <0.05) are associated with the top-ranked 100 comorbidity enriched proteins. The top 20 disease comorbidities, top 20 comorbidity enriched MOA proteins, and top 20 significant pathways are shown in Table 2. There are several significant cardiovascular-related comorbidities such as heart disease, cardiovascular system disease, myocardial infarction, and congestive heart failure. Asthma21, diabetes22, and obstructive lung disease23 are also in the top ten with known comorbidities to CAD. CAD is also known to be comorbid with liver disease24, kidney disease25, and hyperthyroidism26. Interestingly, allergic rhinitis is associated with decreased coronary heart disease27. In summary, 14 (70%) of the top 20 predicted comorbid diseases have literature evidence to support these predictions. To further show that these comorbidities with literature evidence cannot be generated randomly, we randomly selected 20 diseases from the 3608 diseases and did a literature search for their associations with CAD. We found nine diseases, far fewer than our list of 14 diseases (see Supplementary Table 2). A further random test selecting 20 from those after excluding LeMeDISCO predicted comorbid diseases to CAD, we find six diseases having literature evidence (see Supplementary Table 3).

Table 2 Top 20 comorbidities (excluding same disease pair, (i.e., CAD-CAD)), top 20 comorbidity enriched MOA proteins (with respect to original disease), and top 20 pathways associated with the prediction CAD results.

Among the top, COX-related comorbidity enriched proteins were found. COX proteins are involved in the synthesis of prostanoids. Prostanoids are structurally like lipids and are involved in thrombosis and other undesirable cardiovascular events28. Several GPCR-related pathways (Class A/1 rhodopsin-like receptors, olfactory signaling pathway, GPCR ligand binding) are among the top five pathways predicted for CAD, consistent with the literature that GPCRs play a crucial role in heart function29.

The above results were obtained without any extrinsic knowledge of CAD. Next, we show how LeMeDISCO can be used to prioritize targets from other studies. A GWAS study identified 155 CAD-associated genes30. While they are associated with CAD, to find out which ones to target is a non-trivial task. Here, we applied LeMeDISCO to prioritize them by examining their frequencies of presence in other diseases. There were 26 comorbidity enriched MOA proteins (score >0.01) and 40 pathways (p value <0.05, but q value <0.20) found from global pathway analysis of the 26 comorbidity enriched MOA proteins. The top disease comorbidities, top 20 comorbidity enriched MOA proteins, and top 20 pathways are shown in Table 3. There were only three significant predicted comorbidities (q value <0.05) by LeMeDISCO. The top two comorbidities are renal artery disease and anuria, both are associated with dysfunction of the kidneys and are related to CAD25,31. Anuria is attributed to failure of the kidneys to produce urine, and renal artery disease occurs when the arteries that supply blood and oxygen to the kidneys narrows. A study found an increase in renal artery stenosis in patients with CAD31. The last comorbid disease is anterior uveitis. Studies showed that anterior uveitis is associated with Kawasaki disease that can lead to heart complication32. Thus, all three have literature evidence.

Table 3 Up to top 20 comorbidities, top 20 comorbidity enriched MOA proteins (with respect to input), and top 20 pathways (ranked by p value since q values are the same) associated with the prediction CAD GWAS-driven LeMeDISCO results using the gene set from ref. 30.

While the top genes are associated with CAD according to the GWAS study of ref. 30, we predicted that they are also associated with the corresponding comorbid diseases—renal artery disease, anuria, and anterior uveitis. For example, VEGFA is predicted to be associated with all three comorbid diseases. It was found that in progressive kidney disease, the VEGFA expression level is attenuated33; in contrast, in uveitis disease, it is increased34. Other top genes are predicted to be associated only with anterior uveitis. Among them, SERPINA1 is a potential causal gene of uveitis35, RAB23 is associated with uveitis in sarcoidosis36, and HHAT has evidence of association with uveitis37.

None of the 40 pathways obtained using the top 26 genes overlaps with the six pathways obtained using the original 155 genes with the same cutoff. The predicted top pathway RAB geranylgeranylation through RAB23/RAB5C genes is part of the signaling network of statin-induced effects of improving cardiac health in Drosophila38.

Ovarian cancer (OC)

LeMeDISCO predicts 1,092 significant comorbidities to OC (q value <0.05), with 282 (171) comorbidity enriched MOA proteins (genes) (score >0.01). There were 159 significant pathways (q value <0.05) from the top 100 comorbidity enriched MOA proteins. The top 20 disease comorbidities, top 20 comorbidity enriched MOA proteins, and all significant pathways are shown in Table 4. It is not surprising that all of the top comorbidities are cancers. The top first comorbid disease is testicular cancer. Although OC and testicular cancer cannot occur in one individual, they are hereditarily associated39. Fallopian tube cancer is considered similar to OC. It was reported that squamous cell carcinoma occurred in the ovary40. Nodular prostate, (the male version of OC), cervical cancer41, and inflammatory breast carcinoma42 are all reproduction-related cancers like OC. OC from lung cancer metastasis occurs in 2–4% of OC patients43. Bile duct cancer is a very rare site of OC metastases44. Peritoneal cancer behaves similarly to OC. Gland cancer is linked to BRCA-positive families, and BRCA is a risk gene for ovarian cancer45. Neurofibroma is reported to mimic ovarian tumors46. Renal cell carcinoma is metastatic to ovarian and fallopian tube cancers47. In total, 14 of the top 20 (70%) comorbidities have literature evidence. Similar to CAD, we did a literature search of 20 randomly selected diseases for their associations with OC. We found only 6 cases; far fewer than our 14 (see Supplementary Table 2). In a further random test selecting 20 from those after excluding LeMeDISCO predicted comorbid diseases to OC, we find four diseases have literature evidence (see Supplementary Table 3).

Table 4 Top 20 comorbidities (excluding same disease pair, (i.e., OC-OC)), top 20 comorbidity enriched MOA proteins (with respect to original disease), and top 20 pathways associated with the prediction OC results.

Eleven of the top 20 enriched MOA proteins are kinases that are cancer-related. The topmost comorbidity enriched MOA protein is TEK, angiopoietin-1 receptor; angiopoietins are found to promote ovarian cancer progression48. Interestingly, TYRO3 is related to drug resistance in OC49. The top predicted pathway by enriched MOAs is MAPK1/MAPK3 signaling that mediates the expression of ERBB2 silencing, OC cell migration, and invasion50. There are also enriched pathways associated with ephrin ligands. Aggressive forms of ovarian cancer have been previously shown to involve upregulated forms of ephrin, such as ephrinA551. There are 14 ephrin-related comorbidity enriched MOA proteins found (all score >0.37).

We next examined a set of 11 genes associated with OC risk from a study that assessed the multiple-gene germline sequences in 95,561 women with OC using LeMeDISCO52. The results for the top 20 comorbidities, seven MOA proteins (score >0.01), and their associated pathways are shown in Table 5. There were 125 significant comorbidities (q value <0.05) predicted and 33 significant pathways (q value <0.05) associated with these seven proteins. The top comorbidity associated with OC was angiosarcoma, a rare cancer of the inner blood and lymph vessels and in very rare cases, it occurs in the ovaries53. Patients with epithelial ovarian cancers show an increased risk of skin cancer54. OC is also considered to have genetic risk factors55. Myxoid leiomyosarcoma is a very rare tumor with similarity to ovarian cancer56, and leiomyosarcoma was reported in the ovaries57. A study found a relationship between hemoglobin levels and interleukin-6 levels in individuals with untreated epithelial ovarian cancer, indicating an inflammatory role in cancer-associated anemia58. Medulloblastoma can arise from ovarian tumors in pregnancy59. Uveal cancer is associated with breast cancer and OC60. OC is part of urinary system neoplasm. In total, 15 (75%) of top 20 comorbidities have literature evidence.

Table 5 Top 20 comorbidities, seven comorbidity enriched MOA proteins (with respect to input), and top 20 pathways associated with the prediction OC GWAS-driven results using the gene set from ref. 52.

The top two comorbidity enriched MOA proteins are RAD51C, RAD51D and belong to 16 of the top 20 pathways. These involve such processes as DNA repair, transcriptional regulation by TP53, DNA double-strand break repair, and reproduction (see Table 5). The top two and the third-ranked MSH6 proteins are shared by all top 100 comorbidities. For example, RAD51C is associated with Fanconi anemia (ranked 63th)61; RAD51D is associated with leiomyosarcoma62, and MSH6 is a risk gene for pancreatic adenocarcinoma (26th)63.

LeMeDISCO web server

The LeMeDISCO web service allows researchers to query our library of 3608 diseases or input a set of pathogenic human genes/proteins and compute their predicted comorbidities, prioritized MOA proteins, and pathways associated. The web service is freely available for academic users at http://sites.gatech.edu/cssb/LeMeDISCO. The programs and input data for reproducing disease–protein, disease–disease relationships, and all LeMeDISCO results as well as benchmark results are available at https://github.com/hzhou3ga/lemedisco.

Discussion

LeMeDISCO is a systematic approach for studying and analyzing possible features underlying the common proteins driving comorbid diseases. The resulting predicted driver proteins and pathways for each disease or input gene set can allow researchers to generate new diagnostic and treatment options and hypotheses. Interestingly, there were some MOA proteins and pathways present across approximately a third of the diseases, implying common disease drivers. The implications of this observation and its relationship to disease origins will be pursued in future work. We do note that the current comorbid disease analysis strongly suggests that the “one target-one disease-one molecule” approach often used in developing disease therapeutics31 is likely too simplistic.

To fully understand the complexities of a disease, one must trace the origin of its pathogenesis, which may be due to a genetic or somatic variant that is somehow related to the disease. However, such variants may also be associated with a disease not previously known to be associated with that disease. Such interrelations can be further investigated by identifying high confidence comorbidity predictions from LeMeDISCO, regardless of whether or not their comorbidity was previously known in the literature. For example, analysis of the comorbid diseases associated with CAD and OC have not only recapitulated known disease comorbidities but have also provided novel insights. The results for CAD yielded high confidence associations between liver diseases and forms of asthma, which can be further investigated through the comorbidity enriched MOA proteins and pathways. Furthermore, the results for OC revealed more high confidence associations to other forms of cancer such as squamous cell carcinoma and lung cancer.

LeMeDISCO not only has applications to the study of the underlying etiology behind a disease but may also be used during the early stages of drug discovery to identify efficacious drugs. Rather than starting with a small molecule or protein target of choice, LeMeDISCO allows one to begin at the level of disease biology, often termed phenotypic drug discovery. In future work, we shall demonstrate the utility of LeMeDISCO in identifying efficacious drugs to treat a given disease. Overall, the results of the current analysis and preliminary applications to drug discovery suggest that LeMeDISCO provides a set of tools for elucidating disease etiology and interrelationships and that a more systems-wide, comprehensive approach to both personalized medicine and drug discovery is required.

We note that some of the predicted MOA proteins are present in around 1/3 of diseases. The top five (AR, NR4A3, PGR, NR3C2, and NR3C1) proteins all belong to the nuclear receptor family and regulate other genes. All have DNA binding sites, especially two zinc finger domains64. The regulatory functions and ubiquity of well-studied zinc fingers in these proteins may explain their frequent presentation as disease MOA proteins65. Even though in our predicted drug targets of the probe drugs, these proteins are not the most frequent ones (e.g., AR is ranked 1135th of 16,762), their disease associations were enriched by MEDICASCY predicted disease–drug relationships.

With the above possible applications, there is also the limitation of the current approach of using FDA-approved DrugBank drugs as probe drugs to tease out the MOA proteins of diseases, i.e., some possible MOA proteins of a given disease might not be the targets of the probe drugs and others might be incorrectly assigned. This will be addressed in future work that includes more diverse small molecule drug libraries and improved virtual ligand screening algorithms to map the drugs to their respective protein targets66. As the probe drug target space is expanded, additional MOA proteins will be discovered. Concomitantly, as the virtual ligand screening algorithms that assign small molecules to their predicted protein targets improve, false positives will be eliminated, and additional true positive proteins might be added. These will result in more accurate MOA protein predictions.

Similar to previous work1,6,7,8, our predicted disease–disease relationships were benchmarked using large-scale clinical data and have only small-scale validation by literature searches. One single relationship requires at least one published work to validate. Large-scale automatic text mining is a feasible way to scale up the validation and build a more confident subset of our predictions67. This is the subject of ongoing studies.

Methods

Overview of LeMeDISCO

A flowchart of LeMeDISCO is shown in Fig. 2. LeMeDISCO employs MEDICASCY12 to predict possible disease MOA proteins. Here, MEDICASCY is applied in prediction mode (i.e., any training drugs having a Tanimoto-Coefficient = 1 to a given input drug is excluded from training) to avoid a strong bias toward drugs in the training set on a set of 2095 FDA-approved drugs68. For each of the 3608 indications, we rank the 2095 probe drugs according to their Z-scores, Zd, defined using the raw score computed by MEDICASCY from:

$${Z}_{d}=\left(\frac{{{{{{{\mathrm{raw}}}}}}}\,{{{{{{\mathrm{score}}}}}}}{{\mbox{--}}}{{{{{{\mathrm{average}}}}}}}\,{{{{{{\mathrm{raw}}}}}}\,{{{{{\mathrm{score}}}}}}}\,{{{{{{\mathrm{of}}}}}}}\,2095\,{{{{{{\mathrm{drugs}}}}}}}}{{{{{{{\mathrm{standard}}}}}}}\,{{{{{{\mathrm{deviation}}}}}}}\,{{{{{{\mathrm{of}}}}}}}\,2095\,{{{{{{\mathrm{raw}}}}}}}\,{{{{{{\mathrm{scores}}}}}}}}\right)$$
(1)
Fig. 2: Schematic representation of LeMeDISCO.
figure 2

a The method for determining the MOA proteins associated with a disease indication via MEDICASCY, and b The method for determining the comorbidities associated with a given disease and its molecular mechanisms via LeMeDISCO.

To predict a drug as having the given indication, we applied a Zd cutoff of 1.65, that approximately corresponds to a p value of 0.05 for the upper-tailed null hypotheses of random variable Zd. Thus, for each indication D, the 2095 probe drugs are separated into two groups: N1 are predicted to have indication D (Zd ≥ 1.65) and N2 (=2095 − N1) are not predicted to have indication D (Zd < 1.65). This is a very loose prediction of a drug’s indication with the advantage that it always predicts some drugs having the indication with its expected statistical confidence. Then, for a given indication D and each protein target, T, in the human proteome of our modeled 32,584 proteins, there are a subset of the drugs (or perhaps none) predicted by FINDSITEcomb2.0 69 to bind T. The relative risk RR(D, T) of the given target T with respect to indication D as:

$${{{{{\rm{RR}}}}}}({{{{{\rm{D}}}}}},{{{{{\rm{T}}}}}})=\frac{{N}_{1}^{T}/{N}_{1}}{{N}_{2}^{T}/{N}_{2}}$$
(2a)

where \({N}_{1}^{T}\) and \({N}_{2}^{T}\) are the numbers of drugs binding to T with and without indication D, respectively. The numerator is the estimation of the probability of drugs having the predicted indication D (Zd ≥ 1.65) that bind to protein T (F1 = \({N}_{1}^{T}/{N}_{1}\)). The denominator is the probability of finding drugs that do not have the predicted indication D but which bind to protein T (F2 = \({N}_{2}^{T}/{N}_{2}\)). This latter probability serves as the background probability that an arbitrary drug will bind to T. When no drug is predicted to bind to protein T, RR(D, T) is set to zero. RR(D, T) = F1/F2 >1 means that a drug having indication D is more likely to bind to T than arbitrary drugs not having the predicted indication D will bind to T.

We then compute the statistical significance of RR(D, T) by calculating a p value using Fisher’s exact test70,71 on the following contingency table:

$$\left(\begin{array}{cc}{N}_{1}^{T} & {N}_{1}-{N}_{1}^{T}\\ {N}_{2}^{T} & {N}_{2}-{N}_{2}^{T}\end{array}\right)$$
(2b)

We define a protein target T as predicted to be a possible MOA target for indication D if its p value <0.05 because it is more likely to be targeted by efficacious drugs than arbitrary drugs. Thus, for each of the 3608 indications, there is a list of predicted possible MOA proteins.

To reduce false positive MOAs, we utilized the human protein atlas database (https://www.proteinatlas.org/about/download, normal_tissue.tsv) of expression profiles for proteins in normal human tissues based on immunohistochemistry using tissue micro arrays72 to filter those proteins that are “not detected” and not “uncertain” in all tested tissues related to an indication. To determine the tissues related to an indication, tissues are mapped to their ICD-10 main codes and indications having the same main codes are related to the tissue.

Using the input of two sets of putative MOA proteins having a p value of <0.05 calculated by Fisher’s exact test70, we calculate their Jaccard index17 J(D1,D2) (J-score) defined in Eq. 3a as

$${{{{{\rm{J}}}}}}{\mbox{-}}{{{{{\rm{score}}}}}}={N}_{s}/({{{{{\rm{N}}}}}}{{{{{\rm{D}}}}}}1+{{{{{\rm{N}}}}}}{{{{{\rm{D}}}}}}2-{N}_{s})$$
(3a)

We then calculate the p value for significance by Fisher’s exact test for the contingency table70 that gives the probability of having overlap ≥Ns by randomly selecting \({N}_{D2}\) out of \({N}_{t}\) proteins70,73:

$$\left(\begin{array}{cc}{N}_{s} & {N}_{D2}-{N}_{s}\\ {N}_{D1} & {N}_{t}-{N}_{D1}\end{array}\right)$$
(3b)

\({N}_{D1}\), \({N}_{D2}\) are the numbers of MOA proteins/genes of disease D1 and D2; Ns is the number of overlapped MOA proteins between D1, D2, and \({N}_{t}\) is the total number of human proteins. The Jaccard index J-score is a statistical measure of the similarity between MOA proteins of D1 and D2, and its value ranges between 0 and 1. Since the null hypothesis of \({N}_{s}\) corresponds to a hypergeometric distribution, the p value of observing the number of overlapped MOA proteins between D1, D2 \(\ge\) \({N}_{s}\) can be calculated using Fisher’s exact test on the table in Eq. 3b71. We will use the J-score for predicting comorbidity and compare it with the observed comorbidity. We note that the J-score is determined by the number of overlapped MOAs, which means that the comorbidity defined by the J-score are not limited by diseases occurring in one individual but rather considers the effect of the malfunctioning proteins in the human population. This is especially true for sex-specific diseases such as ovarian cancer and prostate cancer that may have overlapping MOA proteins; this may result in significant comorbidity between them. In other words, the two diseases may share common driver proteins, although ovarian and prostate cancer could occur unless the individual has both an ovary and a prostate, which is highly unlikely. Similarly, it can predict the comorbidity of rare and common diseases; again, whether this would occur would depend on the presence in a given person of the appropriate set of malfunctioning genes. Therefore, though many of the LeMeDISCO comorbidity predictions are seen in one individual, others may not be. Thus, LeMeDISCO comorbidity predictions are a population-based approach.

To better control the false discovery rate (FDR) due to background noise from statistic errors, we performed the multiple testing correction to the p values for disease–protein and disease–disease associations calculated by Fisher’s exact test by computing the q value using the method described in ref. 74.

In large-scale disease–disease comorbidity calculations, we use the MOAs predicted by MEDICASCY12. In addition, MOA targets between disease pairs can also be derived from experimental data; examples include differential gene expression (GE), Mendelian or somatic mutation profiles comparing disease vs. control normal samples, better vs. worse prognosis samples, or drug-treated vs. control untreated samples75.

Benchmarking of LeMeDISCO

We validated LeMeDISCO’s J-score by correlating it with the observed comorbidity as quantified by (a) the logarithm of relative risk log(RR) score and (b) the φ-score (Pearson’s correlation for binary variables)1. The relative risk (RR) is the probability that two diseases co-occur in a single individual relative to random. Since RR scales exponentially with respect to the strength of two interacting diseases, we use log(RR) for correlation analysis. The log(RR) and φ-score are computed from US Medicare insurance claim data using1:

$${{\log }}({{{{{\rm{RR}}}}}})\,=\,{{\log }}\left(\frac{{n}_{{AB}}/{n}_{{tot}}} {({n}_{A}/{n}_{{tot}}){({n}_{B}/{n}_{{tot}})}}\right)$$
(4a)
$${{\varphi }}{\mbox{-}}{{{{{\rm{score}}}}}}=({n}_{{AB}}* {n}_{{tot}}-{n}_{A}* {n}_{B})/\sqrt{{n}_{A}* {n}_{B}* \left({n}_{{tot}}-{n}_{A}\right)* \left({n}_{{tot}}-{n}_{B}\right)}$$
(4b)

where ntot = total number of patients; nA, nB = number of patients diagnosed with diseases A and B, and nAB = number of patients diagnosed with both diseases A and B.

Permutation tests

Two permutation tests were performed: (a) Permute drug–protein relationships: Randomly permute the predicted drug–protein relations (i.e., randomly replace a drug’s protein targets with another drug’s protein targets predicted by FINDSITEcomb2.0). This acts to transfer the protein targets of a drug (possibly incorrectly) to another drug. This test evaluates the performance of LeMeDISCO if we have the correct drug–disease relations (predicted by MEDICASCY) but the incorrect drug–protein relations. To ensure the correct drug–disease relations after permuting the drug–protein relations, MEDICASCY was applied to the permuted drug–protein relations since MEDICASCY depends on the drug’s protein targets; (b) Permute drug–disease relationships: Randomly permute the predicted drug–disease relations (by randomly replacing a drug’s predicted indications with another drug’s indications). This test evaluates how LeMeDISCO will perform if the drug–protein relations are correct (predicted by FINDSITEcomb2.0), but the drug–disease relations are randomly permuted. In both cases, disease MOAs are derived using the permuted relationships and 100 runs for each test with different random seeds were performed. A p value is calculated from z-score = (LeMeDISCO value-average)/standard deviation to characterize the significance of the difference between LeMeDISCO and the permutation tests.

Identification of key MOA proteins and associated pathways for disease comorbidity

After determining the significant comorbidities for each disease, the p value weighted frequency of shared MOA proteins across the top 100 predicted comorbidities are calculated. We define a p value weighted frequency of an input MOA as follows (i.e., CoMOAenrich score): If MOA protein T is shared by a comorbid indication D and the p value of T associated with D is P, then the weight defined by the min(1.0,−αlogP) is counted as T’s frequency. In practice, we used 10 cancer cell line data76 to optimize the coefficient α to 0.025. We further computed a p value via \({e}^{-\frac{{{{{{{\mathrm{COMOAenrich}}}}}}\; {{{{{\mathrm{score}}}}}}}}{\alpha }}\) where α = 0.025, as previously mentioned. These MOA proteins expand the number of possible molecular players driving disease pathogenesis. An empirically derived CoMOAenrich score (normalized by the number of comorbid indications that is 100) threshold of 0.01 was used, which is equivalent to 1% of the comorbid indications having the MOA proteins with a significant p value (<4.2 × 10−18). Then, up to the top 100 comorbidity enriched MOA proteins for each disease were used in global pathway analysis via Reactome13. The pathways with a p value <0.05 were extracted. The frequency of pathways across diseases was assessed to identify common pathways of disease.

LeMeDISCO usage

As shown in Fig. 2, LeMeDISCO can be used in two different ways: (1) MEDICASCY-driven LeMeDISCO: The comorbidities for any of the 3608 diseases from the MEDICASCY-provided MOA proteins are predicted (Fig. 2a). (2) Pathogenic gene set driven LeMeDISCO: Input your own pathogenic gene set derived from differential gene expression, GWAS, exome analysis, or other experimental/clinical techniques (shown in Fig. 2b). The LeMeDISCO web service allows users to query the LeMeDISCO database as well as input their own set of pathogenic genes to assess the associated comorbidities, MOA proteins, and pathways.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.