Speos: an ensemble graph representation learning framework to predict core gene candidates for complex diseases

Understanding phenotype-to-genotype relationships is a grand challenge of 21st century biology with translational implications. The recently proposed “omnigenic” model postulates that effects of genetic variation on traits are mediated by core-genes and -proteins whose activities mechanistically influence the phenotype, whereas peripheral genes encode a regulatory network that indirectly affects phenotypes via core gene products. Here, we develop a positive-unlabeled graph representation-learning ensemble-approach based on a nested cross-validation to predict core-like genes for diverse diseases using Mendelian disorder genes for training. Employing mouse knockout phenotypes for external validations, we demonstrate that core-like genes display several key properties of core genes: Mouse knockouts of genes corresponding to our most confident predictions give rise to relevant mouse phenotypes at rates on par with the Mendelian disorder genes, and all candidates exhibit core gene properties like transcriptional deregulation in disease and loss-of-function intolerance. Moreover, as predicted for core genes, our candidates are enriched for drug targets and druggable proteins. In contrast to Mendelian disorder genes the new core-like genes are enriched for druggable yet untargeted gene products, which are therefore attractive targets for drug development. Interpretation of the underlying deep learning model suggests plausible explanations for our core gene predictions in form of molecular mechanisms and physical interactions. Our results demonstrate the potential of graph representation learning for the interpretation of biological complexity and pave the way for studying core gene properties and future drug development.


Supplementary Note 2: Design and optimization of GNN
Our model consists of three parts, pre-message passing (pre-MP), message passing (MP) and postmessage passing (post-MP) (Supplementary Figure 5).Pre-and post-MP are MLPs built from stacked fully connected layers interspersed with ELU nonlinearity, while the MP module is built from GNN layers, interspersed with ELU nonlinearity and instance normalization layers.The pre-MP module is built of one input layer which transforms the input space into the hidden dimension  ℎ of our model and q further layers which extract node-level features and feed it into the MP module.
We use the same hidden dimension  ℎ across all hidden layers of the model.The MP module is built from r blocks each consisting of one GNN, one nonlinearity, and one normalization layer, which then feeds the latent features into the post-MP module.The post-MP module consists of  fully connected layers for node-level pattern recognition and two fully connected layers to transform it into the one-dimensional output space with an intermediate step of dimensions.We have searched hyperparameters q and s for all iterations from 1 to 6 without much difference in performance (data not shown).Therefore we settled at q = s = 2, which leaves both modules ample room for feature recognition and keeps the number of model parameters small.When evaluating the impact of the dimensionality of the pre-and post-MP input/output vectors  ℎ at 10, 20, 30, 40, 50, 75, 100, 125 we observed no performance change exceeding one standard deviation of the AUROC; we fixed  ℎ = 50.Although hidden dimensions larger than 125 were included in the code, the runs went out of memory on a 40GB A100 GPU for the large, multi-network models and did not return results.
For the MP module, we tested 35 adjacency matrices and eleven GNN layers from the repertoire of PyTorch Geometric, of which eight are single-network GNNs, which only use a single adjacency matrix and which are evaluated first, and three are multi-network GNNs, that can use multiple adjacency matrices simultaneously.Setting the number of GNN layers to r = 0 collapses the model into an MLP with only pre-MP and post-MP and renders the influence of adjacency matrix and GNN layer to zero.Supplementary Figure 6 shows the influence of different single-network GNN layers and the hyperparameter  on the performance of the model using the adjacency matrices BioPlex 3.0 HEK293T, GRNdb Adipose Tissue and IntAct Direct Interaction.We evaluated the topology adaptive graph layer (TAG) 8 , GraphSAGE 9 , the graph transformer layer 10 , Chebyshev graph convolution layer 11 , the seminal graph convolution layer (GCN) 12 , simplified graph convolution layer (sGCN) 13 , the graph attention layer (GAT) 14 and the graph isomorphism layer (GIN) 15 .We conducted the hyperparameter optimization for gene labels of immune dysregulation (Supplementary Figure 6) and cardiovascular disease (now shown).For two adjacency matrices, most layers reduce the model performance compared to an MLP (r = 0).Only with IntAct Direct Interaction (Supplementary Figure 6c), the GNN (r > 0) outperforms the MLP.However, across all three adjacency matrices, the TAG layer usually outperforms all other layers.The best performing layers, especially according to Supplementary Figure 6b, c are TAG, GraphSAGE and the graph transformer layer.These layers either have mechanisms to block out the neighborhood and retain the self node's features during the convolution (TAG and GraphSAGE) or have a non-linear neighborhood attention mechanism (transformer), allowing them to modulate the incoming message.Simpler GNN layers, however, fail to perform compared to the MLP, which hints towards the tendency of the full neighborhood information being harmful for prediction.From these data we settled for 2 TAG layers which generally achieve the best performance and do not deteriorate performance below the performance of an MLP.
After fixing the MP module to 2 layers of TAG, we investigated the performance on all 35 adjacency matrices.In this analysis we also included three multi-network GNN layers that use all adjacency matrices which are typed and fed in the network simultaneously: RGCN 16 , RGAT 17 and FiLM 18 .RGAT went out of memory for all runs on a 40GB A100 GPU and is not shown.As recent analyses 19 showed that allowing the information flow to also bypass the GNN can drastically improve the performance, we also evaluated a skip-connection from pre-MP to the output of MP and concatenation of pre-MP's and MP's output features prior to post-MP (Supplementary Figure 5c).From the single-adjacency runs, only Intact Direct Interaction and Intact Physical Association provide a benefit, likely reflecting the above discussed bias (Supplementary Figure 7a).Using all adjacency matrices simultaneously is only helpful if the FiLM layer is used.Furthermore, using a skip-connection or concatenation is not beneficial, possibly because the TAG layer already contains skip-connections and the FiLM layer is flexible enough to block and override unhelpful information.We therefore settle on using the FiLM layer with all adjacency matrices and the TAG layer with IntAct Direct Interaction without any further skip-connections or concatenations.
Based on the HPO we have settled on two combinations of GNN layer and network: 1. using r = 2 TAG layers on the IntAct Direct Interaction network and 2. using r = 2 FiLM layers on a merge of all networks.Due to the strict annotation criteria the IntAct Direct Interaction network is very sparse (Supplementary Figure 7b) and most nodes are isolated, whereas after merging all networks for the FiLM layer, only 33 nodes are not in the large connected component (Supplementary Figure 7c).
The average shortest path length between positives and between all genes are much shorter in the combined network than in IntAct Direct Interaction.However, in both graphs positives and unlabeled nodes do not differ in their average shortest path lengths.Also, when combining the networks, the node degree distribution shifts away from a scale-free graph as low-degree nodes become less frequent than mid-degree nodes.Finally, positives in Intact Direct Interaction have a much higher frequency of other positives in their neighborhood than when all networks are combined.

Supplementary Note 3: Network features influencing model performance
Several observations indicate that, while the input features of every node and adjacent edges are important for its own prediction, the features of neighboring nodes may be irrelevant or even harmful for the machine learning task.In the initial evaluation of base classifiers (Fig. 1a), the best performing method for recovering held out positives is N2V+MLP.This method first uses a random-walk based neural network to project the graph topology into vector space without any node features.These vectors, which describe the position of nodes in the graph, are concatenated with the regular input features like GWAS summary statistics and gene expression for the node of interest and feed it into an MLP.Since the MLP part of N2V+MLP processes every gene in isolation, the neighboring node's features are not seen by the model.It only processes the input features of the node itself along with the topological information of its position in the graph, encoding its connections to other nodes.At the same time we established that this topological information is valuable as N2V+MLP outperforms an MLP that only uses the regular input features in the recovery of held out positives (Fig. 1a) and in the discovery of unlabeled positives (Fig. 3a).In contrast to N2V+MLP, GNNs process graphs by convoluting the features of adjacent nodes along the connecting edges using the message passing framework.Thus, the edges are used to determine "where" the convolution is applied, but the neighboring nodes' features define the content that is processed and propagated through the network.
We observed that GNNs are outperformed by N2V+MLP in recovering held out positives (Fig. 1a), but perform better or comparable to N2V+MLP in discovering unlabeled positives (Fig. 3a, 4a).We ascribe this observation to the effect of GNN layers to regularize the ML training by weighting adjacent node's features and acting as a low pass filter, and thus lowering the risk of overfitting to known positive examples.However, the performance of different GNN layers varies greatly, and in fact several decrease overall model performance (Supplementary Figure 6), whereas others achieve high prediction performance also for novel core genes (Fig. 3d).N2V+MLP works well for known positives but underperforms for unknowns (Fig. 3c).Since we are predominantly interested in the discovery of unlabeled positives, GNNs are the more promising choice.However, GCN 12 and RGCN 16 layers generally underperform (Fig. 1b).On the one hand GCN, which mainly exploits the features of all neighboring nodes, is outperformed by an MLP, and on the other hand N2V+MLP, which exploits edges but no neighboring node features outperforms the MLP.So, It appears that the node features of neighbors are not helpful, while the edges are per se informative.This explains that TAG 8 ameliorates the dip in performance caused by the GCN layer by incorporating skipconnections, which can bypass the message passing.In this setting, the TAG layer can be understood as a conditional GNN layer which, if the adjacent node's features are unhelpful, can revert to an MLP for any given node.This is also mirrored in the fact that TAG always performs at least as good as an MLP if the network topology is unhelpful (Fig. 1b).However, this mechanism neglects the information contained in the topology of the graph alone, i.e. in the edges themselves, which appear to be helpful as indicated by the N2V+MLP performance.Conceptually, the FiLM 18 layer, does not only use the sender node's features for its convolution, but also the receiving nodes' features and the type of the connecting edge.When we examined which features are most important for FiLM predictions we found that the influence of the adjacent node's input features was negligible for individual examined nodes (see Supplementary Figure 17), and globally the latent features of surrounding nodes make up only a tiny proportion of all messages in the FiLM layer (Supplementary Figure 18, Supplementary Note 6).This indicates that also for FiLM the receiving node's features and the edge type are the most important factors determining the message (Supplementary Note 7).
Thus, reminiscent of the N2V+MLP results, the FiLM layer predominantly learns the topology free of the influence of adjacent node's features but has the additional advantage of incorporating edge types.
The fact that FiLM's messages are mostly dependent on the receiving node's features and the edge type is also reflected in the results shown in Fig. 5.The importance of the edges is almost identical between edges of the same type, with only minute differences caused by the sending node's features.It also indicates that only a handful of roughly 300 incident edges is important for the prediction.Collectively these analyses indicate that methods gain performance when they have the option to ignore adjacent nodes' features and instead learn patterns of select incident edges.
Biologically, it is possible that neighborhood functions are encoded in the network topology, e.g. in the form of protein complexes, or a sufficiently specific interaction wiring in different modules, and thus learned indirectly by the GNNs as a pattern that better captures functions than individual node features.In addition, the wealth of connections, and network incompleteness, likely also impact on the observed phenomena.Novel methods that are designed to distill the information of edges 20,21 or topological features 22,23 of the graph alongside the node features could therefore be a valuable addition to future iterations of comparable work.

Supplementary Note 4: Impact of network biases on predictions
As the bias of aggregating small-scale literature is visible in network characteristics concerning Mendelian disorder genes (Supplementary Note 1), we monitored how these biased inputs affect the predicted candidate genes and potentially inhibit new insights.TAG, which is trained only on IntAct Direct Interaction, is heavily biased towards predicting genes that have a degree larger than zero in the IntAct Direct Interaction network (Supplementary Figure 11a) and in the case of immune dysregulation even exclusively predict such genes as candidates (Supplementary Figure 11b).This indicates that presence in the IntAct direct network, and hence the fact that the involved proteins were deemed 'interesting' by researchers to justify their biochemical purification and in vitro interaction studies, and hence previous perceptions of a gene's importance, was a key feature in their prediction by TAG.The argument that the scientific communities' accumulated knowledge reflected in such a focused deeper characterization of relatively few genes corresponds to underlying biological importance has previously been refuted 1,3,4 .The candidate genes predicted by FiLM, which is trained on all networks simultaneously including IntAct Direct Interaction and is aware of edge types, shows much lower odds ratios (OR) for genes from IntAct Direct Interaction, but still significantly higher than 1 (Supplementary Figure 11a).The Node2Vec in N2V+MLP is also trained on all networks simultaneously but is not aware of edge types, so IntAct Direct Interaction makes up only 0.36% of all edges, thus further reducing the ORs of genes contained in IntAct Direct Interaction among its candidates.Importantly, when FiLM is trained on all networks except IntAct Direct Interaction and IntAct Physical Association (FiLM Unbiased), we still reap the benefits of GNNs without biasing predictions towards genes included in IntAct.However, even the candidates produced by N2V+MLP and FiLM Unbiased are weakly enriched for genes involved in IntAct Direct Interaction, reflecting a bias of IntAct Direct Interaction for immune system regulation.This interpretation is supported by the validation of FiLM Unbiased predictions using mouse KOs, in which the OR drops for immune dysregulation, compared to the initial FiLM predictions, but not for other disease groups like cardiovascular disease (Supplementary Figure 12a, Supplementary Data 4).
Furthermore, differentially expressed genes and drug targets (Supplementary Figure 12b, c) show comparable levels of enrichment for most diseases whether IntAct Direct Interaction is used with FiLM or not.Intriguingly, for immune dysregulation the enrichment of differentially expressed genes increases when IntAct data are removed, indicating that these networks might also be biased towards genes examined in small-scale mouse experiments, which is consistent with the reasoning that laborious mouse knock-out and in vitro studies are more readily done for genes/proteins considered important.Moreover, for immune dysregulation only candidates produced by the unbiased version of FiLM are enriched for druggable genes which are not yet drug targets (Supplementary Figure 12c, Dr-), indicating that the biases inherited from small scale literature could indeed prevent the discovery of new drug development opportunities.
prediction while the edges themselves are relevant, we have ascertained the influence of sender, receiver and edge on the message passing dynamic of the FiLM layer.The FiLM layer introduces an offset beta and a linear coefficient gamma for every feature of an incoming message   () from the sender node  in the neighborhood of  based on the edge type  and the receiver node : (+1) = ∑ ∑ ( , () ⊙     () +  , () ) ∈() ∈ (1)      Thus, the influence of the neighborhood node's features is only relevant for the first part of the term: while the bias  , () is only dependent on the receiver node  and the edge type .We can therefore assess the balance between the influence of the neighborhood node's features and the features of the receiving node the following ratio: which is close to 0 if the message is dominated by the bias term  , (𝑡) and thus irrelevant of the neighborhood node's features.A high value still does not guarantee a high influence of   () , but we can assume that it is increasingly relevant if the model decides to modulate it via  , () instead of simply overriding it via  , (𝑡) .
The message features passed along the edges of all networks of the first FiLM-layer are, in fact, dominated by the bias term  , (𝑡) , rendering the sender's latent features irrelevant (Supplementary Fig. 18a).The message features passed along the edges of gene regulatory layers by the second layer (see Supplementary Fig. 18b) are less heavily dominated by the bias term.However, this still does not mean that the actual input features are important, since the latent features of the senders have already been influenced by their incident edges in the first layer.It does, however, indicate that protein-protein networks and gene regulatory networks convey different notions of neighborhoods, which might be influenced by the former being bidirectional and the latter being unidirectional.

Supplementary Note 7: Additional predicted examples
Analogous to the immune dysregulation examples in the main text, we also explored candidate genes predicted by FiLM for cardiovascular disease for suitable targets for drug development.
OBSCN and ITGA7 receive high Consensus Scores (11 and 9, respectively) and their protein products are druggable but not yet targeted by any drug.
OBSCN encodes the protein obscurin and is a large, modular protein with more than 80 exons and 28 transcript isoforms 25 , which fulfill a wide range of functions in different tissues including skeletal and heart muscle 26 .Specific mutations in the OBSCN gene are implicated in hypertrophic cardiomyopathy 27 , age-dependent cardiac remodeling and arrythmia 28 .Furthermore, obscurin has been implicated in non-muscular functions and pathologies 26 .Film bases its prediction of OBSCN as candidate gene on its location downstream of multiple transcription factors across several tissues, and implicating it in inflammation (STAT2, ZNF384) [29][30][31] , angiogenesis (BRF1) 32 , and immune dysregulation after ischemic damage (IRF3) 33 (Supplementary Figure 16a).
The integrin subunit alpha 7 encoded by ITGA7 is located in the cell membrane and involved in cellcell and cell-matrix communication, and has been implicated in migration and invasion of malignant cells in metastasis formation 34,35 .Recently it was shown that mutations in ITGA7 contribute to congenital muscular dystrophy 36 , adult-onset cardiac dysfunction 37 and cardiomyopathy 38,39 , implicating a role in the etiology of cardiovascular diseases.The fact that FiLM bases its prediction on ITGA7 being located downstream of the estrogen related receptor alpha, encoded by ESRRA (Supplementary Figure 16b), opens the possibility that this gene mediates sex-related differences in genetic cardiomyopathies 40 .
Neither OBSCN nor ITGA7 have been detected in GWAS studies for any heart-related traits, except PR interval (GCST010321) in the case of OBSCN.Despite this lack of detection, some OBSCN variants are known to contribute to left ventricular compaction 41 and dilated cardiomyopathy 42 .
Despite the absence of GWAS signal, Speos identified them as core gene candidates based on their tissue-specific gene expression (Supplementary Figure 16c, d).Especially their high expression in the left ventricle and atrial appendage are vital for their classification as expected for factors contributing to cardiovascular disease, as these anatomical regions are key players in several related pathophysiologies 41,43 .Although the understanding of the role the two genes play in cardiovascular disease is still in its infancy, at least OBSCN already raised expectations for novel treatments and therapeutics 26 , underscoring the value of Speos' predictions for hypothesis development, even when significant genome wide associations have not been detected.
and Mendelian disorder genes for immune dysregulation and cardiovascular disease, respectively.The two left columns show results for systematically generated Bioplex 3.0 HEK293T and HuRI adjacency matrices.The two right columns show the adjacency matrices IntAct Direct Interaction and STRING (confidence > 0.7), which are largely assembled from hypothesis-driven small-scale data.Connectivity of Mendelian gene encoded proteins in the systematic networks is similar to that of unlabeled nodes.In the collated networks, proteins encoded by Mendelian disorder genes show higher assortativity, i.e. tendency to interact with each other, for both phenotypes.b, in each panel the left bar shows the fraction of nodes in the largest connected component (component 0) versus isolated small components and disconnected nodes.The right bar shows how the positive and unlabeled nodes are distributed among these components.c, the top and bottom rows show the degree distributions of Mendelian disorder genes and unlabeled genes for immune dysregulation and cardiovascular disease, respectively.The two left columns show the adjacencies Bioplex 3.0 293T and HuRI, which are unbiased, systematically generated networks.The two right columns show the adjacencies IntAct Direct and STRING (confidence > 0.7), which are not systematically generated.The bias towards known disease genes in the two right networks can be seen for both phenotypes.First, the average degree of Mendelian disorder genes is higher than the average degree of unlabeled genes.Second, the degree distribution of the Mendelian disorder genes in STRING does not follow a scale-free degree distribution.On the contrary, nodes with a medium degree are the most abundant, while nodes of low and very high degree are rare.Supplementary

Figure 5 |
GNN Model Architecture.a, the general model architecture of all GNN models used in the experiments.The input features of node v are transformed into latent space by the pre-message passing module, which produces the latent vector xv pre-MP.This latent vector is fed into the message-passing module, where the neighborhood feature aggregation takes place according to the graph shown in panel b.Each layer aggregates one hop in the network.Arrows denote the aggregation operators of the respective GNN layers described in the Methods section.After message passing, the latent vector xv post-MP contains information of its n-hop neighborhood and is fed into the post-message passing module, which predicts the class of node v.The hyperparameters q, r and s control the number of layers per module.Not shown are nonlinearity functions and normalization layers.b, the simplified graph structure for the message passing shown in a with the observed node v in the center.Arrowheads denote the direction of the message passing; circles denote the respective n-Hop neighborhoods.c, normal versus alternative information flow through the network.Most commonly, all modules are chained consecutively, each feeding its output to the next.In the 'Skip' setting, the output vectors of the pre-MP and of the MP are summed up before being fed into the post-MP module.In the 'concatenate' setting, the output vectors of the pre-MP and of the MP are concatenated before being jointly fed into the post-message passing module.In this setting, the first layer of the post-message passing module has twice the number of dimensions.Supplementary Figure7| Network Performance and Properties.a, Boxplots of model performance (y-axis) for different adjacency matrices (x-axis).Adjacency "None" refers to an MLP that does not use any graph information.Boxes represent the interquartile range, colored bars are medians, whiskers extend at most 1.5 times the interquartile range, and outliers are shown individual.The gray bar in the background denotes the interquartile range of all MLP-runs."Normal" indicates the normal information flow from pre-MP to MP to post-MP (Supplementary Figure 5c)."Concat" indicates that the output of pre-MP is concatenated to the output of MP before being passed into post-MP."Skip" indicates that the output of pre-MP is added to the output of MP using a sum operation before being passed into post-MP.b, Network properties of IntAct Direct with the label set for immune dysregulation.c, Network properties of all networks merged together with the label set for immune dysregulation.Components: The left bar shows the fraction of the network that is either in the largest connected component (component 0), in microcomponents (smaller than 1% of all nodes), or isolated nodes which have no incident edge, right bar shows the distribution of labeled and unlabeled nodes.Paths: Each bar shows the number of positives which have other positives in the neighborhood of the indicated size.Color indicates the number of positives in the neighborhood for each node according to scale on the right.The black bar on the left indicates the number of isolated positives.Degrees: Degree distributions of positives and unlabeled nodes.Homophily: Plot shows the percentage of nodes in the neighborhood of a node that either share the same label or have the opposite label.Metrics: additional metrics of the graphs.

Supplementary Figure 13 |
Mouse Knockout validation disease-specificity experiments.Candidate genes for all five disorders are validated against the mouse knockout genes for immune dysregulation.Odds ratio (OR) (right y-axis) for observing disease relevant phenotypes in mice with knockouts of orthologs of candidate core genes in the indicated convergence score bins (x-axis) of the five classifier methods (colored lines).Gray lines indicate strength of candidate gene sets (left yaxis) in the corresponding bin for the phenotypes as indicated in the panel.Only ORs with an FDR < 0.05 (Fisher's exact test) are shown.Bars to the right (M) and left (G) of each plot indicate set strength (gray) and OR (colored) of Mendelian genes and GWAS genes for each phenotype.Filled bars represent ORs with an FDR < 0.05, otherwise bars are hollow.Precise P-values, FDR, and n for each test are shown in Supplementary Data 4. Supplementary Figure 14 | External Validation of Novel Phenotypes.a, b, LoF intolerance and missense mutation intolerance Z-scores of Mendelian genes, and the indicated candidate and noncandidate sets generated by the five methods.Shown are group means and 95% confidence intervals of Tukey's HSD test.Colored symbols and error bars indicate P < 0.05 in comparison with respective non-candidate sets; not significant sets in gray.Dashed line indicates the mean across all genes.c, Odds ratio (OR) (right y-axis) for observing disease relevant phenotypes in mice with knockouts of orthologs of candidate core genes in the indicated convergence score bins (x-axis) of the five classifier methods (colored lines).Gray lines indicate strength of candidate gene sets (left yaxis) in the corresponding bin for the phenotypes as indicated in the panel.Only ORs with an FDR < 0.05 (Fisher's exact test) are shown.Bars to the right (M) and left (G) of each plot indicate set strength (gray) and OR (colored) of Mendelian genes and GWAS genes for each phenotype.Bars representing significant ORs are filled, hollow bars represent non-significant ORs.d, Odds ratios (ORs) of Mendelian genes (first row) and of candidate genes of the five selected methods (rows) for common complex subtypes of the Mendelian disorder subgroups.ORs with FDR > 0.05 (Fisher's exact test) in gray.e, Enrichment of drug targets and druggability in Mendelian disorder genes and in the indicated consensus score bins (x-axis) of the five classifier methods (colored lines).Gray lines indicate strength of candidate gene sets (left y-axis) in the corresponding bin for the phenotypes as indicated in the panel.Only ORs with an FDR < 0.05 are shown.Bars to the right of each plot (M) indicate set strength (gray) and OR (colored) of Mendelian genes for each phenotype.Precise Pvalues, FDR, and n for each test are shown in Supplementary Data 4. b, Odds ratios (ORs) of Mendelian genes (first row) and of candidate genes of the five selected methods (rows) for common complex subtypes of the five Mendelian disorder groups.ORs with a FDR > 0.05 in gray.c,Enrichment of drug targets and druggability in Mendelian disorder genes and indicated candidate gene sets.DT: OR of known drug targets.xDC:Ratio of median number of drug-gene interactions per candidate gene to the median of non-candidates, only genes with drug-gene interactions are considered.Dr: OR of druggable genes.Dr-: OR of druggable genes, after all drug targets have been removed.Odds Ratios with FDR > 0.05 are grayed out.For all panels, precise P-values, FDR, and n for each test are shown in SupplementaryData 9 & 23.
indicated candidate gene sets.DT: OR of known drug targets.xDC: Ratio of median number of druggene interactions per candidate gene to the median of non-candidates, only genes with drug-gene interactions are considered.Ratios with FDR > 0.05 (U-test) are grayed out.Dr: OR of druggable genes.Dr-: OR of druggable genes, after all drug targets have been removed.Odds Ratios with FDR > 0.05 (Fisher's exact test) are grayed out.Precise P-values, FDR, and n for each test in each panel are shown in Supplementary Data 16 -29, respectively.candidate genes of the five selected methods (rows) for common complex subtypes of the Mendelian disorder subgroups.ORs with FDR > 0.05 (Fisher's exact test) in gray.d, Enrichment of drug targets and druggability in Mendelian disorder genes and indicated candidate gene sets.DT: OR of known drug targets.xDC: Ratio of median number of drug-gene interactions per candidate gene to the median of non-candidates, only genes with drug-gene interactions are considered.Ratios with FDR > 0.05 (U-test) are grayed out.Dr: OR of druggable genes.Dr-: OR of druggable genes, after all drug targets have been removed.Odds Ratios with FDR > 0.05 (Fisher's exact test) are grayed out.Precise P-values, FDR, and n for each test in each panel are shown in Supplementary Data 10 -13, respectively.