Preclinical validation of therapeutic targets predicted by tensor factorization on heterogeneous graphs

Incorrect drug target identification is a major obstacle in drug discovery. Only 15% of drugs advance from Phase II to approval, with ineffective targets accounting for over 50% of these failures1–3. Advances in data fusion and computational modeling have independently progressed towards addressing this issue. Here, we capitalize on both these approaches with Rosalind, a comprehensive gene prioritization method that combines heterogeneous knowledge graph construction with relational inference via tensor factorization to accurately predict disease-gene links. Rosalind demonstrates an increase in performance of 18%-50% over five comparable state-of-the-art algorithms. On historical data, Rosalind prospectively identifies 1 in 4 therapeutic relationships eventually proven true. Beyond efficacy, Rosalind is able to accurately predict clinical trial successes (75% recall at rank 200) and distinguish likely failures (74% recall at rank 200). Lastly, Rosalind predictions were experimentally tested in a patient-derived in-vitro assay for Rheumatoid arthritis (RA), which yielded 5 promising genes, one of which is unexplored in RA.


State-of-the-art Algorithm Comparison, Additional Metrics
Reported below are the mean average precision at rank 500 (mAP@500) and recall at rank 200 (recall@200) performance numbers. Note that in the state-of-the-art comparison, we focus only on recall. We have included mAP here to provide additional information about Rosalind's relative performance, but, as we have mentioned in the manuscript, we do not believe mAP to be a reliable performance metric for these analyses.  Table 5. State-of-the-art comparison. mAP@500 and recall@200 is calculated across the full set of 198 diseases, and reported as a value between 0 and 100. Recall@200 is also compared across all algorithms for the full set of diseases (Full) and for RA alone (RA). Recall numbers correspond to the markers shown in Figure 3C and 3D.

Aligning State-of-the-art Gene prioritization with Rosalind Data
To map diseases and gene predictions from Open Targets 1 , the v3 API was used to match the disease name in the 198 disease test set to the closest match in the Open Targets database, collecting an Orphanet ID for each disease.
Next, all associated genes and scores sorted according to the Open Targets composite score were collected for each disease using the API, producing a ranked list of genes for each disease in the test set. Of the 198 test diseases, 184 diseases were mapped successfully for Open Targets.
For SCUBA 2 , the training genes for each of the 198 disease were provided as the seed genes for the algorithm. The algorithm learns a weighting on a matrix of gene-gene similarities, and this multiple kernel learning strategy is used to associate seed (training) genes to new genes. Five matrices were used here, as provided in their work: a

Markov Diffusion Kernel inspired by heat diffusion with iteration parameters 2 and 6; and a regularized Laplacian
Kernel (RLK) similar to random walks with scaling factors 1, 10, and 100. Therapeutic genes in the Rosalind training dataset were mapped to ENSEMBL 3 IDs, resulting in an 8% loss of genes which could not be mapped successfully, and used as seed genes for learning kernel weightings. After learning, these weightings were used to rank the genome. The diversity of information sources and access to the training data used in Rosalind aids the SCUBA algorithm to successfully rank genes. Of the 198 test diseases, 187 were mapped successfully for SCUBA.
For the Bayesian matrix factorization algorithm MACAU 4 , the conditioning information was used from that work, using Interpro 5 , Gene Ontology 6 , and Uniprot 7 additional context for the genes; similarly, for diseases, literaturebased disease features derived from textual term-frequency inverse-document frequency (TF-IDF) occurrences in PubMed were used in 8 . The provided textual terms were not used for the gene targets as the article material does

5/9
not provide the means to successfully map them. The disease-gene matrix was defined using the training data from the benchmark described above (using training data from Rosalind), with 10x as many randomly-sampled negative associations (zero-entries) in the matrix for every one positive entry (1-entry low 20 targets), as the authors show in their work, it suffers in ranking as k is increased and more targets are examined.

6/9
The performance across algorithms for the minimal set of diseases present in all methodologies can be found in Fig. 2, with the diseases themselves listed in Table 6.  Table 6, and shown here with recall at k averaged across diseases. Note that this qualitatively matches Figure 3C Table 6. Minimal set of 40 diseases present for all comparison models.