Large-scale computational drug repositioning to find treatments for rare diseases

Rare, or orphan, diseases are conditions afflicting a small subset of people in a population. Although these disorders collectively pose significant health care problems, drug companies require government incentives to develop drugs for rare diseases due to extremely limited individual markets. Computer-aided drug repositioning, i.e., finding new indications for existing drugs, is a cheaper and faster alternative to traditional drug discovery offering a promising venue for orphan drug research. Structure-based matching of drug-binding pockets is among the most promising computational techniques to inform drug repositioning. In order to find new targets for known drugs ultimately leading to drug repositioning, we recently developed eMatchSite, a new computer program to compare drug-binding sites. In this study, eMatchSite is combined with virtual screening to systematically explore opportunities to reposition known drugs to proteins associated with rare diseases. The effectiveness of this integrated approach is demonstrated for a kinase inhibitor, which is a confirmed candidate for repositioning to synapsin Ia. The resulting dataset comprises 31,142 putative drug-target complexes linked to 980 orphan diseases. The modeling accuracy is evaluated against the structural data recently released for tyrosine-protein kinase HCK. To illustrate how potential therapeutics for rare diseases can be identified, we discuss a possibility to repurpose a steroidal aromatase inhibitor to treat Niemann-Pick disease type C. Overall, the exhaustive exploration of the drug repositioning space exposes new opportunities to combat orphan diseases with existing drugs. DrugBank/Orphanet repositioning data are freely available to research community at https://osf.io/qdjup/.


Table of content:
• Text S1. Measuring the chemical correlation with virtual screening (page 2).
• Table S1. The Huang dataset of bound and unbound proteins (page 8).
Text S1. Measuring the chemical correlation with virtual screening.
The chemical correlation was developed to indirectly measure the similarity between binding sites with virtual screening. 1,2 It can be calculated for compound ranks assigned by either ligandor structure-based virtual screening. In the present study, structure-based virtual screening is conducted with AutoDock Vina 3 for target pocket in the Huang dataset 4 against a non-redundant library of 1,515 FDA-approved drugs obtained from the DrugBank database. 5 Docking poses generated by Vina are ranked according to the predicted binding energy. Subsequently, nonparametric Spearman's r correlation coefficient 6 is computed for compound ranks assigned to a pair of pockets. Spearman's r measures the degree of monotonic relationship ranging from +1 to -1, where +1 is a perfect correlation, 0 is the lack of any correlation, and -1 is an anti-correlation.
A high Spearman's r indicates that a pair of pockets not only exhibit high binding affinity toward similar compounds but also do not bind similar ligands.
Text S2. Addressing the early recognition problem with BEDROC.
The Boltzmann-enhanced discrimination of the receiver operating characteristic (BEDROC) 7 is a generalization of the area under the ROC curve (AUC) addressing the early recognition problem.
While the AUC metric is useful to assess the performance of a binary classifier, it fails to discriminate curves with the same AUCs but differing degrees of the early recall. For two ROC curves varying in shape, many applications prefer the curve with a higher proportion of its AUC at a low false positive rate. Classifiers requiring early recognition capabilities include, for instance, virtual screening and the detection of off-targets, where a large number of initial molecules must be reduced to a testable number of promising candidates. Similar to AUC, BEDROC ranges from 0 to 1 and can be interpreted as the probability of a ranked positive to be positioned higher in the ordered list than by a random chance. However, in contrast to the uniform distribution in AUC, BEDROC is based on the exponential distribution with the adjustable exponential factor defining the desired degree of "early recognition". In our study, we use the recommended value of 20, which means that 80% of the maximum contribution to the BEDROC score comes from the first 8% of the ranked list.

Text S3.
Evaluating the structure quality with RMSD, TM-score, and GDT-score.
The root-mean-square deviation (RMSD) measures the similarity between superposed threedimensional protein structures based on Cartesian distances. 8 It can be calculated for Ca atoms or all atoms over the entire length of a protein, as well as for specific regions, such as transmembrane helices, loops, binding pockets, etc. The unit of the RMSD is Angstrom [Å] and high values correspond to low similarities between two structures. Nonetheless, the global RMSD was shown to be the least representative of the degree of structural similarity because it is dominated by the largest error, 9 for instance, different conformation of a single loop can inflate the RMSD between two otherwise identical proteins. Furthermore, the RMSD is strongly lengthdependent complicating the comparison of proteins of different length.
A number of other measures have been developed to provide a statistically meaningful assessment of similarity between biomolecules. An example is the Template Modeling (TM)score quantifying the topological similarity between a pair of protein structures based on the coordinates of Ca atoms. 10 TM-score ranges from 0 to 1 with higher values indicating a higher similarity between protein structures, and the value of 1 is a perfect match between two structures. Scores below 0.17 correspond to randomly chosen unrelated protein structures, whereas scores above 0.5 indicate that two protein structures have the same fold 11 according to the Structural Classification of Proteins (SCOP) 12 and the CATH Protein Structure Classification database. 13 Another metric is the Global Distance Test (GDT)-score reporting the number of Ca atom pairs within distance thresholds of 1, 2, 4, and 8 Å after the superimposition of the query and reference structures. 14 However, these distance cutoffs are subjective and may require target-specific adjustments. 15 Further, the magnitude of the GDT-score for random structure pairs has a similar to the RMSD power-law dependence with the protein length. 10 GDT-score ranges from 0 to 1 with higher values indicating a higher similarity between protein structures.
eMatchSite is a sequence-order independent algorithm to compare ligand-binding sites. 1,16 It assigns a set of residue-level scores extracted from weakly homologous template proteins complexed with small molecules covering various properties of binding ligands and residues. In addition, the evolutionary information is included as sequence and secondary structure profiles, and entropy. An important feature of eMatchSite is its capability to predict pairwise Ca-Ca distances between binding residues upon the optimal alignment of two pockets by machine learning. Based on these distances, it constructs local alignments of pocket residues by solving the assignment problem with the Kuhn-Munkres algorithm. 17,18 Binding site alignments are subsequently assigned a similarity score, called the eMS-score, which measures the overlap of various physicochemical and evolutionary features. eMS-score ranges from 0 for completely dissimilar pockets to 1 for identical pockets, with an optimized threshold of 0.56 accurately distinguishing between pockets binding similar and dissimilar molecules. eFindSite is a structure/evolution-based ligand-binding site prediction approach employing meta threading to identify a set of evolutionarily related templates complexed with ligands. 19,20 These templates are first structurally aligned onto the target with Fr-TM-align 21 followed by the clustering of the centers of mass of bound ligands to identify putative binding sites in the target structure. eFindSite offers a machine learning-based confidence estimation system not only to rank the predicted sites, but also to reliably evaluate the corresponding ranking confidence. This algorithm uses a vector of various features, including the fraction of templates that share a particular site, the cluster multiplicity, the average TM-score of templates to the target, the number and the average confidence of predicted binding residues, and a protein-ligand binding index calculated over predicted binding residues. The assigned confidence estimates the likelihood that the site center is predicted within a distance of 8 Å from the geometrical center of a natively bound ligand.

Text S7. Chemical alignment with KCOMBU and the Tanimoto coefficient.
Comparing the chemical structures of organic molecules has a number of applications in cheminformatics. Techniques employing the graph theory find equivalent atom and bonds in molecules by solving the maximum common substructure (MCS) and/or maximum clique problems. An example of such algorithm is the K(ch)emical structure COMparison using the BUildup algorithm (KCOMBU). 22 This method is capable of finding connected and disconnected MCSs in molecules represented by graphs. In addition to the chemical alignment between two molecules, KCOMBU reports their similarity in terms of the Tanimoto coefficient (TC). 23 Widely used TC is arguably the most reliable similarity measure for low-molecular weight organic molecules 24 . Briefly, the TC compares the extent of commonality or similarity between two sets by defining the ratio of common elements to the non-common elements. TC ranges from 0 for a pair of completely dissimilar compounds to 1 indicating identical molecules. For molecule pairs with the TC greater than 0.4, KCOMBU was demonstrated to correctly match the majority of atoms when compared to their exact 3D superpositions. Therefore, a minimum TC value of 0.4 in KCOMBU should be employed keeping in mind that the atom matching accuracy significantly improves for chemical alignments assigned higher TC values.  RMSD is the root-mean-square deviation computed over ligand heavy atoms.