Identification of Candidate Drugs for Heart Failure using Tensor Decomposition-Based Unsupervised Feature Extraction Applied to Integrated Analysis of Gene Expression between Heart Failure and DrugMatrix Datasets

. Identifying drug target genes in gene expression profiles is not straightforward. Because a drug targets not mRNAs but proteins, mRNA expression of drug target genes is not always altered. In addition, the interaction be-tween a drug and protein can be context dependent; this means that simple drug incubation experiments on cell lines do not always reflect the real situation during active disease. In this paper, I apply tensor decomposition-based unsupervised feature extraction to the integrated analysis of gene expression between heart failure and the DrugMatrix dataset where comprehensive data on gene expression during various drug treatments of rats were reported. I found that this strategy, in a fully unsupervised manner, enables us to identify a combined set of genes and compounds, for which various associations with heart failure were reported.


Introduction
In silico drug discovery is an important task because experimental identification/verification of therapeutic compounds is a time-consuming and expensive process.There are two major trends of in silico drug discovery: the ligand-based approach [1] and structure-based [2] approach.The former is very straightforward; new drug candidates are identified based upon the similarity with known drugs no matter how the similarity is evaluated.Although it is a powerful method, there are some drawbacks; if there are no known drugs for target proteins, then there is no way to find new drug candidates.Even if there are many known drugs for the target protein, it is hopeless to find compounds that are effective but without any similarity with known drugs.The second, structure-based approach, can address these weaknesses.It can identify new therapeutic compounds even without the information about known drugs.Of course, there are some drawbacks in this strategy, too.If the target protein structure is not known, it must be predicted prior to the drug discovery process.Even if the target protein's structure is known, because we need to numerically verify the binding affinity between the ligand compound and target protein, which also requires a large amount of computational resources, structure-based in silico drug discovery is still far from easy to perform.In addition, prediction accuracy of protein structure and of ligand-binding structure is not very high at all.Thus, it would be helpful to have an additional/alternative strategy for in silico drug discovery.
Recently, an alternative approach was proposed that is aimed at finding drug candidates computationally using gene expression profiles of cell lines treated with compounds [3,4].This third approach is not straightforward at all.First of all, because compounds target not mRNAs but proteins, mRNA expression of drug target proteins is not always affected.Therefore, direct identification of a drug target protein in gene expression data cannot be done.Second, gene expression alteration caused by treatment with a compound may be context dependent; in other words, in a cell line, the gene expression difference caused by incubation with a compound may differ from that in diseases.To compensate these difficulties, the gene expression signature strategy was developed.In this approach, gene expression alteration profiles caused by treatment of a cell line with various drug candidates are compared with those of known drugs.If the profiles are similar, then new drug candidates are expected to function similarly to known drugs.Although this third strategy is a useful one, if there are no known drugs for the target disease, this approach cannot function at all as in the case of ligand-based approaches.
Some examples aiming to identify drug-target interaction from gene expression based upon previous knowledges are as follows.Wang et al [5] tried to identify on and off-targets of drugs based upon similarity between drug-induced in vitro genomic expression changes.Iwata et al [6] explored potential target proteins with cell-specific transcriptional similarity using chemical-protein interactome.Lee et al [7] tried drug repositioning for cancer therapy based on large-scale drug-induced transcriptional signatures.Although these are only a few examples, they need pre-knowledge about drugtarget interactions.Alternatively, instead of drug-target interactions, drug-disease interaction is investigated.For example, Cheng et al [8] tried to measure the connectivity between disease gene expression signatures and compound-induced gene expression profiles.Sirota et al [9] also integrated gene expression measurements from 100 diseases and gene expression measurements on 164 drug compounds, yielding predicted therapeutic potentials for these drugs.Iorio et al [10] investigated compound-targeted biological pathways based upon gene expression similarity.They are unsupervised approaches to some extent, but target genes cannot be exploited.
In this paper, I propose a strategy that can infer drug candidates from drug treatmentassociated gene expression profiles without the information about known compounds for diseases.In this strategy, I employ the tensor decomposition (TD)-based unsupervised feature extraction (FE) approach, which is an extension of the recently proposed principal component analysis (PCA)-based unsupervised FE; PCA-based unsupervised FE successfully solved various bioinformatic problems [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29].In this TD-based strategy, tensors were generated using a mathematical product of a gene expression profile of drug-treated cell lines and of a gene expression profile of a disease.Then, pairs of compounds and genes are identified whose mRNA expression alteration is associated with drug-treated cell lines and is coincident with such alteration during disease progression.Biological evaluation of the identified genes and compounds based upon past studies turned out to be promising.
where ( 1 …   ) is a core tensor and     and      are singular value matrices that are supposed to be orthogonal to one another.Because ( 1 …   ) is assumed to be as large as   1 … −1  , it is obviously an overcomplete problem; thus, there are no unique solutions.To solve TD uniquely, I specifically employed the higher-order singular value decomposition [30] (HOSVD) algorithm that tries to attain TD such that smaller number of core tensors and singular value vectors can represent   1 … −1  as much as possible.

Tensor Generation for Integrated Analysis
It is quite common when there is a set of gene expression profiles of human cell lines or model animals treated with various compounds at multiple dose densities.For example, Drug Matrix 1 and LINCS [31] are good examples, although the former comprises only temporal gene expression after drug treatments.Nonetheless, it is not easy to infer a drug's action on diseases by means of only these gene expression profiles; some kind of integrated analysis with disease gene expression profiles is required, but it is not so straightforward.Candidate drugs should satisfy these conditions: • Gene expression in these profiles must significantly decrease or increase with the increasing dose density of compounds.• Gene expression alteration caused by drug treatment must be significantly coincident with that associated with disease progression.
How these two independent significance values can be evaluated is unclear.For example, we can have two sets of significant gene expression alterations of the ith gene, {∆  }, caused by drug treatment and those of the i'th gene, {∆′ ′ }, during disease progression, respectively.First, we need to test whether the two sets of genes are significantly overlapping.Next, when there is a significant overlap, we have to determine whether these two gene expression alteration profiles correlate significantly.Furthermore, because the analysis is usually conducted among multiple compounds, all the significance evaluation must be corrected based upon a multiple comparison criterion.It is obviously a complicated and not a promising strategy.
1 https://ntp.niehs.nih.gov/drugmatrix/index.htmlNevertheless, if we can have gene expression profiles expressed via a tensor,   1 … −1  , where jk (k=1,...,m-1) corresponds to drug candidates, dose density, and disease progression, we can easily evaluate a candidate drug using TD, eq. ( 1).If there are      values that represent significant dependence upon dose densities and disease progression, genes' and compounds' singular value vectors that share core tensor G with larger absolute values with these      s can be used for the selection of genes as well as compounds as follows.
Suppose {  } is a set of indices of genes' or compounds' singular value vectors that are associated with significant dose density dependence as well as disease progression dependence.Genes and compounds can be identified as being associated with significant singular value vector components.For this purpose, P-values are attributed to each ith gene/  th compound assuming a  2 distribution, where   2 [> ] is the cumulative probability that the argument is greater than x assuming the  2 distribution and    and    are standard deviation.After adjusting Pvalues using the Benjamini-Hochberg (BH) criterion [32], genes and compounds that have significant P-values, e.g., less than 0.01, are selected as those contributing to the specified singular value vectors.Nevertheless, because such a tensor can be obtained only when drug treatment is performed on patients, this strategy is useless; if we can test drug efficiency directly on patients, then there is no need for in silico drug discovery.To overcome this discrepancy, I replace   1 … −1  with a product,   1 … ′  •   1 … "  , where   1 … ′−1  is gene expression for the drug treatment of cell lines/model animals, while   1 … "−1  is gene expression for the patients ( − 1 =  ′ + ").Because these two can be obtained independently, we can test any kind of combinations of drug treatments and diseases even after all measurements were performed.
The process performed in the present study following the above procedures is illustrated in Figure 1.

Gene Expression Profiles
Gene expression profiles for drug treatments of rats were retrieved from Drug Matrix under the gene expression omnibus (GEO) ID GSE59905, while heart failure human gene expression was taken from GEO ID 57345.For both datasets, expression files of genes, GSE57345-GPL11532_series_matrix.txt.gz,GSE59905-GPL5426_series_matrix.txt.gz, and GSE59905-GPL5425_series_matrix.txt.gz were directly downloaded from the series matrix.

Various Servers for Enrichment Analysis
To Enrichr [33] and TargetMine [34], 274 gene symbols were uploaded.For Tar-getMine, human was assumed as an organism under study, and the BH criterion was used for P-value correction.

Figure 1
Schematic that illustrates gene-drug pairs identification processes performed in the present study.Gene expression from DrugMatrix,       , and heart failure,     , are multiplied and tensor,         , is generated.It is decomposed into core matrix, G, and compound singular value matrix      , time point singular value matrix,      , human sample singular value matrix,      , and gene singular value matrix,     , are generated.After the temporal dependent      and disease dependent      are identified, associated genes and compounds      and      are selected.Based upon them, genes and compounds as outliers are selected by assuming a   distribution.

Statistical Analysis
All the statistical analyses were performed within the R software.HOSVD was carried out using the hosvd function in the rTensor package.

TD-based Unsupervised FE Was Applied to a Combined Tensor
From gene expression profiles of the rat left ventricle (LV) treated with 218 drugs, we selected four time points (1/4, 1, 3, and 5 days after treatment).Although these do not directly represent drug dose dependence, time course observations can be replaced with dose dependence, because drug dose density is expected to monotonically decrease with time.On the other hand, human heart gene expression profiles are composed of 82 idiopathic dilated cardiomyopathy patients, 95 ischemic patients, and 136 healthy controls, respectively.Among them, 3937 genes sharing gene symbols between human and rat were considered.Then, the generated tensor is which represents the products of gene expression of the ith gene of LV treated with  1 compound at the  2 th time point after the drug treatment and gene expression of the  3 th human heart, respectively.HOSVD was applied to   1  2  3  and core tensor ( 1  2  3  4 ), 1 ≤  1 ≤ 218, 1 ≤  2 ≤ 4, 1 ≤  3 ≤ 313, 1 ≤  4 ≤ 3937 , compound singular value matrix   1  1 , time point singular value matrix,   2  2 , human sample singular value matrix,   3  3 , and gene singular value matrix,   4  , were obtained.Prior to selection of genes and compounds, we need to know which time points singular value vector represents time dependence and which human sample singular value vector represents the distinction between patients and healthy controls (Fig. 2).As for time point singular value vectors, I decided to use the 2 nd time point singular value vector,   2  2 ,  2 = 2, because it has the strongest correlation with days.It also represents reasonable time development.After drug treatment, gene expression gradually increases because it takes a while for a drug treatment to have an effect.Then, after it has a peak on day 1, a monotonic decrease follows.On the other hand, for human sample singular value vectors, the 2 nd and 3 rd ones were selected because they have a clear distinction between patients and healthy controls.
Next, I tried to identify gene singular value vectors and compound singular value vectors associated with core tensor G( 1  2  3  4 ),  2 = 2, 2 ≤  3 ≤ 3 that have larger absolute values (Table 1).One can see that the 2 nd singular value vector of compounds is always associated with top 20 core tensors.The selection of gene singular value vectors is not so trivial.First of all, generally low-ranked gene singular value vectors are listed.This means that gene expression associated with disease progression is not a majority.This is a common situation because the disease usually affects only a limited number of genes.Then, tentatively, I decided to select top 10 gene singular value vectors, 21 st , 25 th , 27 th , 28 th , 33 rd , 36 th , 37 th , 38 th , 41 st , and 42 nd singular value vectors of genes.Using these singular value vectors, P values were attributed to genes and compounds.The attributed P values were adjusted by the BH criterion.Then, 281 probes and 0 compounds associated with adjusted P values less than 0.01 were selected.Because no compounds pass our criteria, I sought another way to select compounds.Fig. 3 shows the histogram of the 2 nd singular value vectors of compounds.There are obviously some outliers.Then, tentatively, I selected 43 compounds having the absolute 2 nd singular value vector components larger than 0.1.

Biological Evaluation of the Selected Compounds and Genes
To see if we can successfully identify biologically relevant compounds and genes, we evaluated these selected genes and compounds.At first, a literature search was performed on the 43 drugs.Then, some heart failure-related studies were identified for most of the 43 drugs (Table 2).This means that biologically relevant drugs were likely to be identified successfully.As for the genes identified, 274 genes associated with the identified 281 probes are shown in Table 3.To evaluate biological reliability of these 274 genes, they were uploaded to various enrichment servers.When they were uploaded to TargetMine, top five tissue enrichment results were related to the heart (Table 4).Top four significant disease enrichment results represent heart failure (Table 5).When they were uploaded to Enrichr, top three OMIM disease enrichment results were related to heart failure (Table 6).Two out of top three MGI Mammalian Phenotype Level 3 enrichment results were also related to heart failure (Table 7).Thus, our identification of genes was also successful.Enrichr also outputted many epigenetic feature enrichment results.Top most significant ENCODE TF-ChiP-seq 2015 is POLR2A_heart_mm9; POLR2A is a transcription factor (TF) reported to be a stable reference gene for gene expression alteration in gene expression studies on rodent and human heart failure [35].This finding suggested that POLR2A is constantly expressive in heart failure, which is coincident with our analysis.Top most significant TF-LOF Expression result from GEO is yy1_227711985_skele-tal_muscle_lof_mouse_gpl8321_gse39009_up.YY1 is a TF reported to play critical roles in cardiac morphogenesis [36].The top most significant ENCODE Histone Modifications 2015 result is H3K36me3_myocyte_mm9; H3K36me3 was reported to play a crucial role in cardiomyocyte differentiation [37].These TFs as well as histone modifications identified by our strategy can be possible drug targets.

Conclusions
In this paper, I introduced a new strategy that integrates disease (heart failure) gene expression profiles with drug treatment-related tissue gene expression profiles.The identified genes as well as compounds have been widely reported to be related to heart failure.Thus, this strategy turned out to be useful for in silico drug discovery.

Figure 2
Figure 2 Left: Time points' singular value vectors.Black circle: 1 st , red triangle: 2 nd , green cross: 3 rd , and blue cross: 4 th singular value vectors, respectively.Pearson's correlation coefficients toward days are -0.72,-0.82, 0.51, and -0.09, respectively.Right: A box plot of human sample singular value vectors.From left to right, the 1 st , 2 nd , and 3 rd singular value vectors are shown.

Figure 3 A
Figure 3 A histogram of 2 nd singular value vectors of compounds.

Table 2
Literature search performed on 43 drugs identified by TD-based unsupervised FE.

Table 3 .
The 274 genes associated with 281 probes identified by TD-based unsupervised FE.

Table 4 .
Top five significant tissue enrichment results of TargetMine.

Table 5 .
Top four disease enrichment results of TargetMine.

Table 6 .
Top three significant OMIM enrichment results of Enrichr.

Table 7 .
Top three significant MGI Mammalian Phenotype Level 3 enrichment results of Enrichr.