Identification of candidate drugs using tensor-decomposition-based unsupervised feature extraction in integrated analysis of gene expression between diseases and DrugMatrix datasets

Identifying drug target genes in gene expression profiles is not straightforward. Because a drug targets proteins and not mRNAs, the mRNA expression of drug target genes is not always altered. In addition, the interaction between a drug and protein can be context dependent; this means that simple drug incubation experiments on cell lines do not always reflect the real situation during active disease. In this paper, I applied tensor-decomposition-based unsupervised feature extraction to the integrated analysis using a mathematical product of gene expression in various diseases and gene expression in the DrugMatrix dataset, where comprehensive data on gene expression during various drug treatments of rats are reported. I found that this strategy, in a fully unsupervised manner, enables researchers to identify a combined set of genes and compounds that significantly overlap with gene and drug interactions identified in the past. As an example illustrating the usefulness of this strategy in drug discovery experiments, I considered cirrhosis, for which no effective drugs have ever been proposed. The present strategy identified two promising therapeutic-target genes, CYPOR and HNFA4; for their protein products, bezafibrate was identified as a promising candidate drug, supported by in silico docking analysis.


Combined analysis of human diseases and rat tissues.
Here, I focused on multiple human diseases and rat tissues for the following reason. If the proposed strategy is tested on only a single combination of human diseases and rat tissues, then the probability of success is rather low. Therefore, it is better to test the strategy on as many combinations as possible. On the other hand, if the number of compounds tested on some rat tissue is not large enough, then this tissue is not suitable for the present study because the purpose of this study is to apply TD-based unsupervised FE to drug discovery; a smaller number of tested compounds means fewer opportunities for identification of an effective drug. Using this criterion, I selected five tissues, i.e. heart, brain, born narrow, kidney, and liver tissues, on which 22-315 drugs have been tested.
Next, I selected diseases with which the five tissues are analysed together. During this selection, diseases that are supposed to take place in the corresponding tissues should be chosen; otherwise, correct drug selection is unlikely. For heart tissue, I used heart failure. For brain tissue, I used post-traumatic stress disorder (PTSD). For bone narrow tissue, I selected acute lymphocytic leukaemia (ALL). For kidney tissue, I used diabetes and cancer. For liver tissue, I chose cirrhosis. The reason why I used two diseases with kidney tissue is to check whether distinct genes are identified for different diseases when I use distinct gene expression profiles.
Heart failure. In gene expression profiles of the rat left ventricle (LV) treated with 218 drugs, I selected four time points (1/4, 1, 3, and 5 days after treatment); this is because there were substantially smaller numbers of compounds tested for other time points than for these four. On the other hand, human heart gene expression profiles represent 82 patients with idiopathic dilated cardiomyopathy, 95 patients with ischemic stroke, and 136 healthy controls. In these profiles, 3937 genes sharing gene symbols between humans and rats were considered. Then the generated tensor is j j j i j j i ji  1 2 , and gene expression of the j 3 th human heart, x j i 313 3937 3  ∈ × , respectively. Higher-order singular value decomposition (HOSVD) was applied to  x j j j i 1 2 3 , and then core tensor     ∈ × × ×  G( ) 1 3 3 , and gene singular value matrix  ∈ ×  u i 3937 3937 4 were obtained. x j j j i 1 2 3     G( ) 1  x j j j j j i x j j j j i 3,5,6,9,10; 3; 4; 1,2,3,5

PTSD.
In gene expression profiles of the rat brain treated with 22 drugs, I selected four time points (1/4, 1, 3, and 5 days after treatment); this is because there were substantially smaller numbers of compounds tested for other time points. Rat hippocampus and amygdala gene expression profiles of 15 males and 15 females are composed of five control samples, five minimal behavioural response samples, and five extreme behavioural response samples, respectively. In these profiles, 7501 genes sharing gene symbols between the two experiments were considered. Then, the generated tensor is j j j j j i jj i j j i j j i , which represents the mathematical product of gene expression of the i th gene of a brain treated with the j 1 th compound at the j 2 th time point after the drug treatment,  ∈ × × x j j i 22 4 7501 1 2 , gene expression of the j 3 th brain region (hippocampus: j 3 = 1, and amygdala: j 3 = 2) of the j 4 th female sample and j 5 th male sample (control samples: 1 ≤ j 4 , j 5 ≤ 5, minimal behavioural response samples: 6 ≤ j 4 , j 5 ≤ 10, and extreme behavioural response samples: 11 ≤ j 4 , j 5 ≤ 15),  ∈ × × x j j i 2 15 7501 3 4 and  ∈ × × x j j i 2 15 7501 3 5 , respectively. HOSVD was applied to  x j j j j j i 1 2 3 4 5 , and then core tensor , singular value matrix of a rat brain region , rat sample singular value matrices  ∈ ×  u j 15 15 4 4 and  ∈ ×  u j 15 15 5 5 , and gene singular value matrix  ∈ ×  u i 7501 7501 6 were obtained. Prior to selection of genes and compounds, we need to know which singular value vector of time points,  u j 2 2 , represents time dependence (Fig. 2). Then, I decided to use the second singular value vectors ( =  2 2 ) for PTSD. In relation to the rat model of PTSD, the set of samples is composed of 30 female and male samples, and each of these groups represents 15 hippocampus samples and 15 amygdala samples. Fifteen rats were subdivided into five controls, five minimal-behavioural-response samples, and five extreme-behavioural-response ones. Figure 3B shows the comparisons of sample singular value vectors =  u k ( 4, 5) j k k between 10 male and female controls (1 ≤ j k ≤ 5), 10 minimal-behavioural-response males and females (6 ≤ j k ≤ 10), and 10 extreme-behaviouralresponse males and females (11 ≤ j k ≤ 15), respectively. Next, the third singular value vector of samples, , turned out to show a significant difference between controls and others. Thus, I decided to to determine which singular value vectors of genes and drugs should be used for selection of genes and drugs in the analysis below. Although the P-value associated with = =  u k , 4, 5 is relatively greater (weaker significance), this drawback is caused by the smaller number of samples. In actuality, the BH criterion correction does not make the P-value non-significant.
Regarding the PTSD rat model,  ALL. From gene expression profiles of the rat bone narrow treated with 77 drugs, I selected four time points (1/4, 1, 3, and 5 days after treatment); this is because there were substantially smaller numbers of compounds tested for other time points than for these four. On the other hand, gene expression profiles of bone narrow in human (child) ALL represents 74 ALL patients at 0, 8, 15, and 33 days after a remission induction therapy. In these profiles, 2597 genes sharing gene symbols between rats and humans were considered. Then, the generated tensor is   and ALL bone narrow gene expression on the j 3 th day after the remission induction therapy of the j 4 th sample,  ∈ × × x j j i 4 74 2597 3 4 , respectively. HOSVD was applied to  x j j j j i 1 2 3 4 , and then core tensor , a singular value matrix of ALL days, , a singular value matrix of ALL patients,  ∈ ×  u j 74 74 4 4 , and gene singular value matrix  ∈ ×  u i 2597 2597 5 were obtained. Prior to selection of genes and compounds, we need to know which singular value vector of time points,  u j 2 2 , represents time dependence (Fig. 2). After that, I decided to use the third vector ( =  3 2 ) for ALL because this vector has the strongest correlations with days.
Regarding ALL, in contrast to the other diseases, the given samples represent not diseases and controls but rather time progression of gene expression after therapy. Then, because the fourth singular value vector of time points (Fig. 3C) 3 3 , has the strongest correlation with time points, I decided to use it for the further analysis in the text below. Although the first one, =  u j 1, 3 3 , also shows a moderate correlation, because the gradient is much smaller, I did not select the first vector.
As for ALL, were studied. Although and compound singular value vectors =  2, 3, 5, 6, 9, 10 1 , which are used for identification of genes with altered expression. Genes encoding drug target proteins are later inferred by the enrichment analysis of these genes in terms of the genes associated with altered gene expression when genes encoding drug target proteins are perturbed. The reason why the number of vectors selected is less than 10, which is the number of …   G( , , ) 1 5 s being analysed, is that each singular value vector is associated with more than one

Figure 4.
Intuitive illustration of the present strategy. Suppose there is a tensor, x j1j2i , which represents the i th gene expression at the j 2 th time point after the j 1 th compound is given to a rat; these data are taken from the DrugMatrix 44 dataset. There is also a matrix, x j3i , which represents the i th gene expression of the j 3 th sample; samples typically include disease samples and control samples. Tensor  x j j j i 1 2 3 was generated as a 'mathematical product' of x j1j2i and x j3i . Then, tensor  x j j j i 1 2 3 is decomposed, and singular value matrix of compounds  u j 1 1 , singular value matrix of time points  u j 2 2 , sample singular value matrix  u j 3 3 , and gene singular value matrix , whose value significantly varies with time, and iii) sample singular value vector  u j 3 3 . These parameters are different between a disease (red filled circles) and control samples (cyan filled circles). Finally, using gene singular value vector  u i 4 and compound singular value vector  u j 1 1 , compounds (filled pink circles) and genes (filled light-green circles) associated with     G( , , , ) 1 2 3 4 s with large enough absolute values are selected. Next, if the selected genes are coincident with the genes associated with a significant alteration when gene X is knocked out (or overexpressed), then the compounds are assumed to target gene X.
Scientific RepoRts | 7: 13733 | DOI:10.1038/s41598-017-13003-0 Diabetes. In gene expression profiles of the rat kidney treated with 253 drugs, I selected four time points (1/4, 1, 3, and 5 days after treatment); this is because there were substantially smaller numbers of compounds tested for other time points than for these four. Kidney gene expression of 50 controls (25 whole kidneys, 12 tubuli, and 13 glomeruli) and 19 diabetic kidney samples were considered. In these data, 3489 genes sharing gene symbols between rats and humans were analysed. Then the generated tensor is 1 2 3 , which represents the products of gene expression of the i th gene of the rat kidney treated with the j 1 th compound at the j 2 th time point after the drug treatment,  ∈ × × x j j i 253 4 3489 1 2 and gene expression of the j 3 th human kidney sample,  ∈ × x j i 69 3489 3 , respectively. HOSVD was applied to  x j j j i 1 2 3 , and then core tensor , human sample singular value matrix  ∈ ×  u j 69 69 3 3 , and gene singular value Prior to selection of genes and compounds, we need to know which singular value vector of time points,  u j 2 2 , represents time dependence (Fig. 2). Then, I decided to use the second singular value vectors ( =  2 2 ) for diabetes.
As for diabetes, the first four singular value vectors of samples, , show a significant difference among four classes (Fig. 3D), which are composed of three healthy tissue classes and one diseased kidney class. Nonetheless, because the first and fourth vectors have greater significance, I decided to employ these two.
Regarding ) that are associated with the first two top-ranked …   G( , , ) 1 4 s, which are used for identification of genes with altered expression. Genes encoding drug target proteins are later inferred by the enrichment analysis of these genes relative to the genes associated with altered gene expression when genes encoding drug target proteins are perturbed.
Renal carcinoma. Rat kidney samples were also combined with human renal carcinoma samples, which represent 101 human patients (one tumour sample and one sample of adjacent non-tumourous renal tissue from each patient) because some correlation between diabetes and renal carcinoma has been reported 29 . If the present strategy works similarly for both diseases, then these results can strengthen its usefulness. Among genes included in rat kidney and renal carcinoma profiles, 4036 genes sharing gene symbols between rats and humans were considered. Accordingly, the generated tensor is j j j i j j i ji , represents time dependence (Fig. 2). Then, I decided to use the second singular value vectors ( =  2 2 ) for renal carcinoma.
Regarding renal carcinoma, the samples represent 101 patients, from whom one tumorous and one healthy tissue sample were taken. In this case, the first several singular value vectors do not show a significant difference between healthy and tumorous tissue samples. Thus, I performed BH corrections on all the P-values attributed to 202 singular value vectors of samples,  ∈ ×  u j 202 202 3 3 . Next, I found that five sample singular value vectors, =  13, 15, 30, 33, 35 3 ( Fig. 3E), can be used to find singular value vectors of genes and drugs for identification of genes and drugs in the analysis below.
In relation to renal carcinoma, were selected, which are used for identification of genes with altered expression. Genes encoding drug target proteins are later inferred by the enrichment analysis of these genes regarding the genes associated with altered gene expression when genes encoding drug target proteins are perturbed. prognosis). In these profiles, 3961 genes sharing gene symbols between rats and humans were considered. Then the generated tensor is

Cirrhosis.
x j j j i 355 4 216 3961 1 2 3 , which represents the products of gene expression of the i th gene of the rat liver treated with the j 1 th compound at the j 2 th time point after the drug treatment,  ∈ × × x j j i 355 4 3961 1 2 , and gene expression of the j 3 th human liver sample,  ∈ × x j i 216 3961 3 , respectively. HOSVD was applied to  x j j j i 1 2 3 , and then core tensor  3 3 , and gene singular value matrix  ∈ ×  u i 3961 3961 4 were obtained. Prior to selection of genes and compounds, we need to know which singular value vector of time points,  u j 2 2 , represents time dependence (Fig. 2). Then, I decided to use the second singular value vectors ( =  2 2 ) for cirrhosis.
As for cirrhosis, the second and sixth singular value vectors of samples, ∈  u j (2,6), 3 3 , show gradual and significant dependence on progression of diseases (Fig. 3F), from good to intermediate to poor prognosis. Accordingly, I decided to use these two. Because analysis of the sixth may be a bad approach, I performed BH criterion corrections on P-vales and confirmed that these two vectors are still significant.
As for cirrhosis, were selected, which are used for identification of genes with altered expression. Genes encoding drug target proteins are later inferred by the enrichment analysis of these genes regarding the genes associated with altered gene expression when genes encoding drug target proteins are perturbed. 1 1 , which were selected in the previous subsections and are listed in Table 1 and Fig. 1, drug candidates and genes encoding drug target proteins were identified as suggested in Fig. 4 Firstly, I considered candidate drugs (for the full list, see Table S2). Except for ALL, the second singular value vector of compounds, =  u j 2, 1 1 , was identified. Thus, I decided to select outliers with it (Fig. 5). In the previous applications where PCA was used for identification of genes, I employed P-values attributed to each gene, assuming a χ 2 distribution. Unfortunately, this strategy could not be applied to the identification of candidate drugs. When trying to identify critical genes, I also observed gene expression of other, non-critical genes. Consequently, I could identify critical genes using comparisons with them. As for the drugs, however, drugs that are unlikely to affect the rat tissues being considered were not tested. Thus, in a sense, all the analysed drugs are seemingly critical. Therefore, I decided to identify drugs belonging to the cluster farthest from the origin. Because the definition of a cluster may be arbitrary, the selection of genes may not be unique (more detailed criteria and the list of selected drugs are presented in supplementary materials, Text S1).

Identification of drugs and drug target proteins. By means of gene and compound singular value vectors
Regarding identification of genes encoding drug target proteins, I utilised the following strategy as suggested in Fig. 4. Firstly, genes were selected using P-values computed from gene singular value vectors listed in Table 1 (P-values were calculated assuming χ 2 distributions for gene singular value vectors; genes associated with P-values adjusted by the BH criterion and less than 0.01 were selected; for the full list, see Table S2). Then, these genes were uploaded to Enrichr 30 to identify genes encoding drug target proteins via 'Single Gene Perturbations from GEO up' and 'Single Gene Perturbations from GEO down' categories. After that, genes associated with adjusted P-values less than 0.01 were selected as genes encoding drug target proteins (Supporting Data S1).

Biological evaluation of the selected compounds and genes.
At first, it was tested whether known drug target proteins were enriched among those identified by the present strategy. To obtain the list of known drug target proteins, I used DINIES 31 . Although it can also infer unknown interactions between drugs and proteins, by uploading a single drug using a DINIES search with parameters 'chemogenomic approach' and 'with learning on all DBs' , I obtained the list of target proteins of individual drugs (Supporting Data S2). Next, target genes were collected for individual drugs, and the collected sets of target genes were generated and used for further analysis in the text below.
To evaluate significant overlaps between the set of target genes predicted by DINIES and those with gene expression profiles, I carried out Fisher's exact test and the uncorrected χ 2 test ( Table 2). After that, I found that most of cases (10 out of 12 tables) are associated with significant P-values (ΔG) for both tests. Odds ratios are also generally approximately equal to three, which is usually considered large enough. Therefore, the proposed strategy improves identification of drug target proteins three-fold over random selection; this enhancement is large enough for effectiveness. Thus, it is obvious that this strategy is successful.
One may wonder whether Table 2 supports the usefulness of the present strategy because, generally, most genes identified as drug target genes by the present strategy are false positives although the inference and previous knowledge are significantly related. In this case, however, the situation is a little bit different from usual cases. For example, proteins inferred in the present study but not supported by previous knowledge may simply reflect the lack of experiments. In this case, these apparent false positives may turn out to be a true positive after suitable experimental validation. Alternatively, proteins supported by previous knowledge but not by the present study (i.e. false negatives) may simply reflect the lack of perturbation experiments. Proteins not included in perturbation experiments cannot be listed as true positives in the present study. This means that more perturbation experiments and additional validation of protein-drug interactions should be performed to validate the present results more precisely.
Readers may also wonder whether this kind of identification is useful because we are dealing with many-to-many correspondence not one-to-one correspondence. Nevertheless, target gene identification by DINIES was largely found to overlap among multiple compounds. As shown in Fig. 6, a substantial proportion of genes was targeted by more than a single compound, except for ALL where only two compounds were selected. In this sense, the proposed strategy successfully identifies a set of genes that share, to some extent, the compounds targeting these genes.

Discussion
A question may arise whether the present study can be useful for real drug discovery experiments because there are more than a hundred genes listed for individual diseases (Table 2). To demonstrate the usefulness of the proposed strategy, in the text below, I specifically consider two genes, CYPOR and HNF4A, listed at the top for cirrhosis; CYPOR is top-ranked according to 'Single Gene Perturbations from GEO up' , whereas HNF4A is   Table 2. Fisher's exact test (P F ) and the uncorrected χ 2 test (P χ 2 ) of known drug target proteins regarding the inference of the present study. Rows: known drug target proteins (DINIES). Columns: Inferred drug target proteins using 'Single Gene Perturbations from GEO up' or 'Single Gene Perturbations from GEO down' . OR: odds ratio. top-ranked based on 'Single Gene Perturbations from GEO down' . At present, there are no drugs directly targeting cirrhosis; thus, the strategy will hold some promise if we can propose some candidate drugs based on this analysis.
Although the CYPOR protein has never been specifically studied in relation to cirrhosis, the activity of cytochrome P450 enzymes, to which CYPOR belongs, is regarded as important in this disease 32 . Accordingly, CYPOR may be an important therapeutic target in cirrhosis. Although no ligand compounds are known for CYPOR, bezafibrate was evaluated as a candidate ligand by in silico analysis (Fig. 7A). Its estimated binding energy, ΔG, is equal to 9.73 kcal, which corresponds to K i = exp(−(ΔG)/(RT)) = 79 nM (T = 300 K), which is small enough to consider bezafibrate a useful ligand (R = 8.31 is the gas constant). Because bezafibrate itself is a known CYP inhibitor 33 , it is plausible that bezafibrate can be a drug for cirrhosis because of effects on CYPOR.
Although HNF4A has long been an orphan receptor, a ligand was identified recently 34 . Among the compounds identified by the present strategy, bezafibrate was estimated to bind to the ligand-binding domain of HNF4A with high affinity (ΔG = 9.44 kcal, which corresponds to 0.13 μM at T = 300 K) by in silico analysis (Fig. 7B). Given that the structure of the ligand-binding domain is supposed to affect that of the DNA-binding domain 35 (even can enhance the binding), and HNF4A is a transcription factor that can affect cirrhosis 36 , bezafibrate may also be a drug for cirrhosis from this point of view.
Although this is only one example and is limited to in silico analysis, if also considering that some researchers reported the efficiency of bezafibrate toward cirrhosis 37,38 , it indicates that the findings here can be a starting point for identification of new candidate drugs for diseases.
To evaluate the validity of this in silico analysis, I tried to assess binding of a possible negative control, morphine, which is unlikely to bind to either CYPOR or HNF4A (Fig. 7C and D). Morphine turned out to have relatively weak binding affinities (ΔG = 8.14 and 8.08 kcal, respectively, both of which correspond to 1.3 μM). This result suggested that our strategy successfully identified drugs that can bind to target proteins with relatively strong binding affinity.
One may wonder whether only ten fold difference (0.13 μM for bezafibrate vs 13 μM for morphine) is significant enough or not. Nonetheless, these values should be evaluated not as the absolute values, but as the relative binding affinity towards proteins. In order to access the general binding affinity towards proteins, I accessed ChEMBL 39 that stores vast number of experimentally validated compounds-protein interactions. Then I have found three records for morphine (CHEMBL409938), 0.32 nM with MOP, 11 nM with OPRD1, and 230 nM with OPRK1. On the other hand, I have found two records for bezafibrate (CHEMBL264374), 33 μM for FABP2 and 44 μM for FABP1. Thus, the best record of morphine in ChEMBL is more than 10 5 times better than that of bezafibrate. Thus, our finding that bezafibrate is ten times better than morphine is definitely remarkable.
Finally, I would like to comment on the related studies. To my knowledge, there are no existing unsupervised methods comparable to the proposed one. The reason is as follows. To perform analyses similar to the present one, in a sense, multi-view data analyses 40 are necessary, because two independent datasets, i.e. DrugMatrix and disease gene expression profiles, must be integrated. There are no existing unsupervised multi-view analytical methods by which more than one feature selection can be performed. For instance, Khan et al. 41 proposed multi-tensor decomposition (thus, by definition, an unsupervised method) and applied it to drug discovery; they could identify only compounds and could not identify any drug target genes because by means of their methodology, only features shared across multiple views (in their case compounds) can be screened. Although Li 42 proposed an integrated method that can identify drug-target protein interactions, the method is fully supervised because it cannot identify new drug-target protein interactions without any pre-knowledge.
Before closing the discussion, I would like to comment on the rationality of the present strategy. It is not obvious that applying TD to a mathematical product to obtain singular value vectors is guaranteed success because the relation between the mathematical product and the obtained singular value vectors is unclear. Nevertheless, the same problem had already existed when PCA was applied to gene expression. PCA is essentially matrix factorisation that approximates a matrix as a matrix product. This means that dimensionality of the resulting matrices can never be the same as that of the original gene expression matrix. As readers can see in eq. ( (7)), the order of the products of singular value vectors is the same as the order of modes which can take any values. In this sense, dimensionality of the original matrix and that of the obtained matrix after application of TD to the original matrix can never match each other. Any kinds of matrix factorisation or TD simply represent assumptions that may become good approximations. If they work (in the present study, this word indicates that the resulting singular value vectors can be biologically interpretable), this means that the assumption is suitable. If not, it is not reliable. In this study, applying TD to a mathematical product clearly worked well, i.e. the obtained singular value vectors are biologically interpretable. Therefore, the proposed strategy is suitable for the present study and for the dataset being analysed, nothing more. I also would like to emphasize that the present strategy cannot fully replace the compounds or structure based approach that can screen huge number of compounds as many as millions. Until there are more gene expression profiles available, the present strategy must be regarded as only a supportive methodology toward these two main stream strategies.

∑ ∑
is assumed to be as large as that of x j 1 … jm − 1i , this is obviously an overcomplete problem; therefore, there are no unique solutions. To solve TD uniquely, I specifically employed the HOSVD algorithm 43 that attempts to attain TD such that a smaller number of core tensors and singular value vectors can represent x j 1 … jm − 1i as much as possible. All the tensors are standardised such that ∑ i x j 1 ,…, jm − 1i = 0, ∑ i x j 1 ,…, jm − 1i 2 = N m before TD is performed in the present study. The advantages over more popular TD -parallel factor analysis (PARAFAC) 43 -are as follows. First of all, PARAFAC is NP-complete; in other words, there are no known algorithms that derive PARAFAC with polynomial time. Especially, in the present analysis, singular value vectors associated with a smaller contribution were often useful;  k is often greater than 100 (see Table 1). Thus, rapid convergence of computation is required, which is not achieved by PARAFAC. Secondly, the advantage of PARAFAC is that it provides one-to-one relations between singular value vectors having common  k . By contrast, as shown in Table 1, the observed correspondence between singular value vectors is not one to one but often many to many, in which case, HOSVD was more suitable. For these two reasons, I decided to conduct HOSVD instead of more popular PARAFAC.
Tensor generation for integrated analysis. Often, there is a set of gene expression profiles of human cell lines or model animals treated with various compounds at multiple dose densities. DrugMatrix 44 and LINCS 45 are good examples although the former comprises only temporal gene expression after treatment with various drugs. Nonetheless, it is not easy to infer a drug's action on diseases by means of only these gene expression profiles. Some kind of integrated analysis with disease gene expression profiles is necessary, but it is not so straightforward. Candidate drugs should satisfy these conditions: • Gene expression in these profiles must significantly decrease or increase with the increasing dose density of a compound. • Gene expression alteration caused by drug treatment must be significantly coincident with that associated with disease progression.
How these two independent significance values can be evaluated is unclear. For example, we can have two sets of significant gene expression alterations of the i th gene, {Δx i }, caused by drug treatment and those of the i′ th gene, {Δx′ i′ }, during disease progression, respectively. Firstly, we need to test whether the two sets of genes significantly overlap. Next, when there is a significant overlap, we have to determine whether these two gene expression alteration profiles correlate significantly. Furthermore, because the analysis is usually conducted among multiple compounds, all the significance evaluation must be corrected based upon a multiple-comparison criterion. This is obviously a complicated and unpromising strategy.
On the other hand, if we can have gene expression profiles expressed via a tensor, x j 1 … jm− 1i , where j k , k = 1, …, m − 1 corresponds to drug candidates, dose density, and disease progression, we can easily evaluate a candidate drug using TD: eq. ( (7) where P χ2 [>x] is the cumulative probability that the argument is greater than x assuming the χ 2 distribution, and σ  m and σ  k are standard deviations. After adjustment of P-values via the BH criterion 46 , genes and compounds that have significant P-values, i.e. less than 0.01, are selected as those contributing to the specified singular value vectors. Nonetheless, because such a tensor can be obtained only when drug treatment is performed on patients, this strategy is useless; if we can test drug efficiency directly on patients, then there is no need for in silico drug discovery. To overcome this obstacle, I replace x j 1 … jm − 1i with a 'mathematical product'; hereafter, this means that each component is defined as a product of two components, i.e. 1 " is gene expression for the patients (m − 1 = m′ + m″). Because these two datasets can be obtained independently, we can test any kind of combinations of drug treatments and diseases even after all measurements were finalised. It is also obvious that this strategy retains the advantages of integrated analysis using PCA-based unsupervised FE because it still does not require assignment of any weights to each gene's expression. On the other hand, it compensates the shortcoming of the integrated analysis of PCA-based unsupervised FE. Because this strategy identifies genes by considering expression of two genes in an integrated manner, there is no need to use common sets among multiple genes selected within an individual dataset as in the PCA based unsupervised FE.

Explanation of the present strategy.
Here is an accessible description of the present strategy (Fig. 4). It is obvious that the success of this seemingly complicated method heavily depends upon whether I can easily (or usually, in other words) obtain such preferable combinations of singular value vectors that have the properties illustrated in Fig. 4. Although this situation is unlikely to happen often, it seems to be possible in many cases as shown in the text above.
Gene expression profiles. Heart failure. Gene expression profiles of the heart for drug treatments of rats were retrieved from DrugMatrix under the gene expression omnibus (GEO) ID GSE59905, whereas human gene expression for heart failure was retrieved from GEO ID 57345. For both datasets, expression files of genes GSE57345-GPL11532_series_matrix.txt.gz, GSE59905-GPL5426_series_matrix.txt.gz, and GSE59905-GPL5425_ series_matrix.txt.gz were directly downloaded from the series matrix.
The rat model of PTSD. Gene expression profiles of the brain for drug treatments of rats were retrieved from DrugMatrix under GEO ID GSE59895, whereas amygdala and hippocampus gene expression for the rat model of PTSD was taken from GEO ID GSE60304. For both datasets, expression files of genes GSE60304_series_matrix. txt.gz, GSE59895-GPL5425_series_matrix.txt.gz, and GSE59895-GPL5426_series_matrix.txt.gz were directly downloaded from the series matrix.
ALL. Gene expression profiles of bone marrow for drug treatments of rats were retrieved from DrugMatrix under GEO ID GSE59894, and ALL human bone marrow gene expression was taken from GEO ID GSE67684. For both datasets, expression files of genes GSE67684-GPL570_series_matrix.txt.gz, GSE67684-GPL96_series_ matrix.txt.gz, GSE59894-GPL5425_series_matrix.txt.gz, and GSE59894-GPL5426_series_matrix.txt.gz were directly downloaded from the series matrix.
Diabetes and renal cancer. Gene expression profiles of kidneys for drug treatments of rats were retrieved from DrugMatrix under GEO ID GSE59913, whereas gene expression for diabetic human kidneys and renal cancer was obtained from GEO ID GSE30122 and GSE40435, respectively. For these datasets, expression files of genes GSE30122_series_matrix.txt.gz, GSE40435_series_matrix.txt.gz, GSE59913-GPL5424_series_matrix.txt. gz, GSE59913-GPL5425_series_matrix.txt.gz, and GSE59913-GPL5426_series_matrix.txt.gz were directly downloaded from the series matrix.