Adapted tensor decomposition and PCA based unsupervised feature extraction select more biologically reasonable differentially expressed genes than conventional methods

Taguchi, Y-h.; Turki, Turki

doi:10.1038/s41598-022-21474-z

Download PDF

Article
Open access
Published: 19 October 2022

Adapted tensor decomposition and PCA based unsupervised feature extraction select more biologically reasonable differentially expressed genes than conventional methods

Y-h. Taguchi¹ &
Turki Turki²

Scientific Reports volume 12, Article number: 17438 (2022) Cite this article

1992 Accesses
4 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Tensor decomposition- and principal component analysis-based unsupervised feature extraction were proposed almost 5 and 10 years ago, respectively; although these methods have been successfully applied to a wide range of genome analyses, including drug repositioning, biomarker identification, and disease-causing genes’ identification, some fundamental problems have been identified: the number of genes identified was too small to assume that there were no false negatives, and the histogram of P values derived was not fully coincident with the null hypothesis that principal component and singular value vectors follow the Gaussian distribution. Optimizing the standard deviation such that the histogram of P values is as much as possible coincident with the null hypothesis results in an increase in the number and biological reliability of the selected genes. Our contribution was that we improved these methods so as to be able to select biologically more reasonable differentially expressed genes than the state of art methods that must empirically assume negative binomial distributions and dispersion relation, which is required for the selecting more expressed genes than less expressed ones, which can be achieved by the proposed methods that do not have to assume these.

Identification of genes associated with altered gene expression and m6A profiles during hypoxia using tensor decomposition based unsupervised feature extraction

Article Open access 26 April 2021

A tensor decomposition-based integrated analysis applicable to multiple gene expression profiles without sample matching

Article Open access 08 December 2022

Supervised dimensionality reduction for big data

Article Open access 17 May 2021

Introduction

Identifying differentially expressed genes (DEGs) on the basis of comparative analyses^1,2 has always been difficult. This challenge is attributable to multiple reasons; however, the primary reason is being a large p small n problem. In a large p small n problem, it is difficult to select features based on statistical criteria because a small number of samples ($=n$) have a tendency to lead to low significance; in reality, the obtained P values must be heavily corrected by considering a large number of features ($=p$). This makes it difficult to find features with significance. To resolve this difficulty, many methods specific to gene expression analysis have been proposed. For example, significant analysis microarray (SAM)³ adds a small amount of constancy to gene expression, thereby avoiding the misidentification of low expressed genes as DEGs. Limma⁴ applied a Bayesian strategy to logarithmic gene expression. After high-throughput sequencing (HTS) became popular, P values are attributed to individual genes, assuming that gene expression follows a negative binomial (NB) distribution^5,6, which is one of the simplest positively valued distributions with a tunable mean and variance. In addition to this, the so-called dispersion relation^5,6,

$$\begin{aligned} \frac{\alpha (\mu )}{\mu ^2} = \alpha _0 + \frac{\alpha _1}{\mu }, \end{aligned}$$

(1)

has also been assumed, where $\mu$ and $\alpha$ are the mean and variance, respectively, and $\alpha _0$ and $\alpha _1$ are regression coefficients; to our knowledge, Eq. (1) is purely empirical and lacks rationalization. Despite these difficulties, many proposed state-of-art methods^5,6,7,8,9 have been widely employed and used in various studies.

Contrary to these empirical methods, we proposed tensor decomposition (TD)- and principal component analysis (PCA)-based unsupervised feature extraction (FE)¹⁰ that only assumes that principal component (PC) and singular value vectors (SVVs) obey Gaussian distribution. Despite this simplicity, TD- and PCA-based unsupervised FE have been successfully applied to a wide range of genomic analyses. However, there have been two problems: 1. The histogram of the P values is not fully coincident with the null hypothesis that PC and SVV obey Gaussian distribution and 2. The number of genes selected is too small to have no false negatives. In this paper, we have shown that the optimization of standard deviation (SD) in Gaussian distribution can resolve these problems.

We tried optimizing SD for PCA-based unsupervised FE and applied this to two highly curated data sets–MAQC and SEQC. Then, we tested the optimization of SD for TD-based unsupervised FE and applied it to two more realistic problems: (1) drug repositioning for SARS-CoV-2 and (2) the analysis of gene expression of multiple organs treated with multiple drugs, to which TD-based unsupervised FE without SD optimization was already applied.

Our contributions are as follows. First, our methods allow more expressed genes to be more selected as DEGs without empirical dispersion relation, Eq. (1). Second, our methods can select significant DEGs without assuming not rationalized negative binomial distribution for individual gene expression. Third, our selected DEGs are much more biologically reasonable than those selected by other state of art methods.

Results

Outlines of TD and PCA based unsupervised FE

In this section, we have briefly explained the algorithm of PCA- and TD-based unsupervised FE (Fig. 1) before explaining how we could improve them.

When a gene expression profile is formatted as a matrix, $x_{ij} \in \mathbb {R}^{N \times M}$, which represents the gene expression of the ith gene of the jth sample, we use PCA-based unsupervised FE. After standardizing $x_{ij}$ as

$$\begin{aligned} \sum _i x_{ij}&= 0 \end{aligned}$$

(2)

$$\begin{aligned} \sum _i x_{ij}^2 &= N, \end{aligned}$$

(3)

a gram matrix $\sum _j x_{ij}x_{i'j} \in \mathbb {R}^{N \times N}$ was diagonalized as

$$\begin{aligned} \sum _{i'} \left( \sum _j x_{ij}x_{i'j}\right) u_{\ell i'} = \lambda _{\ell } u_{\ell i} \end{aligned}$$

(4)

where $u_{\ell i} \in \mathbb {R}^{N \times N}$ is the $\ell$th PC score attributed to gene i. The $\ell$th PC loading attributed to the jth sample can be computed as

$$\begin{aligned} v_{\ell j} = \sum _i x_{ij} u_{\ell i} \in \mathbb {R}^{M \times M}. \end{aligned}$$

(5)

After identifying $v_{\ell j}$, which is associated with a desired property, e.g., the district between control and treated samples, we attributed the P values to the gene i using the corresponding PC score, $u_{\ell i}$, as

$$\begin{aligned} P_i = P_{\chi ^2} \left[ > \left( \frac{u_{\ell i}}{\sigma _\ell } \right) ^2\right] \end{aligned}$$

(6)

assuming that $u_{\ell i}$ obeys the Gaussian distribution, where $P_{\chi ^2} [ >x]$ is cumulative $\chi ^2$ distribution when an argument larger than x and $\sigma _\ell$ is the SD,

$$\begin{aligned} \sigma _\ell &= \sqrt{\frac{1}{N} \sum _{i=1}^N \left( u_{\ell i} - \langle u_{\ell i} \rangle _i \right) ^2} \end{aligned}$$

(7)

$$\begin{aligned} \langle u_{\ell i} \rangle _i &= \frac{1}{N} \sum _{i=1}^N u_{\ell i} \end{aligned}$$

(8)

When we have gene expression that is formatted as a tensor, $x_{ijk} \in \mathbb {R}^{N \times M \times K}$, for the expression of the ith gene at jth sample with the kth condition, we used TD-based unsupervised FE. After standardizing $x_{ijk}$ as

$$\begin{aligned} \sum _i x_{ijk} &= 0 \end{aligned}$$

(9)

$$\begin{aligned} \sum _i x_{ijk}^2 &= N \end{aligned}$$

(10)

Tucker decomposition of $x_{ijk}$

$$\begin{aligned} x_{ijk} = \sum _{\ell _1=1}^N \sum _{\ell _2=1}^M \sum _{\ell _3=1}^K G(\ell _1 \ell _2 \ell _3) u_{\ell _1 i} u_{\ell _2 j} u_{\ell _3 k} \end{aligned}$$

(11)

can be computed with a higher order singular value decomposition (HOSVD)¹⁰. After identifying which $u_{\ell _2 j} \in \mathbb {R}^{M \times M}$ and $u_{\ell _3 k} \in \mathbb {R}^{K \times K}$ are coincident with the target property, e.g., distinction between control and treated samples specifically under kth experimental condition, we try to find $u_{\ell i} \in \mathbb {R}^{N \times N}$ associated with $G(\ell _1 \ell _2 \ell _3) \in \mathbb {R}^{N \times M \times K}$ having the largest absolute value. Then, the P value is attributed to the ith gene as

$$\begin{aligned} P_i = P_{\chi ^2} \left[ > \left( \frac{u_{\ell _1 i}}{\sigma _{\ell _1}} \right) ^2\right] . \end{aligned}$$

(12)

by also assuming that $u_{\ell _1 i}$ obeys the Gaussian distribution and

$$\begin{aligned} \sigma _{\ell _1} &= \sqrt{\frac{1}{N} \sum _{i=1}^N \left( u_{\ell _1 i} - \langle u_{\ell _1 i} \rangle _i \right) ^2} \end{aligned}$$

(13)

$$\begin{aligned} \langle u_{\ell _1 i} \rangle _i &= \frac{1}{N} \sum _{i=1}^N u_{\ell _1 i}. \end{aligned}$$

(14)

For both PCA- and TD-based unsupervised FE, $P_i$ is corrected with the Benjamini-Hochberg (BH) criterion¹⁰; further, the ith genes associated with adjusted $P_i$ less than the threshold value, which is usually 0.01, are selected.

Although PCA- as well as TD-based unsupervised FE were successfully applied to a wide range of genomic analyses, there were two weak points:

Too small a number of genes were selected to have no false negatives.
The histogram of $P_i$ did not fully obey the null assumption that $u_{\ell i}$ and $u_{\ell _1 i}$ obey the Gaussian distribution.

In this paper, by fixing these two problems, we have tried to establish a new method at least comparable to or even superior to state-of-art methods.

Trials using highly curated data sets

Application to MAQC dataset

Initially, to assess what the problem is, we compared the performance of PCA-based unsupervised FE with DESeq2, a state-of-art method, using the MAQC¹¹ data set, which has been carefully curated and frequently used for benchmark studies.

Figure 2C shows a scatter plot of genes using $u_{1i}$ and $u_{2i}$. Figure 2A,B show the PC loading $v_{1j}$ and $v_{2j}$; $v_{1j}$ represents the mean gene expression and $v_{2j}$ represents the differential expression between universal human reference (UHR) and brain. Occasionally, this reminds us of the horizontal and vertical axes of an MAPlot; the horizontal axis of an MAPlot represents the mean expression of individual genes, typically the mean logarithmic expression,

$$\begin{aligned} \frac{1}{M} \sum _{j=1}^M \log _2 x_{ij} \end{aligned}$$

(15)

whereas the vertical axis of an MAPlot represents the differential expression between the two classes, typically the mean logarithmic fold change (LFC),

$$\begin{aligned} \frac{1}{M_A} \sum _{j \in A} \log _2 x_{ij} - \frac{1}{M_B} \sum _{j \in B} \log _2 x_{ij} \end{aligned}$$

(16)

where $M_A$ and $M_B (=M-M_A)$ are sample numbers within one of the two classes, A and B, respectively, and summations are taken within individual classes. As can be seen in Fig. 2D, which represents the contribution of PC loading, $x_{ij}$ can be expressed almost fully in the 2-dimensional space spanned by the first two PCs. Thus, PCA can derive, in a fully unsupervised manner, something that qualitatively corresponds to an MAPlot (Fig. 8), which is usually drawn artificially. In spite of that, unfortunately, the genes selected by the adjusted $P_i$ are too small to have no false negatives (Table 3) and an histogram of $P_i$ is hardly regarded to obey the null hypothesis;

the left panel of Fig. 3 shows the histogram of $1-P_i$, where $P_i$s were computed from $u_{2i}$ by Eq. (6) using $\sigma _2$ defined as

$$\begin{aligned} \sigma _2 &= \sqrt{\frac{1}{N} \sum _i \left( u_{2i} - \langle u_{2i} \rangle \right) ^2} \end{aligned}$$

(17)

$$\begin{aligned} \langle u_{2i} \rangle &= \frac{1}{N} \sum _i u_{2i}. \end{aligned}$$

(18)

If $1-P_i$ is coincident with the null hypothesis; the histogram of $1-P_i < 1$ should have a flat distribution and that of $1-P_i \sim 1$ should have a sharp peak.

Top ranked genes are coincident with DESeq2

To understand the problem of $P_i$s computed by PCA-based unsupervsied FE, we compared $P_i$s computed by PCA-based unsupervised FE with those computed by DESeq2, a state-of-art method. At first, AUC was computed to predict the top 1000 genes based on $P_i$ derived with DESeq2 using $P_i$s computed by PCA-based unsupervised FE; the area under the curve (AUC) was 0.97. Next, in contrast, the AUC was computed to predict the top 1000 genes based on $P_i$ derived with PCA-based unsupervised FE using $P_i$s computed using DESeq2; the AUC was 0.98. This indicated that the top-ranked genes were suitably shared between PCA-based unsupervised FE and DESeq2. Thus, the problem of PCA-based unsupervised FE is not the genes’ ranking but the absolute value of $P_i$s.

Optimization of SD

Based on the observations at the end of the previous subsubsection, we arrived at optimizing $\sigma _\ell$ such that $u_{\ell i}$ and $u_{\ell _1 i}$ obeyed the Gaussian distribution. Generally, optimizing SD to be fitted to the null hypothesis is not easy. For example, Mudge et al¹² had to assume the equivalence between Type I and II errors, which we cannot assume because of an imbalance of numbers between DEGs and the other genes; typically, DEGs are expected to be minorities. Next, we decided to employ an alternative and more empirical approach. To visualize the idea, we have shown some illustrative examples.

Figure 4 shows a historgam of the variable $x_i$ derived from the Gaussian distribution and outliers. If we attribute the P values to the ith variable with $x_i$

$$\begin{aligned} P_i = P_{\chi ^2} \left[ > \left( \frac{x_i}{\sigma }\right) ^2\right] \end{aligned}$$

(19)

using the SD, $\sigma$, directly computed by all points

$$\begin{aligned} \sigma &= \sqrt{\frac{1}{N}\sum _{i=1}^N \left( x_i - \langle x_i \rangle \right) ^2} \end{aligned}$$

(20)

$$\begin{aligned} \langle x_i \rangle &= \frac{1}{N} \sum _{i=1}^N x_i \end{aligned}$$

(21)

and select outliers associated with adjusted P values $<0.01$, we cannot select any of the outliers (Table 1); this is because the SD computed, $\sigma = \frac{1000 \times 1 +100 \times 5^2}{1000+100} = 1.75$, is larger than that of the Gaussian distribution, $\sigma =1$, because of outliers. Because $P_i$s computed with $\sigma =1.75$ is larger than that with $\sigma =1$, it fails to recognize outliers correctly.

Table 1 Confusion matrix of the Gaussian distribution with outliers and prediction for $x_i$, the historam for which is given in Fig. 4.

Full size table

We computed the histogram of $1-P_i$, Fig. 5A, which is far being idealized, Fig. 5C, that should have a constant histogram $h(1-P_i)$ up to $1-P_i$ very close to 1 and has one with a narrow peak near $1-P_i \sim 1$. To optimize the SD, we tried to find an optimal SD such that the histogram for those not recognized as outliers was as flat as possible, i.e, obeying the null hypothesis of the Gaussian distribution; we decided to find the optimal SD that results in the most flat $h(1-P_i)$ for $1-{\text{adjusted}} \; P_i$ less than threshold value $1-{\text{adjusted}} \; P_0$ (${\text{adjusted}} \; P_0$ should be small enough). To minimize the SD of binned $h_i=h(1-P_i)$, $\sigma _h$,

$$\begin{aligned} \sigma _h&= \sqrt{\frac{\sum _{{\text{adjusted}} \; P_i < {\text{adjusted}}\; P_0} \left( h_i - \langle h_i\rangle \right) ^2}{N( {\text{adjusted}} \; P_0)}} \end{aligned}$$

(22)

$$\begin{aligned} \langle h_i\rangle &= \frac{\sum _{{\text{adjusted}}\; P_i < {\text{adjusted}} \;P_0} h_i }{N({\text{adjusted}} \; P_0)} \end{aligned}$$

(23)

with respect to $\sigma$, where $N({\text{adjusted}} \; P_0)$ is the number of h_is associated with ${\text{adjusted}} \; P_i >{\text{adjusted}} \; P_0$, i.e., not recognized as outliers and recognized as a part of the Gaussian distribution. After optimizing $\sigma _\ell$, we recomputed $P_i$. Figure 5A,B show the histogram of $1-P_i$ using $\sigma =1.75$ and optimized SD, respectively; the latter is closer to an idealized histogram of $P_i$, Fig. 5C, than the former.

To validate the effectiveness of the optimization of SD, we repeated this procedure 100 times.

Figure 6 shows the dependence of $\sigma _h$ on SD (upper panel) and the comparison between SD in Eq. (20), optimized SD, and SD computed using is for ${\text{adjusted}} \; P_i < {\text{adjusted}} \; P_0$ (lower panel). In the lower panel, the optimized SD was approximately 1.2, which is much closer to 1 than 1.75, computed by Eq. (20). In addition, the fact that SD computed using is for ${\text{adjusted}} \; P_i < {\text{adjusted}} \; P_0$, which is expected to correspond to the Gaussian distribution part in Fig. 4, is almost 1 helps justify our optimization procedure (Fig. 6, lower panel). The reason why SD = 0 with $\sigma _h=0$ in the upper panel of Fig. 6 was not selected as optimal (as having the smallest $\sigma _h$) is because $\sigma =0$ corresponds to nothing selected and is thus meaningless. Using $P_i$ computed by optimized SD, we can discriminate the outliers almost perfectly (Table 2).

Table 2 Averaged confusion matrix of Gaussian distribution with outliers and prediction using optimized SD.

Full size table

Next, we applied this strategy to the MAQC data set. Figure 7 shows $\sigma _h$, defined in Eq. (22), as a function of SD to compute $P_i$ in Eq. (19) using the MAQC data set; the optimal SD was 0.05557979. It is close to the SD recomputed using is with ${\text{adjusted}} \; P_i < {\text{adjusted}} \; P_0$, 0.03871846; moreover, $h(1-P_i)$ derived from optimal SD looks more idealized (the right panel of Fig. 3). Thus, the optimal SD improved PCA-based unsupervised FE.

Table 3 shows the number of genes selected using DESeq2 (list of genes available as Data S1), the original PCA-based unsupervised FE, than by using optimal SD (list of genes available as Data S2). Although the number of genes selected by original PCA-based unsupervised FE, 344, is too small to regard no false negatives, that of genes selected by PCA-based unsupervised FE with optimal SD, 12252, is large enough to regard no false negatives. Furthermore, that of DESeq2, 20546, seems to be too large to have no false positives, because it is unlikely true that more than half the genes (40933) are distinctly expressed between the brain and controls.

Table 3 The number of genes selected with original PCA-based unsupervised FE, that with optical SD, and DESeq2.

Full size table

Less expressed genes are less likely to be DEGs

Figure 8 shows the selected genes in MAPlot. Although we assumed neither NB distribution nor dispersion relation, Eq. (1), the distribution of selected genes in the MAPlot is reasonable; genes with the same LFC (vertical axis) are less likely selected when associated with smaller mean expression (horizontal axis). Although this property is explicitly assumed in DESeq2 with dispersion relation, Eq. (1), PCA-based unsupervised FE seems to possess the property without assuming dispersion relation explicitly (see the “Discussion” section). On the other hand, DESeq2 selects too many genes and is less likely reasonable. This suggests that PCA-based unsupervised FE with optimized $\sigma _\ell$ is a promising method.

Confirmation using the SEQC dataset

To see if it occurs only occasionally, we repeated all computations on as many as 13 data sets in SEQC¹³, which is yet another curated data set. Coincidence between DESeq2 and PCA-based unsupervised FE (Fig. 9), a reasonable number of selected genes ($\sim 10^3$, Fig. 10), and a lower opportunity of less expressed genes to be DEGs (Fig. 11) are also observed, as in the case of MAQC. In addition to this, although the number of genes selected by DESeq2 are too large ($\sim 10^4$) and heavily dependent upon sample numbers ($\sim 10^3$ for the smallest sample number $\sim 10^0$), that by PCA-based unsupervised FE is not and is always $\sim 10^3$, regardless of sample numbers. Thus, PCA-based unsupervised FE is seemingly superior to DESeq2.

Biological validation

Based on the above results, PCA-based unsupervised FE is seemingly better than DESeq2. Nonetheless, PCA-based unsupervised FE can select a reasonable number of genes regardless of sample numbers (Fig. 10), and less expressed genes are unlikely to be DEGs when genes are selected by PCA-based unsupervised FE with optimized SD (Figs. 8, 11), even without assuming NB distribution and dispersion relations, Eq. (1), which DESeq2 requires, if the selected genes are not biological, it is meaningless. To evaluate the selected genes biologically, we uploaded the genes selected using MAQC to Enrichr. As can be seen in Fig. 12, the genes selected by PCA-based unsupervised FE were better than those selected by DESeq2 (Full list of enrichment analysis is available in Data S1 and S2).

One may still wonder the other state-of-art methods might be better than PCA-based unsupervised FE. To deny this possibility, we biologically evaluated the genes selected for MAQC using edgeR⁶ (full list of enrichment analysis available in Data S3), voom⁸ (full list of enrichment analysis available in Data S4), and NOISeq⁹ (full list of enrichment analysis available in Data S5); it is obvious that these three methods are even inferior to DESeq2 biologically (Fig. 13).

Drug discovery for SARS-CoV-2

Although we have demonstrated that PCA-based unsupervised FE with optimized SD can outperform other state-of-art methods in highly curated data, one might wonder that it is not the case for a realistic and more noisy case. To check if PCA-based unsupervised FE with optimized SD can outperform DESeq2 in more realistic data sets, we considered the drug repositioning of SARS-CoV-2, to which we applied TD-based unsupervised FE¹⁴ and its kernelized version¹⁵.

In our implementation, we employed HOSVD to obtain the tensor decomposition, Eq. (11); because HOSVD is equivalent to SVD applied to a matrix obtained by unfolding a tensor, we can obtain the identical $u_{\ell i}$ independent of which of PCA or HOSVD is used; SD used in Eq. (12) can be optimized too. Next, we applied the optimization of SD and could select 3627 genes associated with adjusted P values of less than 0.1 (list of genes available as Data S6), which is a much higher number of genes than 163 genes than that selected in previous studies^14,15.

Overlap with human genes known to interact with SARS-CoV-2 protein

We evaluated the selected 3627 genes based on the overlap with the human genes known to interact with SARS-CoV-2, as has been done in previous studies^14,15 (Fig. 14). It is obvious that TD-based unsupervised FE with an optimized SD can outperform kernel TD-based unsupervised FE, original (without optimized SD) TD-based unsupervised FE as well as DESeq2 (list of overlap available in Data S7). Thus, it is indeed an outstanding method.

Drug repositioning

We also tried drug discovery using the genes selected by TD-based unsupervised FE with optimized SD. See Table 4 (Full list of drug repositioning available as Data S6). The first one, imatinib, was once identified as a promising drug toward COVID-19, although it was rejected later¹⁶. The second one, apratoxin A, was reported to be a promising compound based on its protein binding affinity¹⁷. The third and fourth one, doxycycline, was supposed to be a promising drug toward COVID-19¹⁸. The seventh one, trovafloxacin, was reported to be a promising compound based on its protein binding affinity¹⁹. The eighth one, doxorubicin, was also reported to be a promising compound based on its protein binding affinity²⁰. The ninth one, cisplatin, and the tenth one, carboplatin, were proposed as a result of drug repositioning²¹. Seven of the nine compounds identified as the top 10 compounds have been previously reported as drugs toward SARS-CoV-2.

See Table 5. The first, fourth, and tenth one, estradiol, was reported as a promising compound²². The second one, tamoxifen, was reported to inhibit SARS-CoV-2 infection by suppressing viral entry²³. The third one, apratoxin A, has been listed in Table 4, too. The fifth one, MK-886, was reported to be an inhibitor of 3CL protease²⁴, although its efficiency was limited to 40 %. The sixth one, IFN-alphacon1, was reported to be an inhibitor of SARS-CoV ²⁵ but not for SARS-CoV-2. The seventh one, arachidonic acid, was generally expected to inhibit SARS-CoV-2 infection²⁶. The eighth one, arsenic, was also generally expected to act against the RdRp of coronavirus²⁷. The ninth one, metoprolo, was reported to be a promising drug toward COVID-19²⁸. Thus, all the top 10 compounds were reported to be promising.

On the other hand, for DESeq2, see Table 6 (full list of drug repositioning is available in Data S8), The use of the second and third one, dexamethasone, resulted in lower 28-day mortality among those who received either invasive mechanical ventilation or oxygen alone at randomization but not among those receiving no respiratory support.²⁹, The seventh one, metformin, suppressed SARS-CoV-2 in cell culture³⁰. The eighth one, etanercept, significantly decreased the risk of developing COVID-19 in patients with rheumatoid arthritis or spondyloarthropathies³¹. The tenth one, lipopolysaccharide, is not a compound but a bacterial protein reported to bind to the SARS-CoV-2 spike protein ³².

See Table 7. The first and fourth one, resveratrol, inhibits HCoV-229E and SARS-CoV-2 coronavirus replication in vitro³³. The second, third, and fifth one, carboplatin, was proposed as a result of drug repositioning²¹. The seventh one, lipopolysaccharide, is listed in Table 6, too.

The proposed method can predict effective drugs for COVID-19 based on gene expression analysis, at least, comparatively to DESeq2. Nevertheless, DESeq2 has less significance and has a tendency to list the same compounds multiple times. The proposed method can identify more convincing and diverse candidate compounds than DESeq2.

Table 4 Drug perturbations from GEO down.

Full size table

Table 5 Drug perturbations from GEO up.

Full size table

Table 6 Drug perturbations from GEO down for A549 by DESeq2.

Full size table

Table 7 Drug perturbations from GEO up for A549 by DESeq2.

Full size table

Based on the overlap between human genes known to interact with SARS-CoV-2 proteins and selected genes (Fig. 14) and from the point of drug repositioning, TD-based unsupervised FE with optimized SD is, at least, competitive with DESeq2.

Comparison of methods using multi-organ measurements with multiple drug treatments

One might wonder if the proposed methods, TD- and PCA-based unsupervised FE with optimized SD, are applicable to a more complicated set-up. To investigate this point, we checked the case where multiple drugs are applied to mice whose gene expression of multiple tissues are measured, to which we applied TD-based unsupervised FE³⁴.

Enrichment of tissue-specific genes

In the previous study³⁴, although we applied TD-based unsupervised FE to gene expression profiles, there existed some problems. First of all, the number of genes selected was too small to have no false negatives.

Table 8 Comparison of selected genes between TD-based unsupervised FE³⁴ and optimal SD with multi-organ data sets.

Full size table

Using the optimized SD, the number of selected genes increased (Table 8; for more details, e.g., the definition of the four gene sets, neurons and testis, muscle, gastrointestine 1 and 2, see the previous study³⁴. This topic has not been discussed herein as it is not directly related to the comparison of the performance between the original TD-based unsupervised FE and that with the optimised SD. The full list of the selected genes is available in Data S9). Although an increased number of genes is meaningless if the biological reliability is less, the biological reliability of selected genes is also improved (lower panel of Fig. 15, which corresponds to a present study and is associated with a greater number of cell lines and tissue specificity than that in the upper panel of Fig. 15, which corresponds to a previous study).

Thus, the employment of optimized SD is also effective to a more complicated data set than simple pairwise comparisons between the treated and control samples investigated in the previous sections.

Coincidence with drug treatment

We have also performed additional validation of the genes selected by TD-based unsupervised FE with optimized SD associated with adjusted P values less than 0.1 (Table 8, full list is available in Data S10–S13). We have uploaded selected genes to Enrichr³⁶ and evaluated the overlaps between the genes selected and those whose expression was altered with the treatment of the 15 drugs used in this study. Then, we found that all four gene sets in Table 8 had a significant overlap with the genes whose expression was altered with the treatment of 5 of the drugs (acetaminophen, cisplatin, clozapine, doxycycline, and olanzapine) in DrugMatrix, which does not include other drug treatments (Supplementary material). This suggests that TD-based unsupervised FE with optimal SD can correctly recognize drug treatments based on gene expression; this was impossible in the previous study³⁴ because of the very small number of genes selected (Table 8). Thus, considering the optimization of SD enables TD-based unsupervised FE to recognize a greater number of biologically reliable genes than the original TD-based unsupervised FE, which did not include the optimization of SD.

Discussion

In this study, we have introduced the optimization of SD to TD- and PCA-based unsupervised FE and have improved their performance by increasing the identified DEGs associated with greater biological reliability. One of the striking features is that DEGs with lesser gene expression are less likely recognized even with the same LFC, if the genes are selected by TD- and PCA-based unsupervised FE with optimized SD. In DESeq2, the tendency that less expressed genes are hardly recognized as DEGs is artificially introduced by assuming dispersion relation, Eq. (1). Nevertheless, in PCA- and TD-based unsupervised FE, it is automatically introduced. Generally, there exists a relationship between difference, $\Delta$ of two variables, x and y, and LFC as

$$\begin{aligned} \Delta\equiv & {} x-y \end{aligned}$$

(24)

$$\begin{aligned} \hbox {LFC}\equiv & {} \log _2 \frac{x}{y} = \log _2 \left( 1 + \frac{\Delta }{y} \right) \end{aligned}$$

(25)

Then

$$\begin{aligned} \Delta = y (2^{\hbox {LFC}} -1) \end{aligned}$$

(26)

Because $v_{2j}$ (Fig. 2B) corresponds to $\Delta$, if DEGs are identified using $u_{2i}$ that corresponds to $v_{2j}$ as in TD- and PCA-based unsupervised FE (see Eqs. (6) and (12)), DEGs associated with the same LFC are less likely selected for the smaller y that corresponds to $\mu$. This results in the distribution of DEGs in MAPlot (Fig. 8), where genes with the same LFC (vertical axis) are less likely identified as DEGs with smaller gene expression (horizontal axis). Figure 16 shows the MAPlot drawn using two independent random variables obeying the same positive uniform distribution; the red colored region associated with $|\Delta |$ larger than some threshold values qualitatively represents the tendency that indicates that a smaller $x+y$ is less likely selected even with the same LFC, $\log _2 \frac{x}{y}$. Thus, TD- and PCA-based unsupervised FE can introduce the tendency that genes with less expression are less likely to be DEGs, even with the same amount of LFC more naturally than DESeq2, which has to manually introduce a dispersion relation, Eq. (1).

In addition to this, although DESeq2 assumes NB distribution that does not have any rationalization other than that it takes only positive values and has a tunable mean as well as variance simultaneously, TD- and PCA-based unsupervised FE assume only that $u_{\ell i}$ obeys the Gaussian distribution (Eqs. (6) and (12)), which is more reasonable because Gaussian distributions can generally appear when independent random variables are summed up. Actually, NOISeq does not assume NB distribution as well but achieves comparative performance with DESeq2 (Fig. 13). In this sense, TD- and PCA-based unsupervised FE can realize DEG distribution in an MAPlot more naturally than DESeq2.

Another remarkable point of TD- and PCA-based unsupervised FE with optimized SD is that it does not have to screen for selected genes by LFC after the genes are selected using P values. As can be seen in Fig. 10, state-of-art methods, including DESeq2, often identify too many DEGs. In these circumstances, LFC is often used to reduce the number of DEGs. Nevertheless, Stupnikov et al³⁷ found that the coincidence of the selected genes among the various state-of-art methods drastically decreases if the genes selected based on P values are further screened with LFC. In this sense, TD- and PCA-based unsupervised FE with optimized SD are more promising methods than state-of-art methods that need screening by LFC to yield a reasonable number of DEGs.

Yet another advantage is that TD- and PCA-based unsupervised FE have already been applied to a wide range of problems. Not only can optimized SD improve the performance of PCA- and TD-based unsupervised FE, as can be seen in Figs. 14 and 15, but also the alteration is limited to the last stage, i.e., P value computation, Eqs. (6) and (12). Thus, the optimized SD is expected to improve the performance in a wide range of problems, to which TD- and PCA-based unsupervised FE have been applied.

One might wonder if the validation should be based upon ground truth. Nevertheless, we do not think that there are ground truth for DEGs; DEGs are depend upon the definition of DEGs since the amount of differential expression is not discrete variable but continuous one. We need to decide threshold values for DEGs which affects which genes are DEGs. In contrast, biological significance is more trustable. In addition to this, the purpose of identification of DEGs is to further make use of them as biological studies. Thus, we believe that the proposed methods that can select biologically more reasonable genes than stat of art methods is worthwhile publishing.

Conclusions

In this study, we optimized SD to improve TD- and PCA-based unsupervised FE. As a result, not only the obtained DEGs increased and became reasonable in number but also the histogram of 1-P became more reliable, i.e., more coincident with the null hypothesis that SVV and PC obey Gaussian distribution. In addition to this, TD- and PCA-based unsupervised FE provide reliable distribution of DEGs in MAPlot, i.e., less expressed genes are less likely selected as DEGs even if they are associated with the same LFC; this property was implemented manually by assuming dispersion relation, Eq.(1), in DESeq2. The biological reliability of the selected genes is also much better by this method than by other state-of-art methods. These points suggest that TD- and PCA-based unsupervised FE are superior than state-of-art methods in terms of achieving better performance with less assumption.

Methods

Sample R code to perform analyses in this study is available as Data S14.

Gene expression profiles

MAQC

Seven human brain expression profiles were downloaded from SRA³⁸ (ID SRX016359), and seven UHR expression profiles were downloaded from SRA (ID SRX016367). Fourteen FASTQ files were mapped to the hg38 human genome using rapmap³⁹. htseq-count⁴⁰ was used to convert the obtained bam files to count data files using the gtf file taken from ftp://ftp.ensembl.org/pub/release-105/gtf/ homo_sapiens/Homo_sapiens.GRCh38.105.gtf.gz.

SEQC

SEQC¹³ were obtained from bioconductor⁴¹ as an experimental package, seqc. It includes thirteen profiles shown in Fig. 11. For more details, see Vignettes in the seqc experimental package.

The histogram composed of Gaussian distribution and outliers in Fig. 4

The Gaussian part is one thousand values drawn from Gaussian distribution with zero mean and an SD of one. Outliers are 100 values, which are equal to 5.

PCA-based unsupervised FE applied

MAQC

Genes not expressed in any of the 14 samples have been excluded. Four rows having annotations “__no_feature”, “__ambiguous”, “__not_aligned”, and “__alignment_not_unique” have also been excluded. As a result, we got $x_{ij} \in \mathbb {R}^{40933 \times 14}$. The $x_{ij}$ was processed as described in the main text.

SEQC

Regardless of which of the 13 data sets was considered, only those genes expressed in all samples were considered. An individual data set has a distinct number of rows (genes) and columns (samples). The $x_{ij}$ obtained from an individual data set was processed as described in the main text.

SARS-CoV-2

All processes used were exactly the same as those described in the previous study¹⁴. After obtaining $u_{5i}$, the SD was optimized as described in the main text.

Multi-organ

All processes used were exactly the same as those described in the previous study³⁴. After getting $u_{\ell i}$, the SD was optimized as described in the main text.

Optimization of SD

At first, a histogram of $1-P_i$ was computed using hclust function in R with the “break=100” option. Then, an SD of the binned histogram, hc$count associated with hc$breaks less than 1-P whose adjusted P value was less than threshold value $P_0$, was minimized using optim function in R. The R code has been provided in Data S14 to show how to optimize SD in an individual data set.

Coincidence between PCA-based unsupervised FE and DESeq2

The coincidence between PCA-based unsupervised FE and DESeq2 was evaluated by AUC (Fig. 9) as follows. At first, the top 1000 genes based on P values computed by DESeq2 were regarded positive and the remaining genes were regarded negative. Then, P values computed by PCA-based unsupervised FE were used to predict positive genes. Using this result, AUC was computed. Next, on the contrary, the top 1000 genes based on P values computed by PCA-based unsupervised FE were regarded positive and the remaining genes were negatives. Then, P values computed by DESeq2 were used to predict positive genes. Using this result, AUC was computed.

Enrichment analyses

Enrichment analyses were performed using either Metascape³⁵ or Enrichr³⁶ by uploading gene symbols. If the gene ID was not a gene symbol in individual data sets, the gene ID conversion tool in Database for Annotation, Visualization, and Integrated Discovery (DAVID)^42,43 was used for conversion.

DEG identification of SARS-CoV-2 data by DESeq2

We used author-provided adjusted P values and LFC (in supplementary data in their paper) to identify DEGs. If we considered only adjusted P values to identify DEGs, DESeq2 would identify too many genes (Table 9). Thus, we had to consider LFC as well. Table 9 shows the number of DEGs used in this study.

Table 9 The number of DEGs in SARS-CoV-2 study by DESeq2 (based on author-provided supplementary material).

Full size table

The evaluation of the overlap with human genes known to interact with SARS-CoV-2 proteins is available in Supplementary materials. The best one, that for the ACE2-expressed A549 cell line, is also included in the main text as Fig. 14.

Data availability

The sequencing datasets are available via the NIH/NCBI Sequence Read Archive (SRA) repository using accession number SRX016359 and SRX016367, via biocondutor with the package of seqc [https://doi.org/doi:10.18129/B9.bioc.seqc, accessed 10th July 2022], via the NIH/NCBI Gene Expression Omnibus (GEO) repository using accession number GSE147507 and GSE142068.

References

Taguchi, Y-h. Comparative transcriptomics analysis. In Encyclopedia of Bioinformatics and Computational Biology (eds Ranganathan, S. et al.) 814–818 (Academic Press, 2019). https://doi.org/10.1016/B978-0-12-809633-8.20163-5.
Chapter Google Scholar
Rapaport, F. et al. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 14, 3158. https://doi.org/10.1186/gb-2013-14-9-r95 (2013).
Article CAS Google Scholar
Tusher, V. G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. 98, 5116–5121. https://doi.org/10.1073/pnas.091062498 (2001).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47. https://doi.org/10.1093/nar/gkv007 (2015).
Article CAS PubMed PubMed Central Google Scholar
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550. https://doi.org/10.1186/s13059-014-0550-8 (2014).
Article CAS PubMed PubMed Central Google Scholar
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140. https://doi.org/10.1093/bioinformatics/btp616 (2009).
Article CAS PubMed PubMed Central Google Scholar
McCarthy, D. J., Chen, Y. & Smyth, G. K. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 40, 4288–4297. https://doi.org/10.1093/nar/gks042 (2012).
Article CAS PubMed PubMed Central Google Scholar
Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29. https://doi.org/10.1186/gb-2014-15-2-r29 (2014).
Article CAS PubMed PubMed Central Google Scholar
Tarazona, S., García, F., Ferrer, A., Dopazo, J. & Conesa, A. NOIseq: a RNA-seq differential expression method robust for sequencing depth biases. EMBnet.journal 17, 18–19. https://doi.org/10.14806/ej.17.B.265
Taguchi, Y-h. Unsupervised Feature Extraction Applied to Bioinformatics (Springer International Publishing, 2020).
Book Google Scholar
Shi, L. et al. The MicroArray quality control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 24, 1151–1161. https://doi.org/10.1038/nbt1239 (2006).
Article CAS PubMed Google Scholar
Mudge, J. F., Baker, L. F., Edge, C. B. & Houlahan, J. E. Setting an optimal $\alpha$ that minimizes errors in null hypothesis significance tests. PLoS ONE 7, 1–7. https://doi.org/10.1371/journal.pone.0032734 (2012).
Article CAS Google Scholar
SEQC/MAQC-III Consortium, A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the sequencing quality control consortium. Nature Biotechnology 32, 903–914. https://doi.org/10.1038/nbt.2957 (2014).
Taguchi, Y.-H. & Turki, T. A new advanced in silico drug discovery method for novel coronavirus (SARS-CoV-2) with tensor decomposition-based unsupervised feature extraction. PLoS ONE 15, 1–16. https://doi.org/10.1371/journal.pone.0238907 (2020).
Article CAS Google Scholar
Taguchi, Y.-H. & Turki, T. Application of tensor decomposition to gene expression of infection of mouse hepatitis virus can identify critical human genes and efffective drugs for SARS-CoV-2 infection. IEEE J. Sel. Top. Signal Process. 15, 746–758. https://doi.org/10.1109/JSTSP.2021.3061251 (2021).
Article ADS PubMed Google Scholar
Zhao, H., Mendenhall, M. & Deininger, M. W. Imatinib is not a potent anti-SARS-CoV-2 drug. Leukemia 34, 3085–3087. https://doi.org/10.1038/s41375-020-01045-9 (2020).
Article CAS PubMed PubMed Central Google Scholar
Naidoo, D., Roy, A., Kar, P., Mutanda, T. & Anandraj, A. Cyanobacterial metabolites as promising drug leads against the mpro and plpro of SARS-CoV-2: An in silico analysis. J. Biomol. Struct. Dyn. 39, 6218–6230. https://doi.org/10.1080/07391102.2020.1794972 (2021).
Article CAS PubMed Google Scholar
Dorobisz, K., Dorobisz, T., Janczak, D. & Zatoński, T. Doxycycline in the coronavirus disease 2019 therapy. Ther. Clin. Risk Manag. 17, 1023–1026. https://doi.org/10.2147/tcrm.s314923 (2021).
Article PubMed PubMed Central Google Scholar
Gimeno, A. et al. Prediction of novel inhibitors of the main protease (M-pro) of SARS-CoV-2 through consensus docking and drug reposition. Int. J. Mol. Sci. 21, 3793. https://doi.org/10.3390/ijms21113793 (2020).
Article CAS PubMed Central Google Scholar
Jamal, Q. M. S., Alharbi, A. H. & Ahmad, V. Identification of doxorubicin as a potential therapeutic against SARS-CoV-2 (COVID-19) protease: a molecular docking and dynamics simulation studies. J. Biomol. Struct. Dyn. 40, 7960–7974. https://doi.org/10.1080/07391102.2021.1905551 (2021).
Article CAS Google Scholar
MotieGhader, H., Safavi, E., Rezapour, A., Amoodizaj, F. F. & asl Iranifam, R. Drug repurposing for coronavirus (SARS-CoV-2) based on gene co-expression network analysis. Sci. Rep. 11, 21872. https://doi.org/10.1038/s41598-021-01410-3 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Mansouri, A., Kowsar, R., Zakariazadeh, M., Hakimi, H. & Miyamoto, A. The impact of calcitriol and estradiol on the SARS-CoV-2 biological activity: A molecular modeling approach. Sci. Rep. 12, 717. https://doi.org/10.1038/s41598-022-04778-y (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Zu, S. et al. Tamoxifen and clomiphene inhibit SARS-CoV-2 infection by suppressing viral entry. Signal Transduct. Targeted Therapy 6, 435. https://doi.org/10.1038/s41392-021-00853-4 (2021).
Article CAS Google Scholar
Zhu, W. et al. Identification of SARS-CoV-2 3cl protease inhibitors by a quantitative high-throughput screening. ACS Pharmacol. Transl. Sci. 3, 1008–1016. https://doi.org/10.1021/acsptsci.0c00108 (2020).
Article CAS PubMed PubMed Central Google Scholar
Paragas, J., Blatt, L. M., Hartmann, C., Huggins, J. W. & Endy, T. P. Interferon alfacon1 is an inhibitor of SARS-corona virus in cell-based models. Antiviral Res. 66, 99–102. https://doi.org/10.1016/j.antiviral.2005.01.002 (2005).
Article CAS PubMed PubMed Central Google Scholar
Ripon, M. A. R., Bhowmik, D. R., Amin, M. T. & Hossain, M. S. Role of arachidonic cascade in covid-19 infection: A review. Prostaglandins Other Lipid Mediators 154, 106539. https://doi.org/10.1016/j.prostaglandins.2021.106539 (2021).
Article CAS PubMed PubMed Central Google Scholar
Chowdhury, T., Roymahapatra, G. & Mandal, S. M. In silico identification of a potent arsenic based approved drug darinaparsin against sars-cov-2: Inhibitor of RNA dependent RNA polymerase (RdRp) and necessary proteases. ChemRxiv. https://doi.org/10.26434/chemrxiv.12200495.v1 (2020).
Clemente-Moragón, A. et al. Metoprolol in critically ill patients with COVID-19. J. Am. Coll. Cardiol. 78, 1001–1011. https://doi.org/10.1016/j.jacc.2021.07.003 (2021).
Article CAS PubMed PubMed Central Google Scholar
The RECOVERY Collaborative Group, Dexamethasone in hospitalized patients with covid-19. N. Engl. J. Med. 384, 693–704. https://doi.org/10.1056/nejmoa2021436 (2021).
Parthasarathy, H., Tandel, D. & Harshan, K. H. Metformin suppresses SARS-CoV-2 in cell culture. bioRxiv. https://doi.org/10.1101/2021.11.18.469078 (2021).
Salesi, M., Shojaie, B., Farajzadegan, Z., Salesi, N. & Mohammadi, E. TNF-$\alpha$ blockers showed prophylactic effects in preventing COVID-19 in patients with rheumatoid arthritis and seronegative spondyloarthropathies: A case-control study. Rheumatol. Therapy 8, 1355–1370. https://doi.org/10.1007/s40744-021-00342-8 (2021).
Article Google Scholar
Petruk, G. et al. SARS-CoV-2 spike protein binds to bacterial lipopolysaccharide and boosts proinflammatory activity. J. Mol. Cell Biol. 12, 916–932. https://doi.org/10.1093/jmcb/mjaa067 (2020).
Article CAS PubMed PubMed Central Google Scholar
Pasquereau, S. et al. Resveratrol inhibits HCoV-229E and SARS-CoV-2 coronavirus replication in vitro. Viruses 13, 354. https://doi.org/10.3390/v13020354 (2021).
Article CAS PubMed PubMed Central Google Scholar
Taguchi, Y-h. & Turki, T. Universal nature of drug treatment responses in drug-tissue-wide model-animal experiments using tensor decomposition-based unsupervised feature extraction. Front. Genet. 11, 695. https://doi.org/10.3389/fgene.2020.00695 (2020).
Article PubMed PubMed Central Google Scholar
Zhou, Y. et al. Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nat. Commun. 10, 1523. https://doi.org/10.1038/s41467-019-09234-6 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Xie, Z. et al. Gene set knowledge discovery with Enrichr. Curr. Protocols 1, e90. https://doi.org/10.1002/cpz1.90 (2021).
Article CAS Google Scholar
Stupnikov, A. et al. Robustness of differential gene expression analysis of RNA-seq. Comput. Struct. Biotechnol. J. 19, 3470–3481. https://doi.org/10.1016/j.csbj.2021.05.040 (2021).
Article CAS PubMed PubMed Central Google Scholar
Leinonen, R., Sugawara, H. & Shumway, M. On behalf of the international nucleotide sequence database collaboration, the sequence read archive. Nucleic Acids Res. 39, D19–D21. https://doi.org/10.1093/nar/gkq1019 (2010).
Article CAS Google Scholar
Srivastava, A., Sarkar, H., Gupta, N. & Patro, R. RapMap: A rapid, sensitive and accurate tool for mapping RNA-seq reads to transcriptomes. Bioinformatics 32, i192–i200. https://doi.org/10.1093/bioinformatics/btw277 (2016).
Article CAS PubMed PubMed Central Google Scholar
Putri, G. H., Anders, S., Pyl, P. T., Pimanda, J. E. & Zanini, F. Analysing high-throughput sequencing data in python with htseq 2.0. Bioinformatics 38, 2943–2945. https://doi.org/10.1093/bioinformatics/btac166 (2022).
Article CAS PubMed PubMed Central Google Scholar
Huber, W. et al. Orchestrating high-throughput genomic analysis with bioconductor. Nat. Methods 12, 115–121. https://doi.org/10.1038/nmeth.3252 (2015).
Article CAS PubMed PubMed Central Google Scholar
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57. https://doi.org/10.1038/nprot.2008.211 (2008).
Article CAS Google Scholar
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Bioinformatics enrichment tools: Paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13. https://doi.org/10.1093/nar/gkn923 (2008).
Article CAS PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by the Japan Society for the Promotion of Science, KAKENHI [Grant numbers 19H05270, 20K12067, and 20H04848] to YHT.

Author information

Authors and Affiliations

Department of Physics, Chuo University, 1-13-27 Kasuga, Bunkyo-ku, Tokyo, 112-8551, Japan
Y-h. Taguchi
Department of Computer Science, King Abdulaziz University, Jeddah, 21589, Saudi Arabia
Turki Turki

Authors

Y-h. Taguchi
View author publications
You can also search for this author in PubMed Google Scholar
Turki Turki
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.H.T. planned the research and performed analyses. Y.H.T. and T.T. evaluated the results, discussions, and outcomes and drafted and reviewed the manuscript.

Corresponding author

Correspondence to Y-h. Taguchi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information 1.

Supplementary Information 2.

Supplementary Information 3.

Supplementary Information 4.

Supplementary Information 5.

Supplementary Information 6.

Supplementary Information 7.

Supplementary Information 8.

Supplementary Information 9.

Supplementary Information 10.

Supplementary Information 11.

Supplementary Information 12.

Supplementary Information 13.

Supplementary Information 14.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Taguchi, Yh., Turki, T. Adapted tensor decomposition and PCA based unsupervised feature extraction select more biologically reasonable differentially expressed genes than conventional methods. Sci Rep 12, 17438 (2022). https://doi.org/10.1038/s41598-022-21474-z

Download citation

Received: 07 March 2022
Accepted: 27 September 2022
Published: 19 October 2022
DOI: https://doi.org/10.1038/s41598-022-21474-z

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Outlines of TD and PCA based unsupervised FE

Trials using highly curated data sets

Application to MAQC dataset

Top ranked genes are coincident with DESeq2

Optimization of SD

Less expressed genes are less likely to be DEGs

Confirmation using the SEQC dataset

Biological validation

Drug discovery for SARS-CoV-2

Overlap with human genes known to interact with SARS-CoV-2 protein

Drug repositioning

Comparison of methods using multi-organ measurements with multiple drug treatments

Enrichment of tissue-specific genes

Coincidence with drug treatment

Discussion

Conclusions

Methods

Gene expression profiles

MAQC

SEQC

The histogram composed of Gaussian distribution and outliers in Fig. 4

PCA-based unsupervised FE applied

MAQC

SEQC

SARS-CoV-2

Multi-organ

Optimization of SD

Coincidence between PCA-based unsupervised FE and DESeq2

Enrichment analyses

DEG identification of SARS-CoV-2 data by DESeq2

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links