Identification of miRNA signatures for kidney renal clear cell carcinoma using the tensor-decomposition method

Cancer is a highly complex disease caused by multiple genetic factors. MicroRNA (miRNA) and mRNA expression profiles are useful for identifying prognostic biomarkers for cancer. Kidney renal clear cell carcinoma (KIRC), which accounts for more than 70% of all renal malignant tumour cases, was selected for our analysis. Traditional methods of identifying cancer prognostic markers may not be accurate. Tensor decomposition (TD) is a useful method uncovering the underlying low-dimensional structures in the tensor. The TD-based unsupervised feature extraction method was applied to analyse mRNA and miRNA expression profiles. Biological annotations of the prognostic miRNAs and mRNAs were examined utilizing the pathway and oncogenic signature databases DIANA-miRPath and MSigDB. TD identified the miRNA signatures and the associated genes. These genes were found to be involved in cancer-related pathways, and 23 genes were significantly correlated with the survival of KIRC patients. We demonstrated that the results are robust and not highly dependent upon the databases we selected. Compared with traditional supervised methods tested, TD achieves much better performance in selecting prognostic miRNAs and mRNAs. These results suggest that integrated analysis using the TD-based unsupervised feature extraction technique is an effective strategy for identifying prognostic signatures in cancer studies.


Scientific RepoRtS
| (2020) 10:15149 | https://doi.org/10.1038/s41598-020-71997-6 www.nature.com/scientificreports/ If supervised learning is applied, overfitting can occur. For example, suppose that we are seeking genes associated with aberrant expression that a disease causes. If some of those genes are also associated with gender-dependent expression while others are not, the former might be identified as less coincident with disease progression than the latter. Although gender-dependent expression is biologically acceptable, it is practically difficult to take into consideration in advance. Overlooking genes that are simultaneously associated with disease-causing aberrant expression and gender-dependent expression is possible if we do not intentionally consider gender dependence. Considering labelling strictly often causes this kind of biologically unnecessary screening of genes. In contrast, unsupervised feature extraction, which is a data-driven strategy, allows us to recognize genes associated with gender-dependent aberrant expression if they are dominant, since we do not have to assume what specifically we would like to find in advance. At the same time, regularization (sparse modeling) attempts to minimize the number of features by restricting the sum of coefficients attributed to features and penalizes the use of additional features. The disadvantage of regularization is that we must select the values of parameters that balance the prediction accuracy and the number of features. There are two major issues with supervised feature extraction methods: (i) class labels may not always be true, and (ii) there may be more class labels present in the dataset. As for (i), it is usual for a medical doctor to label samples by visual investigation. This sometimes results in errors, as some tumour samples can accidentally be classified as normal tissues. As for (ii), many diseases are often associated with several subclasses, e.g. cancer subtypes or different stages of disease progression. Thus, it is possible to have insufficient samples to cover all of these known subclasses. This problem that the number of features are more than that of samples can be resolved by employing unsupervised methods; such as principal component analysis (PCA), because they are often used to generate a smaller number of latent variables than samples through the linear combination of original features.
Unsupervised methods are able to identify the underlying structures in the unlabeled dataset. As for (i), because of the unsupervised nature as described above, mislabelling cannot generate incorrect linear combinations, since labels are used only to validate generated features, not to generate features themselves. As for (ii), again, because of the unsupervised nature, subclasses will be automatically reflected in generated features. Thus, even if we do not have enough samples to attribute to all known subclasses, features generated naturally can take these subclasses into account.
The problem with the unsupervised approach is that the linear combination of many features often prevents us from interpreting the newly generated latent variables. An unsupervised methodology that is suitable for the dimension reduction problems is PCA or tensor decomposition (TD)-based unsupervised feature extraction (FE) [14][15][16][17][18][19][20][21][22][23][24][25][26][27] . This method allows selection of a smaller number of features effectively and stably. As can be seen in below, using this approach, at first the latent variables that are associated with samples and are coincident with the desired property, e.g., the distinction between patients and healthy controls, are selected. The latent variables that are attributed to features and corresponds to the selected latent variables attributed to samples are used for selecting limited number of features. Thus, we can have a limited number of features, which allows us to interpret the meaning of the results more easily, since we do not have to deal with all features included in the latent variable. These limited number of features cannot be obtained by simply performing PCA and TD on a given data set and can be obtained only using our approaches described below.
In this paper, tensors specifically refer to mathematical objects having three or more suffices, while matrices refer to tensors with exactly two suffices. PCA is a kind of matrix factorization, while TD is a factorization method applied to a tensor. The advantages of TD over PCA is that TD needs fewer latent variables to factorize. For example, suppose that we have 1,000 features that are formatted as either a 10 × 100 matrix or a 10 × 10 × 10 tensor. PCA applied to a matrix results in two vectors that have 10 and 100 latent variables, respectively; thus, PCA needs in total 110 latent variables to represent 1,000 features. In contrast, TD applied to a tensor results in three vectors, each of which has 10 latent variables. Thus, in total, TD needs only 30 latent variables to represent 1,000 features Fewer latent variables allow TD to capture features in a more efficient manner, and the results are free from overfitting the expression of 1,000 features, unlike with PCA. Figure 1 shows the flowchart of analyses and results in this study. We applied TD-based unsupervised FE to the KIRC dataset retrieved from TCGA. It was found that u mRNA are significantly correlated, we computed the PCC between them, which was 0.905 ( P = 1.63 × 10 −121 ), indicating that they are highly correlated (Fig. 2).

Results
The results of the miRNA signatures and their significant correlated genes are shown in Table 1. A total of 11 miRNAs and 72 genes were identified. To determine if these miRNAs and mRNAs are significantly correlated, we computed the PCC for all 11 × 72 = 792 pairs. Among them, 353 pairs were positively correlated, and 358 pairs were negatively correlated (P-values were less than 0.01 after correcting with the BH criterion). Therefore, 90% of pairs are significantly correlated. Moreover, we could successfully identify significantly correlated pairs of miRNAs and mRNAs. We noted that, among the predicted 11 miRNAs, one miRNA (miR-155) matched the result reported by Lokeshwar et al. 11 . Enrichment analysis. Next, in order to evaluate the biological significance of selected mRNAs, we determined the top 10 oncogenic signatures of the 72 genes reported by MSigDB (Fig. 3, see also Table S1). The results of the top 10 REACTOME pathways reported by MSigDB are summarized in Fig. 4 (see also Table S2). These results suggest that the selected 72 mRNAs are likely related to oncogenesis. In order to further confirm if these 72 mRNAs are related to kidney cancer, we checked if these genes were linked to survival rates (Fig. 5, see also www.nature.com/scientificreports/ Table S3). Among 72 mRNAs, 23 were significantly correlated with the survival of kidney cancer patients. This also highlights the effectiveness of our analysis. We also evaluated the identified 11 miRNAs by DIANA-mirpath. Figure 6 (see also Table S4) shows the enriched disease-related KEGG pathways (P-value < 0.05). The renal cell carcinoma pathway is identified with a significant P-value equal to 0.01613. The top signature in Table S1 [ Fig. 3(I)] is related to the cAMP signaling pathway. Targeting the cAMP pathway is an effective treatment for kidney cancer 28,29 . The second signature in Table S1 [ Fig. 3(II)] is the Snf5 gene expression profile of a murine model (Mouse Embryonic Fibroblast (MEF) cells) that closely resembles that of human SNF5-deficient rhabdoid tumors (pediatric soft tissue sarcoma that arises in the kidney, the liver, and the peripheral nerves) 30 . Impairment of the SWI/SNF chromatin remodeling complex plays an important role in the development and aggressiveness of clear cell renal cell carcinoma 31 . The sixth signature in Table S1 [ Fig. 3 (VI)] comes from a study of the effects of knockdown of the gene family of eukaryotic translation initiation factors (EIF) by RNAi in MCF10A cells. EIF3b is a promising prognostic biomarker and a potential therapeutic target for patients with clear cell renal cell carcinoma 32 , and EIF4GI is a target for cancer therapeutics 33 .
The top pathway in Table S2 [ Fig. 4(I)] is the 'Pathway of regulation of IGF activity by IGFBP' . Studies show that insulin-like growth factors (IGFs) and insulin play a stimulatory role for renal cancer cells 34,35 . Patients with IGF-1 receptor overexpression have a 70% increased risk of death 36 . Moreover, this overexpression has been shown to increase kidney cancer risk in middle-aged male smokers 37 . The second pathway in Table S2    www.nature.com/scientificreports/ is 'Cytokine Signaling in Immune system' . Cytokines are important biomolecules that play essential roles in tumor formation 38 , and they are therapeutic targets 39,40 . The IL-6 cytokine family can serve as useful diagnostic and prognostic biomarkers. In fact, IL-6 is a potential target in cancer therapy 41,42 . Ishibashi et al. reported that IL-6 suppresses the expression of the cytokine signaling-3 (SOCS3) gene and is associated with poor prognosis of kidney cancer patients 43 .
Survival analysis and miRNA-regulated pathway study.  (Table S4 and Fig. 6), 14 are directly related to cancers, except for Hepatitis B and Hepatitis C. Therefore, we correctly identified miRNA signatures that are cancer-related.   Table 1. See Table S1 for numerical data and full descriptions. The analysis was conducted using R 49 . www.nature.com/scientificreports/ Validation of findings using the TCGA and GEO databases. In order to validate the robustness of our findings, we employed an independent dataset to confirm that our results are independent of datasets to some extent. The alternative dataset was downloaded from GEO (GSE16441). The procedures applied to analyze the GEO dataset are similar to those applied to the dataset obtained from TCGA. The only difference is the number of samples, miRNAs, and mRNAs. After repeating the same procedures, we realized that u mRNA l 1 j and u miRNA l 3 j ( l 1 = l 3 = 2) also varied between normal and tumor samples (Fig. 1). P-values computed by the t-test  Table 1. See Table S2 for numerical data and full descriptions. The analysis was conducted using R 49 .  are significantly correlated, we calculated the PCC between them, which was 0. 931 (P-value = 1.58 × 10 −15 ), indicating that they are highly correlated. Next, we checked if the selected miRNAs and mRNAs were common between the TCGA and GEO datasets. We identified three miRNAs-hsa-miR-141, hsa-miR-210, and hsa-miR-200c, which are listed in Table 1. On the other hand, 209 genes were identified. After restricting genes included in both TCGA and GEO datasets, we evaluated the overlap as the confusion matrix ( Table 2).
The P-value determined using the Fisher exact test was 8.97 × 10 −11 and the odds ratio was 19.7. Therefore, the coincidence between selected genes in the TCGA and GEO datasets is significant, and the results obtained for TCGA are robust and not highly dependent upon specific samples.

Superiority of TD over t-test, SAM and limma.
To test the superiority over the conventional methods, we applied the t-test, SAM 44 , and limma 45 to the TCGA and GEO datasets, respectively. After applying these statistical methods, P-values were calculated and adjusted based on the BH criterion. Then, 13,895 genes and 399 miRNAs for TCGA and 12,152 genes and 78 miRNAs for GEO were associated with adjusted P-values less than 0.01 by t-test. At the same time, by SAM, 14,485 genes and 441 miRNAs for TCGA and 16,336 genes and 108 miRNAs for GEO were selected. Finally, limma selected 18,225 genes and 662 miRNAs for TCGA and 28,524 genes and 319 miRNAs for GEO. Relative to the TD method, the t-test, SAM, and limma identified a larger number of genes and miRNAs using the P-values as criteria. If the top ranked (small enough or restricted) number of genes and miRNAs was selected by the t-test, SAM and limma, the coincidence between TCGA and GEO might be compatible. Therefore, we selected the same number of genes and miRNAs by the t-test, SAM and limma as those selected by TD. Only one miRNA and no genes were common between the TCGA and GEO datasets for t-test. We could not reduce the number of genes and miRNAs selected by SAM, since it attributed P = 0 to more genes and miRNAs than those selected by TD for both TCGA and GEO, as did limma for miRNA. Limma could select a reduced number of genes for TCGA and GEO, while no common genes were selected between them. Therefore, we determined that the t-test, SAM, and limma could identify less coincident sets of genes and  Table 1 whose normal logarithmic numbers are proportional to the radii of blue open circles. See Table S4 for numerical values and full description. The analysis was conducted using R 49 . www.nature.com/scientificreports/ miRNAs between TCGA and GEO. In conclusion, this strongly suggests that the proposed method is superior to the t-test, SAM, and limma. Although we cannot deny the possibility that more advanced or sophisticated methods can compete with TD-based unsupervised FE, since this method was fast, simple, and robust and gave us a biologically reasonable set of genes and miRNAs, we believe that the present results are worth publishing, even without more comprehensive comparisons with methods other than limma, SAM, and t-test.

Discussions
In addition to compare with the supervised methods; i.e. t-test, SAM and limma, we benchmarked the TD approach with the unsupervised method, PCA. We did not apply PCA to miRNAs and mRNAs separately and instead applied TD to a tensor that was generated by merging these two. To demonstrate this point, we applied PCA to miRNAs and mRNAs retrieved from TCGA. We noticed that the second PC loading attributed to miRNA and mRNA samples, v (mRNA) 2j and v (miRNA) 2j , were not only mutually correlated but also distinct between tumours and normal tissues; the PCC between them, which was 0.839 (P-value = 2.74 × 10 −87 ), was less significant than that when TD was employed (0.905, P = 1.63 × 10 −121 ). T-test applied to v (mRNA) 2j and v (miRNA) 2j to evaluate significant distinctions between tumours and normal tissues gave us P = 2.33 × 10 −36 for mRNA and P = 2.39 × 10 −77 for miRNA, which are at best comparable ( P = 7.10 × 10 −39 for mRNA and P = 2.13 × 10 −71 for miRNA when TD was employed). To evaluate if PCA could select common genes and miRNAs between TCGA and GEO, we also applied PCA to the GEO dataset and selected 10 miRNAs and 70 mRNAs for TCGA and three miRNAs and 131 mRNAs for GEO. Since three miRNAs selected for GEO were also included in 10 miRNA selected for TCGA, coincidence between GEO and TCGA is the same for miRNAs between TD and PCA. Conversely, since we could find only six genes in common between TCGA and GEO, coincidence for mRNAs is less than with TD, which identified 11 genes in common. In conclusion, although miRNAs and mRNAs can also be successfully identified separately by PCA, the integrated analysis of TD has some advantages over PCA.

Conclusions
In this study, we applied the TD-based unsupervised FE method to the KIRC miRNA expression and gene expression data. The TD-based method can identify miRNA signatures with differential expression between normal tissues and tumors as well as significant correlations between the gene expression data. Selected mRNAs and miRNAs are not only mutually correlated but are also significantly related to various aspects of cancers. This suggests that integrated analysis performed by TD-based unsupervised FE is an effective strategy; it can identify biologically significant pairs of miRNAs and mRNAs despite its simplicity, which is not easy by other strategies.

Materials and methods
Tensors and tensor decomposition (TD). Tensor 17 is a mathematical structure for storing datasets associated with more than two properties. If we measure miRNA and mRNA expression for the samples, we cannot avoid storing these two measurements into two separate matrices. However, by using tensor we can store these two datasets into a tensor, because tensors can have more than two suffixes, which matrices do not have. TD 17 is a mathematical trick that can approximate tensors as the summation of series whose terms are expressed via the outer product of vectors, each of which represent individual property (in this specific example, these vectors correspond to mRNAs, miRNAs, and samples). Figure 7 shows how TD and PCA were applied to miRNA and mRNA expression to select critical miRNAs and mRNAs for KIRC. The miRNAseq and mRNAseq expression data for KIRC were retrieved from the TCGA Data Portal Research Network (https ://gdcpo rtal.nci.nih.gov/). TD is a natural extension of matrix factorization and is regarded as a generalization of the singular value decomposition (SVD) method. It is a useful technique uncovering the underlying low-dimensional structures in the tensor. There are two popular tensor decomposition algorithms: canonical polyadic decomposition (CPD) and Tucker decomposition 46 . The rank decomposition method, CPD, is to express a tensor as the sum of a finite number of rank-one tensors. The Tucker decomposition decomposes a tensor into a so-called core tensor and multiple matrices.

Tensor decomposition methods.
TD-based unsupervised FE was applied to analyze mRNA and miRNA expression profiles. Let x ij (mRNA) denote the expression profiles of the ith mRNA (i = 1, …N) of the jth sample ( j = 1, … M), whereas x kj (miRNA) denotes the expression profiles of the kth miRNA ( k = 1, …K) of the jth sample ( j = 1, … M). Next, we generated a tensor (the rationale for this can be found in 17 ), that is x ijk is subjected to Tucker decomposition as follows: where G ∈ R N×M×K is the core tensor and u l 1 i ∈ R N×N , u l 2 j ∈ R M×M and u l 3 k ∈ R K×K are singular value matrices that are orthogonal. The three matrices can be interpreted of as the principle components for the three modes (properties). The core tensor describes the degree of interaction between the three components 47 . Because Tucker decomposition is not unique, we have to specify how Tucker decomposition was derived. In particular, we chose higher-order singular value decomposition (HOSVD). Given that x ijk ∈ R N×M×K is too large to apply TD, we generated a matrix, which is given by: By applying SVD, we can get u l 1 i and u l 3 k as Then, we can also obtain two u l 2 j that correspond to miRNA and mRNA expression: Selection of genes can be determined using the following quantities, where P χ 2 [> x] is the cumulative probability that the argument is greater than x in a χ 2 distribution.σ l 1 and σ l 3 denote the standard deviations for u l 1 i and u l 3 k , respectively. After the P-values are adjusted by means of the Benjamini-Hochberg (BH) criterion, miRNAs and mRNAs that are associated with adjusted P-values less than 0.01 are selected as those showing differences in expression between controls (normal tissues) and treated samples (tumors). Although the reason CPD was not employed is fully described in my recent book 17 , I briefly describe it here. First of all, CPD cannot give us unique solutions, only initial value-dependent solutions that prevent us from interpreting the results uniquely. Another disadvantage of CPD compared with HOSVD is computation time: CPD is ten times slower than HOSVD 19 . Thus, there is no reason to employ CPD over HOSVD.
Although many other algorithms can compute Tucker decomposition, to our knowledge, HOSVD is the fastest method. In addition, since it works well applied to various problems 17 as well as in the present study, there is no need to employ other algorithms besides HOSVD. Figure 7. Illustration of TD and PCA application to mRNAs and miRNAs expression, respectively. TD workflow (solid line): mRNA and miRNA expression provided as matrices (middle upper) were multiplied to generate a tensor (right upper), which was converted to a matrix with summation sample suffix (lower right). The tensor was decomposed into a product of miRNA and mRNA singular value matrices (middle lower), which was converted to miRNA sample and mRNA sample singular value vectors (left). PCA workflow (broken lines): PCA directly decomposed miRNA and mRNA expression matrices into a product of PC loading attributed to samples and PC scores attributed to miRNA or mRNA. www.nature.com/scientificreports/ Principal component analysis methods. Similar to TD, PCA can also be applied to miRNA and miRNA expression, although separately rather than in an integrated manner. x ij (mRNA) and x kj (miRNA) are normalized such , attributed to ith mRNA, can be obtained as the eigenvector of the gram matrix, , where l is the eigenvalue. The lth PC loading, v survival-associated genes by using OncoLnc 31 . Supplementary Tables S1 to S4 that include numerical data as well as detailed descriptions that correspond to Figs. 3 to 6.