Introduction

One of the most important applications of RNA sequencing is to compare the differences in the expression of the non-coding RNAs (ncRNAs). ncRNA refers to a kind of RNA that can be transcribed from the genome but not translated into proteins and can perform their biological functions at the RNA level, including rRNA, tRNA, snRNA, lncRNA, microRNA (miRNA) and others. They play important roles in normal development, physiology and disease1. miRNA and lncRNA are ncRNAs that have been widely studied and have been confirmed to have the strong regulatory ability on gene expression2,3,4,5,6. By direct or indirect means, a single miRNA or lncRNA can regulate hundreds of mRNAs.

High throughput sequencing is a common method for ncRNA research. People often select genes with high expression differences for follow-up function research9,10. In the traditional way, using log2 FC and p-value as thresholds to screen ncRNAs will obviously lose a lot of valuable information. In order to screen ncRNAs more scientifically, many analysis methods have been derived. There are many enrichment analysis methods and databases, such as GSEA11 IPA12, David13, Catmap14 and GlobalTest15. Their analytical methods have different priorities, but the general idea is the same, that is, to perform functional annotation on the RNA profile. But through these methods, we can only observe which genes and pathways are associated with ncRNAs. We do not have an indicator to measure the the regulatory function and participation degree of ncRNA on transcriptome expression. This lack will cause us to miss a lot of valuable information when we screen ncRNAs. Here, we developed an algorithm PDNT, through which we can get the contribution value (C value) of each ncRNA. C value is defined as a quantitative indicator of the participation degree of ncRNA in transcriptome. The algorithm is, (1) Enrich the pathways with DEGs in the dataset, and then use the −lg (p-value) of these pathways as the weighted phase; (2) Take the intersection of the target gene of ncRNA and DEGs, and calculate the proportion of this intersection in each pathway; (3) C value is equal to the weighted sum of these proportions. To verify the utility of the C value, we collected the existing sequencing results, including skeletal muscle denervation, Alzheimer's disease, prostate cancer, gastric cancer, and adipocyte differentiation. C57BL/6 mice were used as the model of skeletal muscle denervation, APP/PS1 mice as the model of Alzheimer's disease, prostate cancer, gastric cancer, and adipocyte differentiation samples were all from human16,17,18,19,20.

Our proposed algorithm PDNT takes into accounts the p-value for each enriched pathway and the proportion of ncRNA target genes in each pathway. We expect to quantify the participation degree of ncRNA in transcriptome, and to optimize the efficiency of screening ncRNA after high throughput sequencing.

Results

The C value of each DE ncRNA is equal to the sum of BP value, CC value, MF value and KEGG value

We calculated the C value of each DE miRNA in skeletal muscle denervation, prostate cancer, Alzheimer's disease and gastric cancer data sets respectively. In addition, we calculated the C value of each lncRNA in skeletal muscle denervation and adipocyte differentiation data sets. The details of these data were aggregated into a table (Table 1). The C values of each DE ncRNA based on biological process (BP), cellular component (CC), molecular function (MF) and KEGG analysis can be obtained, and we named these C values as BP value, CC value, MF value and KEGG value respectively. The total C value of each DE ncRNA was equal to the sum of BP value, CC value, MF value and KEGG value. The DE miRNAs were sorted with the total C value to obtain the 10 DE miRNAs with maximum C value, named as top10 C value miRNAs (Table 2). The top10 DE miRNAs with maximum absolute Log2 FC (top10 FC miRNAs), and the top10 DE miRNAs with minimum p-value (top10 p-value miRNAs), were obtained by sorting the DE miRNAs according to the absolute Log2 fold FC and p-value respectively (Supplementary Tables 1, 2). Similarly, DE lncRNAs were processed in the same way to obtain top5 C value lncRNAs, top5 FC lncRNAs, top5 p-value lncRNAs for adipocyte differentiation and top10 C value lncRNAs, top10 FC lncRNAs, top10 p-value lncRNAs for skeletal muscle denervation (Table 3, Supplementary Tables 3–6).

Table 1 Description of publicly available data sets used in the meta-analysis.
Table 2 The top10 miRNAs according to C value.
Table 3 The top lncRNAs according to C value.

C value is superior to log2 FC and p-value in miRNA operation results

In each data set, the most significant enriched IPA canonical pathways were obtained by core analysis (Supplementary Table 7). We took the intersections of DEGs with the predicted target genes of top10 C value miRNAs, top10 FC miRNAs and top10 p-value miRNAs respectively, and then calculated the proportion of these intersections in the above pathways. It was found that the proportion of top10 C value miRNAs target genes was significantly larger than that of top10 FC miRNAs, top10 p-value miRNAs in most pathways (Fig. 1). We built several PPI networks based on DEGs, and calculated the degree of each node. The node with a larger degree had a darker color and was closer to the center. Then we divided these nodes into the core region (top 20% of degree), sub core region (top 20%-50% of degree) and noncore region (bottom 50% of degree) (Fig. 2a,e,i,m). In the PPI network, the predicted target genes of top10 C value miRNAs, top10 FC miRNAs and top10 p-value miRNAs were labeled in red (Fig. 2). It was found that the number of top10 C value miRNAs’ target genes in each region were larger than those of top10 FC miRNAs, and top10 p-value miRNAs, and the C value group are more concentrated in core region (Fig. 3) (Table 4).

Figure 1
figure 1

Proportion of three groups in each IPA canonical pathway (a) Skeletal muscle denervation. (b) Prostate cancer. (c) Alzheimer's disease. (d) Gastric cancer. (FC group: the collection of the top10 FC miRNAs’ predictive target mRNAs; p-value group: the collection of the top10 p-value miRNAs’ predictive target mRNAs; C value group: the collection of the top10 C value miRNAs’ predictive target mRNAs). Picture drawn by Microsoft Excel.

Figure 2
figure 2

Partition of PPI network and distribution of each group in PPI network. (a,e,i,m) PPI network of DEGs in the Skeletal muscle denervation dataset, Prostate cancer dataset, Alzheimer's disease dataset and Gastric cancer dataset. The degree of each node was calculated. The larger the degree of the node, the darker the color and the closer the position is to the center. The top 20% nodes are defined as core regions, the top 20%-50% nodes are defined as sub core regions, and the remaining nodes are noncore regions. (b,f,j,n) Distribution of FC group in PPI network. (c,g,k,o) Distribution of p-value group in PPI network. (d,h,l,p) Distribution of C value group in PPI network. Red is the selected node, blue is the unselected. Number of genes in core region, sub core region and noncore region of each group has been tagged. STRING v11.0 was used to generate protein interactions, and the resulting network was visualized using Cytoscape v3.7.2. (FC group: the collection of the top10 FC miRNAs’ predictive target mRNAs; p-value group: the collection of the top10 p-value miRNAs’ predictive target mRNAs; C value group: the collection of the top10 C value miRNAs’ predictive target mRNAs).

Figure 3
figure 3

Statistics on the distribution of each group in the PPI network. (a) Skeletal muscle denervation. (b) Prostate cancer. (c) Alzheimer's disease. (d) Gastric cancer. The ratio of the number of genes in each group in different regions.

Table 4 The ratio of the number of genes in each group in different regions.

Based on extensive literature, we identified 14 skeletal muscle growth regulatory miRNAs, 6 Alzheimer’s disease associated miRNAs, 6 prostate cancer associated miRNAs, and 6 gastric cancer associated miRNAs and found that when DE miRNAs were sorted by C value, the sum of the ranks of these miRNAs was significantly smaller than that of the other two indexes, which means that these miRNAs sequences increased integrally (Fig. 4). When sorting by C value versus sorting by absolute Log2 FC/ p-value, most of the disease critical miRNAs ranked up (Fig. 4) (Supplementary Table 8).

Figure 4
figure 4

After sorting with C value, the ranking of disease critical miRNAs increased integrally. (a) Skeletal muscle denervation. (b) Alzheimer's disease. (c) Prostate cancer. (d) Gastric cancer. Left: the sum of the ranks of disease critical miRNAs by the three indexes. Right: The number of mRNAs that rank up or down. (FC group: the collection of the top10 FC miRNAs’ predictive target mRNAs; p-value group: the collection of the top10 p-value miRNAs’ predictive target mRNAs; C value group: the collection of the top10 C value miRNAs’ predictive target mRNAs).

C value is superior to log2 FC and p-value in lncRNA operation results

In the skeletal muscle denervation data set, we calculated the proportion of the predicted target genes of top10 C value lncRNAs, top10 FC lncRNAs, and top10 p-value lncRNAs in the most enriched IPA canonical pathways respectively, and found that the proportion of the genes regulated by top10 C value lncRNAs was larger than that of top10 FC lncRNAs and top10 p-value lncRNAs (Fig. 5a). It was found that the number of top10 C value lncRNAs’ target genes in each region were larger than those of top10 FC lncRNAs, and top10 p-value lncRNAs and the C value group are more concentrated in the core region (Fig. 5b–e) (Table 5).

Figure 5
figure 5

LncRNA operation results for skeletal muscle denervation data set (a) The ratio of predicted target genes to the total genes in IPA canonical pathways. The distribution of (b) top10 FC, (c) top10 p-value and (d) top10 C value lncRNAs’ predictive target mRNAs in the PPI network. Number of genes in core region, sub core region and noncore region of each group was tagged. (e) The ratio of the number of genes in each group in different regions. (FC group: the collection of the top10 FC lncRNAs’ predictive target mRNAs; p-value group: the collection of the top10 p-value lncRNAs’ predictive target mRNAs; C value group: the collection of the top10 C value lncRNAs’ predictive target mRNAs).

Table 5 The ratio of the number of genes in each group in different regions.

Since there are relatively few DE lncRNAs and DE mRNAs in the adipocyte differentiation data set, we take top5 C value lncRNAs, top5 FC lncRNAs, top5 p-value lncRNAs. The proportion of the genes regulated by top5 C value lncRNAs was larger than that of top5 FC lncRNAs and top5 p-value lncRNAs in enriched IPA canonical pathways (Fig. 6a). It was found that the number of top5 C value lncRNAs’ target genes in each region were larger than those of top5 FC lncRNAs, and top5 p-value lncRNAs and the C value group are more concentrated in the core region (Fig. 6b–e) (Table 5). And when DE lncRNAs were sorted by C value, the adipocyte differentiation associated lncRNAs sequences increased integrally than that of the other two indexes (Fig. 6f–g) (Supplementary Table 8).

Figure 6
figure 6

LncRNA operation results for adipocyte differentiation data set (a) The ratio of predicted target genes to the total genes in IPA canonical pathways. The distribution of (b) top10 FC, (c) top10 p-value and (d) top10 C value lncRNAs’ predictive target mRNAs in the PPI network. Number of genes in core region, sub core region and noncore region of each group was tagged. (e) The ratio of the number of genes in each group in different regions. (f) The sum of the ranks of adipocyte differentiation associated lncRNAs by the three indexes. (g) The number of adipocyte differentiation associated lncRNAs that rank up or down. (FC group: the collection of the top5 FC lncRNAs’ predictive target mRNAs; p-value group: the collection of the top5 p-value lncRNAs’ predictive target mRNAs; C value group: the collection of the top5 C value lncRNAs’ predictive target mRNAs).

Efficiency comparison of different ncRNAs

Firstly, the results of IPA canonical pathways were analyzed, and the proportion of the C value group in the top10 pathways was calculated compared with the other two groups. We found that in miRNA data set, the efficiency of the C value group was improved by 61% compared with the FC group, and by 145% compared with the p-value group. In lncRNA data set, the C value group increased by 39% compared with the FC group, and by 78% compared with the p-value group (Table 6). Then, by analyzing the results of PPI network and calculating the ratio of the C value group in core region compared with the other two groups, we found that the C value group in miRNA data set increased by 10% compared with the FC group and by 18% compared with the p-value group. In lncRNA data set, the C value group increased by 85% compared with the FC group, and by 81% compared with the p-value group. In general, there is little difference between the results of miRNA and lncRNA, and a greater difference occurs between different data sets, which may be related to the quality of data sets (Table 7).

Table 6 Efficiency comparison of C value in IPA canonical pathways.
Table 7 Efficiency comparison of C value in PPI network.

Discussion

After high-throughput sequencing, it is common to screen ncRNA according to expression differences. But this may lose a lot of valuable information and lead to biased results. Considering the strong regulatory function of ncRNA on gene expression, there is currently no indicator to characterize the regulatory function and participation degree of ncRNA on transcriptome expression to help us evaluate and screen ncRNA. Here we designed a new algorithm PDNT to calculate the Contribution value, which is defined as a quantitative indicator of the participation degree of ncRNA in transcriptome.

To test the superiority of C value, we compared it with absolute Log2 FC and p-value. Log 2 FC reflects the expression change of ncRNAs and p-value reflects how significant the change is. The two indexes of each DE RNA were obtained after the traditional whole transcriptome sequencing, and many follow-up studies have partially referenced Log2 FC and p-values in selecting the target gene9,10,16. We analyzed four microRNA data sets and two lncRNA data sets, and compared the C value with Log2 FC and p-value in each data set. First, we performed enrichment analysis on DEGs to obtain the most enriched IPA canonical pathways. We found that top C value ncRNAs targeted more genes in these pathways than FC and p-value groups, which may suggest that top C value ncRNAs have greater regulatory potential for enriched pathways. Further, we constructed a PPI network based on DEGs, partitioned the PPI by degree, and then observed the distribution of the three groups in different partitions. It was found that the number of target genes of top C value ncRNAs in each region was greater than that of the other two groups. At the same time, a larger proportion of target genes in the C value group were concentrated in the central region of the PPI. It suggests that the top C value ncRNA has a broader and more important influence on the PPI network than the other two groups. Finally, based on literature search, we obtained key ncRNAs that regulate various pathological/ physiological processes, and then tested the screening effect of the three indicators on these key ncRNAs in the datasets. It was found that using the C value to rank ncRNAs made the overall ranking of these key ncRNAs higher than the other two indicators. This suggests that ncRNAs screened with C values have a greater potential for regulating pathological/physiological processes.

In order to correct the bias caused by only considering expression differences when screening ncRNA, many analysis methods and databases have been derived, such as GSEA11 IPA12, David13, Catmap14 and GlobalTest15. Their analytical methods have different priorities, but the general idea is the same, that is, to perform functional annotation on the RNA profile. But through these methods, we can only observe which genes and pathways are associated with ncRNAs. We do not have a measure to evaluate the participation degree of ncRNA in transcriptome. This lack may result in our inability to assess the priority of two ncRNAs when their target genes are close in number. Or when the two ncRNA regulate similar pathways, we cannot judge their participation degree in the expression regulation of the transcriptome. The algorithm PDNT proposed in this study is based on these pathway analysis methods. We hope to make better use of the pathway enrichment results to evaluate ncRNA and we integrated more valuable information to optimize the screening efficiency of ncRNA. The limitation of this study is that we only calculated based on one pathway enrichment method. In the subsequent study, we will compare the differences between the results calculated based on different pathway enrichment methods, to provide more inspiration and help for related research.

Based on the above evidence, the PDNT is an efficient algorithm for calculating the participation degree of ncRNA in transcriptome based on pathway analysis. We found that the PDNT algorithm provides a measure from another view compared with the log2FC and p-value and it may provide more clues to effectively evaluate ncRNA.

Methods

Prediction of ncRNAs’ target mRNAs

MiRNA: MiRNAs target genes prediction software, miRanda-3.3a (http://www.microrna.org/) 22, uses a weighted dynamic programming algorithm to calculate the optimal sequence complementarity between a mature microRNA and a given mRNA. The main parameters are: -sc 140, -en -10, -scale 4, -strict -out.

LncRNA: The target genes of lncRNAs are predicted by co-expression analysis among samples. The Weighted Gene Correlation Network Analysis (http://www.r-project.org/) 23 was used to calculate Pearson correlation coefficients. The absolute value of the Pearson correlation coefficient ≥ 0.90, p-value < 0.01 and FDR < 0.01 was saved.

GO and KEGG pathway enrichment analysis

In this study, the screening criteria for DEG were p < 0.05 and absolute Log2 FC ≥ 1.

GO is a database established by Gene Ontology consortium (http://www.geneontology.org), which includes three parts: molecular function, biological process and cell composition. KEGG is based on the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (http://www.genome.ad.jp/kegg/), Fisher exact test and × 2 test were used. Enrichment analysis of differentially expressed genes was performed using clusterProfiler R software package24, and gene length bias was corrected. The corrected p-value less than 0.05 was considered to be significantly enriched by differentially expressed genes.

C value mathematical model and its calculation

The C value of each DE ncRNA is calculated using the PDNT algorithm (Fig. 7):

$$C value= \sum_{k=1}^{n}{Proportion}_{k}*(-\mathrm{log}10(pValue)$$

p-value is the p-value of the pathway enriched by DEGs; Proportion refers to the proportion of the intersection between ncRNA target genes and DEGs in each pathway; n represents the number of pathways enriched by DEGs.

Figure 7
figure 7

The operation and verification process of the PDNT algorithm.

Ingenuity pathway analysis (IPA) core analysis

IPA core analysis of DEGs (p < 0.05 and absolute Log2 FC ≥ 1) was performed using IPA (version 81,348,237, Qiagen), showing top10 canonical pathways according to p-value.

PPI network for DEGs

For each dataset, the STRING v.11.0 database was used to construct the PPI network based on DEGs. The images were then drawn by cytoscape3.72 (San Diego, CA, USA).

Retrieval and statistics of key miRNAs and lncRNAs

We searched PubMed (http://www.ncbi.nlm.nih.gov/pubmed) for miRNAs that play important roles in skeletal muscle denervation, Alzheimer's disease, prostate cancer and gastric cancer, respectively. The key words were "skeletal muscle AND microRNA", "Alzheimer's disease AND microRNA", "prostate cancer AND microRNA", and "gastric cancer AND microRNA". Next, we retrieved the lncRNAs that play an important role in skeletal muscle denervation and adipocyte differentiation. Keywords: "skeletal muscle AND lncRNA" and "adipocyte differentiation AND lncRNA". The results were shown in Table 8.

Table 8 The key miRNAs and lncRNAs.

Data Analysis

The analysis platform is R 3.6.1 and the R package is clusterProfiler. The database is org.Mm.eg.db developed with the R package.