Systematic comparison and assessment of RNA-seq procedures for gene expression quantitative analysis

RNA-seq is currently considered the most powerful, robust and adaptable technique for measuring gene expression and transcription activation at genome-wide level. As the analysis of RNA-seq data is complex, it has prompted a large amount of research on algorithms and methods. This has resulted in a substantial increase in the number of options available at each step of the analysis. Consequently, there is no clear consensus about the most appropriate algorithms and pipelines that should be used to analyse RNA-seq data. In the present study, 192 pipelines using alternative methods were applied to 18 samples from two human cell lines and the performance of the results was evaluated. Raw gene expression signal was quantified by non-parametric statistics to measure precision and accuracy. Differential gene expression performance was estimated by testing 17 differential expression methods. The procedures were validated by qRT-PCR in the same samples. This study weighs up the advantages and disadvantages of the tested algorithms and pipelines providing a comprehensive guide to the different methods and procedures applied to the analysis of RNA-seq data, both for the quantification of the raw expression signal and for the differential gene expression.

In recent years, RNA-seq has emerged as an alternative method to that of classic microarrays for transcriptome analysis [1][2][3][4] . Compared with microarrays, RNA-seq enables the study of novel transcripts and offers higher resolution, a better range of detection and lower technical variability 5,6 . RNA-seq also offers a high degree of agreement with other techniques considered as the gold standard in transcriptomics such as qRT-PCR, both at absolute and relative gene expression measurement levels 7 . All these facts have led to a great expansion of RNA-seq, becoming the first choice in transcriptomic analysis for many scientists. Nevertheless, its widespread use has generated a large amount of research on algorithms and methods that has eventually produced a lack of consensus about how to analyse RNA-seq data. Thereby, in the last decade, many different algorithms and pipelines have been proposed, but there is much debate about which approaches provide the most precise and accurate results. Thus, further research to compare these methods remains necessary.
RNA-seq data analysis typically involves several steps: trimming, alignment, counting and normalization of the sequenced reads, and, very often, differential expression (DE) analysis across conditions. Trimming is used to increase reads mapping rate through the elimination of the adapter sequences and the removal of poorquality nucleotides. It must be employed non-aggressively, together with a wisely chosen read length, to avoid unpredictable changes in gene expression 8 and transcriptome assembly 9 . The beneficial effects of this process have been evaluated in reference-based 10,11 and de novo 12,13 RNA-seq analyses. The alignment to a reference genome or transcriptome is normally the second step in the RNA-seq workflow and has been widely evaluated by many authors [14][15][16][17][18][19] . Once the reads have been mapped, they must be assigned to a gene or a transcript, in a process known as counting or quantification. This is followed by a normalization procedure to remove possible sequencing bias. Since counting followed by normalization is a crucial component of RNA-seq data analysis, several methods have been developed and many comparative studies evaluating their suitability have been published [20][21][22][23][24][25][26][27][28] . The final step in most RNA-seq studies is DE analysis. The main concern at this point is how to  (1) represents the raw gene expression quantification workflow. Every box contains the algorithms and methods used for the RNA-seq analysis at trimming, alignment, counting, normalization and pseudoalignment levels. The right panel (2) represents the algorithms used for the differential gene expression quantification. *HTSeq was performed in two modes: union and intersection-strict. **EdgeR exact test, edgeR GLM and NOISeq have internally three normalization techniques that were evaluated separately.
Gene expression validation using qRT-PCR. To determine qRT-PCR candidate genes we first selected in our dataset 1181 genes expressed in 32 healthy tissues from the RNA-seq data published by Uhlen et al. 54 . We then performed a filtering process removing all genes with < 4 gene expression units (alignments, FPKM, TPM, etc. depending on the normalization method) in the six DMSO treated samples for each pipeline. This process resulted in the choice of 107 genes that satisfied this filtering criterion in all the 192 pipelines (Supplementary Table S2). As these 107 were constitutively expressed, both in 32 healthy tissues and in our two cell lines, we considered them as our gene expression reference set or housekeeping gene set (HKg). We selected from this first list 30 genes, 10 with the highest and 10 with the lowest median non-parametric coefficient of variation (CoV) 55 , and 10 random genes from the mid-area. Additionally, we selected two commonly used housekeeping genes in RNA studies: GAPDH and ACTB, generating a second list of 32 genes. Starting from the same samples used in RNAseq, 1 µg of total RNA was reverse transcribed to cDNA using oligo dT from SuperScript First-Strand Synthesis System for RT-PCR (Thermo Fisher Scientific). Taqman qRT-PCR mRNA assays (Applied Biosystems) were carried out in duplicate on these 32 genes that are highlighted in the Supplementary Table S2. Oligonucleotide probes used for the qRT-PCR assays are reported in the Supplementary Table S3. To measure DGE by qRT-PCR we used the ΔCt method, calculated as: Three normalization approaches were tested: a) Endogenous control normalization, where endogenous control was calculated as the mean of GAPDH and ACTB Ct values for each sample, b) Global median normalization, in which the normalization factor was calculated using the median value for genes with Ct < 35 for each sample, and c) Most stable gene, that was determined using the 4 algorithms available in the RefFinder webtool, BestKeeper 56 , NormFinder 57 , Genorm 58 and the comparative delta-Ct method 59 . The gene ECHS1 was ranked the top position and it was considered as the most stable in our dataset. We detected bias on GAPDH and ACTB genes expression due to the drug treatments. This bias was an under-expression of these genes in the treatment conditions as it is shown in Supplementary Fig. S1. In view of this fact, we rejected the (a) normalization method. Regarding the (b) and (c) methods, both performed equally well but the (c) method looked more robust than (b) as it captures better the Ct values dispersion inside each sample. Finally, we chose the (b) method (Global median normalization) for our downstream analysis.   Pseudoalignment. These methods do not consider the classical alignment process and carry out alignment, counting and normalization in one single step. We applied three commonly used pseudoaligners: Kallisto 26 (v.0.43.1), Sailfish 28 (v.0.9.2) and Salmon 72 (v.0.8.2). Salmon was executed using the FMD and the quasi-mapping-based (QMB) indexing modes. Details about pseudoaligners are given in Supplementary Note S1.
Statistical approaches for pipeline precision analysis. Precision was tested on the six DMSO treated control samples (CLA-T0 and CLB-T0) considering each cell line separately (three replicates by cell line). For this purpose, we considered the 107 HKg expressed in the six samples and in 32 healthy tissues previously reported (Supplementary Table S2). The median rank of the precision was calculated from the CoV of the 107 genes. The CoV is defined as: where MAD is the median absolute deviation for each gene, given by: being X the expression value of the X gene and i the ith sample. The choice of this parameter was justified because it is a non-parametric method that quantifies precision independently of the measurement units, and it also considers the median variability through the MAD parameter, but in the same context as the median. Given this background, we first calculated the CoV value for every single gene inside of each pipeline and each cell line. Secondly, we ranked CoV values in each of the 192 pipelines generated from the combination of all the above algorithms and methods (Fig. 1). Thirdly, we computed the median of the ranks for the 107 HKg in each pipeline and this value was considered as the precision index of each pipeline. The lower the index, the more precise the pipeline was. This procedure is graphically described in the upper panel of Fig. 2.
Statistical approaches for pipeline accuracy analysis. Accuracy was tested on the six DMSO treated samples considering separately each cell line. We calculated the accuracy of each pipeline based on the 32 gene expression values from qRT-PCR, which is considered as a gold standard method due to its accuracy and sensitivity 73 . The Pearson correlation coefficient (r) was used to assess the association between the RNA-seq data and the qRT-PCR Ct values. Then, we calculated a rank for the correlation coefficient of each gene along each pipeline. Finally, a median rank for each gene was calculated. The lower the rank, the more accurate the pipeline was. This procedure is graphically described in the bottom panel of Fig. 2. All r correlation coefficients were calculated using the corr.test function implemented in the psych 74 R package (v.1.9.12.31).
TOP pipeline selection. The best pipelines were settled considering the summation of the precision and accuracy ranks inside each cell line, thus giving the same weight for both parameters. The lower the summation, the . Details about these methods are given in Supplementary Note S1. These DE methods were executed using the top 10 pipelines regardless of the normalization step, because most of DE methods only allow raw data as input. All these algorithms were assessed under five experimental contrasts, as the result of the combination of the two cell lines with the two treatment options and DMSO (Fig. 3). Contrasts were classified based on the number of DEGs detected. The sequence in descendent order of DEGs was: CLA-T0 vs. CLB-T0 > CLA-T1 vs. CLA-T0 > CLA-T2 vs. CLA-T0 > CLB-T1 vs. CLB-T0 > CLB-T2 vs. CLB-T0. All the above DEGs comparisons were carried out at 3 FDR cut-offs: FDR < 0.05, FDR < 0.01 and FDR < 0.001. The similarity among the 17 methods was calculated in R through the dist function from the stats package (v.3.5.1) using the Euclidean distance as distance measure and group average as linkage method. Dendrograms were depicted using the dendextend package (v.1.13.3). The performance of the DE methods was determined through the measurement of 7 diagnostic test parameters: sensitivity or true positive rate (TPR), specificity or true negative rate (TNR), the positive predictive value (PPV), the negative predictive value (NPC), accuracy (ACC), the area under the receiver operating characteristic curve (AUC) and the Matthews correlation coefficient (MCC). These parameters were calculated using as reference the qRT-PCR Benjamini-Hockberg's adjusted p-values in a two-sample t-test. The true positive, true negative, false positive and false negative estimates were defined based on these qRT-PCR adjusted p-values and compared with the adjusted p-values obtained by each RNA-seq DE method.
Equipment. All the computational procedures for raw gene expression quantification were performed in a 64-bit computer with 264 GB of RAM and 64 CPUs installed with a Linux system CentOS 6.9. The DGE quantification was carried out in a 64-bit computer with 64 GB of RAM and 24 CPUs with either a Linux system Ubuntu 14.04 or a Windows 10 system. Precision and accuracy of the evaluated pipelines. Precision was calculated using the 107 housekeeping reference genes (HKg) (Supplementary Table S2) individually for each cell line, as described in "Methods" and in the upper panel of Fig. 2. This analysis showed that pipelines using pseudoalignment algorithms were disproportionately represented among the top-ranked positions. Specifically, the most precise algorithm was Salmon in conjunction with TPM normalization. The least precise pipelines were those in which the normalization process was less strict, such as the raw reads, Numreads, effective counts and estimated counts (Supplementary Table S4). The accuracy of the 192 pipelines was estimated by testing by qRT-PCR 32 of the 107 genes used for the precision analysis (highlighted genes in Supplementary Table S2). Accuracy was calculated based on the qRT-PCR results as described in the "Methods" section and in the bottom panel of Fig. 2. The top-ranked pipelines were those in which HTSeq was the counting method and TMM was the normalization approach. It was also noticed that the traditional alignment methods like RUM were situated in the top positions. The most accurate pipeline used the Sailfish pseudoaligner. However, most pipelines that used a pseudoalignment method were ranked in the lowest positions in the list for accuracy (Supplementary Table S5).
Finally, we generated a global ranking for all the tested pipelines, simultaneously considering the precision and accuracy on RGEQ. The top 10 ranked pipelines are shown in Table 1, and the complete ordered list is presented in Supplementary Table S6. Pipelines that used the counting algorithm HTSeq with its Union variant and the TMM normalization were the most represented in the highest positions of the pipeline ranking. We also observed that the RUM, Tophat2 and STAR alignment algorithms appeared very frequently at the top positions of this list. We found no pattern for the trimming algorithms. These results clearly show that counting and normalization methods are the most critical steps in the RNA-seq analysis process. Particularly, considering the above results, we concluded that the combination of Trimmomatic + RUM + HTSeq Union + TMM was the most precise and accurate pipeline.

Influence of individual RNA-seq analysis steps on raw gene expression quantification.
We assessed the impact of each individual RNA-seq analysis step on the RGEQ. Thus, we performed this analysis at five levels: trimming, alignment, counting and normalization, and the pseudoalignment.
Trimming. With respect to the effect of trimming, we explored the influence of three algorithms on RGEQ. We found that the three algorithms displayed different patterns regarding the surviving paired reads and the alignment rate. Thus, Cutadapt was the algorithm that produced the highest overall rate of surviving paired reads (95.5%) and the lowest alignment rate (93.4%). On the other hand, BBDuk obtained the highest percentage of aligned reads (97.5%) (Supplementary Table S7). Despite these divergences, we did not find statistically significant differences in RGEQ among the three algorithms (FDR > 0.05) (Fig. 4a and Supplementary Table S8).
Alignment. At this level, we evaluated five aligners that represent three alignment methodologies: to align against the reference genome, to align against the reference transcriptome and a hybrid approach. This process was evaluated on 156 pipelines because the 36 remaining pipelines were used to assess the pseudoalignment performance. There were two parameters that could affected the RGEQ ranking at this level: the number of Table 1. Top 10 pipelines based on the overall precision and accuracy ranking. The lower the median overall precision and accuracy the better the pipeline performance was.  www.nature.com/scientificreports/ concordant alignments and the unmapped reads output by each aligner. We found huge differences for these two parameters among the five aligners. STAR was the algorithm with a higher number of concordant alignments (median = 93%, MAD = 1.0), closely followed by TopHat2 (Median = 90.1%, MAD = 2.0). On the other hand, Bowtie2 only reached a 41.5% (MAD = 1.7) of concordant alignments, showing statistically significant differences with the other four aligners (FDR < 0.05) (Supplementary Fig. S2). Regarding to the unmapped reads, HiSat2, and STAR outperformed the other aligners (FDR < 0.05), both producing less than 5% of unmapped reads, although HiSat2 was particularly effective with only 1.6% (MAD = 0.3) unmapped reads ( Supplementary  Fig. S2).
In view of this, we tested the influence of these aligners on the RGEQ and, despite their particularities, we did not find any statistically significant differences (FDR > 0.05) between the median ranks of the pipelines involving these alignment algorithms (Fig. 4b and Supplementary Table S9). However, we found a bimodal distribution in the RGEQ ranking values for these algorithms. We then investigated the cause of this bimodal distribution and we discovered that it was associated with the method employed for normalization. Therefore, the pipelines that used raw reads, effective counts, estimated counts and coverage, reached the lowest ranks. A reanalysis removing pipelines with these normalization approaches revealed that Bowtie2 pipelines reached poorer ranks of RGEQ than pipelines that employed other alignment methods (Fig. 4b).
Counting and normalization. After the alignment step, we evaluated the influence of the counting methods on RGEQ. Pipelines based on Cufflinks and RSEM reached the highest positions in the pipeline ranking followed by HTSeq and Stringtie based pipelines. These last two methods also obtained high-ranking positions, but they showed a bimodal distribution in their rank values that was eventually translated in a greater variability, similarly to that observed in the alignment step (Fig. 4c). The removal of the pipelines that used raw reads, effective counts, estimated counts and coverage as normalization approach, caused the clearance of the bimodal distributions and revealed a better performance of HTSeq pipelines (HTSeq-Inter and HTSeq-Union, Fig. 4c). It also should be noted that the eXpress pipelines showed the poorest rates for the sum of precision and accuracy ( Fig. 4c and Supplementary Table S10).
Next, we evaluated the influence of normalization methods after the counting procedure. Pipelines using the Trimmed Mean of M values (TMM) method performed the best on RGEQ. Other normalization methods, such as Relative Log Expression (RLE) that was second best, and Transcript Per Million (TPM) or Fragments Per Kilobase of Mapped reads (FPKM), also reached high-ranking positions for precision and accuracy in RGEQ analysis ( Fig. 4d and Supplementary Table S11). Considering these results, the normalization procedure proved to be an essential step in RGEQ, since we detected the higher statistically significant differences among the pipeline ranks based on the normalization method used.
Quantification by pseudoalignment. The RGEQ analysis was completed with the evaluation of three pseudoalignment algorithms considering their different alignment modes. These algorithms have a major advantage over traditional alignment methods as they carry out alignment and counting in a single step. Regarding the performance of these algorithms on RGEQ, we found no statistically significant differences (FDR > 0.05) between their cumulative ranks of precision and accuracy ( Fig. 4e and Supplementary Table S12). When we compared these pseudoalignment algorithms with traditional aligners, we found a similar performance on RGEQ (FDR > 0.05) among the algorithms tested from both methodologies (Supplementary Table S13).
Performance of differential gene expression methods. We tested 17 DE methods obtained from the combination of the different DE and normalization approaches. They were tested under six experimental conditions with three biological replicates per condition. Comparisons under these conditions are explained in Fig. 3. The efficiency of the DE procedures was checked at three commonly used FDR cut-offs: FDR < 0.05, FDR < 0.01 and FDR < 0.001.
First, we compared the detection ability among the DE methods. There was great homogeneity among the methods tested in the comparisons with a large number of gene expression changes. However, this homogeneity decreased as the compared experimental conditions became more similar (Fig. 5). Consistent with the findings of Seyednasrollah et al. 43 , we discovered that Cuffdiff was generally the algorithm that detected the smallest number of gene expression changes, whilst SAMseq was the method that detected the largest number of changes in all compared conditions. It was also observed that SAMseq itself and Ballgown lost detection power at FDR < 0.01 and FDR < 0.001 levels. BaySeq, EBSeq and Cuffdiff also lost detection power when the contrasted conditions were more similar. On the other hand, methods derived from edgeR, limma and NOISeq were stable in all comparisons at the three FDR levels analysed (Fig. 5a-e). The similarity analysis through the Euclidean distance among these 17 methods detected a high correspondence among edgeR, limma and DESeq2 in all scenarios. However, methods such as SAMseq, Cuffdiff, EBSeq and Ballgown showed greater distances than most of the other methods. NOISeq, meanwhile, was the most variable method regarding the DEG scenario ( Supplementary Fig. S3).
We also analysed 7 diagnostic test parameters for each DE method using qRT-PCR as described in the "Methods" section. The individual analysis of each parameter revealed substantial differences among the 17 methods (Fig. 6). Particularly, regarding the MCC (Fig. 6a), which evaluates the quality of the classification achieved by each method, some DE methods such as NOISeq, Cuffdiff, baySeq and some of the edgeR variants showed a moderate and positive relationship (MCC > 0.3) between their results and the true model. Concerning ACC and AUC (Fig. 6b,c), the most accurate methods, that is, those with a higher ACC, were limma trend, limma voom and baySeq at FDR < 0.001 (ACC = 0.78), while the method with a greater AUC, or in other words, the method with the highest discrimination capacity, was baySeq at FDR < 0.001 (AUC = 0.81).

Scientific Reports
| (2020) 10:19737 | https://doi.org/10.1038/s41598-020-76881-x www.nature.com/scientificreports/ Next, we carried out a performance analysis considering simultaneously 7 parameters in three experimental approaches: performance by number of DEG scenario, performance by statistical significance cut-off and overall performance. Regarding the performance by the number of DEGs, we observed great variability among the 17 DE methods, so that the first ranking position was reached by different methods in the five analysed scenarios. In this way, methods such as EBseq, DESeq2 or baySeq showed a highly variable behaviour depending on the number of DEGs of the analysed scenario ( Supplementary Fig. S4). On the other hand, the analysis of statistical significance revealed some methods with fairly high and stable performance ranks: limma trend, limma voom and baySeq (with also good ranks observed for edgeR exact and NOISeq, but suspected of greater variability depending on whether they used UQ, FPKM or TMM) (Supplementary Fig. S5).
To carry out the third approach, we considered both the five DEG scenarios and the three statistical significance levels. Taken together, baySeq and NOISeq UQ were the methods that showed a better behaviour even though both classified between the 9th and the 17th place in at least four of the approaches tested ( Supplementary  Fig. S6). These two methods were closely followed by limma trend, limma voom, and edgeR GLM. www.nature.com/scientificreports/ All these performance analyses are summarised jointly considering all the previous approaches in Fig. 7. We highlight that the most balanced method was limma trend since, unlike the other 16 methods, it was ranked among the 8 best methods in all the approaches. Other methods with good performance in all the scenarios were baySeq, NOISeq FPKM, limma voom and some variants of edgeR GLM.
To complete the DE analysis, we estimated the influence of normalization on DEGs detection by the DE methods that admitted multiple normalization approaches. We observed subtle differences in DEGs detection between the normalization approaches with regard to AUC and ACC, particularly in some analysis scenarios of the NOISeq algorithm, in which the FPKM normalization notably surpassed the other methods ( Supplementary  Fig. S7).

Discussion
The findings of our study propose an optimal workflow for RNA-seq gene expression data analysis, based on the performance of multiple algorithms and pipelines freely available to the scientific community.
On a first stage, we performed a comparison of 192 pipelines to quantify raw gene expression. According to our results, the pipeline that obtained the best precision and accuracy rankings, and therefore it was the best combination of algorithms, was Trimmomatic + RUM + HTSeq Union + TMM. Despite this fact, the combination of methods that made the difference were HTSeq at the counting level and TMM at normalization level, as they were part of 10 out of the top 10 pipelines. On the other hand, the pipelines consistent on raw alignments, Stringtie's coverage and Salmon's NumReads occupied most of the last positions in the ranking, confirming the importance of a proper combination of counting and normalization methods.
In this work, we also compared the most commonly used methods over the stages of RNA-seq analysis: trimming, alignment, counting and normalization. Interestingly, we did not find substantial differences in RGEQ related to the trimming algorithm or the aligner used in the RNA-seq data analysis. This last finding is consistent with the recent work of Schaarschmidt et al. in Arabidopsis thaliana samples 84 , showing that the seven analysed aligners could be equally applied in RNA-seq data analysis. On the other hand, we found that the counting and normalization step was shown to be a critical step in our RNA-seq data analysis. This has also been reported by Robert et al. 85 , who demonstrated that the counting step may overestimate or underestimate the level of gene expression. We also detected a preponderance of the HTSeq counting method (under the Union and Intersectionstrict variants) in the top-ranked positions in our pipeline list. With respect to data normalization, pipelines based on the TMM method outperformed those using RLE, FPKM, TPM and all the other normalization approaches tested. This finding is consistent with those of Wu et al. 21 and Maza et al. 20 , but is not in agreement with those of Li et al. 22 , who concluded that gene normalization methods did not improve the gene expression calculations provided by the raw counts.
We also measured the performance of the pseudoalignment on RGEQ. Recently, pseudoaligners have entered the scene as an alternative to the classic alignment algorithms. One of their main advantages is that they can carry out the processes of alignment, counting and normalization in a single step. We have seen that pipelines Figure 7. Summary of the performance of the RNA-seq gene differential expression analysis methods. This graph includes three experimental approaches for the 17 methods: performance by number of DEG scenario, performance by statistical significance cut-off and overall performance.

Scientific Reports
| (2020) 10:19737 | https://doi.org/10.1038/s41598-020-76881-x www.nature.com/scientificreports/ based on pseudoaligners have a good precision in gene expression estimation, but their accuracy is inferior to that of the classic aligners. In any case, pseudoaligners could be a good alternative as RNA-seq data exploration tools since their execution time is faster than the conventional alignment methods. Regarding RNA-seq DGE quantification, we established that the performance of the methods depended on the number of DEGs present under the contrasted experimental conditions and the FDR cut-off employed to determine statistical significance. Our systematic analysis revealed that limma trend obtained the best results in terms of performance, closely followed by limma voom, NOISeq FPKM, baySeq, and some derivations of edgeR. It is interesting that even though the assumptions about the underlying distribution differ, the performance of these methods is comparable, as was also described by Rapaport et al. 39 with other sets of methods. Of note, in spite of their similar performance, we detected differences in behavioural patterns for these methods. Whilst limma trend, limma voom and edgeR showed a good balance among the seven diagnostic test parameters analysed, baySeq and NOISeq tended to be biased in favour of some of them. In the case of baySeq, we found that its performance was sustained by the excellent specificity of this method, however, we detected a substantial lack of sensitivity. With respect to NOISeq, its performance was supported by its good sensitivity and precision, but we found low levels of specificity. A similar pattern of unequal performance between diagnostic test parameters for NOISeq was also reported in the work of Williams et al. 51 , who concluded that NOISeq exhibited the lowest recall (sensitivity), but agreed with our work in the high precision of this algorithm.
Another important aspect of the DE analysis is the possible influence of normalization on the results. Normalization is a critical step in RNA-seq DGE analysis since it enables expression levels to be compared. According to our results, the choice of the normalization method had little effect on DEGs detection compared to RGEQ. These findings are supported by the research of Assefa et al. 86 and Seyednasrollah et al. 43 , who reported a considerable overlap among DE methods regardless of the normalization approach used. However, our results are in contrast with those of Dillies et al. 25 , Bullard et al. 87 and Zyprych-Walczak et al. 88 , possibly because these authors evaluated several normalization approaches regardless of the DE method, while in our study we only assessed the normalization provided by each DE method. It is also of note that Ballgown and SAMseq performed worse than others when the compared conditions had a low number of DEGs. It must be borne in mind that this bad behaviour may lead to a high frequency of false-negative deregulated genes in the data analysis.

Data availability
RNA-seq data used in this study are deposited in the Gene Expression Omnibus (GEO, https ://www.ncbi.nlm. nih.gov/geo/) under the accession number GSE95077. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.