Novel Data Transformations for RNA-seq Data Analysis

We propose eight data transformations for RNA-seq data analysis aiming to make the transformed sample mean to be representative of the distribution center since it is not always possible to transform count data to satisfy the normality assumption. Simulation studies showed that limma based on transformed data by using the rv transformation (denoted as limma+rv) performed best compared with limma based on transformed data by using other transformation methods in term of high accuracy and low FNR, while keeping FDR at the nominal level. For large sample size, limma based on transformed data by using the 8 proposed transformation methods had similar performance to limma based on transformed data by using existing transformation methods for equal library size scenarios. Otherwise, limma based on transformed data by using the rv, lv, rv2, or lv2 transformation, or by using the existing voom transformation performed better than limma based on data from other transformation methods. Real data analysis results showed that limma+ l2 performed best, while limma+ rv also had good performance.

in different properties of the count matrix. Two common properties are sparsity and skewness. Sparsity means that many counts in the count matrix are zero. Skewness means that the histogram of all counts in the count matrix is usually skewed. Skewness indicates that data transformation is required to apply linear regression analysis, which assumes data from normal distributions. Sparsity indicates that the log2 transformation, which is commonly used in gene microarray data, could not be directly applied to RNA-seq data analysis since log2(0) does not exist. It is still expensive to collect RNA-seq data for large sample size. Hence, existing RNA-seq datasets usually have small sample size. To address these two common properties, count distributions, such as Poisson, negative binomial, and inflated Poisson distributions, have been proposed to fit RNA-seq data [6,7]. Commonly used R Bioconductor packages that fit RNA-seq data using count distributions include edgeR [8,9], DESeq [10], and DESeq2 [11]. These methods could borrow information across genes to increase the power of the tests for detecting genes differentially expressed between two conditions (e.g., cases versus controls).
The distributions of counts are not as statistical tractable as normal distributions [12].
Moreover, there are much fewer analytic tools for count distributions than there are for normal distributions in statistical analysis. Law et al. (2014) [12] proposed the voom transformation to transform the count distribution to a distribution close to the normal distribution in RNA-seq data analysis and demonstrated that using limma [13] with voomtransformed count data performed comparable to count-based RNA-seq analysis methods, such as edgeR [8,9], DESeq [10], baySeq [14] and DSS [15].
The voom transformation is a sample-specific transformation, defined as log-counts per million (log-cpm): = log 2 ( + 0.5 + 1.0 × 10 6 ) where rgi is the count of the g-th mRNA gene for the i-th sample, Ri is the total counts (Ri=r1i+r2i+…+rGi) for the i-th sample, g=1, …, G, i=1, …, n, G is the number of mRNA genes, and n is the number of samples.
The goal of the voom transformation is to make the empirical distribution of transformed RNA-seq data closer to a normal distribution so that the moderate t tests (limma) could be used. However, it is not always possible to transform count data to have a distribution closer to a normal distribution in real data analysis [16]. For example, for the SEQC data, that was analyzed in real data analyses part of [12], the empirical distribution (i.e., histogram) of the voom transformed data is still far from a normal distribution ( Figure 1).
The histogram is based on the pooled data Ygi, g=1, …, 92, i=1, …, 8. In this article, we proposed to relax the normality requirement for a data transformation.

Results for simulation studies
In this article, we proposed 8 data transformation methods to improve the voom the simulation scheme in [12]. For each simulated RNA-seq dataset, we applied limma to the transformed data.
In addition, we would like to evaluate if using non-parametric approaches would have better performance than using parametric approaches in analyzing RNA-seq data, the distribution of which is non-normal. Specifically, we used the Wilcoxon rank sum test (denoted it as Wilcoxon) on each gene based on the untransformed counts. We then adjusted p-values to control false discovery rate < 0.05.  (e)). All methods, except for limma with r, rv, and rv2, had median FNR equal to one (i.e., all true positives were identified as negatives), indicating all the methods had lower testing power when sample size is small and samples have different library sizes.
We also did simulations with large sample size (nCases=nControls=100) to compare the performance of the 10 different methods. Supplementary Figure 1(a) showed that when library sizes are equal, the methods (limma with r, rv, r2, and rv2), which are based on the root transformations, had slightly higher median accuracy than Wilcoxon and limma with voom. Supplementary Figure 1  showed that limma with voom, rv, and rv2 had similar FNR and they had lower median FNR than other 7 methods.
Overall, the Wilcoxon test without data transformation can be used to detect differentially expressed genes for RNA-seq data when sample size is large and sample library sizes are unequal. If data transformation is preferred, then limma with voom, rv, or rv2 can be used.
We also did simulation studies with total sample sizes 60 and 100 (i.e.,   Table 1 summarize the results of the eight simulation scenarios. For each simulation scenario, we used one-sample t-test to test the null hypothesis 0 that the mean of the 100 estimated FDR from the 100 simulated datasets is significantly ≤ 0.05 versus the alternative hypothesis that the mean of the 100 estimated FDR from the 100 simulated datasets is significantly > 0.05. If the p-value of the one-sample t-test < 0.05, we reject 0 . We counted the number of rejections for each method. For a given scenario, we also ranked median accuracy for methods that did not reject the null hypothesis 0 . For ranks with ties, average ranks were used. For the methods that rejected the null hypothesis 0 , we set their ranks as missing values. The left panel of Figure 3 is the plot of the number of rejections versus mean rank considering all 8 scenarios and showed that limma with rv had the best performance with high accuracy, while keeping nominal FDR (0.05) in all simulation scenarios. We also can see that (1)  differences. When sample library sizes are un-equal (Figure 4 (b) and Figure 4 (d)), r, l, rv, lv, rv2, and lv2 still had the mean-median difference close to zero. However, r2 and l2 had much larger mean-median difference than zero. While l2 had smaller difference than voom, the r2 transformation had much larger mean-median difference than voom.

Real data analysis
The first two of our real data analysis datasets were based on the SEQC datasets [17]. We  The information about which genes are truly differentially expressed between two groups were determined based on qRT-PCR (Quantitative Real-Time PCR) experimental data. Figure 5 showed that limma with r, l, rv, r2, l2, rv2 had higher accuracy than limma with voom and showed that limma with lv and lv2 had equal accuracies to limma with voom.
We noticed that Wilcoxon, a non-parametric method, failed to detect any true positives, although Wilcoxon had slighly higher accuracy to limma with voom.   Supplementary Table 2 is a summary of the performances of Wilcoxon test and limma with all 9 RNA-seq transformations for real data analyses, we ranked accuracy for two SEQC tests and the proportion of validation for GSE95587, we then obtained the mean ranks to compare the performance. Supplementary Table 2 showed that limma with all 8 proposed transformations had smaller mean ranks than limma with voom in three real data analyses.

Discussion
In this article, we proposed 8 new RNA-seq data transformations to improve the voom transformation for RNA-seq data analysis. The simulation results (Supplementary Table   1) showed that overall limma with rv had the best performance compared to Wilcoxon test, limma with voom, and limma with the other 7 proposed transformations. We also observed that limma with rv2 had the 2 nd best performance when sample size is small (nCases=nControls=3). Limma with the voom transformation and limma with the r2 transformation performed last among the 10 methods compared when nCases=nControls=3. When sample size is medium (nCases=nControls=30 or nCases=nControls=50), limma with voom, rv, and rv2 had the best performance. When sample size is large, limma with r2 had the best performance followed by the Wilcoxon test. Limma with the proposed log-transformation-based methods did not perform as well as other methods in the simulation studies. However, limma with the l and l2 transformation performed the best in the real data analyses. The real data analyses also showed that limma with voom did not perform as well as the other 9 methods. The inconsistency between the results of simulation studies and those of real data analyses is probably due to the fact that the real data are much more complicated and different than the simulated datasets.
Having sample mean close to sample median for pooled data could not guarantee that for each gene, the sample mean is close to the sample median for cases and for controls, respectively. Also, the empirical distribution of the pooled data after data transformation might still be skewed even if the sample mean is very close to sample median. Hence, robust linear regression might improve the performance of limma after data transformation.
The voom transformation was proposed by [12] and has been implemented in the limma We observed that none of the 8 transformations could dominate each other, although they performed better than voom in most scenarios ( Figure 3 and Supplementary Table 1). For example, limma with the r2 transformation performed best when sample size is large and samples have equal library size, but could not beat limma with rv when sample size is small or moderate (e.g., 6, 60 or 100). Another limitation of our study is that in our real data analyses, no independent cohorts are available to do validation. However, we did 100 times of random splits. In each split, we had discovery set and validation set. In future, we will do validation studies when independent validation sets are available. The third limitation of this study is that the 8 proposed data transformations aim to make the sample mean closer to the distribution center after data transformation. However, the transformed distribution might not be close to a normal distribution. Hence, robust linear regression models are needed since ordinary linear regression requires the normality assumption. Future research is warranted on this subject as well.

Conclusions
In simulation studies, limma with the rv transformation performed better than limma with the voom transformation for data with small or moderate sample size. For large sample size, limma with the r2 transformation performed better than limma with the voom transformation. In real data analysis, several (l2, l, r2, r, rv, and rv2) of our proposed transformations performed better than voom. We hope these novel data transformations could provide investigators more powerful differentially expression analysis using RNAseq data.

Eight new data transformations
We The r transformation is defined as: where is the count of the -th gene for the -th sample. The optimal value for the parameter is to minimize the difference between the sample mean and the sample median of the pooled data: The rv transformation is defined as: That is, we do the root transformation for the sample-specific counts per million. The optimal value for the parameter is to minimize the difference between the sample mean and the sample median of the pooled data.
The r2 transformation has the same form as the r transformation: However, the criterion to estimate the optimal value of is different. The optimal value for the parameter is to minimize the difference between the sample mean and the sample median of each sample: where ( ) = ∑ =1 / and ( ) are the sample mean and sample median of the i-th sample.
The rv2 transformation is a combination of the rv transformation and the r2 transformation, defined as: where is the sample-specific counts per million. The optimal value for the parameter is to minimize the difference between the sample mean and the sample median of each sample.
The l transformation is defined as: The optimal value for the parameter is to minimize the difference between the sample mean and the sample median of the pooled data: The lv transformation is defined as: where is the sample-specific counts per million. The optimal value for the parameter is to minimize the difference between the sample mean and the sample median of the pooled data.
The l2 transformation has the same form as the l transformation: However, the criterion to estimate the optimal value of is different. The optimal value for the parameter is to minimize the difference between the sample mean and the sample median of each sample: The lv2 transformation is a combination of the lv transformation and the l2 transformation, defined as: where is the sample-specific counts per million. The optimal value for the parameter is to minimize the difference between the sample mean and the sample median of each sample.

Simulation studies
In each simulation study, we generated 100 datasets. Each dataset contains 10,000 genes, among which 200 genes are differentially expressed between nCases cases and nControls controls. An inverse chi-square distribution with 40 degrees of freedom was used to generate a modest amount of gene-wise biological variation [12]. We set the number of cases equal to the number of controls (i.e., nCases=nControls).
After data transformation, we used Bioconductor package limma to detect differentially expressed genes. We also compared the results based on transformed data with the results of the Wilcoxon rank sum test based on the original counts.
(nCases=nControls=3) with equal or un-equal library sizes in their simulation studies. In our simulation studies, we evaluated the effects of sample size and library size on the performances of the 10 methods (Wilcoxon, voom, r, l, r2, l2, rv, lv, rv2, lv2)  The dataset contains 117 samples, 84 of which are from Alzheimer's cases (ADs) and 33 of which are controls (CONs). We randomly split the 117 samples into two roughly equal parts: a discovery set and a validation set. The discovery set has 42 ADs and 17 CONs.
The validation set has 42 ADs and 16 CONs. We then applied limma after data transformations to the discovery set and the validation set to detect differentially expressed (DE) genes. For the discovery set, we claimed a gene is DE if its FDR-adjusted p-value < 0.05. For the validation set, we claimed a gene is validated DE if it had a raw pvalue < 0.05 in the validation set and it had FDR-adjusted p-value < 0.05 in the discovery set. We repeated the above split-validation procedure 100 times. For each of the 10 methods, we calculated the proportion of the validated DE genes among the DE genes detected in the discovery set. The higher the proportion is, the better performance, the method is.
For the voom transformation in all real data analyses in this article, we followed Law et al. (2014) by first applying TMM scale-normalization [18] and quantile normalization before applying for the voom transformation.

Ethics approval and consent to participate
Not applicable. The real data sets we used are downloaded from public data repositories.

Consent for publication
Not applicable. The real data sets we used are downloaded from public data repositories.