## Introduction

High-throughput and next generation sequencing data analyses have dominated much of biological research in the last decade. The major challenge is to tackle the large dataset into a manageable way for key biological inference. There has been much effort in the development of statistical tools to interpret the data, especially to identify genes that act differently between any two samples, for example, between wild type and mutants or across time for a given stimulus1,2,3,4.

Till today, the predominant way is to input user defined parameters to select genes for evaluation, such as 2 or threefold differently expressed, sometimes with a given minimum expression value and/or with a statistical null hypothesis (p value) criteria5,6,7. These approaches have provided valuable insights into the underlying differential activation mechanisms, nevertheless, to overcome the arbitrarily or biasedly used selection criteria, we require newer methods that provide alternative solutions.

Previously, to reveal how the transcriptional machineries of human and mouse embryonic developmental cells evolve with time, we had quantified and used transcriptome-wide noise (squared coefficient of variation) as a non-parametric metric to observe key differences between the developmental stages8. Here, we set a similar approach to track genes that vary or scatter significantly compared with replicate (technical or operator induced) variability.

## Results and discussion

### Transcriptome-wide scatter

We obtained RNA-Seq dataset, from the NCBI GEO database, for Escherichia coli in aerobiosis, Saccharomyces cerevisiae in hypoxia, and Mus Musculus embryonic stem cell (ESC) with and without Etomoxir (ETO) treatment (see Materials and Methods). After performing Transcripts Per Kilobase Million or Transcripts Per Million (TPM) normalization of the read counts for all samples, we plotted transcriptome-wide expression scatter between any two replicates and between the anchor time (t = 0) and the last time points for both E. coli and S. cerevisiae, and between untreated and ETO treated mouse ESC cells (Fig. 1a–c).

In all the replicate plots, we observe a scatter that reduces with higher expression values resulting in an increase of Pearson correlation R with increasing expressions (Fig. 1a–c, left and middle panels). This is expected as the effect of noise, such as due to technical variability, tend to become less significant for higher expressions8,9,10. Thus, noise is usually a concern for lower gene expressions. We also observe for all cell types that the transcriptome-wide expression scatter widens, with decreasing R, when samples are compared from anchor time with other times (Fig. 1a,b, right panels), or untreated with treated (Fig. 1c). This is an indication that certain number of genes are differentially regulated in time or condition; the widening of those gene expressions contributing to the observed scatter.

### Statistical distribution fitting to remove lower expression or noisy genes

It is now known that gene expressions follow certain statistical distributions, such as Pareto (power-law) or lognormal11,12,13,14. Noting that lowly expressed genes are generally prone to noise8,9,10 (Fig. 1), previously we used the statistical distribution fittings (Materials and Methods) to select genes for further evaluation13,14. Here, we adopted the same approach to remove lowly expressed “noisy” genes.

Figure S1 shows the transcriptome-wide distribution of the E. coli and S. cerevisiae data for all time points, and mouse ESC for control and different treated conditions (Materials and Methods). Comparing with a number of statistical distributions and using Akaike information criterion15, we concur that lognormal distribution is the best fit for both E. coli and mouse ESC data, while Burr distribution for S. cerevisiae (Fig. S1 and Table S1). Using the lower end tail intersection as a threshold, we obtain TPM > 5 for E. coli, while TPM > 2 for both S. cerevisiae and mouse ESC as the lower expression noise cut-off level. Overall, for subsequent DE gene analysis, we retained 3758, 5330, and 11,787 genes for E. coli, S. cerevisiae and mouse ESC data, respectively.

### Quantifying transcriptome-wide scatter as noise

To quantify or estimate the transcriptome-wide scatter of the selected genes, we revisit gene expression noise, which is defined by expression variance over square of expression mean (Materials and Methods). Figures 2a and S2 show that transcriptome-wide noise is lower between replicates at any time, compared with the anchor time (t = 0) and other time points, or between untreated and treated conditions. The higher noise is mainly due to the differentially expressed genes (DE genes). Note that the level of noise between any two replicates is almost similar (approximately 0.05) for any time points or conditions (Fig. 2a). This indicate the level of noise that one could expect between any two experimental samples due to technical, operator or culture media induced variability8,16. Any values beyond this level are most likely a result of the differential transcriptional mechanisms that occurs in time, such as for aerobiosis, hypoxia or between different experimental treatments.

### Identifying differentially expressed genes

The predominant way of identifying DE genes is based on setting arbitrary expression fold change cutoff, e.g. 1.5, 2 or threefold changes17,18. Although these methods are generally acceptable for selecting the most highly variable genes, recent works indicate even lowly changing genes play key regulatory roles19,20. Hence, a more objective way to identify DE genes can provide a wider spectrum of transcriptional processes at play.

Here, we developed a software with graphic user interface (GUI) to overlay and visualize the transcriptome-wide scatter between any 2 samples (replicates/conditions/time points). The scatters are overlaid over each other, and when the expression of any element (gene) of the dataset become overlapping, its original color (e.g. green) will change (e.g. to orange). In this simple way, we are able to distinguish and separate genes that are not overlapping and are, therefore, differentially expressed.

However, from Fig. 1, it is important to note that gene expressions are variable even between replicates and this fact should also be considered when determining DE genes. Thus, we overlaid the replicate data with the between condition data as well, and choose the DE genes as the ones that do not overlap in all overlaid scatters. To determine DE genes between anchor time (e.g. t = 0) and target time (e.g. t = 10 min) for E. coli and S. cerevisiae, and between untreated and ETO treated mouse ESC cells, we overlaid the anchor time (or untreated) and target time (or treated) replicate data together onto the required axes (Fig. 2b). As the 2 replicates for each of the two conditions resulted in 4 combinatory comparisons (replicate 1-condition 1 vs. replicate 1-condition 2, replicate 2-condition 1 vs. replicate 1-condition 2, and so on), we chose DE genes as those that do not overlap in all combinations. In other words, the genes from the two-condition scatter that do not overlap (green dots) are the actual DE genes, considering the replicate combinatorial variability. In this way, we can visualize and track DE genes more objectively for every time point or condition than setting an arbitrary expression threshold cut-off.

One limitation of this approach, however, is the size of dot used to represent a gene; a larger size will result in less DE genes compared to a smaller size used as there will be larger artificial overlap due to size on the scatter plots (Fig. S3). To overcome this, we performed scatter plot overlay for a range of dot size and computed noise (see above section) for the DE genes,  as well as for the remaining (non-DE) genes for each dot size used (Fig. 2c and Table 1). As expected, as the dot size increases, the number of DE genes decreased.

To determine a more objective way to choose the correct dot size for selecting DE genes, we utilized the noise analysis again. As shown in Fig. 2a, the increased noise between conditions compared to replicates is due to DE genes, thus we used the average replicate noise threshold as a means to select the dot size (Fig. 2c). For E. coli, the size is 0.004 log10(TPM) which indicates 1194 DE genes while for S. cerevisiae, the indicated size is 0.001 log10(TPM) resulting in 4455 DE genes. For mouse ESC, dot size of 0.002 log10(TPM) yields 5019 DE genes. However, for simplicity, we used the most conservative dot size of E. coli, 0.004 log10(TPM), for all cell types. For this, we obtained 2061 and 2932 DE genes for S. cerevisiae and mouse ESC, respectively.

Particularly, when we evaluate the noise of the DE genes and the remaining non-DE genes, we find the latter’s noise similar to replicate noise levels and remarkably lower than DE genes’ noise (Fig. 2d). This confirms that our selected DE genes are responsible for increasing noise observed between time points. Note that the overlay of data is not restricted to replicate data, it can also be overlaid across multiple repeat datasets but with 2 replicates at a time. Figure S4 and Table S2 shows the triplicate data, available only for E. coli, is compared at all 3 possible combinations, and the total number of DE genes was almost the same (between 1191 and 1194).

### Correlation and PCA shows significant response of DE genes

Previously, we have used Pearson auto-correlation and principal component (PC) metrics to track the global, local and attractor gene expression responses of several cell types13,21,22,23. For studying Toll-like receptor induced immune response, the correlation metrics revealed that immune-related local genes were highly responsive while myriad global genes showed significantly less response21,22. In a similar way, for E. coli, we showed the subset of attractor genes, crucial for cell state transition, showed the most pronounced correlation metrics, while the rest of transcriptome lacked significance23. The PC metrics revealed that the attractor genes tracked almost identical trajectory compared with the transcriptome-wide response23. These data revealed that both correlation and PC metrics can be used to test the significance of Scatlay-derived DE genes.

Here, we checked the progressive time response of (i) whole transcriptome, (ii) DE genes, (iii) rest of transcriptome without DE genes (non-DE), using the same statistical metrics for E. coli and S. cerevisiae only, as the time-series data is not available for mouse (Fig. 3). Both auto-correlation and PC metrics reveal that the DE genes dominates transcriptome-wide response, while removing them (rest of transcriptome or non-DE) show highly subdued response. In other words, the ScatLay-derived DE genes are key for the progressive response of both cell types.

### Comparison of ScatLay with other DE gene methods

Next, we compared our results with other commonly used techniques based on DESeq2 and NOISeq methods with the conventional threshold of twofold expression changes and 0.05 p value cut-offs. Notably, ScatLay produces more DE genes than both DESeq2 (261 genes in E. coli, 494 genes in S. cerevisiae and 553 in mouse ESC) and NOISeq (597 genes in E. coli, 1526 genes in S. cerevisiae and 1865 in mouse ESC) (Fig. 4a). One of the reasons for this, based on our noise evaluation (Fig. 4b), is that both methods adopt arbitrary threshold cutoffs that are generally more conservative. The stringent thresholds applied on NOISeq and DESeq2 give rise to DE genes with higher noise level than ScatLay DE genes for all 3 cell types. In this case, our noise analysis could help determine a better threshold cutoff for higher coverage (Fig. 4c). For NOISeq, we observe that, with a p value cut-off at 0.05, expression fold threshold for E. coli and mouse ESC yields a value of 1.75, giving rise to 780 and 2705 DE genes, respectively, while it is 1.5 for S. cerevisiae, providing 2734 DE genes when matched with ScatLay noise benchmark.

For DESeq2, however, at any expression fold threshold cutoff above 1 with p value at 0.05 yields noise that are greater than ScatLay noise benchmark for all cell types. This indicate that DESeq2 is very stringent initially and our noise analysis could be used in conjunction to improve the overall coverage of DE genes. Thus, expression noise analysis is a useful tool to provide higher coverage of DE genes, and it can be used in conjunction with both ScatLay and other DE analysis methods like the popularly used DESeq2 or NOISeq, as discussed here.

To obtain a reduced or finer set of DE genes in ScatLay, we derived a method to determine a threshold cutoff based on p value estimation from kernel density estimation (Materials and Methods). To determine the probability whether a gene is DE, 2D kernel density estimation allows determining the possibility for a gene in the between-condition scatter to be overlapped by the between-replicate scatter (Fig. S5a). We applied the conventional p value cutoff of 0.05, in conjunction with ScatLay at scatter dot size 0.004 log10(TPM) (Fig. 2c), and found 815, 1744 and 2091 DE genes in E. coli, S. cerevisiae, and mouse ESC data, respectively. We also further included two fold expression threshold to ScatLay DE genes, and most of ScatLay-specific DE genes were eliminated by this criterion (Fig. S5b). For these 2 commonly used arbitrary cutoffs, ScatLay DE genes consist mainly of the overlapping DESeq2 and NOISeq DE genes. Notably, ScatLay still show higher coverage than NOISeq and DESeq2 DE genes. (Fig. S5b and Table 2).

Finally, we conducted gene enrichment analysis (Gene Ontology Consoritum24) on the DE genes detected by ScatLay with a p value threshold of 0.05. We observed that the 815 DE genes of E. coli in aerobiosis are mostly enriched in cellular respiration, DNA processes and homeostasis (Fig. 5 and Table S3). The 1744 DE genes of S. cerevisiae in hypoxia largely consist of RNA metabolism, ribosome biogenesis, and methylation, whereas, processes such as anatomical structure development, smell sensory perception, and cell cycle are elucidated for the 2091 DE genes of mouse ESC in ETO treatment (Fig. 5 and Table S3).

Notably, we observe that a small number of DE genes detected by NOISeq were not picked up by ScatLay at p value < 0.05 (34 genes in E. coli, 171 genes in S. cerevisiae and 478 genes in mouse ESC—Fig. S5b, top panel). Nevertheless, gene enrichment analysis did not show any known biological function for these NOISeq-specific DE genes from E. coli and S. cerevisiae types. For mouse ESC, the 478 NOISeq-specific genes are enriched in 9 biological processes only, consisting of mostly regulation of cellular process and phosphorous metabolism (Table S4). On the other hand, the 252 ScatLay-specific DE genes (p value < 0.05) in E. coli show enrichment in serine amino acid metabolism, locomotion, and translation processes. In S. cerevisiae, the 389 ScatLay-specific genes are enriched in 128 biological processes, including ribosome biogenesis, translation, and gluconeogenesis. In mouse ESC, 430 enriched biological processes are detected for the 704 ScatLay-specific genes, such as developmental process, cell cycle phase transition, and regulation of apoptosis (Table S5).

Overall, ScatLay elucidates statistically reliable DE genes with overall higher coverage, without or with threshold cutoff, as compared with DESeq2 and NOISeq. As the 3 methods compared originate from distinct statistical methodologies and assumptions, it is inevitable to obtain a small number of distinct DE genes pertaining to each method. Notably, even with p value and expression threshold cutoff, ScatLay covered almost all the genes of DESeq2. However, NOISeq picks up several distinct DE genes not captured by ScatLay. Nevertheless, further experimental work is necessary to investigate these distinct DE genes captured by each method.

## Conclusion

Here, we developed a new method, implemented in R programming language with a graphical user interface, to identify and visualize DE genes through overlaying transcriptome-wide expressions between samples (replicates, condition or time points). Unlike approaches that uses arbitrary threshold levels to select DE genes, here the genes are checked for replicate variability before sample variability by our noise analyses. Overall, our method provides a novel way to uncover DE genes that are not biased by user defined threshold cutoff and are able to produce a larger overall coverage. Nevertheless, we also provide an optional utilization of p value cut-off, derived from 2D kernel density of between-replicate scatter plots, if further reduction of DE genes is required, for example, to focus only on the highly variable genes.

## Materials and methods

### Data

We obtained time-series RNA-Seq dataset, in raw read counts, for Escherichia coli in aerobiosis (GEO accession number GSE71562)25, Saccharomyces cerevisiae in hypoxia (GEO accession number GSE85595)26, and Mus Musculus embryonic stem cell in different treatment or gene knock-out conditions (GEO accession number GSE137138)27.

Briefly, for the E. coli, K-12 strain W3110 was grown in a 3-L continuously stirred tank bioreactor anaerobically at pH7 and 37 °C. The first sample was drawn (t = 0) when OD of 3 at 600 nm was achieved, and air supply of 1L/min was then initiated. Subsequent samples were taken, at t = 0.5, 1, 2, 5 and 10 min25.

For S. cerevisiae (strain yMH914 with wildtype HAP1), cells were subjected to 100% nitrogen gas and collected after 0, 5, 10, 30, 60, 120, 180, and 240 min26. Total RNA was extracted and mRNAs were enriched by polyA selection.

Mouse ESCs were derived from blastocysts of 2–6-month-old male mice from C57BL/6 strain. Mouse ESCs from E14tg2a cell lines were cultured in 2i/LIF medium, and treated with H2O (control), or Etomoxir (ETO), or then released from ETO for another 4 days (ETO-released). Wild-type mouse ESCs and Mof-deleted (Mof knock-out or Mof-KO) mouse ESCs were cultured in 2i/LIF medium with Ethanol (WT) or 4-OHT (Mof-KO)27. In this study, we selected only the control and ETO conditions for DE analysis.

In all datasets, the cDNAs were prepared into a sequencing library, multiplexed and sequenced by an Illumina HiSeq 2500 sequencer. In total, there were 4240, 6494 and 17,392 non-zero gene expressions with gene lengths for E. coli, S. cerevisiae and mouse ESC, respectively. For our analysis, we chose replicate data with best pairwise correlation for each species at each time point.

### Statistical distributions fitting

Fitting gene expression distributions was performed using the Maximum-likelihood Fitting method (fitdistplus packge28 for parameter fitting and the mass package29 for log-normal, Pareto, Burr, Loglogistic, Weibull and Burr distributions30).

### Gene expression noise

Gene expression noise, η2, is defined by gene expression variance (σ2) over square of mean (μ2)8,10,16. To compute transcriptome-wide noise, we need to first evaluate noise of each gene (i = 1, …, m) between pairs of replicates or samples (j,k = 1,…,n):

$${\eta }_{i\left(jk\right)}^{2}=\frac{{\sigma }_{i\left(jk\right)}^{2}}{{\mu }_{i\left(jk\right)}^{2}}=2\frac{{\left({x}_{ij}-{x}_{ik}\right)}^{2}}{{\left({x}_{ij}+{x}_{ik}\right)}^{2}}$$

where xij and xik is the expression value of the ith gene in the jth and kth replicates/samples, and $${\sigma }_{i\left(jk\right)}^{2}={({x}_{ij}-{x}_{ik})}^{2}/2$$ and $${\mu }_{i\left(jk\right)}^{2}={({x}_{ij}+{x}_{ik})}^{2}/4$$ are the variance and square mean expression. We then summed the noise values of all genes between pairs of samples (j,k = 1,…,n) to calculate the total noise for each transcriptome, such as

$${\eta }^{2}=\sum_{i=1}^{m}{\eta }_{i}^{2}$$

where m is the total number of genes.

### Probability of differential expression for Scatlay

We select DE genes from the between-condition scatter as those not overlapped onto the between-replicate scatters. Thus, the probability whether a gene is differentially expressed equates the probability for its between-condition gene expression vector [$${x}_{i1, } {x}_{i2}$$] (with i = 1, …, m) to fall into the cloud of gene expression scatter between 2 replicates:

$$p= {\int }_{-\infty }^{({x}_{i1}, {x}_{i2})}f({x}_{i1},{x}_{i2})$$

in which f is the estimated kernel density function on between-replicate scatters:

$$f\left({x}_{i1},{x}_{i2}\right)=\frac{1}{2}({G}_{H}\left({x}_{i1}-{X}_{1}\right)+{G}_{H}\left({x}_{i2}-{X}_{2}\right))$$

where $${X}_{k}$$ is the concatenated gene expression vector in 2 conditions at replicate k (k = 1,2), $${G}_{H}$$ is 2D Gaussian kernel function at variance matrix (bandwidth) H, and the variance matrix H was estimated based on $${X}_{1}$$ and $${X}_{2}$$ vectors using hpi function from ks package31.