Variation-preserving normalization unveils blind spots in gene expression profiling

RNA-Seq and gene expression microarrays provide comprehensive profiles of gene activity, but lack of reproducibility has hindered their application. A key challenge in the data analysis is the normalization of gene expression levels, which is currently performed following an implicit assumption that most genes are not differentially expressed. Here, we present a mathematical approach to normalization that makes no assumption of this sort. We have found that variation in gene expression is much greater than currently believed, and it can be measured with available technologies. Our results also explain, at least partially, the problems encountered in transcriptomics studies. We expect this improvement in detection to help efforts to realize the full potential of gene expression profiling, especially in analyses of cellular processes involving complex modulations of gene expression.

Since the discovery of DNA structure by Watson and Crick, molecular biology has progressed increasingly quickly, with rapid advances in sequencing and related genomic technologies.Among these, microarrays and RNA-Seq have been widely adopted to obtain gene expression profiles [1,2,3,4,5], by measuring the concentration of tens of thousands of mRNA molecules in single assays.Despite their enormous potential [6,7,8,9], problems of reproducibility and reliability [10,11,12] have discouraged their use in some areas, e.g.biomedicine [13,14,15].In more mature microarray technologies, issues such as probe design, cross-hybridization, non-linearities and batch effects [16,17] have been identified as possible culprits, but the problems persist [18,19].
The normalization of gene expression, which is required to set a common reference level among samples [20,21,22,23,24], is also reportedly problematic, affecting the reproducibility of both microarray [18,25] and RNA-Seq [19,22,24] results.An underlying assumption of the most widely used normalization methods (such as median and quantile normalization [26] for microarrays, or RPKM [4] and TMM [27] for RNA-Seq), is that most genes are not differentially expressed [24,28].This lack-of-variation assumption may seem reasonable for many applications, but it has not been confirmed.Furthermore, results obtained with other technologies, particularly qRT-PCR, suggest that it may not be valid [18,22].Thus, in attempts to clarify and overcome limitations imposed by this assumption, we have developed an approach to normalization that does not assume lack-of-variation.The analysis of a large gene expression dataset [29] using this approach shows that the assumption can severely undermine the detection of variation in gene expression.We have found that large numbers of differentially expressed genes with substantial expression changes are missed when data are normalized with methods that assume lack-of-variation.
The dataset was obtained from biological triplicates of Enchytraeus crypticus (a globally distributed soil organism used in standard ecotoxicity tests), sampled under 51 experimental conditions (42 treatments and 9 controls), involving exposure to several substances, at several concentrations and durations according to a factorial design (Table S1).Gene expression was measured using a customized high-density oligonucleotide microarray [29], and the resulting dataset was normalized with four methods.Two of these methods are the most widely used procedures for microarrays, median (or scale) normalization and quantile normalization [26], whereas the other two, designated median conditiondecomposition normalization and standard-vector condition-decomposition normalization, have been developed for this study.
With the exception of quantile normalization, all used methods apply a multiplicative factor to the expression levels in each sample, equivalent to the addition of a number in the usual log 2 -scale for gene expression levels.Solving the normalization problem consists of finding these correction factors.The problem can be exactly and linearly decomposed into several sub-problems: one within-condition normalization for each experimental condition and one final between-condition normalization for the condition averages [29].In the within-condition normalizations, the samples (replicates) subjected to each experimental condition are normalized separately, whereas in the final between-condition normalization average levels for all conditions are normalized together.Because there are no genes with differential expression in any of the within-condition normalizations, the lack-of-variation assumption only affects the final between-condition normalization.The assumption is avoided by using, in this normalization, expression levels only from no-variation genes, i.e. genes that show no evidence of differential expression under a statistical test [29].Both methods of normalization proposed here follow this condition-decomposition approach.
With median condition-decomposition normalization, all normalizations are performed with median values, as in conventional median normalization, but only no-variation genes are included in the between-condition step.Otherwise, if all genes were used in this final step, the resulting total normalization factors would be exactly the same as those obtained with conventional median normalization.
For standard-vector condition-decomposition normalization, a vectorial procedure was developed to carry out each normalization step.The samples of any experimental condition, in a properly normalized dataset, must be exchangeable.In mathematical terms, the expression levels of each gene can be considered as an s-dimensional vector, where s is the number of samples for the experimental condition.After standardization (mean subtraction and variance scaling), these standard vectors are located in a (s − 2)-dimensional hypersphere.The exchangeability mentioned above implies that, when properly normalized, the distribution of standard vectors must be invariant with respect to permutations of the sample labels and must have zero expected value.These properties allow to obtain, under fairly general assumptions, a robust estimator of the normalization factors [29].
An important feature of the approaches to normalization proposed here (linear decomposition into normalization sub-problems per condition, and standard-vector normalization for each sub-problem) is that they do not depend on any particular aspect of the technology of gene expression microarrays or RNA-Seq.The numbers in the input data are interpreted as measured concentrations of mRNA molecules, irrespective of whether they were obtained from fluorescence intensities of hybridized cDNA (microarrays) or from counts of fragments read of mRNA sequences (RNA-Seq).Nevertheless, we consider that specific within-sample corrections for each technology are still necessary and must be applied before the between-sample normalizations proposed here.Examples include background correction for microarrays or gene-length normalization (RPKM) for RNA-Seq.
To further explore and compare outcomes of the normalization methods, they were also applied to a synthetic random dataset [29].This dataset was generated with identical means and variances gene-by-gene to the real dataset, and with the assumption that all genes were no-variation genes.In addition, normalization factors were applied, equal to those obtained from the real dataset.Thus, the synthetic dataset was very similar to the real one, while complying by construction with the lack-of-variation assumption.
Figure 1 displays the results of applying the four normalization methods to the real and synthetic datasets.Each panel shows the interquartile ranges of expression levels for the 153 samples, grouped in triplicates exposed to each experimental condition.Both median (second row) and quantile normalization (third row) yielded similar outputs, for both datasets.In contrast, the condition-decomposition normalizations (fourth and fifth rows) identified marked differences, detecting much greater variation between conditions in the real dataset.Conventional median normalization makes, by design, the median of each sample the same, while quantile normalization makes the full distribution of each sample the same.Hence, if there were differences in medians or distributions of gene expression between experimental conditions, both methods would have removed them.Figures 1G,I show that such variation between conditions was present in the real dataset.
The variation between medians displayed in Figs.1G,I may seem surprising, given routine expectations based on current methods (Figs. 1C,E).Nevertheless, this variation inevitably results from the imbalance between over-and under-expressed genes.As an illustration, let us consider a case with two experimental conditions, in which the average expression of a given gene is less than the distribution median under one condition, but greater than the median under the other.The variation of this gene alone will change the value of the median to the expression level of the next ranked gene.Therefore, if the number of over-expressed genes is different from the number of under-expressed genes, and enough changes cross the median boundary, then the median will substantially differ between conditions.Only when the differential expression is balanced or small enough, will the median stay the same.This argument applies equally to any other quantile in the distribution of gene expression.
To clarify how the condition-decomposition normalizations preserved the variation between conditions, we studied the influence of the choice of no-variation genes in the final between-condition normalization.To this end, we obtained the between-condition variation with both methods in two families of cases.In one family, no-variation genes were chosen in decreasing order of p-values from an ANOVA test.In the other family, genes were chosen at random.The first option was similar to the approach implemented to obtain the results presented in Fig. 1G-J, with the difference that there the number of genes was chosen automatically by a statistical test.As shown in Fig. 2A, for the real dataset the random choice of genes resulted in n −1/2 decays (n being the number of chosen genes), followed by a plateau.The n −1/2 decays reflect the standard errors of the estimators of the normalization factors.Selecting the genes by decreasing p-values, however, yielded a completely different result.Up to a certain number of genes, the variance remained similar, but for larger numbers of genes the variance dropped rapidly. Figure 2A shows, therefore, that between-condition variation was removed as soon as the between-condition normalizations used genes that varied in expression level across experimental conditions.The big circles in Fig. 2A indicate the working points of the normalizations used to generate the results displayed in Figs.1G,I.In fact, these points slightly underestimated the variation between conditions.Although the statistical test for identifying no-variation genes ensured that there was no evidence of variation, inevitably the expression of some selected genes varied across conditions.
Figure 2B displays the results obtained with the synthetic dataset.There were no plateaus when no-variation genes were chosen randomly, only n −1/2 decays, and small differences when no-variation genes were selected by decreasing p-values.Big circles show that working points were selected with much larger numbers of genes in the synthetic dataset (Figs.1H,J) than in the real dataset (Figs.1G,I).The residual variation, produced by errors in the estimation of the normalization factors, was much smaller than the variation detected in the real dataset, especially for standard-vector condition-decomposition normalization.Overall, Fig. 2 shows that the between-condition variation pictured in Figs.1G,I is not an artifact caused by using an exceedingly small or extremely particular set of genes in the final between-condition normalization, but that this variation originated from the real dataset.
Finally, Fig. 3A shows the numbers of differentially expressed gene probes (DEGP), identified after normalizing with the four methods, for each of the 42 experimental treatments versus the corresponding control (Table S2).Compared to conventional methods, the number of DEGP detected with the condition-decomposition normalizations was much larger under most treatments, including some whose number of DEGP was larger by more than one order of magnitude.These are statistically significant changes of gene expression, i.e. changes that cannot be explained by chance.More important is the scale of the detected variation, as illustrated by the boxplots in Fig. 3C showing absolute fold changes of DEGP detected after standard-vector condition-decomposition normalization.
For all treatments, the entire interquartile range of absolute fold change is above 1.5-fold, and for more than two thirds of the treatments the median absolute fold change is greater than 2. This amount of gene expression variation cannot be neglected, and warrants further research to explore its biological significance.
The lack-of-variation assumption underlying the current methods of normalization was self-fulfilling, removing variation in gene expression that was present in the real dataset.
Moreover, it had negative consequences for downstream analyses, as it both removed potentially important biological information and introduced errors in the detection of gene expression.A removal of variation can be understood as errors in the estimation of normalization factors.Considering data and errors vectorially, the length of each vector equals, after centering and up to a constant factor, the standard deviation of the data or error [29].The addition of an error of small magnitude, compared to the data variance, would have only a minor effect.However, errors of similar or greater magnitude than the data variance may, depending on the lengths and relative angles of the vectors, severely distort the observed data variance.This will in turn cause spurious results in the statistical analyses.Furthermore, the angles between the data and the correct normalization factors (considered as vectors) are random.Data reflect biological variation, while normalization factors respond to technical variation.If the experiment is repeated, even with exactly the same experimental settings, the errors in the normalization factors will vary randomly, causing random spurious results in the downstream analyses.This explains, at least partially, the lack of reproducibility found in transcriptomics studies, especially for the detection of small changes of gene expression, because small variations are most likely to be distorted by errors in the estimates of normalization factors.Accordingly, the largest differences in numbers of DEGP detected by conventional compared to condition-decomposition methods (Fig. 3A) occurred consistently in the treatments with the smallest magnitudes of gene expression changes, e.g.treatments 28, 29 and 33 (Figs.3B,C).
In summary, this study proves that large numbers of genes change in expression level (often strongly) across experimental conditions, and too extensively to ignore in the normalization of gene expression data.Further, our approach, which avoids the prevailing lack-of-variation assumption, demonstrates that current normalization methods likely remove and distort important variation in gene expression.It also offers a means to investigate broad changes in gene expression that have remained hidden to date.We expect this to provide revealing insights about diverse biomolecular processes, particularly those involving substantial numbers of genes, such as cell differentiation, toxic responses, diseases with non-Mendelian inheritance patterns and cancer.After years of lagging behind the advances in genome sequencing, we believe that the procedures presented here will assist efforts to realize the full potential of gene expression profiling.The condition-decomposition normalizations detected a large amount of between-condition variation in the real expression data, in contrast with conventional methods.All 10 panels show interquartile ranges of expression levels of the 153 samples, grouped by the 51 experimental conditions (Ag, blue-yellow; Cu, red-cyan; Ni, greenorange; UV, purple; see Table S1).Black lines indicate medians.Rows and columns correspond to normalization methods and datasets (as labeled), respectively.In the synthetic dataset no gene was differentially expressed between any two conditions.Table S2: Treatment vs control comparisons, listed in increasing number of differentially expressed gene probes (DEGP) obtained with standard-vector condition-decomposition normalization and limma statistical analysis.This is the same order as in Fig. 3, from left to right.

Supplementary Figures
# Gene probes (Fig. 3A).Labeling is the same as in Fig. 3A (empty black circles, median normalization; empty red triangles, quantile normalization; filled green circles, median conditiondecomposition normalization; filled blue triangles, standard-vector condition decomposition normalization).Treatment vs control comparisons are ordered according to the number of DEGP identified with standard-vector condition-decomposition normalization, increasing from left to right.This order (not shown) was similar but not exactly the same as in Fig. 3A.

Test organism
The test species was Enchytraeus crypticus.Individuals were cultured in Petri dishes containing agar medium, in controlled conditions [30].
The tested Ag forms [30]

Spiking procedure
Spiking for the Cu and Ag materials was done as in previous work [30].For the Ni materials, the Ni-NPs were added to the soil as powder, following the same procedure as for the Cu materials.NiNO 3 , being soluble, was added to the pre-moistened soil as aqueous dispersions.
The concentrations tested were selected based on the reproduction effect concentrations

Exposure details
In soil (i.e. for Cu, Ag and Ni) exposure followed the standard ERT [34] with adaptations as follows: twenty adults with well-developed clitellum were introduced in each test vessel, containing 20 g of moist soil (control or spiked).The organisms were exposed for three and seven days under controlled conditions of photoperiod (16:8 h light:dark) and temperature 20 ± 1 • C without food.After the exposure period, the organisms were carefully removed from the soil, rinsed in deionized water and frozen in liquid nitrogen.The samples were stored at −80 • C, until analysis.
For UV exposure, the test conditions [32] were adapted for E. crypticus [35].The exposure was performed in 24-well plates, where each well correspond to a replicate and contain 1 ml of ISO water and five adult organisms with clitellum.The test duration was five days, at 20 ± 1 The cRNA samples were hybridized on custom Gene Expression Agilent Microarrays ( 4x 44k format), with a single-color design [36].Hybridizations were performed using the Agilent Gene Expression Hybridization Kit and each biological replicate was individually hybridized on one array.The arrays were hybridized at 65 • C with a rotation of 10 rpm, during 17 h.Afterwards, microarrays were washed using Agilent Gene Expression Wash Buffer Kit and scanned with the Agilent DNA microarray scanner G2505B.

Data acquisition and analysis
Fluorescence intensity data was obtained with Agilent Feature Extraction Software v. 10.7.3.1, using recommended protocol GE1 107 Sep09.Quality control was done by inspecting the reports on the Agilent Spike-in control probes.Background correction was provided by Agilent Feature Extraction software.To ensure an optimal comparison between the different normalization methods, only gene probes with good signal quality (flag IsPosAndSignif = True) in all samples were employed in the analyses.This implied the selection of 18,339 gene probes from a total of 43,750.Analyses were performed with R [37] v. 3.2.0,using R packages plotrix and RColorBrewer, and with Bioconductor [38] v. 3.1 packages genefilter and limma [39].
The synthetic data was generated gene by gene as normal variates with mean and variance equal, respectively, to the sample mean and sample variance of the real data.The applied normalization factors were those detected from the real data with standard-vector condition-decomposition normalization.
Median normalization was performed by subtracting the median of each sample distribution, and then adding the overall median to preserve the global expression level.Quantile normalization was performed as implemented in the limma package.
The two condition-decomposition normalizations proceeded in the same way: first, 51 independent within-condition normalization using all genes; then, final between-condition normalization, iteratively detecting no-variation genes and normalizing until convergence.
No-variation genes were identified with one-sided Kolmogorov-Smirnov tests, as goodnessof-fit tests against the uniform distribution, carried out on the greatest p-values obtained from an ANOVA test on the complete dataset (see below).The ANOVA test benefited from the already corrected within-condition variances, provided by the within-condition normalizations.The KS test was rejected at α = 0.001.
The criterion for convergence for the median condition-decomposition (CD) normalizations was to require that the relative changes in the standard deviation of the normalization factors were less than 1%, or less than 10% for 10 steps in a row.In the case of standard-vector CD normalizations, convergence required that numerical errors were, compared to the estimated statistical errors (see below), less than 1%, or less than 10% for 10 steps in a row.For Figs. 2 and S1, due to the very low number of gene probes in some cases, the thresholds for convergence for 10 steps in a row were increased to 80% and 50%, respectively, for median CD and standard-vector CD normalization.
In standard-vector CD normalization, the distribution of standard vectors was trimmed in each step to remove the 1% more extreme values of variance.
Differentially expressed gene probes were identified with limma (Fig. 3) or t-tests (Fig. S2), using in all cases a FDR threshold of 5%.
The reference distribution with permutation symmetry shown in the polar plots of the probability density function in Movies S1-S3 was calculated with the 6 permutations of the empirical standard vectors.The Watson U 2 statistic was calculated with the twosample test [40].An equal number of samples for comparison was obtained by sampling with replacement the permuted standard vectors.

Mathematical Methods
In a gene expression dataset with g genes, c experimental conditions and n samples per condition, the observed expression levels of gene j in condition k, y , can be expressed in log 2 -scale as j is the vector of true gene expression levels and a (k) is the vector of normalization factors.
Given a sample vector x, the mean vector is x = x1, and the residual vector is x = x − x.
Then, (1) can be linearly decomposed into Equations ( 3) define the within-condition normalizations for each condition k.The scalar values in (2) are used to obtain the equations on condition means, The between-condition normalization is defined by (5).Equations ( 4) reduce to a single number, which is irrelevant to the normalization.The complete solution for each condition is obtained with a (k) = a (k) + a (k) .
The n samples of gene j in a given condition can be modeled with the random vectors whereas the observed standard vectors verify, as long as a = 0, This motivates the following iterative procedure to solve (3) and (5) (standard-vector normalization): At convergence, lim t→∞ b (t) = 0, which implies lim t→∞ y Convergence is faster the more symmetric the empirical distribution of x j / x j is on the unit (n − 2)-sphere.Convergence is optimal with spherically symmetric distributions, such as the Gaussian distribution, because in that case Assuming no correlation between genes, an approximation of the statistical error at step t can be obtained with This statistical error is compared with the numerical error to assess convergence.
See Supplementary Text below for a detailed exposition of the mathematical methods, and Movies S1-S5 for an illustration.whenever x = 0, or otherwise as stdvec(x) = 0. 0 is the all-zeros column vector of dimension n.
For a given number of samples n, all the non-zero standard vectors belong to the (n − 2)sphere of radius √ n − 1, embedded in the (n − 1)-dimensional hyperplane perpendicular to 1. Besides, all the components of a standard vector are equal to the corresponding standardized samples, For the degenerate case of having only two samples (n = 2), the only possible values of a non-zero standard vector are

SM2 Linear decomposition of the normalization problem
Let us consider a gene expression dataset, with g genes and c experimental conditions.
Each condition k has s k samples.The total number of samples is s = c k=1 s k .
Let us denote the observed expression level of gene j in the sample i of condition k by y (k) ij .We assume that the observed level y (k) ij is equal, in the usual log 2 -scale, to the addition of the normalization factor a (k) i to the true gene expression level x Solving the normalization problem amounts to finding the normalization factors a There is, however, an additional requirement imposed by the methods with which we propose to solve the between-condition normalization.We would like to consider the condition means x (k) j in (36) as sample data across conditions.This only holds when all the conditions have the same number of samples.Otherwise, we balance the condition means so that they result from the same number of samples in all conditions, according to the procedure described in the following.
Let s * be the minimum number of samples across conditions, s * = min{s 1 , . . ., s c }. Let S (k) j be independent random samples (without replacement) of size s * from the set of indexes {1, . . ., s k }, with one sample per gene j and condition k.Then, the balanced condition means are defined as From ( 22), the balanced condition means verify a relationship similar to (36), Moreover, the average of a  k) .This implies that (51) are, on average, equivalent to (36).Hence, we use the following vectors of balanced conditions means in the n-dimensional space of random vectors, projected onto the (n − 1)-dimensional hyperplane of residual vectors.
Residual vectors and standard vectors have been widely studied, especially in relation to elliptically symmetric distributions and linear models [42,43], and to the invariances of probability distributions [44].Here, we consider these vectors from the viewpoint of the problem of normalizing multivariate data, and its relationship with permutation invariance.
It is well know that, for a multivariate distribution with independent and identically distributed components, the expected value of the standard vector is zero [41], given that it is so for each component.We prove this here for completeness, and to show that it is also a necessary consequence of the permutation invariance of the distribution.
Proposition.The expected value of any true (i.e.without normalization issues) standard vector is zero.If the n ≥ 2 samples of feature j are independent and identically distributed, then Proof.Let P n be the set of all the permutation matrices in R n×n .Then, for any P ∈ P n , X j / X j is equal in distribution to P X j / X j .This implies that The only vectors that are invariant with respect to all possible permutations are those that have all components identical.Therefore, E( X j / X j ) = α 1, with α ∈ R.However, For each true random vector X j , there is an observed random vector Y j = X j + A, where A is the random vector of normalization factors.The random vectors X j and A are independent, representing biological and technical variation, respectively.Therefore, and without loss of generality, we assume in what follows a fixed vector of normalization factors a, i.e. we condition on the event {A = a }.We also assume that P( Y j = 0 ) = 0, which implies that Y j / Y j has length 1 almost surely.
In contrast to the true standard vector √ n − 1 X j / X j , the observed standard vector √ n − 1 Y j / Y j is biased toward the direction of a, with the result that the expected value is not zero.
Proposition.If the n ≥ 2 samples of feature j are independent and identically distributed, whenever a = 0, When n = 2, there is the additional requirement that P( X i < a ) > 0. This threshold of detection only occurs for the degenerate case of n = 2.
Proof.Let us consider the projection of Y j / Y j on a, compared to the projection of When the vectors X j and a are collinear, X j a X j a = ±1, and Y j a Y j a = ±1, with Y j a Y j a ≥ X j a X j a .This is the only case when n = 2.The additional requirement ensures that, for n = 2, Otherwise, when n > 2 and the vectors X j and a are not collinear, they lie on a plane.
The vector Y j = X j + a is the diagonal of the parallelogram defined by X j and a. Hence the angle between Y j and a is strictly less than the angle between X j and a, so the cosine of the angle is strictly greater.Thus, Y j a Y j a > X j a X j a .
Due to the permutation symmetries in the distribution of X j / X j , when n > 2 the vector X j has non-zero probability of being not collinear with a, i.e.P( | X j a| < 1 ) > 0. Therefore, Finally, As a consequence, the normalization of residuals problem may be restated as the problem of finding the normalization factors a from the observed vectors y j , such that the standard vectors √ n − 1 ( y j − a)/ y j − a are invariant against permutations of the observation labels.Or equivalently, such that the standard vectors √ n − 1 ( y j − a)/ y j − a have zero mean.The following property provides an approach to the solution.
Proposition.Whenever a = 0, the component of the expected value of Y j / Y j parallel to a verifies As in (58), when n = 2 we also assume that P( X j < a ) > 0.
Proof.The first inequality holds from the previous proof.Concerning the second inequality, let us consider so that Back to the initial expected values, it follows that E X j X j + a X j a X j a + < E X j X j + a X j a X j a − , which implies E X j X j + a X j a X j a < 0.
The Gaussian multivariate distribution, among others, has spherical symmetry besides permutation symmetry.For parametric families with spherical symmetry, the true standard vector √ n − 1 X j / X j has uniform distribution over the (n−2)-sphere.As a result, the components of Y j / Y j perpendicular to a are antisymmetric with respect to the direction of a, so that they cancel out in expectation.That is, for parametric families with spherical symmetry, and as long as a = 0, experimental work and collected the microarray data.C.P.R. designed and implemented the novel normalization methods.C.P.R. performed the statistical analyses.All the authors jointly discussed the results.C.P.R. drafted the paper, with input from all the authors.All the authors edited the final version of the paper.

Figures
Figures Figure 1

Figure 3 :
Figure 3: The condition-decomposition normalizations detected much larger numbers of differentially expressed gene probes (DEGP), with substantial fold changes.(A) Number of DEGP for each treatment vs control comparison, obtained after applying the four normalization methods (empty black circles, median normalization; empty red triangles, quantile normalization; filled green circles, median condition-decomposition normalization; filled blue triangles, standard-vector condition decomposition normalization).Significant differential expression was identified with R/Bioconductor package limma.(see Fig. S2 for results with t-tests).Lower panel shows boxplots of absolute values of DEGP fold changes (absolute differences of log 2 expression levels), also per treatment vs control comparison, obtained with quantile normalization (B) and standard-vector conditiondecomposition normalization (C).Boxplots are colored by treatment, with the same color code as in Fig. 1.In both panels comparisons are ordered according to the number of DEGP identified with standard-vector condition-decomposition normalization, increasing from left to right (TableS2).Dashed horizontal lines in the lower panel indicate references
ij .The normalization factors can be understood as sample-wide changes in the concentration of mRNA molecules by multiplicative factors equal to 2 a (k)i .These changes are caused by technical reasons in the assay and are independent of the biological variation in the true levelsx (k) ij .Let us represent the true and observed expression levels, x (k) ij and y (k) ij , of gene j in the normalizations, the features are gene expression levels, with one observation per sample of the corresponding experimental condition.In the between-condition normalization, the features are means of gene expression levels, with one observation per condition.

Table S2
). Dashed horizontal lines in the lower panel indicate references of 1.5-fold and 2-fold changes.
• C. The organisms were exposed to UV on a daily basis, during 15 minutes per day to two UV intensities (280-400nm) of 1669.25±50.83and 1804.08±43.10mW/m 2 , corresponding to total UV doses of 7511.6 and 8118.35J/m 2 , respectively.The remaining time was spent under standard laboratory illumination (16:8 h photoperiod).
[35]adiation was provided by an UV lamp (Spectroline XX15F/B, Spectronics Corporation, NY, USA, peak emission at 312 nm) and a cellulose acetate sheet was coupled to the lamp to cut-off UVC-range wavelengths[35].Thirty two replicates per test condition (including control without UV radiation) were performed to obtain 4 biological replicates with 40 organisms each for RNA extraction.After the exposure period, the organisms were carefully removed from the water and frozen in liquid nitrogen.The samples were stored at −80 • C, until analysis.withthe Agilent one-color RNA Spike-In Kit.Purification of the amplified and labeled cRNA was performed with RNeasy columns (Qiagen, Valencia, CA, USA).
where a is a fixed vector of normalization factors.It can be proved, under fairly general assumptions, that the true standard vectors have zero