Selecting reference genes in RT-qPCR based on equivalence tests: a network based approach

Because quantitative reverse transcription PCR (RT-qPCR) gene expression data are compositional, amounts of quantified RNAs must be normalized using reference genes. However, the two most used methods to select reference genes (NormFinder and geNorm) ignore the compositional nature of RT-qPCR data, and often lead to different results making reliable reference genes selection difficult. We propose a method, based on all pairwise equivalence tests on ratio of gene expressions, to select genes that are stable enough to be used as reference genes among a set a candidate genes. This statistical procedure controls the error of selecting an inappropriate gene. Application to 30 candidate reference genes commonly used in human studies, assessed by RT-qPCR in RNA samples from lymphoblastoid cell lines of 14 control subjects and 26 patients with bipolar disorder, allowed to select 7 reference genes. This selection was consistent with geNorm’s ranking, less with NormFinder’s ranking. Our results provide an important fundamental basis for reference genes identification using sound statistics taking into account the compositional nature of RT-qPCR data. The method, implemented in the SARP.compo package for R (available on the CRAN), can be used more generally to prove that a set of genes shares a common expression pattern.


Model of individual RNA quantified amount
Let consider a set of K different RNAs, of which some will be quantified using qRT-PCR ; let k = 1 . . . K the gene index. This quantification will be done for C conditions ; let c = 1 . . . C the condition index. In each condition, experiment will be done on n c samples ; let i = 1 . . . n c be the sample index. With these notations, the amount of RNA k in the sample i for condition c is given by a random variable, X k,c,i .
For sake of simplicity, expressions will be given in the case of two conditions, C = 2, and the first condition will be considered as the reference one -they can easily be extended to more than two conditions. In the two-conditions case, one can write log X k,c,i = µ k + δ k 1 c=2 + ε k,c,i where µ k is (in log) the average expression level of RNA k in condition 1, δ k the average change in expression level of this RNA in condition 2, 1 c=2 a function that equals 1 if the measure is done in condition 2, 0 otherwise and ε k,c,i an centered random variable (E(ε k,c,i ) = 0), of variance σ 2 k,c , which includes the between-sample variation and the intra-sample variation.
Control genes, also called reference genes or housekeeping genes, are genes that are assumed to have similar expression levels between conditions, that is δ k = 0. In addition, best control genes will be the one with the smallest variances, that is both σ 2 k,1 and σ 2 k,2 minimal. All methods to select control genes will try to select genes that satisfy these conditions.
The experimental setup raises additional difficulties. First, After the retro-transcription step, the amount of the corresponding cDNA is X c k,c,i = ρ k X k,c,i , where ρ k is the retro-transcription yield for RNA k. Because of the extraction of a predefined amount of total DNA after retro-transcription, however, the quantified amount is The result of the quantification is finally Y * k,c,i = λ k X * k,c,i , where λ k is the, unknown, proportionnality constant between the obtained result (typically, C q in the log scale) and the amount of the RNA of interest in the well. Finally, the model for the available data is Both ρ k and λ k are assumed to depend only on the RNA, but neither on the experimental condition, nor on the patient ; unavoidable variations of these values between samples will be included in the error term of the model.

Expression of the ratios
We define R k,k ′ ,c,i the ratio of the measured expressions of genes k and k ′ , in a given sample. We have where ∆ k,k ′ µ = µ ′ k − µ k and so on. We have E(log R k,k ′ ,c,i ) = log c the linear correlation coefficient between values for genes k and k ′ in condition c.

All ratios of ratios method
For each pair of genes, k and k ′ , we consider the eventual change in their ratio, R k,k ′ ,c,i , between the two conditions. In case of such a change, the ratio of these ratios is different from 1 -that is, Consequently, one can estimate ∆ k,k ′ δ, which is the difference of changes of expression of genes k and k ′ between the two conditions. If both genes are control genes, then ∆ k,k ′ δ = 0, so identification of control genes is the identification of pairs of genes for which ∆ k,k ′ δ = 0.
In general, samples are different between the two conditions, hence one cannot access this ratio individual values, and its average will be estimated, in the log-scale, by the difference of the averages of the values in each condition ("two independant samples comparison if mean").
Assuming a Gaussian distribution for ε k,c,i , the test of ∆ k,k ′ δ = 0 can be done for all pairs of genes, either using an usual Student's T-test (and if the test of R k,k ′ ,c,j is non-significant, the pairs of genes k and k ′ will be declared as candidate control genes) or an equivalence test of ∆ k,k ′ δ to 0 (and if the test of R k,k ′ ,c,j is non-significant, the pairs of genes k and k ′ will be declared as candidate control genes).
For an equivalence test, the test criterion is based on the confidence interval of the difference of the means, hence will be given here, assuming a Gaussian distribution and denoting arithmetic means on the samples by overline, by The expectation of this confidence interval will be which includes the average change of expression between the two conditions, but also correlations between genes and their variabilities, through the σ 2 k,k ′ ,c terms. Finally, all possible couples of genes are tests and, according to the methodology described in [Curis et al., BioInformatics, 2019], a graph of genes is built and analyzed.

The geNorm method
The original geNorm model does not specify an explicit model for RNA amounts. It is a descriptive approach in three steps : first, compute all pairwise ratios, second estimate the standard deviation of these ratios for all pairs of RNAs and third, for each RNA, average the standard deviation of all ratios involving it. The second and third step are performed on the binary log scale.
The first step is equivalent to the first one in the previous methd, and with the suggested expression model, it corresponds to computing the R k,k ′ ,c,i values defined above. Using the Vandesompele et al. notations, with indices translated to our conventions, that gives be the average of the whole A k,k ′ set -that is, the average of all available ratios for the (k, k ′ ) pair of RNA, the "grand mean". Let be the average of these ratios, limited to the condition c. We have Application of the usual one-way analysis of variance sum of square decomposition gives is the unbiased estimator of σ 2 k,k ′ , the (k, k ′ ) ratio variance in condition c, in the log scale. Considering only the expectation of this expression, with have E(V 2 k,k ′ ) = 1 log 2 (n 1 − 1)σ 2 k,k ′ ,1 + (n 2 − 1)σ 2 k,k ′ ,2 n 1 + n 2 − 1 + n 1 n 2 ((∆ k,k ′ δ) 2 + σ 2 k,k ′ ,1 + σ 2 k,k ′ ,2 ) (n 1 + n 2 )(n 1 + n 2 − 1) = 1 log 2 1 n 1 + n 2 − 1 n 1 − 1 + n 1 n 2 n 1 + n 2 σ 2 k,k ′ ,1 + n 2 − 1 + n 1 n 2 n 1 + n 2 σ 2 k,k ′ ,2 + n 1 n 2 n 1 + n 2 (∆ k,k ′ δ) 2 To have lighter notations, let n = n1 n2 n1+n2 and A c = n c − 1 + n : the expectation then rewrites as The last step computes, for a given RNA (k), the average of the V k,k ′ for all k ′ = k, which gives where K * is the number of considered genes amongst the K. The genes are then sorted according to this value.
Understanding the exact origin of this sorting order, even considering ideal values instead of the random variables, is not straightforward, partly because of the use of the standard deviation that introduces a square root, and partly because each value involves the relations between a gene and all other genes.
Nevertheless, if one consider the special case of a set of candidate reference genes, with one gene, let's say k = 1, that is not a reference and all other genes are indeed, this means that ∆ k,k ′ δ = 0 except for k = 1 where ∆ k,k ′ δ = d. In that case, and also assuming that variances are the same for all genes and all conditions, then for all reference genes (k > 1) the ∆δ 2 k,k ′ term will be present in only V k,1 , which mean it will have influence roughly of d K * −1 , whereas for the "outlier" gene, it will be present in all V 1,k ′ terms, which mean it will have influence roughly of d -all other parts beeing equal. In other words, the method will detect a gene for which expression changes among a set of non-varying genes, and the more non-varying genes, the easiest is the detection by ranking the M k values.
For more complex models, using the expressions above can give the expected classification of the genes according to the method.
To further explore qualitatively the characteristics of the method, we consider instead the measure Despite there is no simple relation between M k and M * k , factors changing M * k should change M k in the same way, allowing to understand what drives the final ranking of the genes. We have By replacing σ 2 k,k ′ ,c = σ 2 k,c − 2 ρ k,k ′ ,c σ k,c σ k ′ ,c + σ 2 k ′ ,c and ∆ k,k ′ δ = δ ′ k − δ k by their expressions, we then have, after grouping terms, This expression mainly contains two parts, the first part that describes the variability of the gene k and the second part is an average of what happens for all other genes. This second part is all the more negligible, for the ranking, than the number of genes is large, because it basically becomes the same for all genes. Hence, ranking is mainly based on the A 1 σ 2 k,1 + A 2 σ 2 k,2 + n δ 2 k part. Genes with high variations, either between patients (σ 2 k,1 and σ 2 k,2 ) or, in average, between conditions δ 2 k will have high M * k values, hence high M k values, and will be excluded first.
Of note, in the usual algorithm implementation, the M k values are reevaluated after each candidate gene exclusion ; hence, the last term becomes more variable between genes. We can then suspect that ranking of the last genes may be more sensitive to correlations between genes and all other terms appearing in this factor.

The NormFinder method
The original paper presenting this method does not explicitly makes the difference between X k,c,i , the real amount in the original sample, and Y k,c,i , the amount quantified in the sample. In other words, it assumes that ∀k, ρ k = λ k = 1 -an assumption that we will kept in this part.
The NormFinder ranking method is based on the following model, adapting the original paper notations to the one used here : where α k,c is presented as the (average) expression level of gene k in condition c, β c,i as the total amount of RNAs in sample i of condition c (in log scale), and ε NF k,c,i is a random variable assumed to be centered, of variance σ NF k,c 2 and all ε NF k,c,i are assumed to be independant. The model also assumes that α k,c = α c + ε α k,c , where α c is the average expression level of all genes in condition c and ε α k,c is a centered random variable, with V(ε α k,c ) = γ 2 . An important remark should be made on this model : since e βc,i represents the amount of total RNA, one must have X k,c,i ≤ e βc,i , that is e α k,c e ε NF k,c,i ≤ 1. In other words, since the second factor represents noise and should be either greater or less than one, the e α k,c term represents the fraction that the RNA of interest represents in the total RNA and is constrained by e α k,c < 1 ⇐⇒ α k,c < 0. This constraint is not consistent with the hypothesis of a Gaussian distribution for V(ε α k,c ) = γ 2 , assumed to develop the stability measure ; however, it might be acceptable or not, depending on the value of γ compared to α c .
Let define θ c = 1 nc nc i=1 β c the average total amount of RNA in condition c and Z k,c = 1 nc i=1 log X k,c,i the average quantified amount of gene k in condition c, in log-scale. Let α k = α k,1 +α k,2 2 be the average expression level of gene k, all conditions taken together. The stability measure used to rank the gene is then defined, for two conditions, by ρ NF which is the sum of the expectation (with sign removed) and the standard deviation of the deviation of the observed values and the average level in each conditions.
Linking this model with the one presented in introduction is not straightforward, first because they are developed in quite different perspective (absolute amounts in our case, relative amounts in the NormFinder case), and second because of different hypothesis. Basically, one should make correspond log X k,c,i = µ k + δ k 1 c=2 + ε k,c,i = α c + β c,i + ε α k,c + ε NF k,c,i with the additional equations µ k + δ k 1 c=2 = α c + ε α k,c = α 1 + δα 1 c=2 + ε α k,c and e βc,i = K k=1 X k,c,i which gives ε k,c,i = β c,i + ε NF k,c,i and σ 2 k,c = V(β c,i ) + σ NF k,c 2 . Here lies a technical discrepancy between the two models : in our model, ε α k,c is fixed and β c,i is random, while in the NormFinder model, ε α k,c is random and β c,i , while not specified, is also random. Besides, the V(β c,i ) term is difficult to obtain, due to the non-linear relationship between β c,i and X k,c,i ; it depends on all the σ NF k,c 2 values, but also the δ k and µ k .
Neglecting this, we can write which gives |α k,c − α k | = 1 2 δ k and in turn, by letting G c =