Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Corruption of the Pearson correlation coefficient by measurement error and its estimation, bias, and correction under different error models

Abstract

Correlation coefficients are abundantly used in the life sciences. Their use can be limited to simple exploratory analysis or to construct association networks for visualization but they are also basic ingredients for sophisticated multivariate data analysis methods. It is therefore important to have reliable estimates for correlation coefficients. In modern life sciences, comprehensive measurement techniques are used to measure metabolites, proteins, gene-expressions and other types of data. All these measurement techniques have errors. Whereas in the old days, with simple measurements, the errors were also simple, that is not the case anymore. Errors are heterogeneous, non-constant and not independent. This hampers the quality of the estimated correlation coefficients seriously. We will discuss the different types of errors as present in modern comprehensive life science data and show with theory, simulations and real-life data how these affect the correlation coefficients. We will briefly discuss ways to improve the estimation of such coefficients.

Introduction

The concept of correlation and correlation coefficient dates back to Bravais1 and Galton2 and found its modern formulation in the work of Fisher and Pearson3,4, whose product moment correlation coefficient \(\rho \) has become the most used measure to describe the linear dependence between two random variables. From the pioneering work of Galton on heredity, the use of correlation (or co-relation as is it was termed) spread virtually in all fields of research and results based on it pervade the scientific literature.

Correlations are generally used to quantify, visualize and interpret bivariate (linear) relationships among measured variables. They are the building blocks of virtually all multivariate methods such as Principal Component Analysis (PCA5,6,7), Partial Least Squares regression, Canonical Correlation Analysis (CCA8) which are used to reduce, analyze and interpret high-dimensional omics data sets and are often the starting point for the inference of biological networks such as metabolite-metabolite associations networks9,10, gene regulatory networks11,12 an co-expression networks13,14.

Fundamentally, correlation and correlation analysis are pivotal for understanding biological systems and the physical world. With the increase of comprehensive measurements (liquid-chromatography mass-spectrometry, nuclear magnetic resonance (NMR), gas-chromatography mass-spectrometry (MS) in metabolomics and proteomics; RNA-sequencing in transcriptomics) in life sciences, correlations are used as a first tool for visualization and interpretation, possibly after selection of a threshold to filter the correlations. However, the complexity and the difficulty of estimating correlation coefficients is not fully acknowledged.

Measurement error is intrinsic to every experimental technique and measurement platform, be it a simple ruler, a gene sequencer or a complicated array of detectors in a high-energy physics experiment, and already in the early days of statistics it was known that measurement error can bias the estimation of correlations15. This bias was first called attenuation because it was found that under the error condition considered, the correlation was attenuated towards zero. The attenuation bias has been known and discussed in some research fields16,17,18,19 but it seems to be totally neglected in modern omics-based science. Moreover, contemporary comprehensive omics measurement techniques have far more complex measurement error structures than the simple ones considered in the past on which early results were based.

In this paper, we intend to show the impact of measurement errors on the quality of the calculated correlation coefficients and we do this for several reasons. First, to make the omics community aware of the problems. Secondly, to make the theory of correlation up to date with current omics measurements taking into account more realistic measurement error models in the calculation of the correlation coefficient and third, to propose ways to alleviate the problem of distortion in the estimation of correlation induced by measurement error. We will do this by deriving analytical expressions supported by simulations and simple illustrations. We will also use real-life metabolomics data to illustrate our findings.

Measurement Error Models

We start with the simple case of having two correlated biological entities \({x}_{0}\) and \({y}_{0}\) which are randomly varying in a population. This may, e.g., be concentrations of two blood metabolites in a cohort of persons or gene-expressions of two genes in cancer tissues. We will assume that these variables are normally distributed

$$(\begin{array}{l}{x}_{0}\\ {y}_{0}\end{array})\sim N({\boldsymbol{\mu }},{{\boldsymbol{\Sigma }}}_{0})$$
(1)

with underlying mean

$${\boldsymbol{\mu }}=(\begin{array}{l}{\mu }_{{x}_{0}}\\ {\mu }_{{y}_{0}}\end{array})$$
(2)

and variance-covariance  matrix

$${{\boldsymbol{\Sigma }}}_{0}=(\begin{array}{cc}{\sigma }_{{x}_{0}}^{2} & {\sigma }_{{x}_{0}{y}_{0}}\\ {\sigma }_{{x}_{0}{y}_{0}} & {\sigma }_{{y}_{0}}^{2}\end{array}).$$
(3)

Under this model the variance components \({\sigma }_{{x}_{0}}^{2}\) and \({\sigma }_{{y}_{0}}^{2}\) describe the biological variability for \({x}_{0}\) and \({y}_{0}\), respectively. The correlation \({\rho }_{0}\), between \({x}_{0}\) and \({y}_{0}\) is given by

$${\rho }_{0}=\frac{{\sigma }_{{x}_{0}{y}_{0}}}{\sqrt{{\sigma }_{{x}_{0}}^{2}{\sigma }_{{y}_{0}}^{2}}}.$$
(4)

We refer to \({\rho }_{0}\) as the true correlation.

Whatever the nature of the variables \({x}_{0}\) and \({y}_{0}\) and whatever the experimental technique used to measure them there is always a random error component (also referred to as noise or uncertainty) associated with the measurement procedure. This random error is by its own nature not reproducible (in contrast with systematic error which is reproducible and can be corrected for) but can be modeled, i.e. described, in a statistical fashion. Such models have been developed and applied in virtually every area of science and technology and can be used to adjust for measurement errors or to describe the bias introduced by it. The measured variables will be indicated by \(x\) and \(y\) to distinguished them from \({x}_{0}\) and \({y}_{0}\) which are their errorless counterparts.

The correlation coefficient \({\rho }_{0}\) is sought to be estimated from these measured data. Assuming that \(N\) samples are taken, the sample correlation \({r}_{N}\) is calculated as

$${r}_{N}=\frac{{\sum }_{i=1}^{N}\,({x}_{i}-\bar{x})({y}_{i}-\bar{y})}{N{s}_{x}{s}_{y}},$$
(5)

where \((\bar{x},\bar{y})\) is the sample mean over \(N\) observations and \({s}_{x},{s}_{y}\) are the usual sample standard deviation estimators. This sample correlation is used as a proxy of \({\rho }_{0}\). The population value of this sample correlation is

$$\rho =\frac{{\rm{E}}[xy]-{\rm{E}}[x]\,{\rm{E}}[y]}{\sqrt{{\rm{E}}[{x}^{2}]-{\rm{E}}{[x]}^{2}}\sqrt{{\rm{E}}[{y}^{2}]-{\rm{E}}{[y]}^{2}}},$$
(6)

and it also holds that

$$\mathop{\mathrm{lim}}\limits_{N\to \infty }\,{r}_{N}=\rho .$$
(7)

We will call \(\rho \) the expected correlation. Ideally, \({\rho }_{0}=\rho \) but this is unfortunately not always the case. In plain words: certain measurement errors do not cancel out if the number of samples increases.

In the following section we will introduce three error models and will show with both simulated and real data how measurement error impacts the estimation of the Pearson correlation coefficient. We will focus mainly on \({\rho }_{0}\) and \(\rho \).

Additive error

The most simple error model is the additive error model where the measured entities \(x\) and \(y\) are modeled as

$$\{\begin{array}{l}x={x}_{0}+{\varepsilon }_{a{u}_{x}}\\ y={y}_{0}+{\varepsilon }_{a{u}_{y}}\end{array}$$
(8)

where it is assumed that the error components \({\varepsilon }_{a{u}_{x}}\) and \({\varepsilon }_{a{u}_{y}}\) are independently normally distributed around zero with variance \({\sigma }_{a{u}_{x}}^{2}\) and \({\sigma }_{a{u}_{y}}^{2}\) and are also independent from \({x}_{0}\) and \({y}_{0}\). The subscripts \(a{u}_{x}\), \(a{u}_{y}\) stand for additive uncorrelated error (\(\varepsilon \)) on variables \(x\) and \(y\).

Variables \(x\) and \(y\) represent measured quantities accessible to the experimenter. This error model describes the case in which the measurement error causes within-sample variability, which means that \(p\) measurement replicates \({x}_{i,1},{x}_{i,2},\ldots {x}_{i,p}\) of observation \({x}_{i}\) of variable \(x\) will all have slightly different values due to the random fluctuation of the error component \({\varepsilon }_{a{u}_{x}}\); the extent of the variability among the replicates depends on the magnitude of the error variance \({\sigma }_{a{u}_{x}}^{2}\) (and similarly for the \(y\) variable). This can be seen in Fig. 1A where it is shown that in the presence of measurement error (i.e. \({\sigma }_{a{u}_{x}}^{2},{\sigma }_{a{u}_{y}}^{2} > 0\)) the two variables \(x\) and \(y\) are more dispersed. Due to the measurement error, the expected correlation coefficient \(\rho \) is always biased downwards, i.e. \(\rho < {\rho }_{0}\), as already shown by Spearman15 (see Fig. 1B) who also provided an analytical expression for the attenuation of the expected correlation coefficient as a function of the error components (a modern treatment can be found in reference20):

$$\rho =A{\rho }_{0},$$
(9)

where

$$A=\frac{1}{\sqrt{(1+\frac{{\sigma }_{a{u}_{x}}^{2}}{{\sigma }_{{x}_{0}}^{2}})(1+\frac{{\sigma }_{a{u}_{y}}^{2}}{{\sigma }_{{y}_{0}}^{2}})}}.$$
(10)
Figure 1
figure1

(A) Correlation plot of two variables \(x\) and \(y\) (\({\sigma }_{{x}_{0}}^{2}={\sigma }_{{y}_{0}}^{2}=1\)) generated without (\({\sigma }_{a{u}_{x}}^{2}={\sigma }_{a{u}_{y}}^{2}=0\)) and with uncorrelated additive error (\({\sigma }_{a{u}_{x}}^{2}={\sigma }_{a{u}_{y}}^{2}=0.75\)) with underlying true correlation \({\rho }_{0}=0.8\) (model 8). (B) Distribution of the sample correlation coefficient for different levels of measurement error (\({\sigma }_{au}^{2}={\sigma }_{a{u}_{x}}^{2}={\sigma }_{a{u}_{y}}^{2}\)) for a true correlation \({\rho }_{0}=0.8\). (C) The attenuation coefficient \(A\) from Eq. (10) as a function the measurement error for different level of the variance \({\sigma }^{2}={\sigma }_{{x}_{0}}^{2}={\sigma }_{{y}_{0}}^{2}\) of the variables \({x}_{0}\) and \({y}_{0}\). See Material and Methods section 6.5.1 for details on the simulations.

Equation (9) implies that in presence of measurement error the expected correlation is different from the true correlation \({\rho }_{0}\) which is sought to be estimated. The attenuation A is always strictly smaller than 1 and it is a decreasing function of the size of the measurement error relative to the biological variation (see Fig. 1C), as it can be seen from Eq. (10). The attenuation of the expected correlation, despite being known since 1904, has sporadically resurfaced in the statistical literature in the psychological, epidemiology and behavioral sciences (where it is known as attenuation due to intra-person or intra-individual variability, see19 and reference therein) but has been largely neglected in the life sciences, despite its relevance.

The error model (8) can be extended to include a correlated error term \({\varepsilon }_{ac}\)

$$\{\begin{array}{l}x={x}_{0}+{\varepsilon }_{a{u}_{x}}+{\varepsilon }_{ac}\\ y={y}_{0}+{\varepsilon }_{a{u}_{y}}\pm {\varepsilon }_{ac}\end{array}$$
(11)

with \({\varepsilon }_{ac}\) normally distributed around zero with variance \({\sigma }_{ac}^{2}\); the correlated error term takes on exactly the same value for \(x\) and \(y\) in a given sample. The ‘±’ models the sign of the error correlation. When \({\varepsilon }_{ac}\) has a positive sign in both \(x\) and \(y\) the error is positively correlated; if the sign is discordant the error is negatively correlated. The subscript ac is used to indicate additive correlated error. The variance for \(x\) is given by

$${\sigma }_{x}^{2}={\sigma }_{{x}_{0}}^{2}+{\sigma }_{a{u}_{x}}^{2}+{\sigma }_{ac}^{2}$$
(12)

and likewise for the variable y. In general, additive correlated error can have different causes depending on the type of instruments and measurement protocols used. For example, in transcriptomics, metabolomics and proteomics, usually samples have to be pretreated (sample work-up) prior to the actual instrumental analysis. Any error in a sample work-up step may affect all measured entities in a similar way21. Another example is the use of internal standards for quantification: any error in the amount of internal standard added may also affect all measured entities in a similar way. Hence, in both cases this leads to (positively) correlated measurement error. In some cases in metabolomics and proteomics the data are preprocessed using deconvolution tools. In that case two co-eluting peaks are mathematically separated and quantified. Since the total area under the curve is constant and (positive) error in one of the deconvoluted peaks is compensated by a (negative) error in the second peak, this may give rise to negatively correlated measurement error.

To show the effect of additive uncorrelated measurement error we consider the concentration profiles of three hypothetical metabolites P1, P2 and P3 simulated using a simple dynamic model (see Fig. 2A and Section 6.5.2) where additive uncorrelated measurement error is added before calculating the pairwise correlations among P1, P2 and P3: also in this case the magnitude of the correlation is attenuated, and the attenuation increases with the error variance (see Fig. 2B).

Figure 2
figure2

Consequences of measurement error when using correlation in systems biology. (A) Time concentration profile of three metabolites P1, P2 and P3 generated through a simple enzymatic metabolic model; 100 profiles are generated by randomly varying the kinetic parameters defining the model and sampled at time 0.4 (a.u.). (B) Average pairwise correlation of P1, P2 and P3 as a function of the variance of the additive uncorrelated error. (C) Inference of a metabolite-metabolite correlation network: two metabolites are associated if their correlation is above 0.623 (see threshold in B). The increasing level of measurement error hampers the network inference (compare the different panels). See Material and Methods section 6.5.2 for details on the simulations.

This has serious repercussions when correlations are used for the definition of association networks, as commonly done in systems biology and functional genomics10,22: measurement error drives correlation towards zero and this impacts network reconstruction. If a threshold of 0.6 is imposed to discriminate between correlated and non correlated variables as usually done in metabolomics23, an error variance of around 15% (see Fig. 2B, point where the correlation crosses the threshold) of the biological variation will attenuate the correlation to the point that metabolites will be deemed not to be associated even if they are biologically correlated leading to very different metabolite association networks (see Fig. 2C).

Multiplicative error

In many experimental situations it is observed that the measurement error is proportional to the magnitude of the measured signal; when this happens the measurement error is said to be multiplicative. The model for sampled variables in presence of multiplicative measurement error is

$$\{\begin{array}{l}x={x}_{0}(1+{\varepsilon }_{m{u}_{x}}+{\varepsilon }_{mc})\\ y={y}_{0}(1+{\varepsilon }_{m{u}_{y}}\pm {\varepsilon }_{mc})\end{array}$$
(13)

where \({x}_{0}\), \({y}_{0}\), \({\varepsilon }_{m{u}_{x}}\), \({\varepsilon }_{m{u}_{y}}\) and \({\varepsilon }_{mc}\) have the same distributional properties as before in the additive error case, and the last three terms represent the multiplicative uncorrelated errors in \(x\) and \(y\), respectively, and the multiplicative correlated error.

The characteristics of the multiplicative error and the variance of the measured entities \({\sigma }_{x}^{2}\) depend on the level \({\mu }_{x0}\) of the signal to be measured (for a derivation of Eq. (14) see Section 6. 6.1.1):

$${\sigma }_{x}^{2}={\sigma }_{{x}_{0}}^{2}+({\sigma }_{{x}_{0}}^{2}+{\mu }_{{x}_{0}}^{2})({\sigma }_{m{u}_{x}}^{2}+{\sigma }_{mc}^{2}),$$
(14)

while in the additive case the standard deviation is similar for different concentrations and does not depend explicitly on the signal intensity, as shown in Eq. (12). A similar equation holds for the variable y.

It has been observed that multiplicative errors often arises because of the different procedural steps like sample aliquoting24: this is the case of deep sequencing experiments where the multiplicative error is possibly introduced by the pre-processing steps like, for example, linker ligation and PCR amplification which may vary from tag to tag and from sample to sample25. In other cases the multiplicative error arises from the distributional properties of the signal, like in those experiments where the measurement comes down to counts like in the case of RNA fragments in an RNA-seq experiment or numbers of ions in a mass-spectrometer that are governed by Poisson distributions for which the standard deviation is equal to the mean. For another example, in NMR spectroscopy measured intensities are affected by the sample magnetization conditions: fluctuations in the external electromagnetic field or instability of the rf pulses affect the signal in a fashion that is proportional to the signal itself 26.

A multiplicative error distorts correlations and this affects the results of any data analysis approach which is based on correlations. To show the effect of multiplicative error we consider the analysis of a metabolomic data set simulated  from real mass-spectrometry (MS) data, on which extra uncorrelated and correlated multiplicative measurement errors have been added. As it can be seen in Fig. 3A, the addition of error affects the underlying data structure: the error free data is such that only a subset of the measured variables contributes to explain the pattern in a low dimensional projection of the data, i.e. have PCA loadings substantially different from zero (3B). The addition of extra multiplicative error perturbs the loading structure to the point that all variables contribute equally to the model (3C), obscuring the real data structure and hampering the interpretation of the PCA model. This is not necessarily caused by the multiplicative nature of the error, but it is caused by the correlated error part. Since the term \({\varepsilon }_{mc}\) is common to all variables, it introduces the same amount of correlation among all the variables and this leads to all the variables contributing similarly to the latent vector (principal component). One may also observe that the variation explained by the first principal component increases when adding the correlated measurement error.

Figure 3
figure3

Consequences of multiplicative (correlated and uncorrelated) measurement error for data analysis. (A) Scatter plot of the overlayed view of the first two components of two PCA models of simulated data sets; one without multiplicative error and one with multiplicative error. For visualization purposes, the scores are plotted in the same graph, but the subspaces spanned by the first two principal components for the two data sets are of course different. The labels on both axes also present the percentage explained variation for the two analyses. (B) Loading plot for the error free data. (C) Loading plot for the data with multiplicative error. See Material and Methods section 6.5.3 for details on the simulations.

Realistic error

The measurement process usually consists of different procedural steps and each step can be viewed as a different source of measurement error with its own characteristics, which sum to both additive and multiplicative error components as is the case of comprehensive omics measurements27. The model for this case is:

$$\{\begin{array}{l}x={x}_{0}(1+{\varepsilon }_{m{u}_{x}}+{\varepsilon }_{mc})+{\varepsilon }_{a{u}_{x}}+{\varepsilon }_{ac}\\ y={y}_{0}(1+{\varepsilon }_{m{u}_{y}}\pm {\varepsilon }_{mc})+{\varepsilon }_{a{u}_{y}}\pm {\varepsilon }_{ac}\end{array}$$
(15)

where all errors have been introduced before and are all assumed to be independent of each other and independent of the true (biological) signals (\({x}_{0}\) and \({y}_{0}\)).

This realistic error model has a multiplicative as well as an additive component and also accommodates correlated and uncorrelated error. It is an extension of a much-used error model for analytical chemical data which only contains uncorrelated error28. From model (15) it follows that the error changes not only quantitatively but also qualitatively with changing signal intensity: the importance of the multiplicative component increases when the signal intensity increases, whereas the relative contribution of the additive error component increases when the signal decreases.

Since most of the measurements do not usually fall at the extremity of the dynamic range of the instruments used, the situation in which both additive and multiplicative error are important is realistic. For example, this is surely the case of comprehensive NMR and Mass Spectrometry measurements, where multiplicative errors are due to sample preparation and carry-over effect (in the case of MS) and the additive error is due to thermal error in the detectors29. To illustrate this we consider an NMR experiment where a different number of technical replicates are measured for five samples (Fig. 4A,B). We are interested in establishing the correlation patterns across the (binned) resonances. For sake of simplicity we focus on two resonances, binned at 3.22 and 4.98 ppm. If one calculates the correlation using only one (randomly chosen) replicate per sample, the resulting correlation can be anywhere between −1 and 1 (see Fig. 4C.1). The variability reduces considerably if more replicates are taken and averaged before calculating the correlation (see Fig. 4C), but there is still a rather large variation, induced by the limited sample size. Averaging across the technical replicates reduces variability among the sample means: however this not accompanied by an equal reduction in the variability of the correlation estimation. This is because the error structure is not taken into account in the calculation of the correlation coefficient.

Figure 4
figure4

(A) PCA plot of 5 different samples of fish extract measured with technical replicates (10×) using NMR29. (B) Overlap of the average binned NMR spectra of the 5 samples: the two resonances whose correlation is investigated are highlighted (3.23 and 4.98 ppm). (C) Distribution of the correlation coefficient between the two resonances calculated, taking as input the average over different numbers of technical replicates (see inserts). See Material and Methods section 6.5.4 for more details on the estimation procedure.

Estimation of Pearson’s Correlation Coefficient in Presence of Measurement Error

In the ideal case of an error free measurement, where the only variability is due to intrinsic biological variation, \(\rho \) coincides with the true correlation \({\rho }_{0}\). If additive uncorrelated error is present, then \(\rho \) is given by Eqs. (9) and (10) which explicitly take into account the error component; it holds that \(\rho < {\rho }_{0}\).

In the next Section we will derive analytical expressions, akin to Eqs. (9) and (10), for the correlation for variables sampled with measurement error (additive, multiplicative and realistic) as introduced in Section 2.

Before moving on, we define more specifically the error components. The error terms in models (11), (13) and (15) are assumed to have the following distributional properties

$$(\begin{array}{c}{\varepsilon }_{a{u}_{x}}\\ {\varepsilon }_{a{u}_{y}}\end{array})\sim N(0,{{\boldsymbol{\Sigma }}}_{A})\,{\rm{and}}\,(\begin{array}{c}{\varepsilon }_{m{u}_{x}}\\ {\varepsilon }_{m{u}_{y}}\end{array})\sim N(0,{{\boldsymbol{\Sigma }}}_{M})$$
(16)

with variance-covariance matrices

$${{\boldsymbol{\Sigma }}}_{A}=(\begin{array}{cc}{\sigma }_{a{u}_{x}}^{2} & 0\\ 0 & {\sigma }_{a{u}_{y}}^{2}\end{array})\,{\rm{and}}\,{{\boldsymbol{\Sigma }}}_{M}=(\begin{array}{cc}{\sigma }_{m{u}_{x}}^{2} & 0\\ 0 & {\sigma }_{m{u}_{y}}^{2}\end{array}),$$
(17)

and

$${\varepsilon }_{mc}\sim N(0,{\sigma }_{mc}^{2})\,{\rm{and}}\,{\varepsilon }_{ac}\sim N(0,{\sigma }_{ac}^{2}).$$
(18)

From definitions (16), (17) and (18) it follows that:

  1. (1)

    The expected value of the errors \({\rm{E}}[{\varepsilon }_{\alpha }]\) is zero:

    $${\rm{E}}[{\varepsilon }_{\alpha }]=0\,\,\forall \,\alpha \,\,{\rm{in}}\,\{{\rm{au}},{\rm{ac}},{\rm{mu}},{\rm{mc}}\}.$$
    (19)
  2. (2)

    The covariance between \({x}_{0}\) \(({y}_{0})\) and the error terms is zero because \({x}_{0}\) \(({y}_{0})\) and errors are independent,

    $${\rm{E}}[{x}_{0}{\varepsilon }_{\alpha }]-{\rm{E}}[{x}_{0}]\,{\rm{E}}[{\varepsilon }_{\alpha }]=0\,\,\forall \,\alpha \,\,{\rm{in}}\,\{{\rm{au}},{\rm{ac}},{\rm{mu}},{\rm{mc}}\}.$$
    (20)
  3. (3)

    The covariance between the different error components is zero because the errors are independent from each other.

$${\rm{E}}[{\varepsilon }_{\alpha }{\varepsilon }_{\alpha ^{\prime} }]-{\rm{E}}[{\varepsilon }_{\alpha }]\,{\rm{E}}[{\varepsilon }_{\alpha ^{\prime} }]=0\,\,\forall \,\alpha ,\ne \alpha ^{\prime} \,{\rm{in}}\,\{{\rm{au}},{\rm{ac}},{\rm{mu}},{\rm{mc}}\}.$$
(21)

The Pearson correlation in the presence of additive measurement error

We show here a detailed derivation of the correlation among two variables \(x\) and \(y\) sampled under the additive error model (11). The variance for variable \(x\) (similar considerations hold for \(y\)) is given by

$${\rm{var}}(x)={\rm{E}}[{x}^{2}]-{\rm{E}}{[x]}^{2}$$
(22)

where

$${\rm{E}}[x]={\rm{E}}[{x}_{0}+{\varepsilon }_{a{u}_{x}}+{\varepsilon }_{ac}]={\mu }_{{x}_{0}}.$$
(23)

and

$$\begin{array}{rcl}{\rm{E}}[{x}^{2}] & = & {\rm{E}}[{x}_{0}^{2}+{\varepsilon }_{a{u}_{x}}^{2}+{\varepsilon }_{ac}^{2}+2{x}_{0}{\varepsilon }_{a{u}_{x}}+2{x}_{0}{\varepsilon }_{ac}+2{\varepsilon }_{a{u}_{x}}{\varepsilon }_{ac}]\\ & = & {\sigma }_{{x}_{0}}^{2}+{\mu }_{{x}_{0}}^{2}+{\sigma }_{a{u}_{x}}^{2}+{\sigma }_{ac}^{2}.\end{array}$$
(24)

It follows that

$${\rm{var}}(x)={\sigma }_{{x}_{0}}^{2}+{\sigma }_{a{u}_{x}}^{2}+{\sigma }_{ac}^{2}.$$
(25)

The covariance of \(x\) and \(y\) is

$${\rm{cov}}(x,y)={\rm{E}}[xy]-{\rm{E}}[x]\,{\rm{E}}[y]$$
(26)

with

$$\begin{array}{rcl}{\rm{E}}[xy] & = & {\rm{E}}[{x}_{0}{y}_{0}+{x}_{0}{\varepsilon }_{a{u}_{y}}\pm {x}_{0}{\varepsilon }_{ac}+{\varepsilon }_{a{u}_{x}}{y}_{0}\\ & & +\,{\varepsilon }_{a{u}_{x}}{\varepsilon }_{a{u}_{y}}\pm {\varepsilon }_{a{u}_{x}}{\varepsilon }_{ac}+{\varepsilon }_{ac}{y}_{0}+{\varepsilon }_{ac}{\varepsilon }_{a{u}_{y}}\pm {\varepsilon }_{ac}^{2}].\end{array}$$
(27)

Considering (20) and (21), Eq. (27) reduces to

$${\rm{E}}[xy]={\rm{E}}[{x}_{0}{y}_{0}]\pm {\rm{E}}[{\varepsilon }_{ac}^{2}]$$
(28)

with

$$\begin{array}{rcl}{\rm{E}}[{x}_{0}{y}_{0}] & = & {\rm{cov}}({x}_{0},{y}_{0})+{\rm{E}}[{x}_{0}]\,{\rm{E}}[{y}_{0}]\\ & = & {\sigma }_{{x}_{0}{y}_{0}}+{\mu }_{{x}_{0}}{\mu }_{{y}_{0}}\end{array}$$
(29)

and

$$\pm {\rm{E}}[{\varepsilon }_{ac}^{2}]=\pm \,{\sigma }_{ac}^{2},$$
(30)

with ± depending on the sign of the measurement error correlation. From Eqs. (23), (28), (29) and (30) it follows

$${\rm{cov}}(x,y)={\sigma }_{{x}_{0}{y}_{0}}\pm {\sigma }_{ac}^{2}.$$
(31)

Plugging (25) and (31) into (6) and defining the attenuation coefficient \({A}^{a}\)

$${A}^{a}=\frac{1}{\sqrt{1+\frac{{\sigma }_{a{u}_{x}}^{2}}{{\sigma }_{{x}_{0}}^{2}}+\frac{{\sigma }_{ac}^{2}}{{\sigma }_{{x}_{0}}^{2}}}\sqrt{1+\frac{{\sigma }_{a{u}_{y}}^{2}}{{\sigma }_{{y}_{0}}^{2}}+\frac{{\sigma }_{ac}^{2}}{{\sigma }_{{y}_{0}}^{2}}}}=\frac{1}{\sqrt{1+{\xi }_{x}^{2}+{\gamma }_{x}^{2}}\sqrt{1+{\xi }_{y}^{2}+{\gamma }_{y}^{2}}},$$
(32)

where \({\xi }_{x}^{2}={\sigma }_{a{u}_{x}}^{2}/{\sigma }_{{x}_{0}}^{2}\), \({\xi }_{y}^{2}={\sigma }_{a{u}_{y}}^{2}/{\sigma }_{{y}_{0}}^{2}\), \({\gamma }_{x}^{2}={\sigma }_{ac}^{2}/{\sigma }_{{x}_{0}}^{2}\) and \({\gamma }_{y}^{2}={\sigma }_{ac}^{2}/{\sigma }_{{y}_{0}}^{2}\); the superscript \(a\) in \({A}^{a}\) stands for additive.

The Pearson correlation in presence of additive measurement error is obtained as:

$$\rho ={A}^{a}({\rho }_{0}\pm {\gamma }_{x}{\gamma }_{y})$$
(33)

where the sign ± signifies positively and negatively correlated error.

The attenuation coefficient \({A}^{a}\) is a decreasing function of the measurement error ratios, that is, the ratio between the variance of the uncorrelated and the correlated error to the variance of the true signal. Compared to Eq. (9), in formula (33) there is an extra additive term related to the correlated measurement error expressing the impact of the correlated measurement error relative to the original variation. In the presence of only uncorrelated error (i.e. \({\sigma }_{ac}^{2}=0\)), Eq. (33) reduces to the Spearman’s formula for the correlation attenuation given by (9) and (10). As previously discussed, in this case the correlation coefficient is always biased towards zero (attenuated).

Given the true correlation \({\rho }_{0}\), the expected correlation coefficient (33) is completely determined by the measurement error ratios. Assuming the errors on \(x\) and \(y\) to be the same (\({\sigma }_{a{u}_{x}}^{2}={\sigma }_{a{u}_{y}}^{2}\), \({\sigma }_{m{u}_{x}}^{2}={\sigma }_{m{u}_{y}}^{2}\), an assumption not unrealistic if \(x\) and \(y\) are measured with the same instrument and under the same experimental conditions during an omics comprehensive experiment) and taking for simplicity \({\sigma }_{{x}_{0}}^{2}={\sigma }_{{y}_{0}}^{2}\), then \({\xi }_{x}={\xi }_{y}=\xi \) and \({\gamma }_{x}={\gamma }_{y}=\gamma \) and Eq. (33) can be simplified to:

$$\rho =\frac{{\rho }_{0}\pm {\gamma }^{2}}{1+{\xi }^{2}+{\gamma }^{2}},$$
(34)

and \(\rho \) can be visualized graphically as a function of the uncorrelated and correlated measurement error ratios \(\xi \) and \(\gamma \) as shown in Fig. 5.

Figure 5
figure5

The expected correlation coefficient \(\rho \) in the presence of additive measurement error as a function of the uncorrelated (\({\xi }^{2}\)) and correlated (\({\gamma }^{2}\)) measurement error ratios (m.e.r.) for different values of the true correlation \({\rho }_{0}\). (A) Positively correlated error. (B) Negatively correlated error.

In the presence of positively correlated error, the correlation \(\rho \) is attenuated towards 0 if the uncorrelated error increases and inflated if the additive correlated error increases (Fig. 5A, which refers to Eq. (34)) when \({\rho }_{0} > 0\). If \({\rho }_{0} < 0\) the distortion introduced by the correlated error can be so severe that the correlation \(\rho \) can become positive. When the error is negatively correlated (Fig. 5B), the correlation \(\rho \) is biased towards 0 when \({\rho }_{0} > 0\) (and can change sign), while it can be attenuated or inflated if \({\rho }_{0} < 0\).

A set of rules can be derived to describe quantitatively the bias of \(\rho \). For positively correlated measurement error (for negatively correlated measurement error see Section 6.2) if the true correlation \({\rho }_{0}\) is positive the correlation \(\rho \) is always strictly positive: this is shown on Fig. 6A where the relationship between \(\rho \) and \({\rho }_{0}\) is shown by means of Monte Carlo simulation (see Figure caption for more details). The magnitude of \(\rho \) (\(\parallel \rho \parallel \)) depends on how \({A}^{a}\) (for readability in the following equations we will use \(A\)) and the additive term \({\gamma }_{x}{\gamma }_{y} > 0\) compensate each other. In particular when \({\rho }_{0} > 0\)

$$\rho \to \{\begin{array}{ll}0 < \rho < {\rho }_{0} & {\rm{if}}\,{\rho }_{0} > \frac{A}{1-A}{\gamma }_{x}{\gamma }_{y}\\ {\rho }_{0} & {\rm{if}}\,{\rho }_{0}=\frac{A}{1-A}{\gamma }_{x}{\gamma }_{y}.\\ \, > {\rho }_{0} & {\rm{if}}\,{\rho }_{0} < \frac{A}{1-A}{\gamma }_{x}{\gamma }_{y}\end{array}$$
(35)
Figure 6
figure6

Calculations of the correlation coefficient \(\rho \) (40) as a function of the different realizations of the signal means and the size of the error components for different values of the true correlation \({\rho }_{0}\). The shadowed area encloses the maximum and the minimum of the values of \(\rho \) calculated in the simulation using the different error models. The dots represent the realized values of \(\rho \) (only 100 of 105 Monte Carlo realizations for different values of the variances of error component are shown). The solid lines represent the 5-th and the 95-th percentiles of the observed values. (A) Additive measurement error with positive correlated error. (B) Multiplicative measurement error with positive correlated error. (C) Realistic case with both additive and multiplicative measurement error with positive correlated error. (D) Additive measurement error with negative correlated error. (E) Multiplicative measurement error with negative correlated error. (F) Realistic case with both additive and multiplicative measurement error with negative correlated error. For more details on the simulations see Material and Methods section 6.5.5.

This means that \(\rho \) is always a biased estimator of the true correlation \({\rho }_{0}\), with the exception of the second case which happens only for specific values of \(\gamma \) and \({\rho }_{0}\). This is unlikely to happen in practice.

If \({\rho }_{0} < 0\) it holds that

$$\begin{array}{c}\rho \to \{\begin{array}{cc}0 < \rho < {\rho }_{0} & \text{if}\,{\rho }_{0} > \frac{A}{1-A}{\gamma }_{x}{\gamma }_{y}\\ {\rho }_{0} & \text{if}\,{\rho }_{0}=\frac{A}{1-A}{\gamma }_{x}{\gamma }_{y}.\\ > {\rho }_{0} & \text{if}\,{\rho }_{0} < \frac{A}{1-A}{\gamma }_{x}{\gamma }_{y}\end{array}\end{array}$$
(36)

The interpretation of Eq. (36) is similar to that of Eq. (35) but additionally, the correlation coefficient can even change sign. In particular, this happens when

$$|{\rho }_{0}| > \sqrt{{\gamma }_{x}{\gamma }_{y}}.$$
(37)

The terms \(S=\frac{A}{1-A}{\gamma }_{x}{\gamma }_{y}\) and \(S=\frac{A}{A+1}{\gamma }_{x}{\gamma }_{y}\) in Eqs. (35), (36), (71) and (72) describe limiting surfaces \(S\) of \({\rho }_{0}\) values delineating the regions of attenuation and inflation of the correlation coefficient \(\rho \). As can be seen from Fig. 7, these surfaces are not symmetric with respect to zero correlation, indicating that the behavior of \(\rho \) is not symmetric around 0 with respect to the sign of \({\rho }_{0}\) and of the correlated error.

Figure 7
figure7

Limiting surfaces S for the inflation and deflation region of the correlation coefficient in presence of additive measurement error. The surfaces are a function of the uncorrelated (\({\xi }^{2}\)) and correlated (\({\gamma }^{2}\)) measurement error ratios (m.e.r.). (A) S in the case of positively correlated error. (B) S for negatively correlated error. The plot refers to \(\rho \) defined by Eq. (34) with \({\xi }_{x}^{2}={\xi }_{y}^{2}={\xi }^{2}\) and \({\gamma }_{x}^{2}={\gamma }_{y}^{2}={\gamma }^{2}\).

The Pearson correlation in presence of multiplicative measurement error

The correlation in the presence of multiplicative error can be derived using similar arguments and detailed calculations can be found in Section 6.1.1. Here we only state the main result:

$$\rho ={\rho }_{0}(1\pm {\sigma }_{mc}^{2}){A}^{m}\pm {\delta }_{x}{\delta }_{y}{\sigma }_{mc}^{2}{A}^{m}$$
(38)

with \({\delta }_{x}={\mu }_{{x}_{0}}/{\sigma }_{{x}_{0}}\), \({\delta }_{y}={\mu }_{{y}_{0}}/{\sigma }_{y0}\) (biological signal to biological variation ratios) and \({A}^{m}\) is the attenuation coefficient (the superscript \(m\) stands for multiplicative):

$${A}^{m}=\tfrac{1}{\sqrt{1+(1+\tfrac{{\mu }_{{x}_{0}}^{2}}{{\sigma }_{{x}_{0}}^{2}})(\tfrac{{\sigma }_{m{u}_{x}}^{2}}{{\sigma }_{{x}_{0}}^{2}}+\tfrac{{\sigma }_{mc}^{2}}{{\sigma }_{{x}_{0}}^{2}})}\sqrt{1+(1+\tfrac{{\mu }_{{y}_{0}}^{2}}{{\sigma }_{{y}_{0}}^{2}})(\tfrac{{\sigma }_{m{u}_{y}}^{2}}{{\sigma }_{{y}_{0}}^{2}}+\tfrac{{\sigma }_{mc}^{2}}{{\sigma }_{{y}_{0}}^{2}})}}.$$
(39)

In this case, the correlation coefficient depends explicitly on the mean of the variables, as an effect of the multiplicative nature of the error component. Our simulations show that if the signal intensity is not too large, the correlation can change sign (as shown in Fig. 6B); if the signal intensity is very large the multiplicative error will have a very large effect and if the correlated error is positive the expected correlation \(\rho \) will also be positive, and will be negative if the error are negatively correlated. but simulations cannot be exhaustive (as shown in Fig. 6B).

The Pearson correlation in presence of realistic measurement error

When both additive and multiplicative error are present, the correlation coefficient is a combination of formula (33) and (38) (see Section 6.1.2 for detailed derivation):

$$\rho ={\rho }_{0}(1\pm {\sigma }_{mc}^{2}){A}^{r}\pm ({\gamma }_{x}{\gamma }_{y}+{\delta }_{x}{\delta }_{y}{\sigma }_{mc}^{2}){A}^{r},$$
(40)

where the \(\gamma \) and \(\delta \) parameters have been previously defined for the additive and multiplicative case. Ar is the attenuation coefficient (the superscript \(r\) stands for realistic):

$${A}^{r}=\tfrac{1}{\sqrt{1+(1+\tfrac{{\mu }_{{x}_{0}}^{2}}{{\sigma }_{{x}_{0}}^{2}})(\tfrac{{\sigma }_{m{u}_{x}}^{2}}{{\sigma }_{{x}_{0}}^{2}}+\tfrac{{\sigma }_{mc}^{2}}{{\sigma }_{{x}_{0}}^{2}})+\tfrac{{\sigma }_{a{u}_{x}}^{2}}{{\sigma }_{{x}_{0}}^{2}}+\tfrac{{\sigma }_{ac}^{2}}{{\sigma }_{{x}_{0}}^{2}}}\sqrt{1+(1+\tfrac{{\mu }_{{y}_{0}}^{2}}{{\sigma }_{{y}_{0}}^{2}})(\tfrac{{\sigma }_{m{u}_{y}}^{2}}{{\sigma }_{{y}_{0}}^{2}}+\tfrac{{\sigma }_{mc}^{2}}{{\sigma }_{{y}_{0}}^{2}})+\tfrac{{\sigma }_{a{u}_{y}}^{2}}{{\sigma }_{{y}_{0}}^{2}}+\tfrac{{\sigma }_{ac}^{2}}{{\sigma }_{{y}_{0}}^{2}}}}.$$
(41)

General rules governing the sign of the numerator and denominator in Eq. (40) cannot be determined since it depends on the interplay of the six error components, the true mean and product thereof. Within the parameter setting of our simulations, the results presented in Fig. 6C show that the behavior of \(\rho \) under error model 15 is qualitatively similar to that in presence of only multiplicative error. However different behavior could be emerge with different parameter settings.

Generalized correlated error model

The error models presented in Eqs. (11), (13) and (15) assume a perfect correlation of the correlated errors, since the correlated error terms \({\varepsilon }_{ac}\) appear simultaneously in both \(x\) and \(y\); the same hold true for \({\varepsilon }_{mc}\). A more general model that accounts for different degrees of correlation between the error components can be obtained by modifying the model (15) (other cases are treated in Section 6.3). to

$$\{\begin{array}{l}x={x}_{0}(1+{\varepsilon }_{m{u}_{x}}+{\varepsilon }_{m{c}_{x}})+{\varepsilon }_{a{u}_{x}}+{\varepsilon }_{a{c}_{x}}\\ y={y}_{0}(1+{\varepsilon }_{m{u}_{y}}+{\varepsilon }_{m{c}_{y}})+{\varepsilon }_{a{u}_{y}}+{\varepsilon }_{a{c}_{y}}\end{array}$$
(42)

where the correlated error components \({\varepsilon }_{m{c}_{x}}\), \({\varepsilon }_{a{c}_{x}}\), \({\varepsilon }_{m{c}_{y}}\) and \({\varepsilon }_{a{c}_{y}}\) are distributed as

$$(\begin{array}{c}{\varepsilon }_{a{c}_{x}}\\ {\varepsilon }_{a{c}_{y}}\end{array})\sim N(0,{{\boldsymbol{\Sigma }}}_{AC})\,{\rm{and}}\,(\begin{array}{c}{\varepsilon }_{m{c}_{x}}\\ {\varepsilon }_{m{c}_{y}}\end{array})\sim N(0,{{\boldsymbol{\Sigma }}}_{MC})$$
(43)

with variance-covariance matrices

$${{\boldsymbol{\Sigma }}}_{AC}=(\begin{array}{cc}{\sigma }_{a{c}_{x}}^{2} & {\sigma }_{a{c}_{xy}}\\ {\sigma }_{a{c}_{xy}} & {\sigma }_{a{c}_{y}}^{2}\end{array})\,{\rm{and}}\,{{\boldsymbol{\Sigma }}}_{MC}=(\begin{array}{cc}{\sigma }_{m{c}_{x}}^{2} & {\sigma }_{m{c}_{xy}}\\ {\sigma }_{m{c}_{xy}} & {\sigma }_{m{c}_{y}}^{2}\end{array}),$$
(44)

where \({\sigma }_{a{c}_{xy}}\) is the covariance between error term \({\varepsilon }_{a{c}_{x}}\) and \({\varepsilon }_{a{c}_{y}}\) and \({\sigma }_{m{c}_{xy}}\) is the covariance between error term \({\varepsilon }_{m{c}_{x}}\) and \({\varepsilon }_{m{c}_{y}}\).

It is possible to derive expression for the correlation coefficient under the model (43) as shown in Section 3.1 and in the Section 6.1.1 and 6.1.2. The only difference is that under this model the terms \({\rm{E}}[{\varepsilon }_{ac}^{2}]\) and \({\rm{E}}[{\varepsilon }_{mc}^{2}]\) in Eqs. (27), (58), (65) and (66) are replaced by \({\rm{E}}[{\varepsilon }_{a{c}_{x}},{\varepsilon }_{a{c}_{y}}]={\sigma }_{a{c}_{xy}}\) and \({\rm{E}}[{\varepsilon }_{m{c}_{x}},{\varepsilon }_{m{c}_{y}}]={\sigma }_{m{c}_{xy}}\), respectively.

From the definition of covariance it follows that

$${\sigma }_{a{c}_{xy}}={\pi }_{ac}\sqrt{{\sigma }_{a{c}_{x}}^{2}{\sigma }_{a{c}_{y}}^{2}}$$
(45)

and

$${\sigma }_{m{c}_{xy}}={\pi }_{mc}\sqrt{{\sigma }_{m{c}_{x}}^{2}{\sigma }_{m{c}_{y}}^{2}},$$
(46)

where \({\pi }_{ac}\) and \({\pi }_{mc}\) are the correlations among the error terms for which it holds −\(1\le {\pi }_{mc}\le 1\) and −\(1\le {\pi }_{mc}\le 1\). If \({\pi }_{ac}\) and \({\pi }_{mc}\) are negative the errors are negatively correlated. Equation (40) becomes now:

$$\rho ={\rho }_{0}(1+{\pi }_{mc}{\sigma }_{m{c}_{x}}{\sigma }_{m{c}_{y}}){A}^{r}+({\pi }_{ac}{\gamma }_{x}{\gamma }_{y}+{\delta }_{x}{\delta }_{y}{\pi }_{mc}{\sigma }_{m{c}_{x}}{\sigma }_{m{c}_{y}}){A}^{r},$$
(47)

with \({\gamma }_{x}={\sigma }_{a{c}_{x}}/{\sigma }_{{x}_{0}}\), \({\gamma }_{y}={\sigma }_{a{c}_{y}}/{\sigma }_{{y}_{0}}\), and

$${A}^{r}=\tfrac{1}{\sqrt{1+(1+\tfrac{{\mu }_{{x}_{0}}^{2}}{{\sigma }_{{x}_{0}}^{2}})(\tfrac{{\sigma }_{m{u}_{x}}^{2}}{{\sigma }_{{x}_{0}}^{2}}+\tfrac{{\sigma }_{m{c}_{x}}^{2}}{{\sigma }_{{x}_{0}}^{2}})+\tfrac{{\sigma }_{a{u}_{x}}^{2}}{{\sigma }_{{x}_{0}}^{2}}+\tfrac{{\sigma }_{a{c}_{x}}^{2}}{{\sigma }_{{x}_{0}}^{2}}}\,\,\times \sqrt{1+(1+\tfrac{{\mu }_{{y}_{0}}^{2}}{{\sigma }_{{y}_{0}}^{2}})(\tfrac{{\sigma }_{m{u}_{y}}^{2}}{{\sigma }_{{y}_{0}}^{2}}+\tfrac{{\sigma }_{m{c}_{y}}^{2}}{{\sigma }_{{y}_{0}}^{2}})+\tfrac{{\sigma }_{a{u}_{y}}^{2}}{{\sigma }_{{y}_{0}}^{2}}+\tfrac{{\sigma }_{a{c}_{y}}^{2}}{{\sigma }_{{y}_{0}}^{2}}}.}$$
(48)

This model generalizes the correlation coefficient among \(x\) and \(y\) from Eq. (40) to account for different strength of the correlation among the correlated error components. All considerations discussed in the previous sections do apply also to this model. Expressions for \(\rho \) in the case of additive and multiplicative error can be found in the Section 6.3.1 and 6.3.2.

By setting \({\sigma }_{a{c}_{x}}^{2}={\sigma }_{a{c}_{y}}^{2}={\sigma }_{ac}^{2}\), \({\sigma }_{m{c}_{x}}^{2}={\sigma }_{m{c}_{y}}^{2}={\sigma }_{mc}^{2}\), and \({\pi }_{ac}={\pi }_{mc}=1\) (perfect correlation), model (40) is obtained, and similarly models (33) and (38).

Correction for Correlation Bias

Because virtually all kinds of measurement are affected by measurement error, the correlation calculated from sampled data is distorted to some degree depending on the level of the measurement error and on its nature. We have seen that experimental error can inflate or deflate the correlation and that \(\rho \) (and hence its sample realization \(r\)) is almost always a biased estimation of the true correlation \({\rho }_{0}\). An estimator that gives a theoretically unbiased estimate of the correlation coefficient between two variables \(x\) and \(y\) taking into account the measurement error model can be derived. For simple uncorrelated additive error this is given by the Spearman’s formula (49): this is a known results which in the past has been presented and discussed in many different fields16,17,18,19. To obtain similar correction formulas for the error models considered here it is sufficient to solve for \({\rho }_{0}\) from the defining Eqs. (33), (38) and (40). The correction formulas are as follows (the ± indicates positive and negatively correlated error):

  1. 1.

    Correction for simple additive error (only uncorrelated error):

    $${\rho }_{0}={A}^{-1}\rho .$$
    (49)
  2. 2.

    Correction for additive error:

    $${\rho }_{\pm }^{corrected}=\frac{1}{{A}^{a}}\rho \mp {\gamma }_{x}{\gamma }_{y}.$$
    (50)
  3. 3.

    Correction for multiplicative error:

    $${\rho }_{\pm }^{corrected}=\frac{1}{{A}^{m}(1\pm {\sigma }_{mc}^{2})}\rho \mp \frac{{\sigma }_{mc}^{2}}{1\pm {\sigma }_{mc}^{2}}{\delta }_{x}{\delta }_{y}.$$
    (51)
  4. 4.

    Correction for realistic error:

$${\rho }_{\pm }^{corrected}=\frac{1}{{A}^{c}(1\pm {\sigma }_{mc}^{2})}\rho \mp \frac{{\gamma }_{x}{\gamma }_{y}+{\delta }_{x}{\delta }_{y}{\sigma }_{mc}^{2}}{1\pm {\sigma }_{mc}^{2}}.$$
(52)

In practice, to obtain a corrected estimation of the correlation coefficient \({\rho }_{0}\), the \(\rho \) is substituted by \(r\) in (50), (51) and (52), which is the sample correlation calculated from the data. The effect of the correction is shown, for the realistic error model (15), in Fig. 8 where the true know error variance components have been used. It should be noted that it is possible that the corrected correlation exceeds ±1.0. This phenomenon has already been observed and discussed16,30: it is due to the fact that the sampling error of a correlation coefficient corrected for distortion is greater than would be that of an uncorrected coefficient of the same size (at least for the uncorrelated additive error4,18,31). When this happens the corrected correlation can be rounded to ±1.019,31.

Figure 8
figure8

Correction of the distortion induced by the realistic measurement error (see Eq. (15)). (A) Pairwise correlations \(\rho \) among 25 metabolites calculated from simulated data with additive and multiplicative measurement error vs the true correlation \({\rho }_{0}\). (B) Corrected correlation coefficients using Eq. (52) and using the known error variance components. See Section 6.5.6 for details on the data simulation.

Estimation of the error variance components

Simulations shown in Fig. 8 have been performed using the known parameters for the error components used to generate the data. In practical applications the error components needs to be estimated from the measured data and the quality of the correction will depend on the accuracy of the error variance estimate.

The case of purely additive uncorrelated measurement error (\({\sigma }_{ac}^{2}=0\)) has been addressed in the past18,19,32: in this case the variance components \({\sigma }_{{x}_{0}}^{2}\) and \({\sigma }_{{y}_{0}}^{2}\) can be substituted with their sample estimates (\({s}_{{x}_{0}}^{2}\) and \({s}_{{y}_{0}}^{2}\)) obtained from measured data, while the error variance components (\({\sigma }_{a{u}_{x}}^{2}\) and \({\sigma }_{a{u}_{y}}^{2}\)) can be estimated if an appropriate experimental design is implemented, i.e. if \(n\) replicates are measured for each observation.

Unfortunately, there is no simple and immediate approach to estimate the error component in the other cases when many variance components need to be estimated (6 error variances in the case of error model (15) and 8 in the case of the generalized model (42), to which the estimations of \({\pi }_{mc}\) and \({\pi }_{ac}\) must be added).

Different approaches can be foreseen to estimate the error components which is not a trivial task, including the use of (generalized) linear mixed model33,34, error covariance matrix formulation29,35,36 or common factor analysis factorization37. None of these approaches is straightforward and require some extensive mathematical manipulations to be implemented; an accurate investigation of the simulation of the error component is outside the scope of this paper and will presented in a future publication.

Discussion

Since measurement error cannot be avoided, correlation coefficients calculated from experimental data are distorted to a degree which is not known and that has been neglected in life sciences applications but can be expected to be considerable when comprehensive omics measurement are taken.

As previously discussed, the attenuation of the correlation coefficient in the presence of additive (uncorrelated) error has been known for more than one century. The analytical description of the distortion of the correlation coefficient in presence of more complex measurement error structures (Eqs. (33), (38) and (40)) has been presented here for the first time to the best of our knowledge.

The inflation or attenuation of the correlation coefficient depends on the relationship between the value of true correlation \({\rho }_{0}\) and the error component. In most cases in practice, \(\rho \) is a biased estimator for \({\rho }_{0}\). In absence of correlated error, there is always attenuation; in the presence of correlated error there can also be increase (in absolute value) of the correlation coefficient. This has also been observed in regression analysis applied to nutritional epidemiology and it has been suggested that correlated error can, in principle, be used to compensate for the attenuation38. Moreover, the distortion of the correlation coefficient also has implications for hypothesis testing to assess the significance of the measured correlation \(r\).

To illustrate the counterintuitive consequences of correlated measurement error consider the following. Suppose that the true correlation is null. In that case, Eqs. (33), (38) and (40) reduce to

$$\rho ={A}^{a}{\gamma }_{x}{\gamma }_{y},$$
(53)
$$\rho ={A}^{m}{\delta }_{x}{\delta }_{y}{\sigma }_{mc}^{2},$$
(54)

and

$$\rho =\pm \phantom({\gamma }_{x}{\gamma }_{y}+{\delta }_{x}{\delta }_{y}{\sigma }_{mc}^{2}){A}^{r},$$
(55)

which implies that the correlation coefficient is not zero. Moreover, in real-life situations there is also sampling variability superimposed on this which may in the end result in estimated correlations of the size as found in several omics applications (in metabolomics observed correlations are usually lower than 0.610,23; similar patterns are also observed in transcriptomics39,40) while the true biological correlation is zero.

The correction equations presented need the input of estimated variances. Such estimates also carry uncertainty and the quality of these estimates will influence the quality of the corrections. This will be the topic of a follow-up paper. Prior information regarding the sizes of the variance components would be valuable and this points to new requirements for system suitability tests of comprehensive measurements. In metabolomics, for example, it would be worthwhile to characterize an analytical measurement platform in terms of such error variances including sizes of correlated error using advanced (and to be developed) measurement protocols.

Distortion of the correlation coefficient has implications also for experimental planning. In the case of additive uncorrelated error, the correction depends explicitly on the sample size \(N\) used to calculate \(r\) and on the number of replicates \({n}_{x}\), \({n}_{y}\) used to estimate the intraclass correlation (i.e. the error variance components): since in real life the total sample size \(N\times ({n}_{x}+{n}_{y})\) is fixed, there is a trade off between the sample size and the number of replicates that can be measured and the experimenter has to decide whether to increase \(N\) or \({n}_{x}\).

The results presented here are derived under the assumption of normality of both measurement and measurement errors. If \({x}_{0}\) and \({y}_{0}\) are normally distributed, then \(x\) and \(y\) will be, in presence of additive measurement error, normally distributed, with variance given by (12). For multiplicative and realistic error the distribution of \(x\) and \(y\) will be far from normality since it involves the distribution of the product of normally distributed quantities which is usually not normal41. It is known that departure from normality can result in the inflation of the correlation coefficient42 and in distortion43 of its (sampling) distribution and this will add to the corruption induced by the measurement error.

We think that in general correlation coefficients are trusted too much on face value and we hope to have triggered some doubts and pointed to precautions in this paper.

Material and Methods

Mathematical calculations

Derivation of ρ in presence of multiplicative measurement error

In presence of purely multiplicative error it holds

$${\rm{E}}[x]={\rm{E}}[{x}_{0}(1+{\varepsilon }_{m{u}_{x}}\pm {\varepsilon }_{mc})]={\mu }_{{x}_{0}}$$
(56)

and

$$\begin{array}{rcl}{\rm{E}}[{x}^{2}] & = & {\rm{E}}[{x}_{0}^{2}+{x}_{0}^{2}({\varepsilon }_{m{u}_{x}}^{2}+{\varepsilon }_{mc}^{2}+2{\varepsilon }_{m{u}_{x}}+{\varepsilon }_{mc}\pm {\varepsilon }_{mc}\pm 2{\varepsilon }_{m{u}_{x}}{\varepsilon }_{mc})]\\ & = & {\sigma }_{{x}_{0}}^{2}+{\mu }_{{x}_{0}}^{2}+{\sigma }_{m{u}_{x}}^{2}({\sigma }_{{x}_{0}}^{2}+{\mu }_{{x}_{0}}^{2})+{\sigma }_{mc}^{2}({\sigma }_{{x}_{0}}^{2}+{\mu }_{{x}_{0}}^{2}),\end{array}$$
(57)

using (19)–(21) to calculate the expectation of the cross terms. For \({\rm{E}}[xy]\) it holds

$$\begin{array}{rcl}{\rm{E}}[xy] & = & {\rm{E}}[{x}_{0}{y}_{0}+{x}_{0}{y}_{0}({\varepsilon }_{mc}\pm {\varepsilon }_{mc}\pm {\varepsilon }_{mc}^{2}\pm {\varepsilon }_{mc}{\varepsilon }_{m{u}_{y}}\\ & = & +\,{\varepsilon }_{mc}{\varepsilon }_{m{u}_{x}}+{\varepsilon }_{m{u}_{x}}+{\varepsilon }_{m{u}_{y}}+{\varepsilon }_{m{u}_{x}}{\varepsilon }_{m{u}_{y}})].\end{array}$$
(58)

Because of the independence of \({x}_{0}\), \({y}_{0}\) and the error terms, the expectations of all cross terms is null except

$$\begin{array}{rcl}\pm {\rm{E}}[{x}_{0}{y}_{0}{\varepsilon }_{mc}^{2}] & = & \pm {\rm{E}}[{x}_{0}{y}_{0}]\,{\rm{E}}[{\varepsilon }_{mc}^{2}]\\ & = & \pm {\sigma }_{mc}^{2}({\sigma }_{{x}_{0}{y}_{0}}^{2}+{\mu }_{{x}_{0}}{\mu }_{{y}_{0}}),\end{array}$$
(59)

where \({\rm{E}}[{x}_{0}{y}_{0}]\) is given by Eq. (29). Plugging (56), (57) and (58) in (6), the expected correlation coefficient is

$$\rho =\tfrac{{\sigma }_{{x}_{0}{y}_{0}}\pm {\sigma }_{mc}^{2}({\sigma }_{{x}_{0}{y}_{0}}+{\mu }_{{x}_{0}}{\mu }_{{y}_{0}})}{\sqrt{{\sigma }_{{x}_{0}}^{2}+({\sigma }_{{x}_{0}}^{2}+{\mu }_{{x}_{0}}^{2})({\sigma }_{m{u}_{x}}^{2}+{\sigma }_{mc}^{2})}\sqrt{{\sigma }_{{y}_{0}}^{2}+({\sigma }_{{y}_{0}}^{2}+{\mu }_{{y}_{0}}^{2})({\sigma }_{m{u}_{y}}^{2}+{\sigma }_{mc}^{2})}},$$
(60)

and it can re-written as (38) by setting \({\gamma }_{x}={\sigma }_{ac}^{2}/{\sigma }_{{x}_{0}}^{2}\) and \({\gamma }_{y}={\sigma }_{ac}^{2}/{\sigma }_{{y}_{0}}^{2}\) \({\delta }_{x}={\mu }_{{x}_{0}}/{\sigma }_{{x}_{0}}\), \({\delta }_{y}={\mu }_{{y}_{0}}/{\sigma }_{y0}\) and defining the attenuation coefficient \({A}^{m}\) (39).

Derivation of ρ in presence of realistic measurement error

To simplify calculations we set

$$\{\begin{array}{rcl}{M}_{x} & = & {x}_{0}(1+{\varepsilon }_{m{u}_{x}}\pm {\varepsilon }_{mc})\\ {A}_{x} & = & {\varepsilon }_{a{u}_{x}}\pm {\varepsilon }_{ac}\end{array}$$
(61)

and similarly we define \({M}_{y}\) and \({A}_{y}\) for variable \(y\). It holds

$${\rm{E}}[{A}_{x}]=0\,{\rm{and}}\,{\rm{E}}[{M}_{x}]={\mu }_{{x}_{0}}$$
(62)

and

$${\rm{E}}[{A}_{x}^{2}]={\sigma }_{a{u}_{x}}^{2}+{\sigma }_{ac}^{2}.$$
(63)

E\([{M}_{x}^{2}]\) is given by Eq. (57). Because error components are independent and with zero expectation (see Eqs. (19)–(21)) it holds

$${\rm{E}}[{M}_{x}{A}_{x}]={\rm{E}}[{M}_{x}{A}_{y}]={\rm{E}}[{M}_{y}{A}_{x}]=0,$$
(64)
$${\rm{E}}[{M}_{x}{M}_{y}]={\sigma }_{{x}_{0}{y}_{0}}+{\mu }_{{x}_{0}}{\mu }_{{y}_{0}}\pm ({\sigma }_{{x}_{0}{y}_{0}}+{\mu }_{{x}_{0}}{\mu }_{{y}_{0}}){\sigma }_{mc}^{2},$$
(65)
$${\rm{E}}[{A}_{x}{A}_{y}]=\pm \,{\sigma }_{ac}^{2}.$$
(66)

It follows that

$${\rm{E}}[x]={\mu }_{{x}_{0}},$$
(67)
$$\begin{array}{rcl}{\rm{E}}[{x}^{2}] & = & {\rm{E}}[{M}_{x}^{2}]+{\rm{E}}[{A}_{x}^{2}]+2{\rm{E}}[{M}_{x}{A}_{x}]\\ & = & {\sigma }_{{x}_{0}}^{2}+{\mu }_{{x}_{0}}^{2}+{\sigma }_{m{u}_{x}}^{2}({\sigma }_{{x}_{0}}^{2}+{\mu }_{{x}_{0}}^{2})+{\sigma }_{c}^{2}({\sigma }_{{x}_{0}}^{2}+{\mu }_{{x}_{0}}^{2})+{\sigma }_{a{u}_{x}}^{2}+{\sigma }_{ac}^{2},\end{array}$$
(68)

and

$$\begin{array}{rcl}{\rm{E}}[xy] & = & {\rm{E}}[{M}_{x}{M}_{y}]+{\rm{E}}[{A}_{x}{A}_{y}]+{\rm{E}}[{M}_{x}{A}_{y}]+{\rm{E}}[{M}_{y}{A}_{x}]\\ & = & {\sigma }_{{x}_{0}{y}_{0}}+{\mu }_{{x}_{0}}{\mu }_{{y}_{0}}\pm ({\sigma }_{{x}_{0}{y}_{0}}+{\mu }_{{x}_{0}}{\mu }_{{y}_{0}}){\sigma }_{c}^{2}\pm {\sigma }_{ac}^{2}.\end{array}$$
(69)

Plugging (67), (68), and (69) into (6) one gets the expression for the correlation coefficient in presence of additive and multiplicative measurement error:

$$\rho =\tfrac{{\sigma }_{{x}_{0}{y}_{0}}\pm ({\sigma }_{{x}_{0}{y}_{0}}+{\mu }_{{x}_{0}}{\mu }_{{y}_{0}}){\sigma }_{mc}^{2}\pm {\sigma }_{ac}^{2}}{\sqrt{{\sigma }_{{x}_{0}}^{2}+({\sigma }_{{x}_{0}}^{2}+{\mu }_{{x}_{0}}^{2})({\sigma }_{m{u}_{x}}^{2}+{\sigma }_{mc}^{2})+{\sigma }_{a{u}_{x}}^{2}+{\sigma }_{ac}^{2}}\,\,\times \sqrt{{\sigma }_{{y}_{0}}^{2}+({\sigma }_{{y}_{0}}^{2}+{\mu }_{{y}_{0}}^{2})({\sigma }_{m{u}_{y}}^{2}+{\sigma }_{mc}^{2})+{\sigma }_{a{u}_{y}}^{2}+{\sigma }_{ac}^{2}}},$$
(70)

that can re-written as (40) by using previously defined \({\gamma }_{x},{\gamma }_{y},{\delta }_{x}\) and \({\delta }_{y}\) and defining the attenuation coefficient \({A}^{c}\) (41).

Behavior of ρ in the case of additive negatively correlated error

For negative correlated error, when the true correlation is positive

$$\rho \to \{\begin{array}{ll}0 < \rho < {\rho }_{0} & {\rm{if}}\,{\rho }_{0} > \frac{A}{A-1}{\gamma }_{x}{\gamma }_{y}\\ {\rho }_{0} & {\rm{if}}\,{\rho }_{0}=\frac{A}{A-1}{\gamma }_{x}{\gamma }_{y}.\\ \, > {\rho }_{0} & {\rm{if}}\,{\rho }_{0} < \frac{A}{A-1}{\gamma }_{x}{\gamma }_{y}\end{array}$$
(71)

Since \(\frac{A}{A-1}{\gamma }_{x}{\gamma }_{y} < 0\), \(\rho \) is always smaller than the true correlation. When the true correlation is negative (\({\rho }_{0} < 0\)) the expected correlation is always negative, but it can be, in absolute value, smaller or larger than the absolute value of the true correlation:

$$\rho \to \{\begin{array}{ll}\, < {\rho }_{0} & {\rm{if}}\,\frac{A}{A-1}{\gamma }_{x}{\gamma }_{y} < {\rho }_{0} < 0\\ \,={\rho }_{0} & {\rm{if}}\,{\rho }_{0}=\frac{A}{A-1}{\gamma }_{x}{\gamma }_{y}.\\ \, > {\rho }_{0} & {\rm{if}}\,{\rho }_{0} > \frac{A}{A-1}{\gamma }_{x}{\gamma }_{y}\end{array}$$
(72)

Correlation coefficient under the generalized error model

Additive error

Under the generalized additive correlated error model

$$\{\begin{array}{l}x={x}_{0}+{\varepsilon }_{a{u}_{x}}+{\varepsilon }_{a{c}_{x}}\\ y={y}_{0}+{\varepsilon }_{a{u}_{y}}+{\varepsilon }_{a{c}_{y}}\end{array}$$
(73)

with \({\varepsilon }_{a{c}_{x}}\) and \({\varepsilon }_{a{c}_{y}}\) defined in Eq. (43), the correlation coefficient can be expressed as:

$$\rho ={A}^{a}({\rho }_{0}+{\pi }_{ac}{\gamma }_{x}{\gamma }_{y}),$$
(74)

with \({\gamma }_{x}={\sigma }_{a{c}_{x}}/{\sigma }_{{x}_{0}}\), \({\gamma }_{y}={\sigma }_{a{c}_{y}}/{\sigma }_{{y}_{0}}\), and

$${A}^{a}=\frac{1}{\sqrt{1+\frac{{\sigma }_{a{u}_{x}}^{2}}{{\sigma }_{{x}_{0}}^{2}}+\frac{{\sigma }_{a{c}_{x}}^{2}}{{\sigma }_{{x}_{0}}^{2}}}\sqrt{1+\frac{{\sigma }_{a{u}_{y}}^{2}}{{\sigma }_{{y}_{0}}^{2}}+\frac{{\sigma }_{a{c}_{y}}^{2}}{{\sigma }_{{y}_{0}}^{2}}}}.$$
(75)

Multiplicative error

Under the generalized multiplicative error model

$$\{\begin{array}{l}x={x}_{0}(1+{\varepsilon }_{m{u}_{x}}+{\varepsilon }_{m{c}_{x}})\\ y={y}_{0}(1+{\varepsilon }_{m{u}_{y}}+{\varepsilon }_{m{c}_{y}})\end{array}$$
(76)

with \({\varepsilon }_{m{c}_{x}}\) and \({\varepsilon }_{m{c}_{y}}\) defined in Eq. (43), the correlation coefficient can be expressed as:

$$\rho ={\rho }_{0}(1+{\pi }_{mc}{\sigma }_{m{c}_{x}}{\sigma }_{m{c}_{y}}){A}^{m}+{\delta }_{x}{\delta }_{y}{\pi }_{mc}{\sigma }_{m{c}_{x}}{\sigma }_{m{c}_{y}}{A}^{m}$$
(77)

with

$${A}^{m}=\tfrac{1}{\sqrt{1+(1+\tfrac{{\mu }_{{x}_{0}}^{2}}{{\sigma }_{{x}_{0}}^{2}})(\tfrac{{\sigma }_{m{u}_{x}}^{2}}{{\sigma }_{{x}_{0}}^{2}}+\tfrac{{\sigma }_{m{c}_{x}}^{2}}{{\sigma }_{{x}_{0}}^{2}})}\sqrt{1+(1+\tfrac{{\mu }_{{y}_{0}}^{2}}{{\sigma }_{{y}_{0}}^{2}})(\tfrac{{\sigma }_{m{u}_{y}}^{2}}{{\sigma }_{{y}_{0}}^{2}}+\tfrac{{\sigma }_{m{c}_{y}}^{2}}{{\sigma }_{{y}_{0}}^{2}})}}.$$
(78)

General realistic error

Formulas for the correlation coefficient under the generalized realistic correlated error model are to be found in the main text in Eqs. (47) and (48).

Correction of the correlation coefficient under the generalized correlated error model

Additive error

Under the generalized additive correlated error model the corrected correlation coefficient is

$${\rho }^{corrected}=\frac{1}{{A}^{a}}\rho -{\pi }_{ac}{{\rm{\gamma }}}_{x}{\gamma }_{y}.$$
(79)

Multiplicative error

Under the generalized multiplicative correlated error model the corrected correlation coefficient is

$${\rho }^{corrected}=\frac{1}{{A}^{m}(1+{\pi }_{mc}{\sigma }_{m{c}_{x}}{\sigma }_{m{c}_{y}})}\rho -\frac{{\pi }_{mc}{\sigma }_{m{c}_{x}}{\sigma }_{m{c}_{y}}}{1+{\pi }_{mc}{\sigma }_{m{c}_{x}}{\sigma }_{m{c}_{y}}}{\delta }_{x}{\delta }_{y}.$$
(80)

Realistic error

Under the generalized realistic correlated error model the corrected correlation coefficient is

$${\rho }^{corrected}=\frac{1}{{A}^{r}(1+{\pi }_{mc}{\sigma }_{m{c}_{x}}{\sigma }_{m{c}_{y}})}\rho -\frac{{\pi }_{ac}{\gamma }_{x}{\gamma }_{y}+{\delta }_{x}{\delta }_{y}{\pi }_{mc}{\sigma }_{m{c}_{x}}{\sigma }_{m{c}_{y}}}{1+{\pi }_{mc}{\sigma }_{m{c}_{x}}{\sigma }_{m{c}_{y}}}.$$
(81)

Simulations

We provide here details on the simulation performed and shown in Figs. 14, 6 and 8.

Simulations in Figure 1

N = 100 realizations of two variables \(x\) and \(y\) were generated under model with additive uncorrelated measurement error (11), with \({\rho }_{0}=0.8\), \({\sigma }_{{x}_{0}}^{2}={\sigma }_{{y}_{0}}^{2}=1\) and \(\mu =(100,100)\). Error variance components were set to \({\sigma }_{a{u}_{x}}^{2}={\sigma }_{a{u}_{y}}^{2}=0\) and to \({\sigma }_{a{u}_{x}}^{2}={\sigma }_{a{u}_{y}}^{2}=0.75\) (Panel A).

Simulations in Figure 2

The time concentrations profiles \({P}_{1}(t)\), \({P}_{2}(t)\) and \({P}_{3}(t)\) of three hypothetical metabolites P1, P2 and P3 are simulated using the following dynamic model

$$\{\begin{array}{rcl}\frac{d}{dt}{P}_{1}(t) & = & -{k}_{1}{P}_{1}(t)({E}_{T}-{P}_{2}(t))+{k}_{-1}{P}_{2}(t)\\ \frac{d}{dt}{P}_{2}(t) & = & -{k}_{-1}{P}_{1}(t)+{k}_{1}{P}_{1}(t)({E}_{T}-{P}_{2}(t))-{k}_{2}{P}_{2}(t)\\ \frac{d}{dt}{P}_{3}(t) & = & +{k}_{2}{P}_{2}(t)\end{array}$$
(82)

which is the model of an irreversible enzyme-catalyzed reaction described by Michaelis-Menten kinetics. Using this model, \(N=100\) concentration time profiles for P1, P2 and P3 were generated by solving the system of differential equations after varying the kinetic parameters \({k}_{1}\), \({{\rm{k}}}_{-1}\) and \({k}_{2}\) by sampling them from a uniform distribution. For the realization of the jth concentration profile

$$\begin{array}{rcl}{k}_{1}^{j} & \approx & U(0.9\times {k}_{1},1.1\times {k}_{1})\\ {k}_{-1}^{j} & \approx & U(0.9\times {k}_{-1},1.1\times {k}_{-1})\\ {k}_{2}^{j} & \approx & U(0.9\times {k}_{2},1.1\times {k}_{2})\\ {E}_{T}^{j} & \approx & U(0.9\times {E}_{T},1.1\times {E}_{T})\end{array}$$
(83)

with population values \({k}_{1}=30,{k}_{-1}=20,{k}_{2}=10\), and \({E}_{T}=1\). Initial conditions were set to \(({P}_{10},{P}_{{2}_{0}},{P}_{{3}_{0}})=({P}_{{1}_{0}}^{j},0,0)\) with \({P}_{{1}_{0}}^{j}\approx U(0.9\times {P}_{{1}_{0}},1.1\times {P}_{{1}_{0}})\) and \({P}_{{1}_{0}}=5\). All quantities are in arbitrary units. Time profiles were sampled at \(t=0.4\) a.u. and collected in a data matrix \({{\bf{X}}}_{0}\) of size 100 × 3. The variability in data matrix \({{\bf{X}}}_{0}\) is given by biological variation. The concentration time profiles of P1, P2 and P3 shown in Panel A are obtained using the population values for the kinetic parameters and for the initial conditions.

Additive uncorrelated and correlated measurement error is added on \({{\bf{X}}}_{0}\) following model (11) where P1, P2 and P3 in \({{\bf{X}}}_{0}\) play the role of \({x}_{0},{y}_{0}\) and of an additional third variable \({z}_{0}\) which follows a similar model. The variance of the error component was varied in 50 steps between 0 and 25% of the sample variance \({s}_{{x}_{0}}^{2},{s}_{{y}_{0}}^{2}\) and \({s}_{{z}_{0}}^{2}\) calculated from \({{\bf{X}}}_{0}\). The variance of the correlated error was set to \({\sigma }_{ac}^{2}=0.05\) in all simulations. Pairwise Pearson correlations \({r}_{i,j}\) with \(i,j=\{P1,P2,P3\}\) were calculated for the error free case \({{\bf{X}}}_{0}\) and for data with measurement error added. 100 error realizations were simulated for each error value and the average correlation across the 100 realization is calculated and it is shown in Panel B.

The “mini” metabolite-metabolite association networks shown in Panel C are defined by first taking the Pearson correlation \({r}_{ij}\) among P1, P2 and P3 and then imposing a threshold on \(r\) to define the connectivity matrix \({A}_{ij}\)

$${A}_{ij}=\{\begin{array}{cc}1 & {\rm{i}}{\rm{f}}\,|{r}_{ij}| > 0.6\\ 0 & {\rm{o}}{\rm{t}}{\rm{h}}{\rm{e}}{\rm{r}}{\rm{w}}{\rm{i}}{\rm{s}}{\rm{e}}.\end{array}$$
(84)

For more details see reference10.

Simulations in Figure 3

Principal component analysis was performed on a 100 × 133 experimental metabolomic data set (see Section 6.6 for a description). The 15 variables with the highest loading (in absolute value) and the 45 variables with the smallest loading (in absolute value) on the first principal component where selected to form a 100 × 60 data set \({{\bf{X}}}_{0}\) (we call this now the error free data, as if it only contained biological variation). On this subset a new a principal component analysis was performed. Then multiplicative correlated and uncorrelated measurement error was added on \({{\bf{X}}}_{0}\). The variance of the additive error was set \({\sigma }_{m{u}_{j}}^{2}=0.05\times {s}_{{j}_{0}}^{2}\) with \(j=1,2,\ldots ,60\) where \({s}_{{j}_{0}}^{2}\) is the variance calculated for the jth column of \({{\bf{X}}}_{0}\), i.e., the biological variance. The variance of the correlated error was fixed to 5% of the average variance observed in the error free data \(({\sigma }_{mc}^{2}=0.045)\).

Simulations in Figure 4

Let \({x}_{ij}\) and \({y}_{ij}\) denote the intensities of the resonances measured at 3.23 and 4.98 in the randomly drawn replicate \(j\) of sample Fi (\(i=1,2,\ldots ,5\)) and define the 5 × 1 vectors of means

$${{\bf{x}}}_{J}=\frac{1}{J}(\begin{array}{c}\sum _{j}\,{x}_{1j}\\ \vdots \\ \sum _{j}\,{x}_{5j}\end{array})\,{\rm{and}}\,{{\bf{y}}}_{J}=\frac{1}{J}(\begin{array}{c}\sum _{j}\,{y}_{1j}\\ \vdots \\ \sum _{j}\,{y}_{5j}\end{array}).$$
(85)

The correlation \({r}_{J}={\rm{corr}}({{\bf{x}}}_{J},{{\bf{y}}}_{J})\) is calculated for \(J=1,2,5\), and 10; for each J the replicates used to calculate \({{\bf{x}}}_{J}\) and \({{\bf{y}}}_{J}\) are randomly and independently sampled, for each sample separately, from the total set of the 12 to 15 replicates available per sample. The procedure is repeated 105 times to construct the distributions of the correlation coefficient shown in Fig. 4C.

Simulations in Figure 6

Simulation results presented in Fig. 6 show the results from calculations of the sample correlation coefficient as a function of the true correlation \({\rho }_{0}\) and of the true means (\({\mu }_{{x}_{0}}\) and \({\mu }_{{y}_{0}}\)), the variances (\({\sigma }_{{x}_{0}}^{2}\) and \({\sigma }_{{y}_{0}}^{2}\) of the signals \({x}_{0}\) and \({y}_{0}\) and the measurement error variances as they appear in the definitions of \(\rho \) under the dif ferent error models (Eqs. (33), (38) and (40)). The calculations were done multiple times for varying values for \({\mu }_{{x}_{0}}\) and \({\mu }_{{y}_{0}}\), which were randomly and independently sampled from a uniform distribution \(U(0,{\mu }_{0})\), where \({\mu }_{0}\) was set to be equal to 23.4, which was the maximum values observed in Data set 1 (see Section 6.6). Values for \({\sigma }_{{x}_{0}}^{2}\) and \({\sigma }_{{y}_{0}}^{2}\) were randomly and independently sampled from a uniform distribution \(U(0,{\sigma }_{0}^{2})\), where \({\sigma }_{0}^{2}\) was set to be equal to the average variance observed in the experimental Data set 1. The values of the variance of all error components are randomly and independently sampled from \(U(0,\tfrac{1}{4}{\sigma }_{0}^{2})\). The overall procedure was repeated 104 for each value of \({\rho }_{0}\) in the range \([\,-\,1,1]\) in steps of 0.1.

Simulations in Figure 8

The first 25 variables from Data set 1 have been selected and used to compute the means μ0 and the correlation/covariance matrix \({\Sigma }_{0}\) used to generate error-free data \({{\bf{X}}}_{0}\sim N({\mu }_{0},{\Sigma }_{0})\) of size 104 × 25 on which additive and multiplicative measurement error (correlated and uncorrelated) is added (error model (15)) to obtain \({\bf{X}}\). All error variances are set to 0.1 which is approximately equal to 5% of the average variance observed in \({{\bf{X}}}_{0}\). Pairwise correlations among the 25 metabolites are calculated from \({\bf{X}}\). The correlations are corrected using Eq. (52) using the known distributional and error parameters \(({\mu }_{0},{\Sigma }_{0})\) used to generate the data. The data generation is repeated 103 times and correlations (uncorrected and corrected) are averaged over the repetitions.

Data sets

Data set 1

A publicly available data set containing measurements of 133 blood metabolites from 2139 subjects was used as a base for the simulation to obtain realistic distributional and correlation patterns among measured features. The data comes from a designed case-cohort and a matched sub-cohort (controls) stratified on age and sex from the TwinGene project44. The first 100 observation were used in the simulation described in Section 6.5.3 and shown in Fig. 3.

Data were downloaded from the Metabolights public repository45 (www.ebi.ac.uk/metabolights) with accession number MTBLS93. For full details on the study protocol, sample collection, chromatography, GC-MS experiments and metabolites identification and quantification see the original publication46 and the Metabolights accession page.

Data set 2

This data set was acquired in the framework of a study aiming to the “Characterization of the measurement error structure in Nuclear Magnetic Resonance (NMR) data for metabolomic studies29”. Five biological replicates of fish extract F1 - F5 were originally pretreated in replicates (12 to 15) and acquired using 1H NMR. The replicates account for variability in sample preparation and instrumental variability. For details on the sample preparation and NMR experiments we refer to the original publication.

Software

All calculations were performed in Matlab (version 2017a 9.2). Code to generate data under the measurement error models (11), (13) and (15) is available at systemsbiology.nl under the SOFTWARE tab.

References

  1. 1.

    Bravais, A. Analyse mathématique sur les probabilités des erreurs de situation dun point (Impr. Royale, 1844).

  2. 2.

    Galton, F. Co-relations and their measurement, chiefly from anthropometric data. Proceedings of the Royal Society of London 45, 135–145 (1889).

    ADS  Article  Google Scholar 

  3. 3.

    Pearson, K. Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London 58, 240–242 (1895).

    ADS  Article  Google Scholar 

  4. 4.

    Spearman, C. Demonstration of formulae for true measurement of correlation. The American Journal of Psychology 161–169 (1907).

    Article  Google Scholar 

  5. 5.

    Pearson, K. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 559–572 (1901).

    Article  Google Scholar 

  6. 6.

    Hotelling, H. Analysis of a complex of statistical variables into principal components. Journal of educational psychology 24, 417 (1933).

    Article  Google Scholar 

  7. 7.

    Jolliffe, I. Principal component analysis (Springer, 2011).

  8. 8.

    Härdle, W. & Simar, L. Applied multivariate statistical analysis, vol. 22007 (Springer, 2007).

  9. 9.

    Müller-Linow, M., Weckwerth, W. & Hütt, M.-T. Consistency analysis of metabolic correlation networks. BMC Systems Biology 1, 44 (2007).

    Article  Google Scholar 

  10. 10.

    Jahagirdar, S., Suarez-Diez, M. & Saccenti, E. Simulation and reconstruction of metabolite-metabolite association networks using a metabolic dynamic model and correlation based-algorithms. Journal of proteome research (2019).

  11. 11.

    Dunlop, M. J., Cox, R. S. III., Levine, J. H., Murray, R. M. & Elowitz, M. B. Regulatory activity revealed by dynamic correlations in gene expression noise. Nature genetics 40, 1493 (2008).

    CAS  Article  Google Scholar 

  12. 12.

    Marbach, D. et al. Wisdom of crowds for robust gene network inference. Nature Methods 9, 796–804, https://doi.org/10.1038/nmeth.2016 (2012).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  13. 13.

    Stuart, J. M., Segal, E., Koller, D. & Kim, S. K. A gene-coexpression network for global discovery of conserved genetic modules. Science 302, 249–255 (2003).

    ADS  CAS  Article  Google Scholar 

  14. 14.

    Zhang, B. & Horvath, S. A general framework for weighted gene co-expression network analysis. Statistical applications in genetics and molecular biology 4 (2005).

  15. 15.

    Spearman, C. The proof and measurement of association between two things. The American journal of psychology 15, 72–101 (1904).

    Article  Google Scholar 

  16. 16.

    Thouless, R. H. The effects of errors of measurement on correlation coefficients. British Journal of Psychology 29, 383 (1939).

    Google Scholar 

  17. 17.

    Beaton, G. H. et al. Sources of variance in 24-hour dietary recall data: implications for nutrition study design and interpretation. The American journal of clinical nutrition 32, 2546–2559 (1979).

    CAS  Article  Google Scholar 

  18. 18.

    Rosner, B. & Willett, W. Interval estimates for correlation coefficients corrected for within-person variation: implications for study design and hypothesis testing. American journal of epidemiology 127, 377–386 (1988).

    CAS  Article  Google Scholar 

  19. 19.

    Adolph, S. C. & Hardin, J. S. Estimating phenotypic correlations: correcting for bias due to intraindividual variability. Functional Ecology 21, 178–184 (2007).

    Article  Google Scholar 

  20. 20.

    Fuller, W. A. Measurement error models, vol. 305 (John Wiley & Sons, 2009).

  21. 21.

    Moseley, H. N. Error analysis and propagation in metabolomics data analysis. Computational and structural biotechnology journal 4, e201301006 (2013).

    Article  Google Scholar 

  22. 22.

    Rosato, A. et al. From correlation to causation: analysis of metabolomics data using systems biology approaches. Metabolomics 14, 37 (2018).

    Article  Google Scholar 

  23. 23.

    Camacho, D., de la Fuente, A. & Mendes, P. The origin of correlations in metabolomics data. Metabolomics 1, 53–63, https://doi.org/10.1007/s11306-005-1107-3 (2005).

    CAS  Article  Google Scholar 

  24. 24.

    Werner, M., Brooks, S. H. & Knott, L. B. Additive, multiplicative, and mixed analytical errors. Clinical chemistry 24, 1895–1898 (1978).

    CAS  PubMed  Google Scholar 

  25. 25.

    Balwierz, P. J. et al. Methods for analyzing deep sequencing expression data: constructing the human and mouse promoterome with deepcage data. Genome biology 10, R79 (2009).

    Article  Google Scholar 

  26. 26.

    Mehlkopf, A., Korbee, D., Tiggelman, T. & Freeman, R. Sources of t1 noise in two-dimensional nmr. Journal of Magnetic Resonance (1969) 58, 315–323 (1984).

    CAS  Article  Google Scholar 

  27. 27.

    Van Batenburg, M. F., Coulier, L., van Eeuwijk, F., Smilde, A. K. & Westerhuis, J. A. New figures of merit for comprehensive functional genomics data: the metabolomics case. Analytical chemistry 83, 3267–3274 (2011).

    Article  Google Scholar 

  28. 28.

    Rocke, D. M. & Lorenzato, S. A two-component model for measurement error in analytical chemistry. Technometrics 37, 176–184 (1995).

    Article  Google Scholar 

  29. 29.

    Karakach, T. K., Wentzell, P. D. & Walter, J. A. Characterization of the measurement error structure in 1D 1H NMR data for metabolomics studies. Analytica Chimica Acta 636, 163–174 (2009).

    CAS  Article  Google Scholar 

  30. 30.

    Pearson, K. & Lee, A. On the laws of inheritance in man: I. Inheritance of physical characters. Biometrika 2, 357–462 (1903).

    Article  Google Scholar 

  31. 31.

    Winne, P. H. & Belfry, M. J. Interpretive problems when correcting for attenuation. Journal of Educational Measurement 125–134 (1982).

  32. 32.

    Liu, K., Stamler, J., Dyer, A., McKeever, J. & McKeever, P. Statistical methods to assess and minimize the role of intra-individual variability in obscuring the relationship between dietary lipids and serum cholesterol. Journal of chronic diseases 31, 399–418 (1978).

    CAS  Article  Google Scholar 

  33. 33.

    McCulloch, C. E. & Neuhaus, J. M. Generalized linear mixed models. Encyclopedia of biostatistics 4 (2005).

  34. 34.

    Verbeke, G. & Molenberghs, G. Linear mixed models for longitudinal data (Springer Science & Business Media, 2009).

  35. 35.

    Leger, M. N., Vega-Montoto, L. & Wentzell, P. D. Methods for systematic investigation of measurement error covariance matrices. Chemometrics and Intelligent Laboratory Systems 77, 181–205 (2005).

    CAS  Article  Google Scholar 

  36. 36.

    Wentzell, P. D., Cleary, C. S. & Kompany-Zareh, M. Improved modeling of multivariate measurement errors based on the wishart distribution. Analytica chimica acta 959, 1–14 (2017).

    CAS  Article  Google Scholar 

  37. 37.

    Comrey, A. L. & Lee, H. B. A first course in factor analysis (Psychology press, 2013).

  38. 38.

    Day, N. et al. Correlated measurement error—implications for nutritional epidemiology. International Journal of Epidemiology 33, 1373–1381 (2004).

    CAS  Article  Google Scholar 

  39. 39.

    Pereira, V., Waxman, D. & Eyre-Walker, A. A problem with the correlation coefficient as a measure of gene expression divergence. Genetics 183, 1597–1600 (2009).

    Article  Google Scholar 

  40. 40.

    Reynier, F. et al. Importance of correlation between gene expression levels: application to the type i interferon signature in rheumatoid arthritis. PloS one 6, e24828 (2011).

    ADS  CAS  Article  Google Scholar 

  41. 41.

    Springer, M. D. The algebra of random variables (Wiley and Sons, 1979).

  42. 42.

    Bishara, A. J. & Hittner, J. B. Reducing bias and error in the correlation coefficient due to nonnormality. Educational and psychological measurement 75, 785–804 (2015).

    Article  Google Scholar 

  43. 43.

    Kowalski, C. J. On the effects of non-normality on the distribution of the sample product-moment correlation coefficient. Journal of the Royal Statistical Society: Series C (Applied Statistics) 21, 1–12 (1972).

    Google Scholar 

  44. 44.

    Magnusson, P. K. et al. The swedish twin registry: establishment of a biobank and other recent developments. Twin Research and Human Genetics 16, 317–329 (2013).

    Article  Google Scholar 

  45. 45.

    Haug, K. et al. Metabolights—an open-access general-purpose repository for metabolomics studies and associated meta-data. Nucleic acids research 41, D781–D786 (2012).

    Article  Google Scholar 

  46. 46.

    Ganna, A. et al. Large-scale non-targeted metabolomic profiling in three human population-based studies. Metabolomics 12, 4 (2016).

    Article  Google Scholar 

Download references

Acknowledgements

This work has been partially funded by The Netherlands Organization for Health Research and Development (ZonMW) through the PERMIT project (Personalized Medicine in Infections: from Systems Biomedicine and Immunometabolism to Precision Diagnosis and Stratification Permitting Individualized Therapies, project contract number 456008002) under the PerMed Joint Transnational call JTC 2018 (Research projects on personalised medicine - smart combination of pre-clinical and clinical research with data and ICT solutions). The authors acknowledge Peter Wentzell (Halifax, Canada) for kindly making available the NMR data set.

Author information

Affiliations

Authors

Contributions

E.S. and A.S. conceived the study and performed theoretical calculations. E.S., M.H. and A.S. analysed and interpreted the results. E.S. and M.H. performed simulations. E.S., M.H. and A.S. wrote, reviewed and approved the manuscript in its final form.

Corresponding author

Correspondence to Edoardo Saccenti.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Saccenti, E., Hendriks, M.H.W.B. & Smilde, A.K. Corruption of the Pearson correlation coefficient by measurement error and its estimation, bias, and correction under different error models. Sci Rep 10, 438 (2020). https://doi.org/10.1038/s41598-019-57247-4

Download citation

Further reading

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing