propr: An R-package for Identifying Proportionally Abundant Features Using Compositional Data Analysis

In the life sciences, many assays measure only the relative abundances of components in each sample. Such data, called compositional data, require special treatment to avoid misleading conclusions. Awareness of the need for caution in analyzing compositional data is growing, including the understanding that correlation is not appropriate for relative data. Recently, researchers have proposed proportionality as a valid alternative to correlation for calculating pairwise association in relative data. Although the question of how to best measure proportionality remains open, we present here a computationally efficient R package that implements three measures of proportionality. In an effort to advance the understanding and application of proportionality analysis, we review the mathematics behind proportionality, demonstrate its application to genomic data, and discuss some ongoing challenges in the analysis of relative abundance data.


S1 Appendix: Examination of proportionality
Here, we set out the relationships between the slope (β) and r 2 values of pairwise log-transformed data, and the proportionality statistics φ, φ s and ρ p implemented in propr.
Lovell et al. proposed the "goodness-of-fit to proportionality" statistic φ to assess the extent to which a pair of random variables (x, y) are proportional [3]. This statistic was motivated by challenges in interpreting logratio variance, var(log(x/y)), as a measure of association for variables that carry only relative information [1]. Logratio variance can be factored into two more interpretable terms: var(log(x/y)) = var(log x − log y) where β is the standardized major axis estimate [4] of slope of random variables log y on log x, and r the correlation between those variables. The first term in Equation 1, var(log x), is solely about the magnitude of variation at play and has nothing to do with y. The second term, φ, describes the degree of proportionality between x and y.
Lovell et al. [3] point out that φ(log x, log y) is not symmetric in its arguments. However, a symmetric alternative can be derived as follows. Using similar algebra to Equation 1, one can show that var(log(x · y)) = var(log x + log y) = var(log x) · (1 + β 2 + 2β|r|) and then define φ s (log x, log y) var(log(x/y)) var(log(x · y)) = var(log x − log y) var(log x + log y) Erb and Notredame refer to the square root of Equation 2 asβ in [2] and also show how this symmetric measure of proportionality relates to ρ p (log x, log y): How Proportional Is Proportional Enough?

Introduction
In this vignette, we use a real dataset to show how correlation analysis yields spurious results when applied to RNA-seq count data, and how proportionality analysis minimizes the discovery of falsely positive associations. We conclude by establishing a well-reasoned "rule of thumb" to answer the question, "How proportional is proportional enough?".

Correlation on relative data yields spurious results
Relative, or compositional, data do not exist in Euclidean space, but rather in a sub-space known as the Aitchison simplex (Aitchison 1986). As a consequence, calculating the Euclidean distance between points in relative space does not reflect the absolute distance between them. Many commonly used metrics implicitly assume that underlying data exist in Euclidean space; such metrics are invalid for relative data. Notably, this includes Pearson's correlation coefficient, a measure of the association between two variables .
What does it mean when we say that Pearson's correlation coefficient is invalid for relative data? When applied to relative data, including many of the biological datasets routinely studied, correlation yields erroneous results. Pearson first described this phenomenon in 1896 as spurious correlation (Pearson 1896). A spurious correlation is a falsely positive measure of association between two variables that are, in fact, not associated.
Below, we re-examine the dataset discussed by Lovell et al.  (as published by Marguerat et al. (Marguerat 2012)) which describes a time series of mRNA transcript abundance for yeast cells after the removal of a nutrient. The source data contain two parts: microarray data as expressed relative to the first time point, and a measure of the total RNA at timepoint 0. From these two parts, it is possible to approximate absolute RNA abundance at each timepoint.
Starting with propr version 2.2.0, the absolute abundances are available within the package itself for testing and exploration. The object, "marg.abs" corresponds to the object "Abs.t" as used in the supplementary materials of Lovell et al. . Note that we can make the absolute data relative by dividing each composition (i.e., time point) by the total sum of that composition. This constrains measurements to a proportion of the total, hence making them compositional.
library(propr) data(marg.abs) marg.rel <-t(apply(marg.abs, 1, function(x) x / sum(x))) We now have two datasets: the first contains the absolute abundances of each transcript over time while the second contains the relative abundances of each transcript over time. Although this is microarray data, RNA-seq data is also compositional in that counts carry meaning only as a proportion of the total number of counts (i.e., library size). The compositional nature of RNA-seq data becomes more obvious when expressed, for example, as FPKM.
This dataset provides us with the opportunity to contrast the analysis of relative data with the analysis of the absolute counter-part. The importance of this evaluation comes from the reality that it is not always feasible to acquire data in absolute form. Note, however, that having access to absolute abundances usually makes the analysis of relative data superfluous. Below, we calculate Pearson's correlation coefficient for all pairs using both datasets.

Histogram of Abs.cor
Abs.cor Perhaps surprisingly, there are fewer correlations when looking at the relative data than the absolute data. We emphasize here that spurious correlations does not mean more correlations, but rather wrong correlations.

Histogram of Rel.cor
Rel.cor We will not discuss the distribution of these correlations further; rather, we focus now on the agreement between them. For this, we plot the relative correlations against the absolute correlations. Since correlation matrices have symmetry, we only need to consider one half (e.g., the lower left triangle) of the square matrix. We use the function propr:::lltRcpp to pull out this lower left triangle. We sample 5,000 pairs to render the figure quickly. When studying a relative dataset, we usually want to know about the true relationships in the corresponding (but unattainable) absolute dataset. In this context, it is maybe helpful to think about relative correlation as a potential proxy for the presence of absolute correlation. As such, we can measure the diagnostic performance of relative correlation as a predictor of absolute correlation. One way to do this is by calculating precision and recall at a number of cutoff values. This results in a contingency table of the agreement between relative and absolute correlation coefficients greater than the cutoff. As an example, we use a cutoff of 0.95 below. Here, we define functions for calculating precision and recall from a contingency table like the one above. Note that because each contingency table is specific to the dataset, the precision and recall of relative correlation depends on the dataset too. We can now calculate the precision and recall of relative correlation across a range of cutoff values. From this, we see that relative correlation is neither a precise nor sensitive indicator of absolute correlation. The preponderance of falsely positive correlations here are known as spurious correlations. The poor agreement between relative correlation and absolute correlation leads us to conclude that correlation is invalid for relative data.

Proportionality is a precise indicator of absolute correlation
Proportionality coefficients, as introduced by Lovell et al.  and expounded in Erb & Notredame (Erb 2016), provide a measure of association that is valid for relative data. Although several measures of proportionality exist, we focus here on the metric ρ. This, like correlation, takes on values from [−1, 1], where a value of 1 indicates perfect proportionality. However, unlike correlation, the proportionality coefficient is the same for both relative data and its absolute counter-part, and advantageously tends to produce fewer spurious results. This allows proportionality to serve as a precise indicator of absolute correlation when analyzing relative data.
First, we use the perb function from the propr package to calculate ρ for all feature pairs. We retrieve the proportionality matrix itself from the @matrix slot of the resultant object.

library(propr) rho.clr <-perb(marg.abs)@matrix
We can now plot the agreement between ρ and absolute correlation. On inspection, we see immediately that this scatter plot is tapered in such a way that the absolute correlation coefficient is almost always larger than the proportionality coefficient. Since ρ rarely exceeds the absolute correlation coefficient, ρ is rarely falsely positive (i.e., spurious). Moreover, when it is falsely positive, it only inflates the absolute correlation coefficient marginally (unlike relative correlation). We can now calculate the precision and recall of proportionality across a range of cutoff values. From this, we can see that ρ is a highly precise, albeit not very sensitive, indicator of absolute correlation. Why is ρ not sensitive (i.e., has low recall)? Well, part of the definition of proportionality requires calculating the variance of the log-ratio transformed data. One can think of log-ratio transformations as an attempt to scale each subject vector (i.e., time point) by some reference value. This is done by taking the logarithm of each measurement as divided by the reference.

plot(llt(Abs.cor)[x], llt(rho.clr)[x],
Ideally, this reference is an unchanged reference: a feature that has fixed abundance in absolute space across all samples. By default, propr uses a centered log-ratio transformation in which the geometric mean of the entire subject vector is used as the reference. The better the geometric mean approximates a true unchanged reference, the better ρ agrees with the absolute correlation. The more that ρ agrees with the absolute correlation, the more sensitivity ρ has as an indicator of absolute correlation. Interestingly, proportionality retains precision even when the geometric mean does not approximate the unchanged reference very well (as in this example).
We support this claim by re-calculating ρ using a different log-ratio transformation. When using the perb function, the 'ivar' argument allows us to define a custom reference set to use during the geometric mean calculation. Here, we use the 5 most invariant features (i.e., in absolute space) for the custom reference set. This should provide a reasonable approximation of an idealized unchanged reference. Since we know the absolute abundances, we can determine the invariant features easily. Note that this a contrived example; we most often do not have the absolute abundances when analyzing relative data, or else we would analyze the absolute data directly! We can now plot the agreement between the reference-based ρ and absolute correlation. Here, we see that using the this invariant reference set improves the agreement between ρ and absolute correlation. This translates to higher recall (and precision too). Importantly, precision remains high regardless of the log-ratio transformation used. The stable precision is what makes proportionality an ideal alternative to correlation in the setting of relative data.

Selecting a proportionality cutoff based on absolute correlation
Above, we have shown that the proportionality coefficient provides a highly precise estimate for the absolute correlation coefficient. But how proportional is proportional enough?
One way to select a proportionality cutoff is to consider what absolute correlation coefficients we would consider significant. We can determine this easily using Fisher's z -transformation: this principle states that although the sampling distribution of Pearson's correlation coefficient is not normal, the hyperbolic arctangent transformation of this distribution is normal, and has standard deviation that depends only on the number of samples studied.
Since we have established that proportionality is a highly precise (albeit, non-sensitive) indicator of absolute correlation, we could simply choose a statistically sound cutoff for r to stand in place as our proportionalilty cutoff. In other words, we threshold ρ based on a value of r that corresponds to a statistically significant correlation (i.e., for a given sample size and p-value). To do this, we must first determine a cutoff for the transformed value z based on its the standard deviation. Here, we assume a sample size of 20. We believe this figure provides a useful "rule of thumb" for selecting a reliable cutoff for the proportionality coefficient, ρ. We acknowledge, however, that using r cutof f might lead to slightly inflated false discovery rates (i.e., above α) due to the fact that ρ does not predict absolute correlation coefficients with exactly 100% precision. We therefore recommend users interpret results cautiously, choosing larger (and therefore more conservative) cutoffs whenever possible.
We also recognize that the ability to calculate an exact p-value for the proportionality coefficients would offer a clear advantage over this method. Extending Lin's formal derivation of the concordance correlation coefficient is one way forward (Lin 1989); an implementation of this is available through the experimental function propr:::prop2prob. However, Lin assumes a bivariate normal distribution for the underlying data.
There is no evidence that log-ratio transformed data (used in the calculation of ρ) necessarily conform to this distribution. Moreover, uncertainty about the distribution of the underlying absolute data presents an additional challenge in calculating exact p-values for ρ.