Detecting heritable phenotypes without a model using fast permutation testing for heritability and set-tests

Testing for association between a set of genetic markers and a phenotype is a fundamental task in genetic studies. Standard approaches for heritability and set testing strongly rely on parametric models that make specific assumptions regarding phenotypic variability. Here, we show that resulting p-values may be inflated by up to 15 orders of magnitude, in a heritability study of methylation measurements, and in a heritability and expression quantitative trait loci analysis of gene expression profiles. We propose FEATHER, a method for fast permutation-based testing of marker sets and of heritability, which properly controls for false-positive results. FEATHER eliminated 47% of methylation sites found to be heritable by the parametric test, suggesting a substantial inflation of false-positive findings by alternative methods. Our approach can rapidly identify heritable phenotypes out of millions of phenotypes acquired via high-throughput technologies, does not suffer from model misspecification and is highly efficient.


Supplementary Figures
Supplementary Figure 1 Discrepancy in p-values in a methylation study. p-values from 10,000 permutations, compared to GCTA p-values assuming asymptotics (in log scale). Evaluated on 431,366 methylation sites on all autosomal chromosomes, from the KORA dataset, with 1,799 individuals. Sites withĥ 2 = 0 (206,436 sites) or with a parametric p < 10 −20 (570 sites) omitted for clarity of presentation, showing a total of 224,360 sites, with 99.995% confidence intervals (CIs). Parametric p-values are often smaller than the exact p-values obtained by the permutation test, frequently by several orders of magnitude, resulting in many false positives.
Supplementary Figure 2 Discrepancy in p-values. p-values from 10,000,000 permutations, compared to GCTA p-values assuming asymptotics (in log scale). Evaluated on 3,489 methylation sites for which the GCTA p-value is p < 10 −7 , on all autosomal chromosomes, from the KORA [18] dataset. Sites with a parametric p < 10 −30 (143 sites) omitted for clarity of presentation, showing a total of 3,346 sites, with 99.5% CIs. Parametric p-values are still often smaller than the exact p-values obtained by the permutation test by several orders of magnitude. Supplementary Figure 6 Illustration of the derivative-based approach. The logarithm of the restricted likelihood function is shown as a function of the heritability value, h 2 . The true REML estimate is shown by h 2 max , while the observed heritability estimate is shown by H 2 . The derivative of the log-restricted-likelihood function at H 2 is shown as a linear slope, and it points to the true maximum. Therefore, we can deduce that h 2 max > H 2 simply by evaluating the derivative at H 2 .
Supplementary Figure 7 Performance of SAMC. p-values from 10,000,000 permutations, compared to SAMC p-values with t 0 =1,000 and 1,000,000 permutations (in log scale). Evaluated on 1,907 methylation sites, as in Supplementary Figure 2, with 99.95% CIs. SAMC is reasonably calibrated, also for small p-values.
Supplementary Figure 8 Sensitivity to outliers. The QQ-plots of parametric p-values are shown for phenotypes generated from the assumed normal distribution, as well as from several heavy-tailed distributions. When deviating from normality, skewness in p-values is observed.
Supplementary Figure 9 Discrepancy in p-values, with outliers removed. pvalues from 10,000 permutations, compared to GCTA p-values assuming asymptotics (in log scale). Removal of outliers does not eliminate discrepancies.
Supplementary Figure 10 Sites with SNPs in probes. The KORA analysis is shown here, with sites with known SNPs in their probes colored in red. Such sites are likely to be multimodal. Removal of such sites does not discard all discrepancies.
Supplementary Figure 11 Sites with a multimodal distribution. The KORA analysis is shown here, with sites with a multimodal distribution, as detected by gaphunter, colored in red. Removal of such sites does not discard all discrepancies.   [1,2]. We conclude that the permutation test is equally powerful under the LMM, and thus that reduced power does not explain the discrepancy.

Supplementary
Normalization. A common practice for heritability estimation is to perform some type of normalization on the phenotypes before analysis, such as quantile normalization (QN), which ensures that the phenotype follows an empirical normal distribution. Performing QN can often increase proper p-value behavior. However, QN may also reduce power, e.g. when the phenotype is bimodal, QN will unify the two modes into a single unimodal distribution, and will the discard the important information of the mode to which each sample belongs. Additionally, we note that QN should be applied to the residuals and not to the phenotype itself, which is problematic in the presence of strong unknown covariates. To this end, we applied QN to our data, and calculated parametric p-value vs. the permutation p-values, as in Supplementary Figure 1. The results, shown in Figure 3, indicate that while QN indeed reduces the number of inflated (small) p-values, many sites remain with substantial discrepancy. Also, as expected, QN introduced many deflated p-values, i.e., sites whose parametric p-values are much larger than their permutation p-value counterpart. We additionally reran the analysis of Table 3, comparing permutation p-values with parameteric p-values, with and without QN, on 10 sites with small permutation p-values (Table 1). Again, QN fails to eliminate the large discrepancies. We conclude that while QN may be useful in many cases, it does not alleviate the discrepancies at hand.
Wrong asymptotic distribution for statistic under the model. Another potential source for such discrepancies is the distribution assumed for the statistic used for the cal-culation of the p-value, even when the null model is correct. The assumptions underlying p-value calculation are often violated, leading to inaccurate p-values. However, our simulation study above shows the FPR is well calibrated, so the asymptotic distribution is expected to be a reasonable approximation. As a method for the correction of the distribution of p-values for mild deviations from the asymptotic distribution, it was suggested [3] to consider a larger parametric family of distributions for the distribution of the LR statistic: where Λ is the LR statistic, and w, d and a are constants to be fitted. A small number of permutations (e.g., 1,000) is used to generate a sample of LR statistics, and they are used to find the best fitting model of a weighted mixture of a constant zero and a chi-square distribution, with an arbitrary number of degrees of freedom. We followed the suggested procedure and obtained adjusted p-values. While it was possible to slightly adjust the bias, the results are qualitatively the same, with large discrepancies remaining unchanged (data not shown). We conclude that this is not the source of discrepancy. We finally considered model mis-specification, where the model is improper for the data at hand, and in particular, the assumption that the phenotype takes a multivariate normal distribution.
Sensitivity to outliers. In line with existing literature [4], we observed large p-value deviations in phenotypes simulated to have heavy-tailed marginal distributions, which cause relatively distant outliers. Specifically, we generated phenotypes by drawing phenotype measurements i.i.d. from a target distribution. When the target distribution is a normal distribution, this amounts to the model null hypothesis. For any target distribution, the permutation test is assured to be well-calibrated due to the exchangeable distribution of the phenotype. However, the parametric p-values might not be calibrated. To this end, we generated 1,000 phenotypes, from several heavy-tailed target distributions, calculated their parametric p-values using the KORA kinship matrix, and constructed a quantile-quantile (QQ) plot to examine their calibration.
The results are shown in Supplementary Figure 8. For a normal distribution, the QQ-plot is aligned, with the smallest p-value reaching about 10 −3 , as expected, while, for other distributions, skewness is observed. In particular, we show the QQ-plots for the Gamma distribution with parameters α = 1/2, β = 1, the standard log-normal distribution, the T distribution with 2 degrees of freedom, and the Bernoulli distribution with p = 0.99.
We further tested the effect of outliers on the data at hand. The GTEx dataset, did not contain any outliers, while the KORA dataset did contain 2-5 outliers, as visible in a PCA plot (not shown). We removed all 5 outliers any re-ran a comparison between permutation and parametric p-values and verified that many large discrepancies remain (Supplementary Figure 9). Therefore, removal of outliers does not solve the problem.
Empirical discrepancies extend beyond multimodality. We have seen that multimodality (e.g. a binary Bernoulli distribution) may be the cause of discrepancy. Indeed, analyzing the KORA dataset, we have seen that many of the problematic sites exhibit a bimodal or trimodal behaviour. Thus, we set out to find if such sites are the ones displaying discrepancy.
First, we removed sites with known multimodality. In particular, having a SNP in the probe of a site causes the hybridization of the probe to depend on the genotype in some cases, which may cause multimodal phenotypes. We removed 57,341 such sites, obtained from [5] from our analysis; however, the remaining sites still exhibit large p-value discrepancies (Supplementary Figure 10), qualitatively similar to Supplementary Figure 1.
Second, we explicitly removed sites empirically seen to be multimodal. Current qualitycontrol (QC) methylation pipelines typically remove potentially problematic probes as those with DNA methylation distributions characterized by two or more distinct clusters separated by gaps. We used the gaphunter program [6] with default parameters to identify and remove 3270 multimodal sites. However, remaining sites still show discrepancies, as seen in Supplementary Figure 11. Further examination shows that those are indeed sites with one mode but several scattered outliers. We note that as these outliers may be different within each site, it is not possible to only remove a small amount of samples to avoid outliers. Indeed, using gaphunter to remove these sites as well (using a flag which includes multimodal sites with a very small cluster size) removed 193,762 of 431,366 total sites (45%), well beyond any reasonable QC.

Supplementary Note 2 Validation of likelihood function behaviour
The following was tested on each one of the 7,989 methylation site phenotypes on chromosome 22 of the KORA dataset, and on 10,000 random permutations of each site, resulting in a total of 79,897,989 ≈ 10 7.9 phenotypes.
First, we evaluated the sign of the derivative of the log-restricted-likelihood function at h 2 = 0, 0.1, . . . , 0.9, 1, as described in the Methods section. We checked whether there was indication of more than one local maximum of the log-restricted-likelihood function, by testing if more than one of the follow cases hold: (i) the derivative at h 2 = 0 is negative (indicating a local maximum at 0); (ii) the derivative at h 2 = 1 is positive (indicating a local maximum at 1); (iii) there are two adjacent grid points a and b such that the derivative is positive at h 2 = a and negative at h 2 = b (indicating a local maximum between them). For all phenotypes tested, there was no indication of more than one local maximum.
Second, we estimated the heritability of each phenotype (using REML, as described in Methods), and verified that the sign of the derivative at the grid points was consistent with the estimate, where (i) if the estimate is 0, we expect the derivative at 0 to be negative; (ii) if the estimate is 1, we expect the derivative at 1 to be positive; (iii) at all other grid points, we expect the derivative to be positive if and only if the estimate is larger than the grid point. Again, for all phenotypes tested, all the derivative signs were consistent with the estimate.

Supplementary Note 3 Using the permutation test for LMM with covariates
We discuss the application of permutation testing in LMM when covariates are involved.

Potential issues
In order for the permutation test to be calibrated under the null hypothesis, the distribution of the permuted quantity needs to be exchangeable, i.e., having the same distribution when its entries are exchanged. In particular, any multivariate distribution with i.i.d. entries is exchangeable. Without covariates, the LMM null hypothesis is y ∼ N (0 n , σ 2 p I n ). With the constant intercept covariate, it is y ∼ N (β · 1 n , σ 2 p I n ). In both those cases, the distribution is exchangeable, resulting in an exact permutation test.
When additional, non-constant, covariates are added, the exchangeable quantity is the vector of residuals, e = y − Xβ. However, it is generally not observed, and thus needs to be estimated from the data. This is a limitation compared to the permutation test without non-constant covariates, due to mainly two reasons.
The first reason is that our test is no longer completely non-parametric, as we now explicitly assume a model used for estimating the residuals given the phenotype and the covariates. This is still an improvement over a completely parametric test, as we still do not assume an explicit form for the distribution of the residuals, which allows for flexible modeling; see also the discussion above regarding non-normal i.i.d. errors. This is in line with other non-parametric approaches in genetics [7].
The second issue is that even if the residual model is correct, our estimate may not be precise due to the need to estimate the unobserved parameters. In the linear example above, we have no access to the true β, and thus we need to estimateβ from y and X. If, for example, the estimateβ was far from its true value β,ê will have a component of X that did not exist in e (asê − e = X(β − β)). If, in addition, X has a covariance structure similar to K (equivalently, X are correlated with the dominant eigenvectors of K), this could induce a false correlation that did not originally exist, inflating the p-value.
See [8][9][10] for a detailed discussion in the similar simpler case of linear regression, whose conclusions carry to this case. Luckily, with even small sample sizes, we can expect the bias in estimating β to be small [10].

Dealing with covariates
With these potential issues in mind, our method supports two approaches to dealing with covariates.

Block permutation for discrete covariates
Consider the case where there are few discrete covariates (e.g. sex and smoking status). Covariates partition the samples into a small (e.g. 4) number of groups, according to their covariate values (e.g. non-smoking males). In this case, we can avoid any potential issues with confounding by covariates by conditioning on them. This is done in the simple manner of permuting individuals only within each group. Our software package thus supports drawing random permutations only from the set of permutations maintaining the group block structure. Our method remains unchanged otherwise.

General covariates
In the case of general covariates, we apply our method, extended to covariates. First, the likelihood derivative can be calculated in linear time also with covariates [11]. The covariates are permuted along with the phenotype, i.e. per permutation π, we compare π(y), π(X) vs. the unchanged K. Note that the implicit underlying residual model here, based on REML, is the linear e = y − Xβ.
It is important to note a distinction between testing for heritability and the commonplace scenario of testing for an association between a phenotype and a single marker. In the former, the tested quantity is h 2 , while in the latter, one of the covariates is the marker of interest, and the tested quantity is its associated fixed effect. A permutation test is often applied in the single-marker scenario, and its properties and pitfalls are discussed elsewhere, e.g., [12]. While several issues are common, e.g. inflation in the presence of confounders, the heritability testing scenario is in a sense simpler -there are only three distinct entities: the phenotype y, the covariates X and the kinship matrix K. In the single-marker scenario, there is an additional entity, the tested marker, and its additional relationships to the others adds complications.

Empirical study
To validate the permutation test with covariates, we performed an empirical study. First, we performed a simulation study. Using the KORA kinship matrix and the age, sex, and smoking status covariates (see Methods), we simulated phenotypes under the null hypothesis of no heritability, as y ∼ N (X1 · β, I n ), with varying effect sizes β = 0, 0.25, . . . , 2. For each effect size, we generated 100 random phenotypes, and performed a permutation test with 10,000 permutations. We used a quantile-quantile plot (QQ-plot) to verify that the p-values are well calibrated (Supplementary Figure 12). The QQ-plots were calibrated for all effect sizes, showing that the bias in estimating β [10] is indeed negligible. A similar simulation study with GTEx gave qualitatively similar results (not shown).
In addition, as discussed in the main text, we compared parametric and permutation p-values for the KORA dataset, both with and without covariates. While the p-values for the same phenotype obviously differ, under both regimes there are significant discrepancies, showing that they are not due to the lack (or addition) of covariates.

Summary
While permutation testing is mathematically exact only with constant covariates and an exchangeable distribution, we and others have shown, in theoretical analyses, data studies and simulations, that under a wide range of scenarios, permutation testing remain calibrated with covariates as well.
In practice, there are endless combinations of error distributions, correlations between samples, covariates, markers and phenotypes, and more. A complete mapping of the robustness of both the parametric and permutation tests remains a subject for a future study [4], and is beyond of the scope of this work. We note that despite these potential issues, in many cases the parametric alternatives suffer from significantly larger practical problems. An informed choice between the two alternatives, comparing their advantage and drawbacks, remains the role of the researcher.

Supplementary Note 4 The SAMC algorithm 1 Algorithm description
The Stochastic Approximation Markov Chain Monte Carlo (SAMC) algorithm can be generally described as an iterative, stochastic approximation method, that, given a partitioning of the sample space into subsets, can estimate the subset sizes. SAMC can be used for efficient p-value calculation [13]. For the sake of completeness, we give here a presentation of the SAMC algorithm in the context of p-values; for a full description and explanation, see [13,14].
Given an estimate H 2 , we want to calculate its p-value, i.e. the probability of the event h 2 (π(y)) ≥ H 2 . We define a partitioning of [0, 1] to D + 1 intervals, where the interval [0, H 2 ] is divided into D equally sized intervals, and [H 2 , 1] is an additional interval. This induces a partitioning of the permutation space namely, the set of permutations of the phenotype for which the estimated heritability value falls in the interval [h 2 d , h 2 d+1 ]. Then, X 1 ∪ . . . , ∪X D+1 is a full partitioning of the space of permutations. Then, the p-value is exactly the probability of the event X D+1 assuming a uniform distribution over permutation space, or |X D+1 |/n!.
SAMC utilizes the following key observation. Let p i = |X i |/n! be the true sizes of the subsets, and letp i be some estimate of their sizes. Let ψ(i) = (1/p i )/( D+1 i=1 (1/p i )), that is, a probability distribution on subsets, proportional to 1/p i . This probability distribution is constructed to have the self-balancing property introduced below. For a permutation π ∈ S n , let J(π) be the index of the interval in whichĥ 2 (π(y)) falls. We are interested in the probability distribution over permutations, defined as: The key property of f is that if the estimatesp i are accurate, then sampling from f will sample uniformly from the subsets. To see that, for a subset X i Ifp i = p i , then Pr π∼f (π ∈ X i ) ∝ 1, and thus the subset sampling is uniform. In order to sample from f , a Metropolis-Hastings (MH) step is used. The target distribution for the MH step is f , which in turn is constructed from the current estimatesp i . The choice of a well performing proposal distribution q is discussed below.
The update rule for subset probability estimates follows the stochastic approximation (SA) algorithm [15], which ensures that the estimates can be improved continuously as the simulation goes on. The SA update rule defines a sequence of weights, called gain factors, designed to ensure convergence.
Therefore, the algorithm alternatingly samples a new random permutation according to the target distribution f defined by current estimatesp i , using the MH sampling algorithm; and then, given the subset in which the new permutation falls, updates the estimates. Informally, given the above partitioning, SAMC will eventually sample from each subset uniformly, while in the process estimating the probability of each subset. When the algo-rithm converges, its estimate for the last subset will be our estimate of the p-value.
In practice, as in [13], we will update an estimate of the log of a value only proportional to the probability. That is, defining θ i to be our updated estimate at iteration t, then our estimated probability for subset i is: .
The proposal distribution q(π t , τ ) defines the probability of choosing a new permutation τ , given that the current permutation is π t . Let e i = (0, . . . 3. For t = 1, . . . , T (or until convergence): (a) Simulate a sample π t+1 by a single Metropolis-Hastings update, as follows: i. Generate τ according to the proposal distribution q(π t , τ ).
(b) Update the estimates: For i = 1, . . . , D + 1, set θ is called the gain factor and is defined as γ (t) = t 0 / max(t 0 , t). , then we know a maximum exists within that interval. Using the derivative allows us to avoid the heritability estimation step, as before.

Return exp(θ
An additional speedup is obtained using the following. The proposal distribution altered only a small proportion of the permutations between consecutive steps (see below). Therefore, the quantity u t+1 = U T π t+1 (y) may be calculated from u t = U T π t (y) by adding a small number of columns of U T .

Tuning the parameters of SAMC
We now describe guidelines for tuning the parameters of SAMC algorithm described above. These are the proposal distribution q, the number of intervals, D; the number of iterations, N ; and an additional parameter, t 0 , that corresponds to the number of iterations after which the estimation will begin converging more rapidly. In more detail, the estimate per subset is updated in the t-th iteration by a value which is multiplied by a weight γ t , the gain factor, which effectively controls the magnitude of the update. The gain factor is 1 for the first t 0 iterations, and then it begins decreasing by defining γ t = t 0 /t for t > t 0 .
Proposal distribution. We follow [13] in defining q(π t , τ ) as the uniform distribution over all permutations that are generated by randomly permuting 5% of the values of π t . Results with 10% were qualitatively similar (not shown).
Number of intervals. SAMC in general prefers a fine partition, so that the transition between intervals will be smooth. On the other hand, a large number of subsets will require a larger number of iterations (permutations) until each subset has been visited. We found that D = 50 or D = 100 are often a reasonable trade-off; a D as low as D = 20 may be suitable as well. We refer the reader to [13] for a further discussion about partitioning.
Number of iterations and t 0 . It is important that t 0 will not be too small; otherwise, estimates will begin to converge before all subsets have been sufficiently visited. Previous recommendations have advised t 0 =1,000 to 5,000 or t 0 = 2D to 100D. On the other hand, γ t should be very close to 0 at the end of simulations. Otherwise, the resulting estimates will have a large variation. This can be controlled by determining the ratio between T and t 0 , which controls the last gain factor as γ N = t 0 /N . Therefore, N should be sufficiently larger than t 0 , e.g. N = 100t 0 or 1000t 0 . An alternative to setting a fixed N is to stop the algorithm when it seems to have sufficiently converged; we did not investigate this criterion in this paper.
Validation of parameter choice. In practice, the above rules of thumb are best validated empirically for the problem at hand. To this end, we recommend the following to check the performance of a given set of parameters. (i) First, run SAMC multiple (e.g. 5-10) times and compare the output p-value estimates. If there are large discrepancies, it might mean that convergence was not achieved. Further information can be gained by looking at the probability estimates for all subsets, not only the last. (ii) Second, as mentioned above, as SAMC converges, it visits all subsets approximately uniformly. The relative sampling error (RSE) is defined as the maximal percentage of deviation of the sampling rate of any subset relative to the required uniform rate. For example, if D = 50 and the RSE is 10%, then each subset was visited by 1.9 ± 0.19% of permutations (there are 51 subsets and 1/51 ≈ 0.019). If the RSE is above some threshold, say 10%, then it is a strong indication that the run has not converged, and an increase of N is advised; (iii) Finally, if phenotypes exist whose heritability estimate has a relatively large p-value (e.g, 10 −3 ), run the simple (non-SAMC) permutation testing to get an estimate of the true permutation test p-value (e.g, with N = 10 5 or 10 6 random permutations), and compare the p-value estimated by SAMC. An additional possible approach we did not pursue is using the Potential Scale Reduction Factor [16], a convergence criterion based on a suitable comparison of the variance of sampled values within multiple runs, compared to the variance between them.
Parameter tuning for the KORA dataset. We chose parameters of SAMC by selecting several sites with a relatively high p-value, as estimated by 10,000 permutations. We examined the performance of SAMC with several parameter settings and compared it with the p-value obtained by the simple permutation test. We examined the effect of various parameter settings on the performance of SAMC, checking the relative sampling error (RSE) as well as the variability in estimates across several runs per site (see Methods, Supplementary Figure 13 We found, in agreement with the suggested parameter tuning guidelines of [13], that when t 0 is too small, convergence is slow -t 0 = 1000 appears to work well in our case. Moreover, when the ratio t 0 /N is too large, estimates do not yet converge. That is, N must be sufficiently larger than t 0 . We found N ≥ 100,000 to give good results for t 0 = 1000. We concluded that selecting the right t 0 is important for a speedy convergence. We chose t 0 =1,000 and N =1,000,000 as suitable parameters. Throughout the paper, we used D = 50 intervals.
Implementation. The permutation test was implemented in C++, using the Eigen [17] and Boost software libraries, and in Python.