Fast and covariate-adaptive method amplifies detection power in large-scale multiple hypothesis testing

Multiple hypothesis testing is an essential component of modern data science. In many settings, in addition to the p-value, additional covariates for each hypothesis are available, e.g., functional annotation of variants in genome-wide association studies. Such information is ignored by popular multiple testing approaches such as the Benjamini-Hochberg procedure (BH). Here we introduce AdaFDR, a fast and flexible method that adaptively learns the optimal p-value threshold from covariates to significantly improve detection power. On eQTL analysis of the GTEx data, AdaFDR discovers 32% more associations than BH at the same false discovery rate. We prove that AdaFDR controls false discovery proportion and show that it makes substantially more discoveries while controlling false discovery rate (FDR) in extensive experiments. AdaFDR is computationally efficient and allows multi-dimensional covariates with both numeric and categorical values, making it broadly useful across many applications.

AdaFDR-only discoveries and SBH-only discoveries over each covariate is shown for the tissue Adipose_Subcutaneous and Colon_Sigmoid respectively. There is a higher proportion of AdaFDR-only discoveries at locations where 1) the distance from TSS is small (upper left); 2) the SNP has an active chromatin state (upper right); 3) the SNP AAF is close to 0.5 (lower left); 4) the gene expression level is neither too high or too low (lower right). All these match the enrichment pattern of eQTLs (Results), indicating that the AdaFDR-only discoveries are more biologically relevant. To investigate the individual contribution of different covariates, we run AdaFDR using each covariate separately for all tissues in the GTEx experiment. We use a nominal FDR level of 0.01 same as before. The distance from TSS is most informative while others have have smaller but still notable effects. Interestingly, the combined improvement of using all covariates (31.9%) is similar to the sum of the four individual improvements (33.0%), indicating that the four covariates carry very different information regarding the hypotheses. Source data are provided as a Source Data file.    Figure 7. Algorithm stability. AdaFDR may produce slightly different results in different runs on the same dataset due to its inherent randomness. To showcase its stability, we repeat all 10 experiments in Figure 3a 50 times with different random seeds. As shown in the first column of the table, the number of discoveries of the 50 repetitions are highly consistent. Furthermore, for each of the 50 repetitions, we run AdaFDR for a second time and report the proportion of reproduced discoveries in the second column of the   Table S2 in 11 . Ten resamplings were done for RNA-Seq experiments (a,b,e) while twenty were done for others; 95% confidence intervals are provided. Panels a, b are two RNA-Seq spike-in resampling experiments with an informative covariate, panel c contains a simulated data with the number of tests varying from 500 to 50k, while panel d contains a simulated data with the non-null proportion of tests varying from 0.95 to 0.6. In all four experiments, AdaFDR and AdaPT have the highest power (with AdaFDR being slightly better). We note that AdaPT does not have such high power in the same experiments in 11 . This is probably because we used adapt_gam while adapt_glm is used in 11 ; the former has a better performance but takes a longer time to run. Panel e uses the same set of p-values as panel b but with an uninformative covariate.
We can see the performance of IHW reduces to BH while others reduce to SBH, a phenomenon also mentioned in 11 .
AdaFDR maintains high power here indicating that it does not overfit the uninformative covariate.

Feature preprocessing
We perform feature preprocessing to integrate both numerical covariates and categorical covariates. First for each categorical covariate, the categories are reordered based on the ratio of the alternative probability and the null probability, estimated on the training set using the same method as above. Then quantile normalization is performed for each covariate separately. Note that after this transformation, all covariates will have values between 0 and 1. Also, overfitting is not a concern since the entire proprecessing is done without seeing p-values from the testing set.

Remark on Theorem 1
Theorem 1 is similar to, but stronger than that for NeuralFDR. First, NeuralFDR requires the scale factor to be selected from a finite set of L numbers and has an extra multiplicative factor p log L in the error term e. In contrast, AdaFDR selects the scale factor over all positive numbers and the p log L term is no longer needed. This is done by using a stochastic process argument instead of the union bound. Second, NeuralFDR uses an empirical Bayes model where the tuples (P i , x i , h i ) are generated i.i.d. following some hierarchical model. AdaFDR, however, requires a less restrictive assumption made only on the conditional distribution of null p-values, whereas the covariates and alternative p-values can have arbitrary dependence.

Initialization via EM algorithm
Here we present the EM algorithm that is used to fit the mixture model (2) on a set of N points Recall that due to quantile normalization, the value of x i is within [0, 1] d . Therefore, each component in the mixture model is truncated to be within [0, 1] d , i.e., truncated GLM or truncated Gaussian. Since we need to use the samples each associated with a sample weight, let us consider the general case where each sample x i receives a positive weight v i 2 R + .
For the sake of convenience, let us reparameterize the parameters to have the standard probability distribution where w 2 [0, 1] K+1 with Â K k=0 w k = 1 and It is not hard to see that (1) is equivalent to the mixture threshold (2) up to a scale factor that can be specified by b in (2); knowing one, the parameters for the other can be computed without difficulty. The EM algorithm can be described as follows. For the initialization, the responsibility where the first component corresponds to the slope component and the rest correspond to the K bump components. Then, the algorithm iterates between the E-step and the M-step as follows until convergence: 1. Expection (E-step): For each point x i , update the responsibility

Maximization (M-step): Update the component weights w new by
Update the parameters for the slope component and each of the K bump component: The ML estimates of slope and bump, i.e., ML slope , are described as follows. ML estimate of the slope. The log likelihood function of a single observation x i can be written as Further the weighted average log likelihood function, We add a regularization term ckak 2 2 to encourage small values of ckak 2 2 , i.e.
We found that setting c = 0.005 gives a stable result. We solve the ML estimation problem by setting the derivative to be zero. Namely, for the jth element a j , Rearranging terms on both sides we have that the ML estimateâ j satisfies Since the left-hand-side term is monotonic inâ j , the ML solutionâ j can be computed via binary search. ML estimate of the k-th bump. Since the density function can be factorized as a product of different dimensions, the ML estimation can be done for each dimension separately. Now consider observation x i . The log likelihood function corresponding to dimension j can be written as Then the weighted average log likelihood function for dimension j can be written as Sincel j (µ k j , s k j ) is convex, we compute the ML estimationμ k j andŝ k j via gradient descent, where the derivatives are given as follows.
where the derivatives with respect to Z k j are

Extension to dependent case
Here we describe a simple procedure that extends AdaFDR to allow arbitrary dependency of p-values, borrowing ideas from the extended version of IHW 8 . The procedure can be described as follows: 1. Partition the hypotheses into two folds that are independent of each other, i.e., 2. For each fold j = 1, 2, let {t i } i2D j be the threshold learned from the other fold (up to a scaling factor). Weight the p-values bỹ where |D j | is the cardinality of the set D j .

Apply the BH procedure on the set of weighted p-values
i . We note that in eQTL studies, SNPs from different chromosomes can be regarded as being independent of each other. Also, the third step corresponds to the Benjamini-Yekutieli procedure 1 . By Theorem 1 in 8 , the above procedure controls FDR under arbitrary dependency of p-values. More specifically, it controls FDR under the assumptions: 1. The two folds are independent of each other.
2. The null p-values, conditional on the covariates, are independent and stochastically greater than the uniform distribution.
As a side note, in practice, the dependent case will have minimum effect as long as the two folds are independent, i.e., step 1 in the procedure above.

Accelerate AdaFDRby data filtering using p-values
For input, AdaFDR also allows filtered data with only small p-values close to 0 and large p-values close to 1. This is because the mirror estimator only needs to look at these two parts of the data. In such case, the original number of hypotheses (before filtering) is required as input to control FDR (the argument n_full in the algorithm implementation). For the GTEx data, the data are filtered to have only data points with p-values P i < 0.01 or P i > 0.99, which greatly accelerated the algorithm. Let c 0 be the filtering threshold. Then for filtered data, only data points with p-values P i < c 0 or P i > 1 c 0 are kept as input to AdaFDR. The filtering threshold c 0 should be much larger than the rejection threshold. For example, in the GTEx data, the rejection threshold can be smaller than 10 4 while the filtering threshold is chosen to be c 0 = 0.01.

Implementation of other methods
1. AdaPT: adapt_gam is used with a 5-degree spline for each dimension. This choice is based on a discussion with the authors of AdaPT. 3. BL: First the null distribution p 0 (x) is estimated using lm_pi0 with 5 degrees of freedom. Then BH is used with p-values weighted by 1/p 0 (x i ). This is the same as the usage in 11 .

eQTL study
GTEx For eQTL study, we used Genotype-Tissue Expression (GTEx) dataset 4 . This dataset aims at characterizing variation in gene expression levels across individuals and diverse tissues of the human body. We used the V7 release of GTEx analysis data (dbGaP Accession phs000424.v7.p2). The dataset contains 11688 samples, and in total there are 53 tissues from 714 donors (44 of them with sample size >70 are used in the GTEx paper). We filtered the tissues based on the following criteria. First, the tissue needs to have eQTL analysis, where the number of samples with genotype is greater than 70. Second, we set the number of samples threshold to be 100 in order to make the p-values more reliable. Third, we would like the tissue to have a corresponding roadmap 2 cell type, so that we can leverage the cell-specific chromatin state data from roadmap. After filtering, we were left with 17 cell types. The meta-information of the filtered GTEx dataset is listed in Table 1. In this filtered dataset, each hypothesis is a gene-variant pair. Nominal P values for each gene-variant pair were estimated using a two-tailed t-test. Each gene-variant is associated with 4 or 5 covariates listed below: • gene expression We obtained the median gene expression from the gene in gene-variant pair and used as a feature.
• alternative allele frequency We mapped each SNP to the dbSNP database 14 . We took the alternative allele frequency as a feature. If there were multiple alternative alleles, we took the smallest one. For the SNPs we cannot find a mapping, this feature is imputed with mean alternative allele frequency.
• TSS distance The distance from the SNP to the transcription starting site is used as a feature. It is defined as pos SNP pos T SS .
• cell-specific chromatin state We took the position of the SNP and mapped it to roadmap database 2 . Each SNP falls into the 15-state chromatin model. This state is used as a categorical feature.
• p-value from another tissue (optional) Optionally, we used the P value from another tissue as a covariate. If we cannot find the same gene-variant pair in another tissue, we impute with the mean P value. This covariate is only used for "AdaFDR (aug)" and "AdaFDR (ctrl)" experiments.
Due to their large data size, we only provide the p-value filtered (< 0.01 or > 0.99) curated data for 17 all tissues in our online repository. MuTHER In the Multiple Tissus Human Expression Resource project 5 , samples from 850 individuals were collected and 3 tissues, namely adipose, LCL, and skin, were studied. We used only the data for the adipose tissue and the LCL tissue, where a nominal p-value is provided for each SNP-gene pair. Such curated MuTHER data is available in our online repository.

RNA-Seq data
We used three RNA-Seq datasets to validate our algorithm. The first one bottomly 3 is an RNA-Seq dataset used to detect differential gene expression between mouse strains. We used the same data preprocessing pipeline as in IHW 9 . p-values were calculated using DESeq2, and the mean of normalized counts for each gene were chosen to be the covariate. The second dataset airway 6 is an RNA-Seq dataset used to identify the differentially expressed genes in airway smooth muscle cell lines in response to dexamethasone. The dataset is processed with the same pipeline as bottomly. The thrid dataset Pasilla 7 is an RNA-Seq dataset for detecting genes that are differentially expressed between the normal and Pasilla-knockdown conditions. This dataset is available in Pasilla package and it is analyzed in the vignette of genefilter package using independent filtering method. The p-values were generated using DESeq package and the logarithm of normalized count were used as the covariate. All the preprocessing steps can be reproduced using vignettes of R package IHWPaper 10 . The data are available in our online repository.

Microbiome data
The two microbiome experiments are from the benchmark paper 11 . The data are available in our online repository.

Proteomics data
The proteomics data is from the IHW paper 9 . The data is available in our online repository.

fMRI data
The two fMRI data are from the fMRI paper 13 . The data are available in our online repository.

Simulated data
All simulated data generated (with different random seeds) are available in our online repository as data files. N (2, S), with the symmetric covariance matrix whose upper triangular part is specified as We note instead of 0.25, the value 0.4 is used in the original paper (Section 3.2, 15 ). However such choice makes the covariance matrix not positive semi-definite. We decrease the value until the matrix becomes positive semi-definite. The number of hypotheses is 20000 and 10 datasets are generated with different random seeds. Data 5. Simulated data with strongly-dependent p-values The setting is the same as the weakly dependent data (data 4) except the generation of z-scores. Here, every 5 consecutive null z-scores are generated from N (0, I), while every 5 consecutive alternative z-scores are generated from N (2, I). This perfect correlation means to model the linkage disequilibrium (LD) that frequently occurs in SNPs. Due to the reduction of the inherent multiplicity, the number of hypotheses is increased to 50000. 10 datasets are generated with different random seeds. Data 6. Simulated data used in AdaPT The same data for Figure 6a in 12 is used where the number of hypotheses is 2500. 10 datasets are generated with different random seeds. Data 7. Simulated data used in IHW The data is generated according to Supplementary Note 4.2.2 in 9 where the number of hypotheses is 20000. While the original paper varies the effect size from 1 to 2.5 (the shift of z-scores for alternative p-values), here we only use a fixed effect size 2. 10 datasets are generated with different random seeds.

Proof of Theorem 1
Proof. To avoid ambiguity, we make a few clarifications before the proof. First, the entire analysis is done while conditioning on the hypotheses splitting, all covariates {x i } i2 [N] , the type of hypotheses {h i } i2 [N] , and the alternative p-values {P i } i2H 1 , hence allowing arbitrary dependencies of them. Here we note that the reason for splitting the hypotheses at random is to attain good power. The randomness of the analysis comes from the null p-values, which are assumed to be i.i.d. uniformly distributed for convenience. A discussion on extending to the case where the null p-values, conditional on the covariates, are independently distributed and stochastically greater than the uniform distribution is provided at the end. We also clarify a few notations. We use t ⇤ D 1 to denote the threshold which is learned on fold 1 and will be applied on fold 2. g ⇤ 1 denotes the scale factor of fold 1. For the testing-related quantities, we use subscript "1" to denote those evaluated on fold 1, including the number of discoveries D 1 (g ⇤ 1 t ⇤ D 2 ), the number of false discoveries FD 1 (g ⇤ 1 t ⇤ D 2 ), the mirror-estimated number of false discoveries c FD 1 (g ⇤ 1 t ⇤ D 2 ) and the mirror-estimated false discovery proportion d . Note that here t ⇤ D 2 is the threshold that is learned on fold 2 and then applied on fold 1. The term inside the bracket, (g ⇤ 1 t ⇤ D 2 ), may be omitted when there is no concern of being ambiguous. Quantities for fold 2 are defined in a similar fashion. Now we preceed to the proof.
Step 1: show that in order to prove the result, it suffices to show that Indeed, if (13) it true, then by symmetry P(FDP 1 (1 + e)a)  d 2 . Further by the union bound, with probability (w.p.) at least 1-d , This further implies that w.p. at least 1-d , the FDP on the whole dataset which gives the desired result. Hence in the rest of the proof, we denote effort to proving (13). Also, since we are only to deal with fold 2, we drop the subscript D 1 for threshold learned on fold 1 to have t ⇤ de f = t ⇤ D 1 .
Step 2: convert the probability P(FDP 2 (1 + e)a) to some analyzable stochastic process. Let E 0 denote the set of random variables that we wish to condition on, including hypotheses splitting, all covariates {x i } i2[N] , the type of hypotheses {h i } i2 [N] , and the alternative p-values {P i } i2H 1 . Let us consider the conditional version of (13): . Recall that FD 2 and D 2 correspond to the best rescaled threshold on fold 2 g ⇤ 2 t ⇤ and the best scale factor g ⇤ 2 is selected from the set n g : Then h can be upper bounded by Furthermore, since the indicator function is one only when c FD 2 (gt ⇤ ) D 2 (gt ⇤ )_1  a, which can also be written as 1 with the convention that x 0 = • for any x > 0, we further have Again since indicator function is one only when D 2 (gt ⇤ ) c 0 N, Finally, we can complete the conversion by noting that Here, the first term in (15), i.e.
can be understood as a stochastic process that as g grows from 0 to infinity, new elements are added to the numerator and the denominator with equal probability. Hence this term should always be close to 1. We next proceed to prove the result following this intuition.
Step 3: Upper bound the probability of (15). We note that the p-values involved in (15) Since Rademacher random variables. In addition, it is easy to verify that B i,g is independent of R i and (15) can be written in terms of B i,g 's and R i 's as . Divide the set of g in the sup from [0, •) into [0, g 0 ] and (g 0 , •), and apply union bound to have P(FDP 2 (1 + e)a|E 0 ) Define the random set B g = {i : i 2 H 0,2 , B i,g = 1}. We note that the sequence of sets {B g } g 0 is monotonic in the sense that as g grows, more elements are incorporated into B g . With this definition, the above inequality can be further written as Next we upper bound the two terms in (18) respectively. Here let us use m de f = ac 0 N for simplicity. The first term in (18): For some m 0 > 2m to be specified later, by the law of total probability, first term in (18) = P sup The two terms in (21) are upper bounded separately. Consider the first term. Recall that {B g } g 0 is a random sequence of monotonic sets; let {B g } g 0 denote any of its realization. Then since taking expectation over all possible {B g } g 0 s.t. |B g 0 |  m 0 is no greater than taking the sup of them, first term in (21)  sup Consider the term inside the probability, i.e. sup 0gg 0 Recall that the sequence {B g } g 0 is monotonic that as g grows more elements are incorporated into the set but no element is removed from the set. Also up to the point g = g 0 there are altogether |B g 0 | elements. Then the sup is equivalent to being evaluated over a sequence of |B g 0 | + 1 monotonic sets, i.e. sup 0gg 0 By setting m 0 = 3m, we have first term in (18) = P sup 0gg 0 The second term in (18): For some m 1  2m to be specified later, by the law of total probability, second term in (18) = P sup Using the same argument for analyzing the first term in (21), where we recall thatR 1 ,R 2 , ·· · is a sequence of i.i.d. Rademacher random variables indepedent of everything else, and the second inequality is due to Lemma 1.
By setting m 1 = m, we have second term in (18) = P sup Combining (23) and (27) By equaling the term in the right-hand-side of (29) with d 2 we have e = Q( which concludes the proof. In order for the proof to hold, it is required that the mirror estimate c FD 2 (gt ⇤ ) is stochastically no less than the true number of false discoveries FD 2 (gt ⇤ ) for any g 0. This is still true when the i.i.d. assumption for the null p-values is extended to the assumption that the null p-values, conditional on the covariates, are independently distributed and stochastically greater than the uniform distribution. Hence the result is directly extendable.

Lemma 1 with proof
Lemma 1. (Some properties of random walk) Let R 1 , R 2 , ·· · be i.i.d. Rademacher random variables and let S k = Â k i=1 R k . Then for any integer n > 1 and for any real number t > 0, P( max 1kn S k t)  2e (31) where for the second and the third inequalities, we require t to be large enough for the probability to be positive.
Proof. (30) is proved via a standard reflection argument for random walk. First consider when t is an integer, P( max 1kn S k t) = P( max 1kn S k t, S n t) + P( max 1kn S k t, S n < t) = P(S n t) + P( max 1kn S k t, S n > t) = P(S n t) + P(S n > t)  2P(S n t).
If t is not an integer, P( max 1kn S k t) = P( max 1kn S k dte)  2P(S n dte)  2P(S n t).

Lemma 2 with proof
Lemma 2. (Some properties of non-homogeneous Bernoulli sum) Let B i ⇠ Bern(p i ) be some independent Bernoulli random variables. Then