Adaptive Group-combined P-values Test for Two-sample Location Problem with Applications to Microarray Data

The purpose of this article is to propose a test for two-sample location problem in high-dimensional data. In general highdimensional case, the data dimension can be much larger than the sample size and the underlying distribution may be far from normal. Existing tests requiring explicit relationship between the data dimension and sample size or designed for multivariate normal distributions may lose power significantly and even yield type I error rates strayed from nominal levels. To overcome this issue, we propose an adaptive group p-values combination test which is robust against both high dimensionality and normality. Simulation studies show that the proposed test controls type I error rates correctly and outperforms some existing tests in most situations. An Ageing Human Brain Microarray data are used to further exemplify the method.

In recent decades, technological advances have made it possible to collect simultaneously massive amounts of high-throughput data. For example, in biomedical studies, lots of magnetic response images (MRI) and functional MRI data are gleaned for each subject 1 ; various microarray expression patterns of thousands of genes are measured 2 . In addition, examples of these kinds are plentiful in computer science, engineering, climatology, geology, and finance. This type of data, often called high-dimensional data, are characterized with a large number of variables m and a relatively small number of samples n, usually m is considerably large than n (  m n). So developing approaches for high-dimensional data is of great practical importance. In this context, a problem of concern is to test for the equality of location parameters of two samples simultaneously. Assume that =  X X X i { , , , }( 1, 2) i i in 1 2 i are two independent random samples of sizes n 1 and n 2 , from m-variate distributions F 1 (X − μ 1 ) and μ − F X ( ) 2 2 with m-variate location parameters μ 1 and μ 2 , respectively. We consider the following high-dimensional null hypothesis μ μ = H : are the sample means, S n is the sample covariance matrix, and n 1 , n 2 are the sizes of two samples. The Hotelling's T 2 test requires that the data dimension m is fixed and less than + − n n 2 1 2 . It possesses desirable properties for low-dimensional data when m is fixed. However, the situation is changed for high-dimensional data. Bai and Saranadasa 3 studied the performance of the Hotelling's T 2 test for high-dimensional data and found that its powers drop significantly as m/n increases. A reason for this phenomenon is that Hotelling's T 2 test contains the inverse of sample covariance matrix which may not converge to the population covariance matrix when m is close to n or even is undefined when m > n.
To address this issue, under the assumption of equal covariance matrix, Bai and Saranadasa 3 proposed a new test by removing − S n 1 from the Hotelling's T 2 test. They also derived the asymptotic normality of the test statistic when m and n are of the same order. However, this requirement is too restrictive for high-dimensional data, in which m is often far larger than n. Motivated by this, Chen  but makes no contribution in testing, where ||⋅|| is the squared Euclidean distance. Note that these methods are scalar-invariant since the magnitudes of variables' variances which may vary greatly are not taken into account. Neglecting such heterogeneity information could lose power dramatically since the variables with larger variabilities which dominates the results may not be statistically significant. Hence under the assumption of multivariate normality, Srivastava et al. 5 developed a scalar-transformation-invariant test by replacing S n in the Hotelling's T 2 test with its diagnoal matrix. However, the aforementioned tests are essentially parametric in spirit since their performance would be degraded dramatically when the assumption of normality is not met, especially for heavy-tailed distributions. To overcome this limitation, Feng et al. 6 proposed a scalar-invariant test based on multivariate-sign-based procedures which is robust against non-normality. Although many existing methods are available to test for the equality of location parameters of two samples, most of them perform well under certain conditions on the degree of m/n. For example, Bai and Saranadasa 3 reqiures → ∈ p n c / ( 0, 1); Chen and Qin 4 requires Σ = Σ o tr( ) (tr ( )) 4 2 2 , where Σ is the covariance matrix; Srivastava et al. 5 ( ) for 1/2 < δ < 1. Nevertheless, in a range of high-dimensional applications, it is hard to determine the degree of m/n. Sometimes, the data dimension m can be unimaginable large relative to the sample size n. For example, in the microarray data, tens of thousands of genes are observed on tens of hundreds of samples 2,7 .
For two-sample location problem, an alternative solution is to use univariate test which constructs marginal test for each variable first and then employs some kind of p-values combination method to accelerative accumulate the marginal signals. Common p-values combination methods include Fisher's combined method 8 , truncated product method 9 , truncated tail strength method 10 , and adaptive rank truncated product methods 11 . Hu et al. 12 pointed that the performance of these combination procedures depend heavily on the magnitudes of p-values to be combined. When the magnitudes of p-values varies, they may suffer from a substantial loss of power. To tackle this issue, they proposed a group combined p-values method (denoted by GCP) for large-scale genetic association studies. In GCP, p-values are divided into three groups first and constructed into a test statistic within each group. The final test is obtained by combing these intermediated test statistics. To use GCP, one needs to define two thresholds for p-values beforehand. However, when the number of marginal tests is large, the performance of GCP is very sensible to the selection of thresholds. Hu et al. 12 used two self-defined thresholds which may result in power loss when most of the investigated p-values are not included in their pre-defined groups.
In this article, we aim to propose an adaptive group p-values combination test(AGCP) by optimizing the significant evidence of GCP obtained on each pair of a set of candidate thresholds applied to two sample location problem for arbitrary dimensional data since it is only based on marginal test statistics and poses no demands on the dimensionality. Extensive simulations show that the proposed test perform more powerful than some existing methods for two-sample location problem in high-dimensional, while maintaining correct type I error rates. The superiority of the proposed method is further exemplified with the Ageing Human Brain miacroarray data. In the analysis of this data, the proposed method succeeds in detecting the significant difference while other methods failed to do so.
Suppose that there are two independently and identically distributed random samples as follows:  6 . For a more general illustration, the AGCP test used here is assembled with the two-sample wilcoxon test for each marginal hypothesis. Three simulation models including multivariate normal distribution, multivariate t-distribution, and moving average model are considered to generate two-sample data. The specific scenarios are as follows: the first one is for multivariate normal distribution (MVN).
. X ij are sampled from t m,4 with 4 degrees of freedom, the mean vector μ i , and covariance matrix Σ i , i = 1, 2, =  j n 1, 2, , i ; the third one is for moving average model (MA). The k-th entry of X ij are sampled from the following moving average structure: 1, 2, , i and =  k m 1, 2 , , and Z ijk are i.i.d random variables. For the distribution of Z ijk , we let the first m/2 components of 1 be from centralized Gamma (4, 1) so that it has zero mean, and the other m/2 components from the standard normal distribution N(0, 1). Detailed settings of the other parameters will be introduced later.
Without loss of generality, we fix μ 2 = 0 and Σ = × I m m To assess the performance of the tests on controlling type I error rates, we set μ 1 = μ 2 . Clearly the null hypothesis (1) is true under this setting. In addition, under the alternative hypothesis, we let μ  Table 1 reports the empirical type I error rates of CQ, SKK, SS, and AGCP when the two-sample data are generated from m-variate normal distribution with three patterns of covariance matrix including DS1, DS2 and DS3. From this table, it can be seen that SS and AGCP maintain correct type I error rates with the values being close to the nominal level. When the sample size is small (n = 10), SKK yields inflated type I error rates. For example, when n = 10 and m = 10, the type I error rates of SKK under DS1, DS2 and DS3 are 0.084, 0.097 and 0.119, respectively. For CQ, it can control the type I error rates correctly in most cases, while appears to be a little larger than 0.05 when the sample size is small and the covariance matrix belongs to DS1.

Multivariate normal distribution.
The empirical powers of tests for the two-sample data sampled from multivariate normal data with (n, m) = (10, 100) and (n, m) = (25, 200) are presented in Fig. 1. Since SKK has inflated type error rates when the sample size is small, we excluded it from the power comparison when n = 10. Figure 1 shows that the powers of the proposed AGCP test are always larger than those of the other tests. Sometimes, its powers can exceed two times of those of the CQ, SKK and SS test. Multivariate t-distribution. The empirical type I error rates of the compared tests for the two-sample data from m-variate t-distribution are presented in Table 2. Among all settings, AGCP always can maintain the type I error rates correctly. Likewise, the type I error rates of CQ have some slight size distortion (a little larger than 0.05) when the sample size is 10 and the covariance structure is DS1. In other cases, they are close to the nominal significance level. For the multivariate t-distribution data, SKK has totally incorrect type I error rates. It occurs since SKK is exclusively designed for multivariate normal distribution. The performance of SS on type I error rate depend heavily on the covariance structure. Specifically, its type I error rates are a little larger than 0.05 under DS1 and appear to very low under DS2 and DS3, especially when the sample size is small (n = 0.3). This result is consistent with those in Feng et al. 6 . Figure 2 shows the empirical powers of CQ, SKK, SS, and AGCP for two-sample data generated from multivariate t-distribution with the covariance structures of DS1, DS2, and DS3. It can be clearly observed from this figure that AGCP is the most powerful test among all compared test under all considered cases. As the sample size increases, the superiority of AGCP over the other tests becomes large.   Table 1. Type I error rates of CQ, SKK, SS, and AGCP under the significance level of 0.05 when the two-sample data are generated from multivariate normal distribution. DS1-DS3 correspond to three patterns of dependence structures for Σ 1 , respectively. n is the sample size and m is the data dimension.   Moving average model. In addition, we examined the performance of tests for the data from moving average models. We choose a representative distribution for Z ijk , that is, letting the first m/2 components 2 , the specific setting for μ 1 can be subsequently obtained. Table 3 summarizes the type I error rates of CQ, SKK, SS, and AGCP for the data from moving average models. It shows that CQ and AGCP maintain the type I error rates reasonably well, while SKK and SS seem to be somewhat conservative with the type I error rates much smaller than the nominal significance level. Similar to the results for MVN and MVT, the type I error rates of SKK when n = 10 is a little bit inflated. The powers of tests are presented in Fig. 3. From this figure, we can observe that the proposed test always performs the best among all tests when the coefficients are with the "partial dependence" structure for both (n, m) = (10, 100) and (n, m) = (25,200). And such superiority becomes more significant as n increases. Under the "full dependence" structure, CQ, SKK, and AGCP perform similarly when n = 10, but AGCP outperforms the other two when the sample size becomes large.
The type I error rates seems to be slightly inflated when the sample size is 10. This occurs because under such situation, the sample size is too small relative to the dimension of data which maybe need more permutation replications to approximate p-values of marginal tests in AGCP. Applications to Ageing Human Brain Microarray data. To further exemplify the superiority of the proposed test, we apply it to the Ageing Human Brain Microarray (AHBM) data, downloaded from GEO with accession number GSE1572 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1572). Ageing of human brain, accompanied by slower processing speeds and decreasing ability to convert experiences to episodic  Table 3. Type I error rates of CQ, SKK, SS, and AGCP under the significance level of 0.05 when the two-sample data are generated from moving average model (Replicate 200 times). Two configurations including "full dependence" and "partial dependence" (denoted by FD and PD) are used to generate the coefficients ρ l , =  l m 1, 2, , . n is the sample size and m is the data dimension.  (Row 3 and 4). For each combination of (n, m), two allocations (denoted by Equal and Linear allocation) are specified for the nonzeros of μ 1 Two configurations including "full dependence"and "partial dependence" (denoted by FD and PD).
Scientific REPORTS | (2018) 8:8117 | DOI:10.1038/s41598-018-26409-1 memory, is known as a cause of cognitive decline and potentially risk factors of age-related neurodegenerative diseases, such as Alzheimer's disease [13][14][15] . The AHBM data were used by Lu et al. 16 to detect age-dependent gene regulation in human brain. It contains microarray expression patterns of genes from the frontal cortex of 30 neuropathologically normal individuals ranging in age from 26 to 106 years. Lu et al. 16 showed that the gene expression patterns are relatively stable among the group of individuals ≤42 years old. In addition, they performed statistical group comparison of expression levels from individuals ≤42 and ≤42 years old, about 4% of all genes were detected to undergo significant changes. Most of these genes were found to be associated with some functions, such as synaptic function, stem cell function, vesicular/protein transport, stress response, and others. Among the detected genes, we choose those belonged to some aging-associated pathways to form a gene set in the later analysis.
Five pathways including hedgehog signaling, mitogen-activated protein kinases/extracellular signal-regulated kinases (MAPK/ERK), phosphatidy linositol 3-kinase (PI3K), protein kinase C (PKC), and janus kinases/signal transducers and activator of transcription (JAK/STAT) which are reported to be aging-associated, were chosen. Specifically, hedgehog signaling is a major regulator of stem cell function whose reduced functionality is responsible for ageing 17 ; the core components or regulators of MAPK/ERK pathway were identified as the aging-dependent targets 18 ; the PI3K pathway was found to have relevance to cognitive processes in addition to pathological brain aging and neuro degeneration since it is implicated in aging and lifespan regulation, and the proliferation of adult neuronal progenitor cells 19 ; the PKC pathway and its adaptor protein RACK1 have been shown to be interdependent in pathological brain aging 20 ; JAK/STAT were found to be active in the aging and mature brain and play important role in the control of neuronal proliferation, survival and differentiation 21 . The pathway data was downloaded from InnateDB http://www.innatedb.ca/redirect.do?go=searchPws.
A total of 237 genes (listed in the Supplementary Materials) were included in the gene set. Since the individuals ≤42 years old in this data share similar gene expression patterns, all individuals are divided into two sample groups consisting of 10 individuals ≤42 years old and 20 individuals ≥42 years old in the analysis, respectively. Our aim is to detect simultaneously the difference of expression patterns of the gene set between two samples. To provide some insights into each gene's expression pattern, univariate comparison between two groups were first conducted using the wilcoxon rank-sum test and the p-values ranges from 0.00012 to 1. The detailed marginal p-values are presented in the Supplementary Materials. Then we apply the tests of Chen and Qin 4 , Srivastava et al. 5 , Feng et al. 6 , and the proposed test to conduct the overall comparison of expression patterns of the gene set between two groups. The p-values are 0.221 from Chen and Qin's test 4

Discussion
Through simulation studies, we show that the proposed test outperforms some competing multivariate tests with respect to the type I error rate and power in most scenarios. This is expected since the compared tests including CQ, SKK, and SS which are all Hotelling's T 2 -type tests, neglect the correlations among variables to bypass the non-convergence of the sample covariance matrix (Bai and Saranadasa, 1996), while our method takes the correlation of multiple variables into account and calculate the statistical significance level with the permutation method.
In this article, we developed an adaptive group-combined p-values procedures for two-sample location problem in high-dimensional data. The proposed test extends the p-value combining techniques by dividing p-values into several groups and combing them at the group-level. Instead of fixed thresholds, this adaptive procedure use the optimal one among all possible threholds which is able to improve the power of test significantly. The proposed test provides an efficient and flexible way to accumulate differece evidences across variables and has no restriction on the relationship between the data dimension and sample size. Through simulation studies, we showed that the proposed test outperformed some competing multivariate tests in most scenarios. Applications to Ageing Human Brain Microarray data further demonstrate its satisfactory performance.
In the proposed test, all p-values are divided into three groups and two groups with smaller p-values are used. However, the number of groups is sort of self-defined. Intuitively, such procedure can be generalized to J groups, J ≥ 3. Although Hu et al. 12 explained J = 3 is a good choice through simulation studies and the idea of the degrees of freedom, more theoretical results are needed to support this conclusion. Except the two-sample location problem, our proposed test have a variety of additional applications, such as large-scale genetic association studies. With the advance of high-throughput genotyping technology, researchers are able to get access to a large number of genetic variants. However, the signal of association between an individual genetic variant and the trait could be too weak to be detected by single-variant analysis 22,23 . At this time, a benefiting and complementary strategy for genetic association studies is to simultaneously testing the association between the trait and multiple genetic variants within a gene set or a pathway. A specific high-dimensional test problem thus arises. Our proposed method can be applied by conducting marginal association test for each genetic variant first and then use the proposed test to combine obtained p-values. Our method can also be extended to deal with nonparametric population comparisons in genetic association studies, where much work has been done [24][25][26] .
For our proposed methods, we recommend using the thresholds ξ = . . . . . . {0 0001, 0 001, 0 01, 0 05, 0 1, 0 2, 1}. This is mainly due to the following reasons. First, these thresholds are widely used in the context of p-values combination methods (Fisher 8 ; Zaykin et al. 9 ; Jiang et al. 11 ; Yu et al. 11 ). Besides, since the p-values greater than 0.2 generally do not contribute to the significance of test but may increase the variance substantially, the value of 0.2 is commonly used as the upper bound of threshold for the truncated p-value combination methods (Zaykin et al. 9 ). Finally, we also evaluate the performance of the proposed test under some other sets of thresholds containing ξ through simulations and the results turn out to be similar with those in the "simulation" section; we omit the details here for simplicity. Intuitively, such procedure can be generalized to J ≥ 3 groups. As Hu et al. 12 pointed out the AGCP with J = 3 possesses the potential to outperform that that with J > 3 due to the idea of pseudo degrees of freedom (DFs) for test statistics. That is, since the test with the form of −2 ln(X) is known to follow from the Chi-squared distribution, the pseudo DFs of the AGCP with J = 3 and J = 4 are 4 and 6, respectively. As the number of groups increases, the DFs might increase which yields less powerful tests.
It should be pointed out that our test has its drawback. In principal, the proposed test is supposed to be applied to any dimensional data since it is based on marginal p-values. However due to the difficulty of deriving the exact distribution, permutation procedure is adopted to calculate the statistical significance of the proposed AGCP which may suffer intensive computation or even be infeasible when the data dimension is very large.

Methods
In particular, the null hypothesis (1)  In principal, a large value of B is preferred since it can yield accurate results of p-value. However, increasing value of B would result in extensive computational cost. To balance such tradeoff, we use B = 10000 in this article.
The source of program R code used to perform the simulations is available in the supplementary material.