Introduction

Genetic association analysis has been used widely in the search for genes contributing to complex diseases. One common study design is to collect genomic data from affected and unaffected unrelated individuals, and contrast their genetic features at the level of gene expression (e.g., Hedenfalk et al. 2001; Haga et al. 2002; Ozaki et al. 2002) or marker distribution frequencies (e.g., Risch and Merikangas 1996; Collins et al. 1999). Such studies tend to encounter the problem of multiple testing due to the use of densely spaced markers (Ohashi and Tokunaga 2001). In that case, the Bonferroni procedure, which aims at preventing the occurrence of a single false positive, is frequently adopted. However, the traditional Bonferroni procedure is known to be an over-conservative test because its critical value becomes extremely large when many hypotheses are examined simultaneously.

Several approaches have been proposed with which to pursue higher power in a multiple-testing setting. One direction is to modify the testing procedures, and focus on constructing more powerful tests (see Dudoit et al. 2003 for review). Improved procedures have been obtained either by considering measures different from overall type I error rate, such as false discovery rate (FDR; e.g., Benjamini and Hochberg 1995; Storey and Tibshirani 2003; Tsai et al. 2003), or by empirically obtaining the significance threshold of multiple tests (e.g., Ge et al. 2003; Becker and Knapp 2004). Another direction adopts the design point of view, and accounts for both power and cost. Our proposed approach is motivated by the design perspective.

Given that the cost of a genetic study is largely determined by the number of individuals recruited and the number of markers genotyped, sequential designs to enhance study efficiency are becoming increasingly popular (e.g., Böddeker and Ziegler 2001; Saito and Kamatani 2002; van den Oord and Sullivan 2003ab; Thomas et al. 2004; Hirschhorn and Daly 2005). Either samples or marker density or both may be increased sequentially. Such strategies have also been proposed to deal with low power due to the Bonferroni correction. For example, for array experiments, Miller et al. (2001) suggested first selecting a smaller set of microarray sample and proceeding to the second stage of tests with another data set using only genes found significant in the first stage. Later, Allison and Coffey (2002) pointed out that this two-stage method can be even more conservative if the two significance levels are not chosen carefully. Several other papers have proposed different two-stage methods from different perspectives. Hoh et al. (2000) and Ott and Hoh (2001) proposed first selecting a smaller subset of single nucleotide polymorphisms (SNPs) and then using them in the modeling stage to reduce the number of coefficients to be estimated. Elston et al. (1996) and Guo and Elston (2000) proposed genotyping at two different spacings to save cost. Several authors (see, for instance, Saito and Kamatani 2002; Satagopan and Elston 2003; van den Oord and Sullivan 2003ab) advocated the use of two-stage (or multi-stage) procedures that inflate the probability of false positives in the first stage, and control this probability in later stages. Others have focused on genotyping cost, allocation of sample sizes, and selection of significance levels in two-stage designs. Satagopan et al. (2002, 2004) considered a two-stage design under the constraint of a fixed total number of genotypings. Saito and Kamatani (2002) performed extensive simulation studies based on various combinations of type I error in the first stage, sample sizes at two stages, and genotype relative risks, in the search for an optimal design. Their two tests conducted at two stages use different, and hence independent, data. Thomas et al. (2004) used the likelihood approach to select tagging SNPs in the second stage using a larger sample that combines previous data.

Here we introduce a design for a two-stage procedure for multiple testing that includes data from the first stage. We conduct a formal test in the second stage, conditional on the findings of the first, with both the original and the augmented data. In the first stage, the objective is to eliminate those markers that are very unlikely to be associated with the disease of interest among the total number (M) of markers. A large significance level is used, and markers with “large” P values are excluded. In other words, at this stage we intend to include as many true positives as possible by tolerating a larger-than-usual amount of false positives. This follows the same idea as that proposed by van den Oord and Sullivan (2003ab), i.e., relaxing the false positive probability in the first stage. After obtaining a smaller and more promising set of markers, we conduct statistical tests with a stringently controlled overall type I error using a combination of data from the first-stage sample and the newly genotyped data. van den Oord and Sullivan (2003a) and others (Miller et al. 2001; Saito and Kamatani 2002; Satagopan et al. 2002) considered only new data in their second-stage test to avoid complexity in test statistics due to interdependence between the first and second samples. As argued by van den Oord and Sullivan (2003a), inclusion of first-stage data may elevate false positive discoveries, but can reduce the genotyping burden. Here, we recommend incorporating both samples and we show that the increase in false discovery can be reduced by use of a larger sample size and a more stringent significance level. A test with combined data will save on overall genotyping costs, as long as the dependence and its effect can be handled carefully. This point is similar to that mentioned in Satagopan and Elston (2003) although they did not pursue it further. The proposed procedure is designed to enlarge the power of each test in the first stage and reduce the type I error in the second stage. Overall, the procedure is more powerful at a controlled false positive rate (FPR), and avoids waste of resources in genotyping markers with no association.

In this paper, we use the term true positive rate (TPR) for the probability of rejection of truly associated markers, and FPR for the probability of rejection of non-associated markers. The FPR can be considered as a measure of the overall type I error rate. However, we use TPR and FPR when referring to the overall performance of M multiple tests, and retain type I and II errors when considering each single test. In the following section, we explain the rationale of the design, discuss its implementation, and derive theoretical success and failure rates. More technical details can be found in Appendices 1 and 2. Simulation studies and a real example are then considered. The performance of the method, in terms of TPR and cost saving, is evaluated and compared with other Bonferroni type procedures.

Materials and methods

Two-stage selection procedure

In the first stage, we consider a large significance level, α1, for each test to ensure that even markers with mild effects will be detected. The aim of obtaining a large TPR at this stage will then be guaranteed. For instance, in population-based case-control studies, a chi-square test for contingency tables or z-test for comparing two proportions can be considered. Taking α1=0.05 results in a large power in the first stage for M SNPs (or markers), each with N1 allele data value. For SNPs, N1 is the number of total allele counts from the case group and N1/2 is the number of cases. The resulting significant R markers will proceed to the second testing procedure. In this stage, the sample size for each of the significant R markers will be enlarged by the N2 allele data for each marker. That is, only the R markers of the additional individuals will be genotyped. Therefore, the total number of genotypings is MN1 in the first stage and RN2 in the second. The total cost (MN1+RN2) is much smaller than that (MN1+MN2) in a single-stage design. When testing the association of the R markers in the second stage, we adopt a smaller significance level, α2, for each test such that the overall FPR decreases. Some power may be sacrificed but the inclusion of the additional sample will compensate for the loss. The final significant X markers can now be used for further studies such as fine mapping.

In the first stage, our proposal categorizes M markers into two groups: those with large P values (i.e., obviously removable) and the rest (possibly associated). Any induced error due to loose separation can be corrected in the second stage. For instance, one may apply the traditional stringent Bonferroni correction to test the R markers at this stage or use Benjamini and Hochberg’s procedure (1995) to control the FDR. This combination ensures a high overall success rate (the TPR) and a low error probability (the FPR).

Notation and implementation

Among the total M markers, let w be the fraction of markers that are truly non-associated with the disease, and M(1−w) be the number of associated markers that we want to identify. The proportion w is usually close to 1 in a large-scale association study (Ozaki et al. 2002). Let N1 be the sample size of allele data considered in the first stage. Without loss of generality, here we assume the sample size of the case and of the control group to be the same, with N1 allele data from cases and N1 from controls. For the purposes of illustration, we take individual biallelic markers as the basic unit for association testing. The results can be generalized to other association tests at haplotypic or genotypic level.

For each marker, let α1=0.05 be the significance level in the first stage, and 1−β1 the corresponding power. Suppose that after M tests in the first stage, R (R=R0+Ra) markers result in significance, where R0 are from the original Mw markers of no association and Ra from the M(1−w) markers with association. In the second stage, we increase the sample size by N2 for each case and control group. Next, we genotype only the R markers of the additional N2 subjects and perform a second test on these markers of N1+N2 individuals based on significance level α2 where α21. In terms of genotyping, as suggested by Satagopan et al. (2002), the total number of gene evaluations up to this point is MN1+RN2. The final total of significant markers, after the second stage, is denoted as X (X=X0+Xa, where X0 results from R0 and Xa from Ra).

True positive and false positive rates and numbers

An intuitive measure for a successful testing procedure is the TPR, which can be considered as the overall power. The TPR can be denoted as a product of two conditional probabilities, U1 and U2 (see Appendix 1), for success in detection at each stage, respectively. These two conditional probabilities depend on the testing procedures and significance levels employed. In the same manner, we express the overall FPR as a product of the two conditional probabilities Q1 and Q2 for incorrect rejection at each stage. Their values depend on both α1 and α2. The α1 can be fixed at a large value, say 0.05, to ensure a high probability of true significance in the first stage. The α2, however, will be made smaller to control the overall false positive results. For instance, we recommend α1/R for α2, where R is determined after the test in the first stage is complete. From Appendix 1, expectations of TPR and FPR can be approximated by (1−β2) and α2, respectively, under conditions discussed later.

The sample sizes N1 and N1+N2 also affect FPR. Based on the results in Appendix 1, by setting the proportion of N1 to N1+N2 larger than \((z_{1-{\alpha_{1}}/2}/z_{1-{\alpha_{2}}/2})^{2}\), the overall FPR will be controlled at the level of α 2. A planned design with properly chosen significance levels (α12) and sample sizes (N1>N2) will ensure a high overall TPR. Alternatively, the approximated TPR and FPR can be fixed together to determine the sample sizes.

Given M and w, R0 is binomially distributed with the number of trials Mw and probability Q1, assuming independence between M tests. Similarly, the number of true positives, Ra, is binomially distributed with size M(1−w) and probability U1. When the significance level of the first stage is fixed at α1 for each test, the overall Q1 becomes α1, and U1 is equivalent to (1−β1). Similarly, when α1 and α2 are fixed, the conditional probability of incorrect rejection (Q2) can be approximated by 1/[E(R0)+E(Ra)], and U2 by the expected value, E(1−β2)/(1−β1) (details in Appendix 1), where the expectations, E(R0)=MwQ1 and E(Ra)=M(1−w)U1, are taken with respect to the binomial distributions, respectively, at given values of M, w, Q1, and U1.

Therefore, the expected number of correctly identified markers (true positives), E(Xa), can be approximated by M(1−w) × E(1−β2) where E(Xa)=RaU2 if Ra and U2 are given. Similarly, following the argument above, the number of false positives, X0, is binomially distributed as Bin(R0,Q2). The probabilities of obtaining zero or one non-associated marker can be estimated. For instance, the sum is approximately 0.92 when M =500, w =0.95, and α1=0.05, which implies a less-than-one false alarm.

Results

Example: association with hypertriglyceridemia

The two-stage procedure was applied to a small data set containing only 15 SNPs as markers. These markers locate in the exons and introns of the four genes, lipoprotein lipase (LPL), apolipoprotein A1 (APOA1), apolipoprotein C3 (APOC3), and apolipoprotein A5 (APOA5). There are two, four, four, and five markers contained in each gene, respectively. The last five markers on the APOA5 gene have been previously tested for association with hypertriglyceridemia (Kao et al. 2003). The latter study revealed that, when all samples (290 hypertriglyceridemia and 303 controls) are considered, three markers (c.553G > T, c.1259T > C, and IVS3 + 476G > A) are statistically significantly associated with hypertriglyceridemia after Bonferroni’s correction.

For the purposes of illustration, we selected 198 individuals randomly from each case and control group in the first stage, and conduct 15 tests each with α1=0.05. The significant markers (R markers) then enter the next stage with the data increased by adding the genotypes on each of the R markers of the remaining 197 (197=593−198−198) individuals. Each marker is tested with the significance level α2=0.05/R. The resulting significant markers are considered as showing evidence of association. This procedure is replicated 100 times to assess its performance and to account for sampling variation.

In each replication, we also compute the number of total genotypings, and divide it by 15 to derive the corresponding sample size K \((K=(15 \times 396+R \times 197)/15)\) for a single-stage design. A total of K individuals are then sampled and their 15 markers are tested for association using Bonferroni correction. The final results from Bonferroni correction with K individuals [B(K)], with all individuals [B(all)], with 396 individuals [B(N1)], and from the two-stage method are compared.

Table 1 lists the percentages of significance of each marker over 100 replications. For the purposes of comparison, we consider the results from Bonferroni correction with all data as the hypothetical standard. In other words, markers numbered 7, 9, 10, 12, 14, and 15 with a larger difference in minor allele frequency are considered associated with hypertriglyceridemia. When markers have a strong effect (numbers 12, 14, and 15 in APOA5), it is easy to distinguish between case and control groups. All methods were shown to be consistent. For markers of lesser strength (numbers 7, 9, and 10 in APOC3), the two-stage method reaches the same conclusion as the Bonferroni procedure with all data [B(all)]. However, the average number of genotypings for the two-stage method is only 7,313.09 (7,313.09=15×396+6.97×197 where 6.97 is the average R), i.e., 82% [82%=7,313.09 / (15 × 593)] of the number required under B(all).

Table 1 Numbers in the last four columns are percentages of significance in 100 replications. B(all) denotes the Bonferroni procedure with all data, B(K) with K individuals, B(N1) with only N1 (N1=396) individuals. The absolute difference in minor allele is denoted as δ

When compared with Bonferroni procedure with the same number of genotypings [B(K)], the two-stage method correctly identifyies the association with a much higher percentage (≥94%), while B(K) performs poorly under the same number of genotypings.

In another analysis with N1=298 (149 cases) and N2=295, the two-stage selection procedure still identifies the same signals as Bonferroni’s method (results not shown), and the average relative cost in genotypings is only 72%.

Simulation studies

In this section, we describe simulation studies to evaluate the success and error rates of the two-stage selection method. The final numbers of truly significant markers Xa and TPR Xa/(M(1−w)) are used to assess success, while the numbers of falsely significant markers X0 and FPR X0/(Mw) are adopted to quantify error. The results are compared with those of single-stage methods such as Bonferroni’s procedures. As stated in Reich et al. (2003), the minimal allele frequency of most discovered SNPs is greater than 10%. Therefore, we assume in the following that the allele frequencies range between 10% and 90%. That is, the allele frequencies pc and pn for the case and control groups, and their weighted average \(\bar{p}\) all fall within this interval (10–90%). The range of their absolute difference, δ \((\delta=|p_{{\rm c}}-p_{{\rm n}}|)\), can consequently be derived (Appendix 2). The ranges of the two quantities are shown in Fig. 1. If the genotyping can be made almost free of error so that the allele frequency takes values from 0 to 1, then the range of the two quantities would be larger, as indicated by the dashed lines in Fig. 1.

Fig. 1
figure 1

The range for \(\bar{p}\) and δ. Solid lines Range of possible values when frequencies are within 10 –90%, dotted lines range when frequencies are between 0 and 1

We have fixed M at 500 (Table 2) or 500,000 (Table 3), w=0.95, and N1=2,000. Numbers of allele counts for Mw non-associated SNPs were simulated under the null H0:δ=0 with given \(\bar{p}\), while other M(1−w) associated SNPs were with a given δ>0. We then tested each marker with α1=0.05. If any significance was found, an additional sample of size N2=1,000 would be generated and tested with α2=0.05/R. For the purposes of comparison, we also considered two Bonferroni’s procedures. One uses only the first N1 data [B(N1)], while the other [B(all)] considers the data N1+N2 in all M markers. The number of replications is 1,000 under each condition. Tables 2 and 3 list the theoretical values (numbers in parentheses) of TPR and FPR to allow comparison with the simulation results. It is evident that the theoretical values are very close to the simulation results.

Table 2 The false positive rate (FPR) and true positive rate (TPR) based on simulations (numbers in parentheses are theoretical values) for the proposed two-stage method and two Bonferroni procedures. The sample sizes are N1=2,000 and N2=1,000. The total number of markers is assumed to be M =500 and the proportion of non-associated markers is w =0.95. FDR False discovery rate
Table 3 The FPR and TPR based on simulations (numbers in parentheses are theoretical values) for the proposed two-stage method and two Bonferroni procedures. The total number of markers is assumed to be M =500,000

TPR and FPR

The average TPR for the two-stage method is substantially greater than that of either of the two Bonferroni procedures. Even when the difference in frequency is small (δ<0.05), the two-stage method is still superior. However, when the difference is extremely small (δ=0.01), all methods fail to perform satisfactorily. In fact, the required sample size in this case would be larger than 15,000. That is, more subjects are needed to achieve a reasonable power. The second factor of TPR is the average frequency \(\bar{p}\). When \(\bar{p}\) approaches 0.5, all TPRs decrease, but the two-stage method still outperforms the rest. Between the two Bonferroni procedures, the one with the greater sample size performs better than the other. It should be kept in mind, however, that the cost for B(all) method is larger than others and it requires more laboratory work.

When looking at FPR and FDR, the two-stage method is not as good as the other two. However, these numbers are indeed small. For example, the FPR of 0.158×10−2 (Table 2) implies less-than-one false alarm. The FPRs for both Bonferroni procedures are small, indicating a very slim chance of identifying an unassociated marker, which is as conservative as we had expected. The TPR and FPR from simulation studies also match the theoretical values quite well in both cases (M =500 or 500,000), indicating that the equations derived in Appendix 1 are good approximations.

Figure 2 is an alternative presentation of the simulation results. Figure 2a, b compares the TPRs of the three methods under different δ when \(\bar{p}\) is 0.5 or 0.15. The curve of the two-stage method is located well above the rest, indicating a larger power. In addition, these curves climb quickly up to 1 over a short range of δ. In contrast, Fig. 2c demonstrates the same TPRs with δ fixed at 0.04 (upper three lines), and 0.1 (lower three curves). The power is lower when \(\bar{p}\) is around 0.5. Figure 2d shows the range of FPRs with different values of \(\bar{p}\) and δ. The values all lie within a small range (1×10−4, 20×10−4). This finding is consistent with the theoretical derivations in the Methods section.

Fig. 2a–d
figure 2

The curves for true positive rates (TPR) and false positive rates (FPR) under various settings of \(\bar{p}\) and δ. a TPR with \(\bar{p} = 0.5\). b TPR with \(\bar{p} = 0.15\). c TPR with δ fixed at either 0.04 (upper three lines) or 0.1 (lower three lines). d FPR (×104) with \(\bar{p}\) fixed at 0.15 or 0.5

Figure 3 displays TPR, FPR, and FDR with respect to various values of M, and different settings of N1 and N2. First, it can be seen that the patterns do not change even if M is as large as 500,000. Second, in Fig. 3a and d, the TPR of the two-stage method with dependent samples outperforms both the two-stage method with independent samples and Bonferroni’s method. Third, Fig. 3b,c,e,f show that the errors, in terms of FPR and FDR, of the two-stage method with dependent samples are larger than those seen with the other two methods; however, the difference is very small. We conclude that when correct detection is of concern, the test with dependent samples will be better. When reducing the possibility of false detection is the focus, the test with independent samples should be adopted.

Fig. 3
figure 3

TPR (a, d), FPR (b, e), and false discovery rate (FDR) (c, f) curves with respect to various M. In all figures, \(\bar{p}=0.15\), w=0.95, and δ=0.06. In ac N1=2,000, N2=1,000 and in df N1=1,000, N2=2,000. Solid line Two-stage method with dependent samples, dashed line Bonferroni method with all data, dotted line two-stage method with independent samples

Cost-effectiveness

When looking at the cost-effectiveness in terms of the number of genotypings and TPR, the two-stage procedure also outperforms. The total numbers of genotypings are MN1 for B(N1), MN1+MN2 for B(all), and MN1+RN2 for the two-stage method, respectively. Taking the simulation for example, the increased cost for the two-stage method, using B(N1) as the reference, is only

$$\frac{RN_{2}}{MN_{1}} \approx \frac{({\hat{R}_{0}}+{\hat{R}_{a}})N_{2}}{MN_{1}}\le \frac{(50)(1,000)}{(500)(2,000)}=5\%,$$

while the increase in TPR can be as high as 39% when \(\bar{p}=0.15\) and δ=0.05. In fact, the increase in TPR is quite dramatic and can be larger than 40%, particularly when δ≤0.05. Meanwhile, the magnitude of FPR is still less than 0.001. A similar pattern can be found when comparing B(all) with the two-stage method. The increase in TPR is greater than 30% for δ≤0.05 (except when δ=0.05 and \(\bar{p}=0.15\)), and the two-stage procedure saves about 30% in costs compared with B(all).

Discussion

In this paper, we have proposed a two-stage method for multiple hypotheses testing to avoid multiplicity effects. The selection procedure aims at retrieving markers with little or mild association, incorporating two types of errors simultaneously, thus saving on overall costs in genotyping. We also derived the theoretical boundaries of TPR, FPR, and expected counts of true and false significance. When compared with two traditional Bonferroni procedures, the two-stage method outperforms. However, when δ is extremely small, none of the methods tested provides satisfactory results unless a study of a much larger size can be conducted.

Although we have used SNPs as an illustration throughout this paper, the method is not restricted to this type of data. If genotypes are considered, say three genotypes per locus, chi-square tests can be conducted based on 3×2 contingency tables, and the two-stage method can be similarly implemented. When pedigree data are of interest, the unit of observation becomes the vector for each family per locus. The test statistic can be constructed based on multivariate data and the two-stage procedure can be applied. The method can also be used in microarray data provided the number of subjects is not too small. In any case, the scheme does not change with the statistics used.

Through use of the two-stage design, the original power of a test statistic is always increased. This design guarantees a greater overall power than that achievable in a single stage of testing. If dependence among data exists, tests incorporating such relationships may be considered (Benjamini and Yekutieli 2001; Dale 2004). When a multi-stage approach is employed (van den Oord and Sullivan 2003a; Hirschhorn and Daly 2005), this procedure can be generalized with care for the derivation of TPR and FPR. It is worth noting that pooling data from all previous stages will be more powerful, but the tests must be carefully conditioned on results from previous stages to handle dependence, as illustrated here.

Another issue concerns the relative magnitude of sample sizes. Once a statistic is chosen to test the significance between case and control groups, the standard formula for the sample size can be applied based on given α1 and α2. The magnitudes of N1 and N2 are then determined. Our findings also assure that, since power is the primary interest in the first stage, a larger N1 will guarantee a greater overall TPR. If one is more concerned about error rates, then a larger N2 should be considered (van den Oord and Sullivan 2003a). Further studies on an optimal choice of the relative magnitudes of N1 and N2, and an investigation into constraints on haplotype size (or the number of SNPs) will be considered in a future manuscript.