## Introduction

Three general statistical paradigms are available to analyze genetic association data: the Frequentist paradigm, the Bayesian paradigm1, 2 (or quasi-Bayesian paradigm3), and the pure likelihood or evidential paradigm (EP).4, 5, 6, 7, 8, 9 In each paradigm, the likelihood ratio, LR=f1(x)/f0(x), has a central role, where f1(x) and f0(x) are the probability functions for the random variable x under H1 and H0, respectively. The Law of Likelihood5, 10, 11 informs us how to interpret the LR, stating that the LR measures the strength of evidence favoring H1 over H0.

Under the Frequentist paradigm, the most powerful Frequentist test of H0 rejects in favor of H1 for sufficiently large values of the LR, using the Neyman–Pearson lemma; thus the LR dictates which test statistic to use. Although this is not a direct use of the LR for interpreting evidence strength, and the appropriateness of using a hypothesis test or P-value to represent evidence strength has been questioned,2, 5, 12 the LR remains integral to the hypothesis-testing framework of the frequentist paradigm.

The Bayes Factor (BF) is the Bayesian paradigm's alternative to the P-value.1 The BF can be interpreted as the factor by which the prior odds of association are changed in light of the data to produce the posterior odds of association. The parameters are integrated out of the likelihood function with a weighting given by the prior distribution on the parameters. When θ1 and θ0, the parameters of the prior distributions, reflect two simple hypotheses, the BF=LR. The BF provides an attractive alternative to the P-value for genetic association studies,1, 2, 3 yet it too has limitations: ‘It is well understood that the priors on the parameters of the model can have a non-negligible impact on the value of the Bayes’ factor even as the amount of data gets large.’1 (Supplementary Methods).

The EP takes the Law of Likelihood literally, and uses the LR itself rather than P-values or BFs to plan/design, analyze, and interpret genetic association studies. For the planning stage, the EP provides error probabilities analogous to type I and type II error rates based on LRs. These can be used to estimate sample size and to ensure that the probability of obtaining weak association evidence is low. For the analysis stage, likelihood functions take the central role, with LRs measuring the strength of evidence vis-à-vis two simple hypotheses, LR=f1(x)/f0(x).

In this study, we will delineate the planning, analysis, and multiple-testing approaches of the EP for use in genetic association studies, and highlight the advantages of using this paradigm. This represents an extension of our previous work, applying the EP to linkage analysis.7, 8 In the subsequent sections we provide definitions and the conceptual framework; show how evidential studies are planned for single tests of association; provide an application using a published fine-mapping study of Rolandic Epilepsy (RE);13 and then address the issue of multiple hypothesis testing. The methodology presented here is also applicable to candidate gene and whole genome association studies.

## Definitions and conceptual framework

### Using the LR as a measure of evidence

For the association studies discussed here we assume an underlying logistic regression model:

We define πi=E(yi), where yi is equal to 1 when subject i has the disease and zero otherwise, and xi=1 if the ith subject has the genotype of interest, and zero otherwise. The null hypothesis of no association implies that β1=0, or, equivalently, that the OR is 1 (since β1=log(OR)), whereas under the alternative we will take ${e}^{{\text{β}}_{1}^{*}}$ equal to some value greater than 1, without loss of generality.

Let L(β*1; x) represents the likelihood function for the data x, when the , whereas L(β1=0; x) is the likelihood under the null hypothesis for the OR. Assume further that β0 is a nuisance parameter that has been removed from the likelihood function using conventional methods (see section ‘Calculating error probabilities for a case/control association study: study planning’). Let

The LR in (2) is then the ratio of the two likelihoods, free of the nuisance parameter, and provides a measure of the relative evidence for a specified OR value versus OR=1. Common practice is to plot the likelihood as a function of ${e}^{{\text{β}}_{1}^{*}}$ (see section ‘Genetic association study of RE’); this will then provide a graphical representation of all possible LRs. Association can be determined by investigating the ratio of any two points on the curve, which correspond to two simple hypotheses.

To plan a study, an investigator needs to specify several values including an alternatively hypothesized OR value, ${e}^{{\text{β}}_{1}^{*}}$, which represents the minimum important effect size to detect (eg, OR=1.2 in a genome-wide association study); and some value of k>1 that is chosen to represent strong, convincing evidence favoring one hypothesis value over another. Possible choices for k may be 8, 32, 1000, and so on, with k=32 a commonly used benchmark in the evidential literature4, 5 and k=1000 (or even higher), a commonly used critical value in genome-wide linkage studies.14 A discussion on benchmarks can be found in Royall5, 6 and Edwards.15 The choice of k dictates the observed LR value at which one would declare strong evidence favoring one OR value over another. That is

represent strong evidence favoring H1 and H0, respectively. An LR falling between k and 1/k represents weak evidence, indicating that there is insufficient evidence in the data to strongly favor either hypothesis.

### Error probabilities and bounds

The failure of the conditions in Equation (3) to occur when H1 and H0 are true, respectively, are considered errors, and their probabilities are defined in detail elsewhere.4, 5, 9 Briefly, two types of errors can occur under each simple hypothesis: The first of these occurs when the data yield strong evidence supporting the wrong hypothesis; for these we define the probabilities of misleading evidence,6

under H0 and H1, respectively, where n represents the total sample size in the study (cases and controls). M0(n,k) is analogous to a type I error, yet is not fixed by design at α. Mi(n,k) i=0,1 are allowed to vary but are bounded: there is an absolute but crude upper bound of 1/k that holds for all sample sizes.4, 5, 6, 10 Furthermore, under general regularity conditions a large-sample bound of exists,6 where Φ is the cumulative normal probability distribution. This asymptotic bound holds for fixed-dimensional vector parameters (eg, the two degree of freedom association model) even when one uses profile likelihoods to construct the LR. These bounds ensure small error probabilities (well below 0.05 for reasonable k) in quite general situations.

The second error type occurs when the data yield only weak evidence. For this the probabilities of weak evidence are defined as

under H0 and H1, respectively. As n gets large, Mi(n,k) and Wi(n,k) converge to 0. Although the convergence of Wi(n,k) with n is monotonic for continuous response data, the convergence of Mi(n,k) is not:6 Mi(n,k) generally reaches a maximum (although this maximum is itself generally small) at sample sizes where Wi(n,k) is very large, and then converges to 0. By the time Wi(n,k) is reasonably small, Mi(n,k) is well below its maximum.6

Finally, the probabilities of strong evidence are

Minimizing the probabilities of misleading and weak evidence will necessarily maximize the probabilities of strong evidence, since

S1(n,k) in Equation (6) is analogous to the frequentist concept of power. There is no frequentist analogue to Wi(n,k), outside the context of sequential testing.16

As Mi(n,k) has natural bounds that ensure it remain small, it is Wi(n,k) that must be controlled to ensure Si(n,k) is high. The value of Wi(n,k) varies as a function of three quantities: sample size; the minimum important effect size for the parameter of interest (ie, in our case the OR); and the criterion k.

## Calculating error probabilities for a case/control association study: study planning

Planning an evidential association study entails ensuring that Mi(n,k) and Wi(n,k) are small, i=0,1 and, as a consequence, Si(n,k), are high. This is accomplished by determining the required sample size as a function of minor allele frequency (MAF) and effect size, where effect size (eg, OR=1.5) and MAF are generally determined by study design. Error probabilities (Equations (4–5)) can then be calculated using a likelihood free of nuisance parameters. (Note that one is not restricted to these pre-specified parameter values for analysis, the specification is merely for planning.)

The logistic regression model (Equation (1)) contains a nuisance parameter, β0, whereas our interest is in the . Two options to eliminate the nuisance parameter are to condition on an appropriate statistic or to profile the nuisance parameter out. In section ‘Conditional likelihood’ we will provide analytical formulas for the error probabilities using the conditioning approach. For profile likelihoods, in contrast, we will use simulation to calculate the error probabilities (section ‘Profile likelihood’). Each option has its advantages: Using a profile likelihood we can incorporate many covariates into the model, and these covariates can be coded in any way allowing for additive, dominant, or any other coding for the genetic model; on the other hand, the conditional approach provides analytical formulas that are easier to interpret, yet allow for only a single dichotomous covariate. The error probabilities between the two approaches may differ slightly for the logistic regression model, but not substantially.

### Conditional likelihood

We can use a conditional likelihood to eliminate the nuisance parameter, β0, in Equation (1), and calculate the planning probabilities. The derivation of the likelihood and the closed form solutions for the error probabilities are in Appendix S.1 in Supplementary Material. We illustrate some error probabilities and sample sizes resulting for H1:exp(β*1)=1.5 and 2 versus H1:exp(β*1)=1, and for representative MAFs (or at-risk genotype frequencies, depending on the assumed genetic disease model) and for k=32. Figure 1 shows Mi(n,k) and Wi(n,k) plotted against the sample size needed in each group (n1=n2) for k=32 and for an at-risk genotype frequency, t0/n, of 0.2, assuming we are in complete linkage disequilibrium with the disease allele. Under a recessive model this would correspond to an MAF=0.45. In Figure 1, the left column of plots gives M0(n,k) and W0(n,k), that is, the probabilities when H0 is true, whereas the right column shows Mi(n,k) and Wi(n,k), that is, the probabilities when the true OR is 1.5 (or 2, for the dotted lines). Note how small the probabilities of misleading evidence, Mi(n,k), are even for this relatively low criterion of k=32.

Mi(n,k) and Wi(n,k) are smaller for larger alternatively hypothesized ORs (compare OR=1.5 versus OR=2 in Figure 1), indicating that larger sample sizes are required to detect smaller alternatively hypothesized ORs, as one would expect. As the genotype frequency increases, the error rates decrease for a given sample size (data not shown). These observations suggest that sample size estimation be based on the smallest MAF to be analyzed and the smallest OR one wishes to detect. Notice also that for sample sizes where Wi(n,k) is small, Mi(n,k) is very small. This observation highlights that planning should be based on ensuring small Wi(n,k)s. As k increases, the Mi(n,k) decrease slightly, but the Wi(n,k) get disproportionately larger, indicating that it is counterproductive to decrease Mi(n,k) by raising the criterion for strong evidence, k (see Equation (A.1.3) and Supplementary Figure S.1 in Supplementary Methods).

Table 1 provides sample size estimates, through exact calculations, for given weak evidence bounds (ie, the sample size choice to ensure that both W1(n,k) and W0(n,k) are below the value in column 1) when t0/n=0.2, 0.3, k=32, H0: OR=1 versus H1: OR=2. The maximum probability of misleading evidence, over all n, (max(Mi) n), is also presented; despite being quite small, these values occur at sample sizes for which weak evidence would be too large to consider for a study.

In Table 1 misleading evidence is small when H0:OR=1 and H1:OR=2 for any n. Not surprisingly, the smaller the bound on the probabilities of weak evidence or the smaller the at-risk genotype frequency (or alternatively hypothesized OR (Figure 1)), the larger the sample size required. For comparison using frequentist methods, the number of cases (equal to controls) required to achieve 80% power at a nominal type I error rate of 0.05 to detect an OR=2 for genotype frequency of 0.3 would be 310. (See Strug et al9 for more general comparisons of evidential and frequentist sample size estimates, and section S.1 and Supplementary Table S.1 in Supplementary Methods for a power comparison.)

### Profile likelihood

A profile likelihood replaces the nuisance parameter of the likelihood function by its maximum likelihood estimator (MLE) at each fixed value of the parameter of interest. Thus, given the joint likelihood, L(β0,β1), the profile likelihood for β1 is where the maximization is conducted at fixed values of β1. Then one can treat the profile likelihood as a regular likelihood function17 under weak regularity conditions. One can profile out a multidimensional nuisance parameter vector to assess the relative support for different genotypic effect sizes, after adjusting for the covariates (assuming minimal collinearity). In this study, we will assume a disease is inherited in an additive manner, and we can calculate Mi(n,k) and Wi(n,k) just as we did in the ‘Conditional likelihood’ section, but using simulation and the profile LR, LRp=Lp(β*1)/Lp(β1=0).

Specifying the MAF (P=0.3), the minimum important effect size to detect (OR=1.5), and the prevalence of disease in those with the wild-type genotype (0.002), we simulated equal numbers of cases and controls (n1=n0=1 ,…, 800), with the genotypes in controls in Hardy–Weinberg equilibrium. For each combination of input parameters we simulated 1000 data sets assuming there was association with true OR=1.5, and 1000 data sets assuming no association (OR=1). From each data set j=1, …, 1000, of a given size (n=2 ,…, 1600) we calculated the LRpj for H0: OR=1 versus H1: OR=1.5. In each case, we calculated Mi(n,k) and Wi(n,k) by counting the number of times the LRpj fell in the appropriate range, then dividing by 1000, for example,

Figure 2 provides the values for these error probabilities as a function of n.

Note the scale of the Mi(n,k) plots in Figure 2, where for any sample size, even with k=8, the Mi(n,k) remain very small, and are not of concern. However, at sample sizes where the Mi(n,k) are small, the Wi(n,k) may still be very large for all k considered. This again highlights the need to control the Wi(n,k) during planning, rather than the Mi(n,k). It should also be noted that for the scenario in Figure 2, it is not until the study contains 300 cases, that W1(n,k) drops as low as about 10%, even for k=8.

## Genetic association study of RE

In this section, we illustrate an evidential analysis as applied to an earlier study of RE.13 In that work we conducted mapping studies of RE to assist in unraveling its complex genetic inheritance. We conducted genome-wide linkage analysis in 38 families, using a subclinical phenotype present in all RE probands and some unaffected relatives; then we fine-mapped the linkage region with 44 SNPs in 68 RE cases and 187 controls; we replicated our association evidence in a sample from Calgary, Canada with 40 cases and 120 controls. See Strug et al13 for clinical descriptions and details of those analyses. In this study, we use the RE study to illustrate how to conduct an evidential association study, both for single SNP (section ‘Single SNP association analysis: using likelihood plots’) and regional SNP (section ‘Extending likelihood plots to a region of typed SNPs’) analysis.

### Single SNP association analysis: using likelihood plots

The likelihood function for the OR parameter at a given SNP graphically represents all the evidence about association in the data set. For a single SNP one can plot the likelihood, as a function of the interest parameter (eg, odds ratio, relative risk, hazard ratio, regression coefficient), under an assumed model (eg, dominant, recessive, additive, etc).

Figure 3 provides a simple example of an evidential analysis of genetic association at three SNPs, separately, and the presence of RE in independent cases (n1=68) and controls (n2=187), assuming an additive model for the genotype.

Figure 3 shows a profile likelihood for the odds ratio, profiling out the baseline odds. The likelihoods are standardized to have maximum value of 1 at the MLE. Each plot in Figure 3 provides objective evidence of what the data tell us about the interest parameter at that SNP. The two likelihood intervals (LIs) on each of the three plots represent values of the ORs that are consistent with the data, at a k=8 (1/8 LI) or k=32 (1/32 LI) level. LIs are analogous to confidence intervals. However, LIs do not have a long-run frequency interpretation; rather, they reflect the evidence about the OR in the given data set.

Figure 3(c) shows an association between SNP SG11S39 and RE at the k=8 level, where there are many alternative values of the OR around 1.79 that are better supported than an OR=1 by a factor of greater than 8 (see the vertical line at OR=1 to the left of the likelihood function), and with plausible OR values of 1.07–3.04 from the LI at the k=8 level. For k=32, the LI includes an OR=1 as a plausible value, and hence there is not strong evidence favoring any OR value over an OR=1 by a factor of 32 or more. The corresponding 95% confidence interval for the OR at this SNP is 1.07–2.94. The LI is relatively narrow, indicating substantial information available in the data.

Figure 3(a) and (b) show likelihood functions for two additional SNPs. The likelihoods provide a useful tool to assess which SNP has the most association evidence, in some sense. Although the LIs are a little wider for SNP SG11S 39, the relative support for different ORs versus OR=1 is greater than the others at and around the maximum, and the OR=1 vertical line is further to the left of the LIs in SG11S 39 than for the others. (Supplementary Methods’ section S.2 and Supplementary Table S.2 provide frequentist and Bayesian association measures at these SNPs).

### Extending likelihood plots to a region of typed SNPs

Looking at hundreds or thousands of likelihood functions for individual SNPs, side by side as in Figure 3, is not efficient or helpful when it comes to getting an idea of what is happening across the RE linkage region. Thus, we developed a plot that provides much of the information that is in an individual likelihood function plot, while also providing association evidence for multiple SNPs by base pair position. It does this by plotting the LIs for each SNP, graying out those where an OR=1 is considered a plausible value at some prespecified k, while identifying those that ‘light up’ in a given gene by plotting them in color. For illustrative purposes we reproduce one such figure from the original analysis13 (see Figure 4), to illustrate how the general methodology works.

Figure 4 shows the evidential association plot across the region of 44 SNPs using the original sample of 68 RE cases and 187 controls. In this study, we used an additive disease model, a profile likelihood to eliminate the nuisance parameter from the likelihood function, and evidence strength of k=32 as a criterion to demarcate SNPs of interest (SoIs). To create these evidential figures we plot the SNPs by bp position on the x axis, and provide the OR on the y axis. The OR=1 line is plotted as a solid black horizontal line. Then, for each SNP the LIs for the ORs are plotted. These LIs are exactly the LIs provided in, for example, Figure 3. If association evidence exists at a given SNP (that is, if a SNP is flagged as a SoI because the 1/k LI excludes OR=1), the LI is presented in color, whereas, if no association evidence exists at the k-level specified, the LI is grayed out of the figure. The interpretation of an SoI is that there are alternative OR values that are favored by a factor of k or more over the likelihood at OR=1. Notice that the SoIs have LIs with three separate colors, navy blue, yellow, and turquoise. If the evidence strength is greater than 32 but less than 100 (ie, OR=1 is not in the 1/32 LI but is in the 1/100 LI) then just the navy blue portion of the LI is above the OR=1 horizontal line; if the evidence is greater than 100 but less than 1000, then the blue and yellow portions of the LI are above the OR=1 line; and if the evidence is greater than 1000, then the entire LI is above the OR=1 line, indicating that even at the k=1000 level, an OR=1 is not a plausible value. The small horizontal tick on each LI is the MLE, which provides information about the shape of the likelihood curve, and we can see from Figure 4 that the MLEs for the ORs at these three SoIs are approximately 2. The max LR for each SNP in color is also provided as text in the plot for calibration.

If the vertical LI colored line moves further above the horizontal OR=1 line with additional data rather than lower, then the additional dataset provides corroborating evidence that this SNP, with the same allele, is associated with increased risk of RE. Supplementary Figure S.2 in Supplementary Data provides the results from a joint analysis of the data in Figure 4 and a replication sample from Calgary, Canada of 40 cases and 120 controls, illustrating this principle. Table 2 lists the ORs, the 1/32 LIs, the max LRs, and the unadjusted P-values (for comparison) from the original (discovery sample) and the combined sample with Calgary. As can be seen in Table 2 (and Supplementary Figure S.2 in Supplementary Data), the LIs at all three SNPs of interest have become narrower, and moved further away from including an OR=1 as a plausible value. Interestingly, none of these three SNPs in the replication sample alone would show up as an SoI, highlighting the importance of analyzing samples jointly.

Figure 4 (and Supplementary Figure S.2 in Supplementary Methods) indicate that only SNPs in the elongator protein complex 4 (ELP4) ‘light up,’ pointing to the role that ELP4 might be having in RE susceptibility. Furthermore, the same SNPs are providing corroborating evidence, although the strength of the evidence differs between SNPs and across the two datasets.

## Accounting for multiple hypothesis testing in the EP

Methods to account for multiple hypothesis testing differ between the Bayesian, frequentist, and EPs. Frequentists must adjust their evidence measure, the P-value; Bayesians account for multiple tests by incorporating information into their prior probability;18 and in the EP we adjust our planning probabilities – but not the evidence measure itself – to account for the number of tests to be conducted. We discuss this evidential approach in detail.

### The family-wise error rate and the generalized family-wise error rate

The most common error rate chosen to control for multiple hypothesis tests is the family-wise error rate (FWER). As presented in Table 3, the FWER is defined as P(V≥1).19 It reflects the probability of rejecting at least one true null hypothesis (or observing misleading evidence under the null for at least one SNP), assuming none of m loci is associated.

The EP, unlike the standard frequentist paradigm, decouples error rates from evidence measures.8 This is important for multiple test implications, as delineated in7. Briefly, when one conducts multiple SNP tests, the FWER increases with the number of tests conducted. In the frequentist paradigm, the FWER is always fixed at α (eg, α=0.05); therefore the significance criteria for any given test in a family of tests must be smaller (eg, α/m, m=number of tests). However, in the EP, M0(n,k) is not fixed but rather is allowed to vary and is not tied to the value of the LR at which one declares strong association evidence. The FWER based on M0(n,k) still increases with additional tests, so one must ensure in one's planning that over all tests, the FWER will remain at acceptable levels. However, the increase in the number of tests does not affect how we interpret the strength of the evidence itself, that is, the LR. We provide an upper bound on the FWER for the probability of misleading evidence:7

where M0(n,k) is the probability of misleading evidence for one SNP test, as inEquation (4). This M0(n,k) corresponds to the probability calculations before data collection as outlined in section ‘Calculating error probabilities for a case/control association study: study planning.’ Thus, for a fixed number of SNP tests (m), this upper bound can be made smaller by decreasing M0(n,k) through sample size, k, MAF, or the pre-specified effect size. Increasing k is counterproductive, only minimally reducing Mi(n,k) whereas dramatically increasing Wi(n,k) (Equation (A.1.3) in Supplementary Methods); and the OR was chosen as the minimum important effect size to detect. If Wi(n,k) based on the minimum important effect size and specified k remain large, then these error calculations suggest we simply do not have a sufficiently large data set; here, increasing the sample size is the most desirable and appropriate course of action, when feasible.

Adding samples to ensure that the bound on the FWER remains small can be accomplished through Scheme (1) single-stage designs and Scheme (2) two-stage designs. In Scheme (1) one would plan a larger total sample size n at the beginning of the study through the simple calculation in Equation (8), varying n such that m × M0(n,k) is sufficiently small. In Scheme (2) one adds the additional samples necessary from the calculation in Scheme (1) in a replication phase, which types only those SNPs or regions with strong evidence for association in the first stage. Scheme (2) results in a smaller bound on the FWER than Scheme (1) and may be more cost-effective, but S1(n,k) may be smaller (see section ‘Probability of detecting true positives’, and Appendix S.3 in Supplementary Methods for a (conservative) lower bound on the two-stage probability of strong evidence). Note here that the increase in sample size (or the replication component) is the ‘adjustment’ for multiple hypothesis testing.

Controlling the FWER may be inappropriate for genome-wide association studies or large-scale fine-mapping endeavors. If one uses Scheme (1) or (2) above, one could relax the requirement that even one type I error is unacceptable. When m is large, we might choose to tolerate up to g−1 false positives. Specifically, consider the generalized FWER,20 which can be expressed as gFWER=P(Vg). The gFWER ensures a small probability of observing at least g misleading results in m tests if all are null. The value for g would be chosen depending on resources for follow-up. In this case,

when g=1, this quantity is approximately equal to m × M0(n,k), M0(n,k) small. Equation (9) shows that, for a given M0(n,k), as g gets larger, the bound on the gFWER gets smaller. Thus, the larger the g, the smaller the sample size required. Moreover, the method derived to control the FWER in7 may also be used on the gFWER; that is, Equation (9) provides an upper bound on the gFWER, which can be used to plan larger studies or to implement the two-stage replication design to adjust for multiple hypothesis tests.

### Probability of detecting true positives

Thus far, we have completely ignored the probability of detecting true positives, which should arguably be as important as, if not more important than, controlling false positives. It is straightforward to incorporate S1(n,k) into the planning for multiple tests, ensuring that the probability of getting at least one true positive out of m loci is high. Following the notation of Table 3, suppose that of m marker loci, m1 are truly associated with disease and the remaining m0=mm1 are not associated. For each of the m1 true markers the probability of being detected is P1(LRik)=S1(n,k), equal to the probability of strong evidence under the alternative hypothesis for one SNP test as in section ‘Calculating error probabilities for a case/control association study: study planning.’ Define PTP(m1) as the probability of detecting at least one of the m1 true positive loci. Several properties of PTP(m1) can be noted regardless of whether the markers are independent (see Appendix S.2 in Supplementary Methods for derivation and calculations): (1) PTP(m1) increases as the number of true positives increase; (2) the value of PTP(m1) is independent of the number of false markers, m0; and (3) PTP(m1) is bounded below by S1(n,k), the probability of strong evidence under the alternative in one SNP test. Thus, for any m1, if S1(n,k) is reasonably high for a single SNP analysis, then there is a good chance of identifying at least one true positive along with the false positives. For a single-stage design, S1(n,k) is calculated as in section ‘Calculating error probabilities for a case/control association study: study planning’ with the expanded data set as the new sample size. For the two-stage design some additional calculation is required. The details are given in Appendix S.3 in Supplementary Methods. There, we see that in the two-stage design,

provides a lower bound on the probability of strong evidence under the alternative, where j1 and j2 represent the numbers of observations in the first and second stages, respectively, and n=j1+j2. Equation (10) implies that a larger total sample size is required for the two-stage design to achieve equally large strong evidence probabilities.

In summary, in an association study, one can adjust for multiple hypothesis testing by controlling the FWER or gFWER through a single- or two-stage design, while simultaneously ensuring a high probability of detecting at least one true positive by ensuring W1(n,k) is small (or equivalently S1(n,k) is large (Equation(7)).

### Multiple testing applied to the RE example

We use the RE discovery sample and the Calgary replication sample to illustrate the evidential multiple-testing approach. We use a two-stage design to adjust for multiple hypothesis tests controlling the FWER. With 68 RE cases and 187 controls M0(k=32) equals 0.002 to detect an OR=1.5 with MAF=0.30; thus for 44 SNP tests the FWER≤0.088 (by Equation (8)). Combining the data in a joint analysis with the Calgary sample, the FWER≤0.044 (with the two-stage design bound even smaller, depending on the number of markers chosen for follow-up).

Consequently, adding the Calgary data serves as our adjustment for conducting multiple SNP tests because it ensures that the FWER is controlled at acceptable levels – exactly the point of a multiple test adjustment.

The lower bound on the PTP(m1) using the combined sample is S1(415, 32)=0.04, and under the two-stage approach it equals S1(255, 32)*S1(160, 1)=0.003, for OR=1.5 and MAF=0.30. Although this is only a lower bound, sample size should be much larger to ensure a reasonable bound on the probability of strong evidence. Section ‘Genetic association study of RE’ and Figure 4 illustrate that there was, however, strong evidence of association in one of the genes under the linkage peak, but at a larger OR value than the error probability calculations pre-specified. The a priori small strong evidence probability bound associated with the study does not detract from the strong conclusions of association we can make between RE and ELP4, we are just unable to unequivocally rule out the other genes in the region. From a planning perspective, it is best to have one's study characterized by a low probability of observing weak evidence and not to rely on good fortune.

## Discussion

We have provided an alternative approach to analyzing genetic association studies, which does not require use of P-values, Bayes’ factors, or standard multiple test adjustments. These genetic association studies could involve either genome-wide analysis, fine-mapping linkage regions or candidate genes. In summary, we have shown that case–control genotype data can be analyzed for association using LRs; that when conducting association analyses across multiple SNPs one can adjust for multiple testing by using a replication sample (increasing sample size) and conducting a joint analysis; and that the evidential error probabilities are straightforward to compute and are useful and necessary when planning a study.

A replication study (or the use of additional samples) provides multiple test adjustments in the evidential framework. Replication studies are already a requirement by many journal editors for publication, by funding agencies, and policy makers. In addition, by planning a genetic association study evidentially through sample size choice and multiple test correction approaches, one can control the probability of obtaining weak association signals.

Evidential analysis evaluates evidence vis-à-vis all possible two simple hypotheses, and chooses SNPs of interest through LI criteria. LIs are more appropriate than confidence intervals for genetic association studies as they reflect what the collected dataset has to say about association rather than requiring a long-run frequency interpretation.

There is a common misconception concerning the role the simple alternative plays in evaluating the evidence in the EP: to be clear, the values one chooses for the simple hypotheses during planning are irrelevant for analysis; you are not tied to any particular pre-specified values when assessing evidence strength. For more on this topic, as well as a concrete example, see Strug and Hodge.8 Briefly, the specified alternative value of the OR should represent ‘the smallest meaningful difference’ from the null hypothesized value of OR=1. However, an alternative hypothesis is specified for planning purposes only; once the data have been observed, the value of β*1 has no role in interpreting the evidence, and the investigator can and should report the whole likelihood function (or LIs). The MLE never has the role of alternative hypothesis at the planning phase, for many reasons, one of which being that the MLE does not represent a simple hypothesis, and thus the universal and other bounds do not apply to the maximized LR.21

A limitation of the pure likelihood or evidential approach to analysis is its dependence on the correct choice of model. However, recent advances have provided methodology to ‘robustify’ likelihoods to guard against model misspecification22, 23 and this methodology is also available for use in genetic studies. Another perceived limitation is that evidential analysis requires larger sample sizes and a more stringent significance criteria than standard frequentist methodology,9 for a given SNP test. On the other hand, standard benchmarks for evidence strength are known to be anticonservative.24

Our RE example highlights the ‘power’ one gains from a joint analysis, similar to results from other paradigms.25 Yet even in a joint analysis using a P-value approach, a different, more stringent significance criterion must be applied because of multiple-testing penalties imposed by the frequentist paradigm. In the EP, we manage to avoid all evidence adjustments regardless of the design; rather, we adjust the error probabilities at the planning phase of the study through the sample size or by replication.

The RE example illustrates several other important differences between the two approaches as well: (1) rs986527 would not have been significant after Bonferroni correction in the original RE discovery sample, and so, depending on the scheme for follow-up, this SNP might not have been typed in a replication scheme; (2) if the Calgary samples had been analyzed separately using a P-value approach, only rs210426 would have been flagged as significant and this SNP did not appear important in the original sample; (3) depending on how one defines replication, many might conclude from a separate analysis that the Calgary sample did not replicate the original findings. In fact, this is not the case. We can see that the LIs at rs986527 and rs964112 favor ORs greater than 1.5 over an OR=1 in both samples, with the difference in strength easily attributed to factors such as differential LD patterns, varying MAFs, different sample sizes, and stochastic factors. Moreover, the fact that only SNPs in ELP4 ‘light up’ in the two analyses strongly suggests replication of ELP4.

Evian, an R package to conduct an EVIdential ANalysis and produce the illustrated evidential genetic association plots, is available at http://strug.ccb.sickkids.ca/evian. In this study, we advocate the use of evidential analysis for genetic association studies, highlight the multiple hypothesis-testing adjustment approaches, and illustrate how to plan evidentially. The multiple test adjustment approaches, that is, the addition of replication samples, are more consistent with the practice of science, and the field's move toward large-scale meta-analyses.