Introduction

Many researchers are now focusing efforts on elucidating the role of rare genetic variants in predisposition to common disease.1 This undertaking usually involves next generation resequencing at high coverage, in order to identify very rare variants with high accuracy.2 Indeed, the majority of variants identified in large-scale high-depth sequencing studies are rare,3, 4, 5, 6 and as the number of individuals added to a sequencing study increases, there is a disproportionate increase in the number of singletons (variants present in only one individual) identified.5 The value of these singletons and other very rare variants (defined here as allele frequencies less than 0.125%) in improving power to identify gene-disease associations is currently unclear. In terms of design planning and resource allocation, it is important to know when singletons and other very rare variants have significant impact on power, under which assumptions and statistical models, and when their contribution to power is negligible.

To address this question, we undertook Sanger sequencing of 1998 individuals at seven genes and simulated causal very rare variants to understand their effect on the statistical power of prominent rare variant disease-gene association methods.7, 8, 9, 10, 11, 12, 13

Material and Methods

Study samples and Sanger sequencing data

The subjects used are a subset of the CoLaus study,14 and the data consist of Sanger sequences for the exons and flanking regions of seven genes, provided by GlaxoSmithKline, Upper Merion, PA, USA. The sequencing methods have been described previously,15 and a summary of the genes’ characteristics is shown in Table 1.

Table 1 Description of the seven genes and count of rare variants per gene

Simulations

We recently presented a simulation framework using these data to explore parameters potentially influencing rare variant associations with continuous and dichotomous traits.16 Here, we use the same framework to investigate the impact of the inclusion or exclusion of very rare variants on the power of two recently developed statistical methods for testing rare variants: the variable-threshold approach ,9 and a variance component regression method, Sequence Kernel Association Test (SKAT).10 These two methods have good power when compared to other rare variant methods.16

Phenotype simulations

Phenotypes were simulated to depend on genetic variants chosen from the complete list of rare variants in a gene, allowing us to assume that very rare variants, such as singletons, can have a deleterious effect on the trait. The phenotypes were simulated to illustrate the power of a variety of commonly held hypotheses about the potential effects of rare variants on traits.

Parameters investigated include: The proportion (P=10, 15, 20, and 30%) of rare variants (MAF ≤1%) that are causal, combined with effect sizes of the causal rare variants (μ=0.5, 0.75, 1, 1.25, and 1.5 SD) as the mean effects for a continuous trait; these combinations led to investigation of 20 scenarios. Phenotypes among individuals without any causal genetic variants were assumed to follow a standard normal distribution. A proportion, P, of the rare variants in each gene was randomly chosen to be causal, and phenotype values of individuals carrying at least one rare causal allele was drawn from a normal distribution where the mean is shifted by μ. Two additional scenarios16 explored the assumption that variants with lower MAF have larger effect. In scenario 1, causal variants were sampled based on the inverse of their MAF. The proportion of causal variants was 10%. The effect of each variant was also based on their MAF, with the variant having the lowest MAF receiving an effect of −2.5 SD. The rest of the effect follows equation 1 in Madsen and Browing.7 Scenario 2 replicates Scenario 1, except that the sampling of the causal variants was uniform. Permutation was used to control for type-1 error in all statistical methods, and power is reported for a type-1 error of 0.05.

Results

We present simulated power results for SKAT and VT. By selectively excluding very rare variants from analysis, eight data sets were created, as detailed in the legends to Figures 1 and 2.

Figure 1
figure 1

SKAT and VT Continuous Traits: Relationship between effect size, proportion of causal variants and power as rare variants are removed. All causal variants have a deleterious effect. Each box corresponds to a different proportion of causal variants involved in the relationship between rare variants and continuous traits (from left to right, 10, 15, 20 and 30%). On the x-axis, effect sizes are in SD and correspond to the absolute value of the average size effect. By selectively excluding some variants from analysis, eight data sets were created: (1) All variants are included, (2) Singletons are removed, (3) Doubletons and singletons are removed, (4) Tripletons or less are removed, (5) Quadrupletons or less are removed, (6) Variants with MAF <0.00125 are removed, (7) Variants with MAF <0.005 are removed and (8) Variants with MAF<0.01 are removed from the analysis. Top figure illustrates the power under SKAT model, where the bottom figure is draw under VT model.

Figure 2
figure 2

SKAT and VT Continuous Traits: Relationship between the causal variants and their effect size is inversely proportional to the MAF. In Scenario 1, left line, causal variants are sampled based on the inverse of their MAF, and the effect of each causal variant is based on the inverse of their MAF. Scenario 2, right line, is identical to Scenario 1, except that the sampling of the causal variant is uniform, that is, does not depend on MAF. By selectively excluding some variants from analysis, eight data sets were created: (1) All variants are included, (2) Singletons are removed, (3) Doubletons and singletons are removed, (4) Tripletons or less are removed, (5) Quadrupletons or less are removed, (6) Variants with MAF <0.00125 are removed, (7) Variants with MAF <0.005 are removed and (8) Variants with MAF<0.01 are removed from the analysis. Top figure illustrates the power under SKAT model, where the bottom figure is draw under VT model.

Figure 1a illustrates how the power of SKAT changes across the exclusion criteria. SKAT’s power is altered only modestly by exclusion of very rare variants, especially when the effects of causal rare variants are small to moderate. When the variants’ effects are larger, the small drop in power owing to removing very rare variants attenuates as the proportion of causal rare variants increases (Figure 1a). Similarly, when assessing the dichotomized phenotype, we did not observe any drop in power when removing very rare variants (Supplementary Figure S1). When we removed all rare variants (as defined by a MAF <0.01), that is, when no variant in the model is associated with the trait, the power is very poor in all scenarios. This is expected, as all assigned causal variants were excluded from the analysis.

The power of the collapsing method VT is more affected by the exclusion of rare variants than SKAT. Nevertheless, the loss in power is negligible when singletons are removed, and these are the majority of variants available for analysis. However, power decreased as other thresholds were applied (Figure 1b). For dichotomous phenotypes and VT, power decreases with exclusion of any rare variants (Supplementary Figure S2) by about 20% in some scenarios. The decrease in power for the dichotomous trait analysis is in part explained by the simulation design16, and because removing all rare variants below a certain cutoff will have an impact on a model that is design to collapse variants below a threshold.

If the rare variants have effect sizes and probability of being causal is inversely proportional to their MAF, power is sensitive to removal of the lowest frequency variants. This difference is more pronounced using VT tests than SKAT (Figure 2). In particular, if both the probability of being causal and the effect size increase for rarer variants, there is a large loss in power associated with exclusion of singletons (Figure 2 first scenario). However, we note that even when including such singletons, absolute power remains low.

Discussion

One of the strengths of our study is the use of Sanger sequencing data, rather than simulated genotyping data on seven genes. And we simulated a variety of phenotype-genotype models on these data. Our choices represent plausible scenarios,7, 9, 16 including both constant and allele-frequency dependent effect sizes.

Although the identification of rare variants associated with complex diseases and traits is now underway, our results demonstrate that the inclusion of very rare variants, and particularly singletons, does not always improve power of two popular and powerful rare-variant gene-based association methods. This conclusion does not depend on effect size or the proportion of these variants contributing to the phenotype. However, for dichotomous phenotypes and when rarer variants have stronger effects, power of the VT method may depend substantially on the inclusion of very rare variants of large effect.

In most instances where very rare variants have low effect on power, using lower coverage sequencing on larger sample sizes might be a more suitable allocation of resources. However, this is not always the case. For example, singletons are important when the effect size of causal variants is inversely related to their MAF, but less important if the sampling of causal variants is also inversely related to the MAF. Finally, we note that our study examined seven genes, which are drug targets, and therefore are not necessarily a representative sample of the entire genome.

We recently demonstrated that rare variants have good accuracy even using low coverage sequencing.17 It is worth noting that Nelson et al.6 did not identify any strong associations with rare variants even among 14 000 people in 202 selected genes. We note however, that a lower depth strategy is not applicable for rare diseases or phenotypically extreme traits, whose etiology may be strongly influenced by very rare or private variants. Our message is not intended to discourage efforts to identify causal rare variants to exclude them from analysis, but rather, to generate thoughtful discussions about study designs and allocation of resources.