Introduction

The Human Genome Project was expected to advance knowledge of the genetic basis of common complex disease. Unfortunately, identification of disease-causing genetic variants in complex traits has been challenging,1, 2 a difficulty which may be related, in part, to analytical strategies for gene discovery. Linkage and association are analytic methods used in complex trait analysis.3 Although many replicated linkage and association signals exist, except for a few examples,4, 5, 6 peaks do not overlap. This overlap failure may be fundamental to analytic differences in that linkage requires familial segregation whereas association tests for co-occurrence.7 Further, association pinpoints common variants with small effects, whereas linkage identifies large chromosomal regions of moderate or large effect.3 As such, overlap of linkage and association may occur only in special circumstances, such as Mendelian inheritance, but cannot be expected universally.

Various methods have been proposed to test whether a genetic variant can account for an observed linkage signal.8, 9, 10, 11, 12, 13 These methods model linkage and association jointly and thus may fail to identify variants when linkage and association are not co-occurring. Thus, there is a need for methods that test for the effects of variants on the linkage signal regardless of association.

To overcome this barrier, we propose a novel method, the Variant Impact On Linkage Effect Test (VIOLET). VIOLET is unique because it identifies genetic variants that impact the quantitative trait locus’ variance without any assumptions about association. Using simulated and real data, we demonstrate that VIOLET has reduced false positives but without corresponding increases in the false negatives when compared with measured genotype and combined linkage association. Thus, VIOLET may fill a gap in variant discovery.

Methods

Data sets

To compare VIOLET with standard methods, simulated (Genetics Analysis Workshop: GAW17) and real (Metabolic Risk Complications of Obesity Genes Study: MRC-OB) data sets were used. GAW17 had regions with linkage and association evidence.14, 15 Using MRC-OB, linkage on 7q36 for triglycerides was identified.16 After dense single-nucleotide polymorphism (SNP) genotyping, associations were found, but only a modest portion of the linkage was explained.17, 18

Simulated data set – GAW17-simulated data set

GAW17 used the 1000 Genomes19 exome sequence data14, 15 to generate a data set with 697 individuals in 8 pedigrees. Fully informative markers were used to compute identical-by-descent (IBD) allele sharing.15 A quantitative phenotype (Q1) influenced by 39 SNPs in 9 genes was simulated. GAW17 data providers have given permission for data use.

To identify regions exhibiting consistent evidence of linkage with Q1, all 200 simulations were evaluated. VEGFA, on chromosome 6p21.1, showed linkage (median LOD=3.1). Across chromosome 6, 856 SNPs were polymorphic, including the causal variant, C6S2981.

Real data set – MCR-OB

MRC-OB was established in 1994, when families were recruited from the Take Off Pounds Sensibly Inc. membership.16, 20 Fasting plasma triglycerides were determined spectrophotometrically in triplicate. A total of 2209 individuals from 507 families of Northern European descent formed the cohort.16 All protocols were approved by the Institutional Review Board of the Medical College of Wisconsin.

A genome-wide linkage scan identified a quantitative trait locus (QTL) on chromosome 7q36 linked to triglycerides (LOD=3.7).16 From the initial cohort, 1235 individuals from 258 families contributing to the linkage were selected for dense genotyping (Table 1) of 1048 tag SNPs using an Affymetrix MegAllele custom-designed array (Affymetrix, Santa Clara, CA, USA).17, 18, 21 Additionally, 354 SNPs from chromosome 14 were available for analysis. Chromosome 14 SNPs were used to determine VIOLET’s specificity. SNPs exhibiting Mendelian inconsistencies were blanked.

Table 1 Descriptive statistics for the take off pounds sensibly (MCR-OB) cohort

Statistical methods

Data preparation

Q1 and triglycerides were examined for normality. Q1 exhibited a normal distribution, so no transformation was applied. Triglyceride levels exhibited right skewing, so the data were natural log (ln) transformed. Data were re-examined and observations exceeding 4 SD units were removed.16

Measured Genotype Association (MGA)

To test a SNP’s phenotypic effect, we used MGA.22 Briefly, genotypes were assigned as 0, 1, and 2 according to the number of minor alleles.23 To account for phenotypic correlation between family members, variance component analysis in SOLAR was used (Texas Biomedical Research Institute, San Antonio, TX, USA).24 Mixed effects models are applied where fixed effects are covariates. Random effects are defined by genetic and environmental deviations:

where μ is the grand mean, β is the SNP effect, and g and e are the genetic and environmental deviations, respectively. Assuming g and e are uncorrelated random normal variables with expectation 0, the phenotypic covariance of relative pairs (Ω) can be partitioned into additive genetic and environmental components, where Φ is the kinship matrix, I is the identity matrix, and and are the variance due to additive genetic (g) and residual (e) effects, respectively. To test a SNP effect, log likelihood of the model estimating the SNP effect is compared with the log likelihood of the model in which the SNP effect is constrained to zero. Assuming that trait values follow a multivariate normal distribution, twice the difference in the log likelihoods of these two models is asymptotically distributed as .

Combined Linkage Association (CLA)

To test the impact of variants on the QTL effect, we adapted a variance components CLA (implemented in SOLAR).24 Briefly, the standard linkage model is defined by:24

Where provides the predicted proportion of alleles that related individuals share IBD at locus A and is the variance due to locus A. The significance of linkage is estimated through a LOD score, which is calculated by comparing models with and without a. In this model, the grand mean account for the trait mean but not for the SNP effects. CLA is a linkage model conditional on a SNP fixed effect, such that:

if a SNP accounts for all of the linkage, evidence of linkage should disappear; pragmatically when the LOD score drops <0.5 indicating that the linkage is fully explained.25

As both the simulated and the real data exhibited differences in the base LOD score (simulation due to differences in replicates; real data due to some missing genotype data), examination of the LOD score from CLA did not provide a complete picture on the magnitude of change. To account for these differences, percentage of LOD drop ((LODno SNP−LODSNP)/LODno SNP) was examined. However, it is important to note that the percentage of LOD drop is simply used to provide an assessment in the change in the LOD score while accounting for the baseline LOD.

VIOLET

To test the significance of the impact of a variant on the QTL, VIOLET builds upon CLA. However, VIOLET explicitly tests whether the variance explained by the QTL changes with SNP inclusion. This is operationalized by comparing the CLA model with a model that is identical to the CLA model except that the is constrained to be equal to the variance due to the locus when the SNP effect is constrained to zero (). Thus

To test for significance, twice the difference in the log likelihoods of model CLA and VIOLET are evaluated, this test statistic is named V. This statistic differs from CLA because in CLA the major comparison is between a freely estimated to one constrained to zero. Given that the likelihood function of VIOLET’s model is a function of , which itself is a maximum likelihood solution, Wilks’ Theorem on the asymptotic approximations of test statistic distributions under the null hypothesis (that there is no difference in the goodness-of-fit) may not hold. As such, V’s distribution under the null was examined empirically to determine appropriate thresholds. For the real data, we evaluated V derived when genotypes were randomly assigned across 1000 permutations; however, microsatellite data for linkage retained in their original structure.

Results

Simulated data from GAW17

To evaluate the performance of VIOLET, measured genotype, and combined linkage association, two thresholds were utilized, 99% power and multiple testing corrected type I error (P=0.0000584). The power threshold was set to the 1% quantile from causal SNPs (V=4.17 for VIOLET, χ2=33.16 for measured genotype, and percentage of LOD drop=90.98 for combined linkage association). The type I error rate threshold was set to the 99.99416% quantile of non-causal variants (V=4.08, χ2=42.15, and percentage of LOD drop=97.61) (Table 2).

Table 2 False-positive and false-negative rates for MGA, CLA, and VIOLET using GAW 17-simulated data

VIOLET

The V null distribution for non-causal variants is highly skewed with 99.5% of observations falling below 0.13 (Figure 1). Controlling for type I error (V≥4.08), the causal variant, C6S2981 (MAF=0.033), was identified (median V=8.73, range 2.50–16.10) in 198 out of 200 simulations (Figure 2a). In all but two simulations, C6S2981 exhibited the highest V. Controlling for power (V≥4.17), we detected a very low false-positive rate (9/171 000=0.005%) (Table 2). These results demonstrate that VIOLET has a high degree of specificity, with little overlap in the distribution of V between non-causal and causal variants.

Figure 1
figure 1

Distribution of the VIOLET test statistic (V) in non-causal variants. Results were from 200 simulations, each with 855 non causal SNPs.

Figure 2
figure 2

Comparison of results using the Genetics Analysis Workshop 17 (GAW17) simulated data set. Panel a presents VIOLET, panel b presents MGA, and panel c presents CLA. All data are presented as the median results from 200 simulated replicates. The black dot identifies C6S2981, the causal variant.

Comparison of VIOLET with MGA and CLA

Like VIOLET, MGA and CLA identified C6S2981 (median P-value=2.1 × 10−14, LOD=0) (Figure 2). All methods had high power to detect C6S2981 (Table 2) when controlling for type I error. MGA and CLA exhibited 93% and 88% power to detect C6S2981 as compared with 99% power for VIOLET. Further, when evaluating the percentage of LOD drop, there was 93% power.

When controlling for power, VIOLET exhibited fewer false positives than the other methods. Out of 200 replicates, VIOLET identified 9 false positives, whereas MGA identified 58 and CLA identified 106 using the LOD and 37 using LOD drop. Importantly, two non-causal SNPs were identified as associated using MGA after Bonferroni correction in over half of the simulations (median P-values=2.1 × 10−5 and 3.0 × 10−7; Figure 2b).

Real data from MCR-OB – analysis of linkage for serum triglycerides

Dense genotyping

Dense genotyping was performed on 1235 individuals. There were 1023 and 352 polymorphic SNPs on chromosomes 7 and 14, respectively. There were no major phenotypic differences between the full cohort and the dense genotyping (Table 1). The chromosome 7 LOD score was 8.2, which is higher than the full cohort as families were selected because they positively contributed to the linkage.

VIOLET

When using VIOLET, one SNP (rs39179) exhibited an increased test statistic (V=9.0; Figure 3a); beyond rs39179 there is no evidence of a variant contributing to the linkage (mean Vexcluding rs39179=0.000003±0.000081; Figure 3a). When evaluating 1000 permutations of rs39179, mean V=0.0007±0.0019, with no value exceeding 0.02 (empirical P-value<0.001). When 352 SNPs from chromosome 14 were examined, there is no evidence that these variants account for the chromosome 7 linkage (V<0.02), suggesting a high degree of specificity for V.

Figure 3
figure 3

Comparison of results using data from the Metabolic Risk Complications of Obesity Genes Study (MCR-OB) Study. Panel a presents VIOLET, panel b presents MGA, and panel c presents CLA. The black dot represents the variant, rs39179, accounting for a 24% LOD drop.

Comparison of VIOLET with MGA and CLA

Neither MGA nor CLA identified any significant variant (Bonferroni corrected P-value (P<0.000048) and adjusted LOD<0.5, respectively). Using MGA, 109 chromosome 7 variants exhibited nominal evidence of association (P≤0.05; Figure 3b); minimum P-value=0.00007. Using CLA, the mean percentage of LOD drop (±SD) was 0.008±0.014 with a range of 0–0.24 (Figure 3c). Interestingly, the SNP identified by VIOLET (rs39179) exhibits the largest percentage of LOD drop; the nominal P-value from the measured genotype approach was 0.04. When MGA P-values are ranked, 88 other SNPs showed stronger association than rs39179. When examining an unlinked chromosome 14 region, both VIOLET and CLA exhibited little evidence of an effect (CLA mean LOD drop=0.001±0.01). However, using MGA, 23 SNPS exhibited nominal association; no SNPs reached Bonferroni correction (P<0.00014, minimum P-value=0.0015).

Discussion

Identification of causal variants accounting for linkage has been difficult.26, 27, 28, 29, 30, 31, 32 This is a problem because failure to identify causal variants within linkage regions may impede gene discovery. Part of the difficulty may be related to the analytical strategy using association-based methods to follow up linkage. Using association-based methods to follow up linkage signals may miss variants with little evidence of association but substantial effects on the linkage signal. Thus, we propose a novel method, VIOLET, to examine the impact of a specific variant on linkage without any assumptions about association. VIOLET has considerable advantage over MGA because only variants contributing to the linkage are identified. Additionally, VIOLET offers an advantage over CLA, as it provides a formal test statistic to evaluate the significance of variants that do not completely explain a linkage peak. This is accomplished by comparing two models whose only difference is the proportion of variation explained at a locus. We demonstrate that VIOLET identifies variants underlying linkage in a highly specific manner. As such, VIOLET may expedite casual variant discovery.

Using simulated data, VIOLET had higher power and lower type I error as compared with MGA and CLA. A major challenge with the simulated data was that there was a single causal SNP contributing to the linkage for a quantitative phenotype, as such all methods performed well. Further, it is important to note that for both the simulated and the real data set, the variant identified had MAF<5%. Variants of lower frequency are of concern in association (including MGA) studies due to possible stratification. Thus, future studies should examine VIOLET’s performance in scenarios when multiple variants contribute to the linkage, when variation is non-additive,33 when the outcome is dichotomous,34 and when causal variants differ in frequency.

Using MCR-OB data, VIOLET was applied to a linkage peak on chromosome 7.16 Although this region has been densely genotyped, MGA yielded associations that explained little of the observed linkage.17, 18 Using VIOLET, a single SNP (rs39179) accounting for 24% of the LOD score was identified (but it did not reach the traditional CLA threshold). This SNP was only nominally associated in MGA (P=0.04) and was ranked eighty-ninth in the P-value ranking. However, causal variants do not always result in the highest ranking P-values.35 The problem with this scenario is that based on MGA results there are too many promising candidates to be experimentally validated; thus only the top ranking associations are likely to be examined for biological plausibility. Indeed, our research team had not considered rs39179 (minor allele frequency 2.6% in our cohort; present in 25 of the 258 families and explained 0.7% of the variation in triglycerides) a promising candidate and rather focused on other SNPs.17, 18 Our results suggest that either rs39179 or SNPs in strong linkage disequilibrium (LD) with rs39179 may be causal. Using SNAP,36 a single ungenotyped SNP (rs10276884) in strong LD with rs39179 was identified. This variant is in the promoter of DPP6 and predicted to change a SF2/ASF motif. Clearly, additional studies are required.

It may seem counterintuitive that linkage would not require association as there are examples of association and linkage overlapping.4, 6, 37, 38 However, for complex traits, linkage and association overlap may be the exception. Mouse strain-dependent variability supports such lack of overlap.39, 40, 41, 42, 43 Indeed, fibronectin defects cause cardiovascular malformations;44 but there is substantial phenotypic heterogeneity by strain.42, 45 Thus, even for severe genetic changes such as gene deletion, other loci may contribute to the phenotype. As complex traits are expected to be the combination of multiple genetic factors, lack of strong association is not unexpected. Indeed, most genome-wide association studies exhibit small effects.1, 2 However, VIOLET tests the impact of a variant on linkage in a highly specific manner and thus is optimally positioned to identify variants that contribute to the linkage regardless of association evidence.

Conclusion

In summary, we propose a novel method, VIOLET, to follow up linkage. This method differs from the MGA and CLA because VIOLET measures the change in the estimate of the linkage effect when the SNP is included. Using real and simulated data, VIOLET is shown to be highly specific and reduce false-negative findings when following up linkage.