Introduction

In contrast to Mendelian traits, for which the association with related penetrant mutations is patent, quantitative traits show a continuous range of values and smaller effect sizes of genetic variants. Thus, to identify genetic factors involved in quantitative traits, larger sample sizes and more refined statistical analyses are required.

Population studies with multiple visits and quantitative trait measurements a priori offer the possibility to increase power and determine the trajectory of trait values in relation to disease or other outcomes. Methods that take all the measurements into account can increase statistical power of the genome-wide association studies (GWAS) analyses that dominate current discovery efforts. Similar benefits of using multiple measurements have been shown in analyses of expression profiling on microarrays1, 2, 3, 4 and more recently in studies of blood pressure.5 However, for most existing population cohorts, additional variability is introduced by different numbers of visits for individuals and by possible secular drift. To optimally model this type of data, we propose a shrinkage method that efficiently combines observations from different measurements, even when some visits are missing for some individuals.

The strength of shrinkage estimators compared with frequentist approaches has been clearly described in classical literature6, 7 and more recently in GWAS and related areas such as imputation, fine mapping and meta-analysis.8 Our method implements an empirical Bayes algorithm, ‘Shrunken Average (SHAVE)’, using regression toward the mean for every individual as a function of their number of visits and the correlation between visits. We evaluated the performance of the method by simulations and confirmed the expectations in real data.

We used the SardiNIA cohort (http://sardinia.nia.nih.gov)9 consisting of >6000 individuals and a set of 14 traits that were measured up to three times in all individuals, at time 3-year intervals. To evaluate the impact of the method, we selected top SNPs from single visits and meta-analysis studies, and compared the significance of the same SNPs for single visits, for the average across visits and for SHAVE. Variable but appreciable improvement in performance was found.

SHAVE is fast and easy to run, and can thus be added to approaches such as principal component and variance analysis. Finally, we suggest a way to estimate the cost-benefit of adding additional visits for GWAS signals and discuss the potential utility of SHAVE for other applications. The R code for SHAVE is available (http://sardinia.nia.nih.gov/Download/).

Methods

We outline briefly how we test for the association of a single measurement of a trait with a given SNP, and then generalize for multiple replicates. Consider a given quantitative trait and a given SNP. Let yij (i=1,..., n; j=1,…, ki) denote individual i’s jth repeated measure of the trait, or his/her residual for that trait after adjusting for one or more variables (eg, sex and age). Let Gi denote the number of minor alleles of the given SNP for individual i. Let G be the vector (G1, G2,..., Gn) containing the number of alleles (0,1 or 2) for all individuals. Similarly, let Y1 be the vector (y11, y21,…, yn1), containing the first measurement of the trait for all individuals. To test the association between Y1 and G using an additive model, Y1 is regressed on G such that yi1=β1Gi+α+ei1, and an estimate for β1 is obtained (β*1). We then we divide β*1 by its standard error and obtain a z-statistic. Next for each z-statistic a corresponding P-value is obtained. This is done separately for every SNP and every trait.

SHAVE: the posterior expectation μ*i

Now consider the following random-intercept model:

where w≥0 and σ20 are unknown parameters, which can be easily estimated from our dataset. Straightforward algebra, such as that presented in an elementary textbook on Bayesian statistics (eg, Lee10) would give the posterior expectation of μi when σ2 and w are known. Thus, , where ki is the number of visits and ȳi is the average across visits for individual i. For the proof, please see Supplementary Materials Section 1. Let μ*i denote the posterior expectation , then

We call μ*i the SHAVE estimator for multiple visits. Note that not all individuals have the same number of visits, thus if an individual i did not have visit j, the measurement yij is set to missing, thus ȳi will be the average of non-missing values. The reason that ki varies between individuals is mainly due to missing data, which we are assuming is missing completely at random, that is, is unrelated to age, sex, …, and the missing value itself.

The term will be referred to as the adjustment factor of the average, which is equal to one minus the shrinkage factor. We note that μ*i does not depend on σ2 and is a function only of ki, w and ȳi. Next, μ*i is regressed on G such that and the statistical significance of beta is calculated.

Estimating w and σ2

Let n be the total number of individuals and ki be the number of visits for individual i. Given that equation 1 implies Var(yij|μi)=σ2/w and Var(yij)=σ2/w+σ2, both quantities can be respectively estimated by and , and setting within and yields the estimate of w. Although the estimation of σ2 is not needed for μ*i, σ2 can be estimated by s2totals2within. Thus, the weight estimate is given by:

Therefore, in equation 2 the term w, which is unknown, should be replaced by its estimate. When all individuals have exactly two visits, ŵ is equal to ρ/(1−ρ), where ρ is the sample correlation between the two visits, and ŵ also minimizes the least squares loss function . Another possibility is to use a more robust loss function and estimate w that minimizes L1. Although estimating w by L1 and L2 will give different results, both weights were similar for most traits when using the SardiNIA data, and furthermore, their corresponding z-statistics for SHAVE were extremely similar for all traits. For more details see Supplementary Materials Section 2 and Table S2.

Comparing different metrics – LOD ratio

Comparisons were done in GWAS using three summary trait values from multiple visits: single visit, Average and SHAVE.

Note: To distinguish between the statistical term ‘average’ and the actual ‘Average’ among visits, we use the latter throughout this paper. To assess performance among different metrics, for a given trait and a given SNP, we run an association test for each visit, the Average and SHAVE, obtaining a corresponding slope and z-statistic and calculating the corresponding z2. The LOD score, one of the outputs from the Merlin11 software, is defined as z2/log(100) and was chosen as a performance measure because the LOD score (or equivalently z2) is conveniently proportional to the sample size. For example, assume a true association between a trait and a specific SNP. If the sample size were equal to 2000 individuals with a corresponding z2, then doubling the sample size to 4000 individuals would be expected to double z2 as well. Next, we describe three common scenarios and provide an expected LOD ratio between different metrics, with z1, zAVG, and zSHAVE as the corresponding z-statistics for single visit, Average and SHAVE.

Average vs single visit

We start by considering that all individuals have the same number of visits (equation 4) – that is, from a balanced dataset – and we then account for a situation in which there are different numbers of visits for individuals (unbalanced dataset) (equation 5).

Next we assume an unbalanced dataset for the LOD ratio between SHAVE vs Average (equation 6).

SHAVE vs Average

Proofs for equations 4, 5 and 6 can be found in Supplementary Materials Section 3. At this stage we point out a salient fact that if every individual has the same number of visits, then by equation 2, SHAVE will be the Average multiplied by a constant factor (kw/(1+kw)), which implies that z2SHAVE is identical to z2avg, also indicating that SHAVE will have the same power as the Average. This equality in power in balanced datasets between SHAVE and Average is also consistent with (equation 6), where replacing ki by k, results in a ratio equal to one.

Simulation study

Simulated unbalanced datasets were generated with 5000 individuals, with 2500 individuals with three visits and the remaining 2500 with a single visit. We conducted two types of simulations, one to estimate Type I error and the other to estimate power. In each type of simulation we compared SHAVE, Average and single visits.

Simulation models

As for all metrics described, σ2 is independent from the z-statistics, we set σ2 equal to one. Next, we describe two simulation models, where model 1 is used to measure Type I error and model 2 is used to measure power.

Model 1, for β=0: where and eij is independent from μi. In this model we can see that Var(yij)=1+1/w.

Model 2, for β≠0: , where where δiN(0,1), eijN(0,1/w), Gi is randomly generated based on the pre-defined allele frequency, is the average number of alleles across all individuals, and eij is independent from μi. Since our original random-intercept model assumes that Var(yij) does not depend on the genotype, the term (1−β2Var(G))1/2 is introduced in order to have Var(δi(1−β2Var(Gi))1/2+β (Gi))=Var(μi)=1, which implies that Var(yij)=1+1/w in both models 1 and 2.

Type I error simulations

We set α levels to 1 × 10−5, 1 × 10−6, 1 × 10−7 and 5 × 10−8. Ten billion simulations were performed to achieve an accurate Type I error estimation. Correlations between visits ρ were set equal to 0.2 or 0.5, and minor allele frequency P was set equal to 0.5. We then simulated yij values for all three visits and all individuals according to model 1 using the ‘true’ weight w=ρ/(1−ρ). Next, we randomly set as missing 50% of the values for visits 2 and 3, and Average was then calculated for every individual based on non-missing values. Next, we estimated the sample weight ŵ using equation 3 and generated SHAVE. Finally, we simulated the vector G based on the minor allele frequency P (0.5). After performing the simple linear regression between each metric and G, P-values were obtained. Next, for each metric we measured Type I error as the proportion (over 1010 simulations) of P-values smaller than each α level.

Power simulations

Simulations were conducted using α level of 5 × 10−8, β values of 0.20, 0.25 and 0.30, minor allele frequencies P equal to 0.1 and 0.5, and correlations between visits ρ equal to 0.2 and 0.5. Simulated values were generated similarly to Type I error simulations with the main difference being that model 2 was used instead of model 1. One million simulations were performed for each combination of parameters (β, P and ρ), and as a result of each combination, we measured power as the proportion of P-values less than the 5 × 10−8 cutoff, now considered the standard threshold to declare genome-wide significance findings.

Applying the method – SardiNIA dataset

The SardiNIA project was designed to investigate the genetics of quantitative traits in the Sardinian founder population.9 Over a 10-year period, from November 2001 to the present, residents of four towns in Sardinia, Italy, starting at age 14–95 years, were invited to participate to the study, and a total of 6320 individuals had up to three visits at 3-year intervals. The total number of individuals in each visits one, two and three was 6177; 5670; and 1971, respectively, where each individual could be present or not in any of the visits. Individuals were characterized for >100 quantitative traits,9 and 14 traits were selected for this analysis (bilirubin, total cholesterol, γ-GT, glycemia, HDL, height, LDL, PR-interval, QT-interval, red blood cell counts (RBC), serum iron, transferrin, triglycerides and uric acid). Traits were selected based on previously reported meta-analysis studies (as of October 2011), which also showed top SNPs for visits 1 and 2 with P<5 × 10−8 from the SardiNIA dataset, where the same top SNPs had minor allele frequency >5% and were also SNPs were previously identified from the Hapmap project (SNPs with ‘rs’ as the first two characters). Genotype information was obtained from the Metabochip, a custom Illumina iSELECT genotyping array (http://www.sph.umich.edu/csg/kang/MetaboChip).

To minimize the effect of outliers, we applied an inverse normal transformation for every trait in each visit.9 Transformed traits were used as the dependent variable and modeled using linear regression, with age at the time of visit and sex as covariates for each separate trait and for each visit. As a result, each trait measurement version (a given trait for a given visit) generated standardized residuals (mean equal to zero and SD equal to one) as the output. (This standardization step is needed in order to assume that noise levels are the same for each visit. However, GWAS results without standardization were very similar (not shown)).

Comparing performance of metrics

To measure the performance of metrics, the most significant SNP for each trait was selected based on three criteria: significance of the signal in visit 1, significance of the signal in visit 2, and significance in published meta-analyses.12, 13, 14, 15, 16, 17, 18, 19, 20, 21 Next, for each SNP we ranked the P-values among metrics and then we obtained the average rank for each metric across all traits. As SNPs were selected based on reported meta-analysis (Table 4), but not all of those were present in the Metabochip, we used the SNAP algorithm22 to select a proxy SNP in the Metabochip that had the highest R2(≥0.80). As SardiNIA project is a family based study, to test for association while accounting for relatedness, we used a variance component method implemented in Merlin.11

Results

Simulation results – power and Type I error

Simulated Type I errors were very similar to expected (α), showing that Type I error is well controlled for all three metrics – single visit, Average of up to three visits and SHAVE of up to three visits (Supplementary Materials Table S1). We also noticed a clear increase in power for SHAVE relative to the Average and to a single visit (Table 1). With α level, minor allele frequency P and effect size β, respectively, set to 5 × 10−8, 0.50 and 0.20, simulated power is shown as an increasing function of the correlation between visits (Figure 1 top). Similarly, with α, P and ρ set to 5 × 10−8, 0.50 and 0.20, simulated power is shown as an increasing function of the effect size β (Figure 1 bottom). In addition, simulated and expected power was very similar for all three methods. A detailed description of the calculation of expected power can be found in Supplementary Materials Section 4.

Table 1 Simulated and expected power for alpha equal to 5 × 10−8 and different levels of frequency P, slope β and correlation ρ
Figure 1
figure 1

Simulated power by different levels of correlation between visits ρ (top) with effect size β fixed at 0.20, and simulated power by different levels of effect size β (bottom), with ρ fixed at 0.20. Power was simulated for single visit, Average and SHAVE. In both plots, alpha level was set to 5 × 10−8 and minor allele frequency at 0.5.

SardiNIA dataset – performance by ranking

To compare metrics, we use the average rank across 14 traits (Tables 2, 3, 4), where lower average rank indicates higher overall significance. Using data for all three visits in SardiNIA and selecting for the top SNP based on visit 1 (Table 2), the average ranks for visit 1, visit 2, Average and SHAVE were 3.36, 3.50, 2.07 and 1.07. Similarly when selecting for the top SNP based on visit 2 (Table 3), corresponding average ranks were 3.64, 3.00, 2.21 and 1.14. When selecting for top Meta-Analysis SNP’s (Table 4), corresponding average ranks were 3.29, 3.57, 1.93 and 1.21. On the basis of these findings, Average was superior to any single visit in all three tables, with SHAVE having the best performance, (less significant than the Average only twice out of 42 cases (height and QT-interval in Table 4)). An alternative way to compare performance by ranking is shown in Supplementary Materials Table S3.

Table 2 Association results between 14 traits and their corresponding top SNPs, where top SNPs were selected based on visit 1 results of SardiNIA GWAS, and where z-statistics for Average and SHAVE are based on three visits
Table 3 Association results between 14 traits and their corresponding top SNPs, where top SNPs were selected based on visit 2 results of SardiNIA GWAS, where z-statistics for Average and SHAVE are based on three visits
Table 4 Association results between 14 traits and their corresponding top SNPs, where top SNPs were selected based on multi-study meta-analyses, and z-statistics for Average and SHAVE are based on three visits and results of SardiNIA GWAS

SardiNIA dataset – performance by LOD ratios using top Meta-analysis SNPs

We performed two types of LOD ratios for every trait, the first between Average and single visit, the second between SHAVE and Average. To compare signals between Average and single visits, we selected a subset of individuals who had both visits 1 and 2 and compared their signals. We first obtained the z-statistics of the Average (zAVG) and the z-statistics corresponding to visits 1 and 2 (z1 and z2). Next, we obtained the LOD ratio between Average (represented by the square of zAVG) and a single visit (represented by the square of (z1+z2)/2). Observed LOD ratios were all above one, indicating an increase in power using the Average vs a single visit (Figure 2). We note that traits with lowest correlation between visits had the highest LOD ratios, and in the three traits with lowest correlation, transferrin, serum iron and QT-interval, LOD ratios were above 1.5. Similarly, traits with high correlation between visits, such as RBC and height, had LOD score ratios close to one. In general, expected and observed LOD ratios were quite similar, suggesting that our observations match the expectations of the model.

Figure 2
figure 2

Observed and expected LOD ratio for Average and single visit for top SNPs from meta-analysis, for a subset of individuals that had both visits 1 and 2. Observed LOD score ratio is calculated based on the square of the z-statistics of the Average and the square of (z1+z2)/2 (from z-statistics from visits 1 and 2). Traits on the x axis are sorted by correlation between visits (in parenthesis).

To compare LOD ratios between SHAVE and Average, we generated a subset of the SardiNIA dataset in which all individuals had visit 1, and then, for the same individuals, we randomly selected 50% of them and included their second visits (setting the remaining visit 2 cases as ‘missing values’). The main reason to look at this subset was that differences between SHAVE and Average are only appreciable in unbalanced datasets. Although the observed LOD ratios between SHAVE vs Average were modest when compared with Average vs single visit, the ratios were all greater than one, indicating a consistent increase in power of SHAVE relative to Average (Figure 3). Also, with the exception of transferrin, observed and expected LOD ratios were similar.

Figure 3
figure 3

Observed and expected LOD ratio for SHAVE and Average for top SNPs from meta-analysis for a subset of individuals in which all individuals had visit 1 and a randomly chosen 50% of visit 2 cases were selected among the same individuals.

Expected LOD ratios in a hypothetical dataset

To get a better estimate of the expected LOD ratio between Average and single visits, and between SHAVE and Average, we generated charts based on a hypothetical dataset with multiple visits (from 2 to 10 visits). When comparing Average vs single visit, we show the expected LOD ratio as a function of the number of visits and the correlation between visits (Figure 4). The expected LOD ratios decrease as the correlation between visits increases. Similarly, expected LOD ratios increase as the number of visits increases, and saturates as the number of visits k becomes large, based on equation 4. When comparing SHAVE vs Average, we assumed a hypothetical dataset in which 50% of the individuals had a single visit and 50% of the individuals had k visits (from 2 to 10) (Supplementary Materials Figure S1). Here LOD ratios are more modest when compared with Figure 4, but still show the same relation to number of visits and correlation.

Figure 4
figure 4

Expected LOD ratio between Average and single visit for hypothetical datasets in which all individuals had k visits ranging from 2 to 10.

Discussion

Increasing the strength of a true genetic signal for a quantitative trait can provide overall benefits for GWAS studies, and we show here the extent to which measurements from multiple visits can contribute to that goal. In particular, when we compared the performance of SHAVE vs single visit and SHAVE vs Average using the SardiNIA dataset, some traits showed a large LOD ratio for their top SNPs, indicating that the same genome-wide significance can be achieved using a smaller sample with SHAVE. SHAVE increases power relative to the Average when the dataset is unbalanced (ie, individuals have different number of trait measurements). However, when a dataset is balanced, SHAVE and Average generate identical results. The increase in power for SHAVE was also supported by simulations, which showed both Type I error and power very close to that expected under the assumptions of the linear model.

Power increases with effect size (absolute value of the slope), number of visits and correlation between visits. Given the goal of maximizing the increase in power, when is SHAVE most useful? If power from a single visit is low — such that the top SNP is far from being genome-wide significant — then even the increase in power by SHAVE will not be sufficient for any SNP to achieve genome-wide significance. On the other hand, when a SNP shows marginal genome-wide significance in a single visit, the power boost from SHAVE may make a SNP genome-wide significant. Moreover, when a SNP is already genome-wide significant in a single visit, an increase in power by SHAVE will further improve genome-wide significance, providing additional confidence in the SNP effect.

A major assumption of the random-intercept model is that Var(yij | μi)=σ2/w (a combination of biological variability and measurement error) is identical for each visit. This might not always be the case if better technology were used to measure a trait in a more recent visit (reducing measurement error), or if better protocols are used (reducing biological variability). However, SHAVE can easily be adapted to such datasets, and one potential improvement could be to estimate a different weight wj for each visit j. In such instances SHAVE and the Average will not be equivalent even in balanced datasets, with SHAVE expected to outperform the Average. Another key assumption is that the true variance (unknown) within each individual is constant. If this assumption is violated shrinkage distortions result. In our model we assume that this true variance within individual i, denoted by η2i is equal to σ2/w. However, if η2i is equal to σ2/wi, where wi is the unknown weight for individual i, then if η2i>σ2/w, w will be greater than his/her true weight wi, leading to ‘under-shrinkage’, and similarly if η2i<σ2/w, then w will be smaller than wi, with ‘over-shrinkage’. Thus, to minimize the effects of over shrinkage and under shrinkage, a potential improvement would be to estimate wi for each individual, were SHAVE will likely outperform the Average even in balanced datasets. Although preliminary results showed very small increase in power (Supplementary Materials Table S4), there is still potential for improvement in datasets with more visits, in which case the estimate of wi will be more precise.

Our method uses a two-step model where in the first step we estimate w and use it to calculate SHAVE, and in the second step, SHAVE is regressed on G to obtain the estimate of β and the corresponding z-statistic. One potential improvement would be to use a one-step model, in which w and β are estimated jointly. However, if the genetic variance of the top SNP of a trait, which is equal to is small relative to σ2, the expected increase in power will be insignificant. Moreover, preliminary results comparing both one-step and two-step models were nearly identical (Supplementary Materials Table S5).

The derived relations of LOD score ratios, in which the simplest and most practical is , can be applied in cost-benefit analysis for signal improvement in the usage of measuring devices and in experimental design. For example, suppose we are considering adding an additional visit for a trait, and that we have had some preliminary GWAS results for a given SNP. Under the assumption that the signal is true, by estimating the sample correlation between visits, one could estimate the potential increase in significance for that SNP if an additional visit were obtained. This can provide guidance in planning research. Moreover, one can estimate the potential increase in significance for epidemiological studies and GWAS.

In summary, SHAVE takes advantage of multiple trait measurements to boost statistical power for GWAS of quantitative traits. Although, the specific weighting scheme used in this paper is a simple version that is easy to implement even in large-scale GWAS, there are many additional ways to improve the method. The method can also be adapted for more complicated scenarios with unique trait characteristics. For example, traits such as pulse wave velocity23 show a trait variance that increases with age, in which case weights can be estimated as a function of age. Other traits such as systolic and diastolic blood pressure24 show a trait variance that increases with the magnitude of the measurement, in which case weights can be estimated as a function of the trait. Such new weighting schemes could potentially further increase the statistical power of genetic studies of quantitative traits.