A commentary on Pitfalls of predicting complex traits from SNPs

In their recent Opinion article (Pitfalls of predicting complex traits from SNPs. Nature Rev. Genet. 14, 507–515 (2013))1, Wray and co-authors discuss prediction of complex traits using single-nucleotide polymorphisms (SNPs). We would like to further elaborate and qualify some topics.

As stated by Wray and co-authors1, knowing the proportion of variance of a trait that is explained by regression on markers in the population (h2M) is relevant because, in principle, h2M represents the maximum prediction accuracy (R2TST) that is achievable in testing (TST) data if marker effects were known2. Following one study3, Wray and co-authors1 suggest estimating h2M using a ratio of variance components that are inferred from a G-BLUP analysis (h2G-BLUP). However, the realized proportions of allele sharing at markers and at causal loci can be very different4 owing to, for example, imperfect marker–causal loci linkage disequilibrium (LD). Consequently, the marker-based model may largely misrepresent the data-generating process; this is exacerbated with unrelated individuals5. Under these conditions, it is not clear that the finite sample estimate of h2G-BLUP is an unbiased estimate of h2M (Ref. 5), conseqeuenty, it is not obvious that R2TST can achieve values equal to the finite sample estimate of h2G-BLUP. In a recent article5, we studied the R2TST of G-BLUP and its relationship with h2G-BLUP. We show analytically that mis-specification of the training–testing (TRN–TST) genomic relationships (owing to, for example, imperfect marker–causal loci LD) can impose a large-sample upper bound on R2TST that is considerably lower than the finite sample estimate of h2G-BLUP. The same study5 also presents simulation scenarios with nominally unrelated individuals, where R2TST can be extremely low in situations with markedly different h2G-BLUP, suggesting a tenuous relationship between h2G-BLUP and R2TST, even with moderately large TRN samples.

Assessment of prediction accuracy

In the models discussed by Wray and co-authors1, R2TST is expected to be zero when TRN and TST samples are statistically independent5. Therefore, we disagree with the statement “problems occur in the validation stage, when data are not fully independent from those in the discovery phase” (Ref. 1).

We agree with Wray and co-authors that estimates of R2TST can be biased if the TST sample is not representative of the population in which predictions will be used. But we cannot reconcile this with the general advice of eliminating individuals in the TST sample based on predetermined thresholds for SNP-based relationships. Each prediction problem has its own level of accuracy, and proper representation may or may not involve realized relationships above such thresholds.

Wray and co-authors1 discuss problems due to stratification in TST samples and, as practical advice, suggest including principal component covariates. One study6 shows that inclusion of principal components as fixed effects in a G-BLUP analysis3,7 leads to a procedure with undesirable statistical properties. The same study6 provides statistically sound methods to quantify the relative contribution of each marker-derived principal component to estimates of variances and predictions of genetic values. Because principal components are linear functions of genotypes, removing their effects will, by construction, remove genetic signal that is potentially captured by markers. In general, unless the underlying causes of the signals that are captured by a principal-component analysis can be unambiguously interpreted, it is not clear that 'correcting' for their effects will mitigate the problems emerging from having a non-representative TST sample.

References

  1. 1

    Wray, N. R. et al. Pitfalls of predicting complex traits from SNPs. Nature Rev. Genet. 14, 507–515 (2013).

    CAS  Article  Google Scholar 

  2. 2

    Goddard, M. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica 136, 245–257 (2009).

    Article  Google Scholar 

  3. 3

    Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nature Genet. 42, 565–569 (2010).

    CAS  Article  Google Scholar 

  4. 4

    Hill, W. G. & Weir, B. S. Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genet. Res. 93, 47–64 (2011).

    CAS  Article  Google Scholar 

  5. 5

    de los Campos, G., Vazquez, A. I., Fernando, R., Klimentidis, Y. C. & Sorensen, D. Prediction of complex human traits using the genomic best linear unbiased predictor. PLoS Genet. 9, e1003608 (2013).

    CAS  Article  Google Scholar 

  6. 6

    Janss, L., de los Campos, G., Sheehan, N. & Sorensen, D. A. Inferences from genomic models in stratified populations. Genetics 192, 693–704 (2012).

    Article  Google Scholar 

  7. 7

    Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 76–82 (2011).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

G.d.l.C. acknowledges financial support from the US National Institutes of Health grants R01GM099992 and R01GM101219.

Author information

Affiliations

Authors

Corresponding authors

Correspondence to Gustavo de los Campos or Daniel A. Sorensen.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

de los Campos, G., Sorensen, D. A commentary on Pitfalls of predicting complex traits from SNPs. Nat Rev Genet 14, 894 (2013). https://doi.org/10.1038/nrg3457-c1

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing