A commentary on Pitfalls of predicting complex traits from SNPs

de los Campos, Gustavo; Sorensen, Daniel A.

doi:10.1038/nrg3457-c1

Download PDF

Correspondence
Published: 18 November 2013

A commentary on Pitfalls of predicting complex traits from SNPs

Gustavo de los Campos¹ &
Daniel A. Sorensen²

Nature Reviews Genetics volume 14, page 894 (2013)Cite this article

4013 Accesses
7 Citations
2 Altmetric
Metrics details

Subjects

In their recent Opinion article (Pitfalls of predicting complex traits from SNPs. Nature Rev. Genet. 14, 507–515 (2013))¹, Wray and co-authors discuss prediction of complex traits using single-nucleotide polymorphisms (SNPs). We would like to further elaborate and qualify some topics.

As stated by Wray and co-authors¹, knowing the proportion of variance of a trait that is explained by regression on markers in the population (h²_M) is relevant because, in principle, h²_M represents the maximum prediction accuracy (R²_TST) that is achievable in testing (TST) data if marker effects were known². Following one study³, Wray and co-authors¹ suggest estimating h²_M using a ratio of variance components that are inferred from a G-BLUP analysis (h²_G-BLUP). However, the realized proportions of allele sharing at markers and at causal loci can be very different⁴ owing to, for example, imperfect marker–causal loci linkage disequilibrium (LD). Consequently, the marker-based model may largely misrepresent the data-generating process; this is exacerbated with unrelated individuals⁵. Under these conditions, it is not clear that the finite sample estimate of h²_G-BLUP is an unbiased estimate of h²_M (Ref. 5), conseqeuenty, it is not obvious that R²_TST can achieve values equal to the finite sample estimate of h²_G-BLUP. In a recent article⁵, we studied the R²_TST of G-BLUP and its relationship with h²_G-BLUP. We show analytically that mis-specification of the training–testing (TRN–TST) genomic relationships (owing to, for example, imperfect marker–causal loci LD) can impose a large-sample upper bound on R²_TST that is considerably lower than the finite sample estimate of h²_G-BLUP. The same study⁵ also presents simulation scenarios with nominally unrelated individuals, where R²_TST can be extremely low in situations with markedly different h²_G-BLUP, suggesting a tenuous relationship between h²_G-BLUP and R²_TST, even with moderately large TRN samples.

Assessment of prediction accuracy

In the models discussed by Wray and co-authors¹, R²_TST is expected to be zero when TRN and TST samples are statistically independent⁵. Therefore, we disagree with the statement “problems occur in the validation stage, when data are not fully independent from those in the discovery phase” (Ref. 1).

We agree with Wray and co-authors that estimates of R²_TST can be biased if the TST sample is not representative of the population in which predictions will be used. But we cannot reconcile this with the general advice of eliminating individuals in the TST sample based on predetermined thresholds for SNP-based relationships. Each prediction problem has its own level of accuracy, and proper representation may or may not involve realized relationships above such thresholds.

Wray and co-authors¹ discuss problems due to stratification in TST samples and, as practical advice, suggest including principal component covariates. One study⁶ shows that inclusion of principal components as fixed effects in a G-BLUP analysis^3,7 leads to a procedure with undesirable statistical properties. The same study⁶ provides statistically sound methods to quantify the relative contribution of each marker-derived principal component to estimates of variances and predictions of genetic values. Because principal components are linear functions of genotypes, removing their effects will, by construction, remove genetic signal that is potentially captured by markers. In general, unless the underlying causes of the signals that are captured by a principal-component analysis can be unambiguously interpreted, it is not clear that 'correcting' for their effects will mitigate the problems emerging from having a non-representative TST sample.

References

Wray, N. R. et al. Pitfalls of predicting complex traits from SNPs. Nature Rev. Genet. 14, 507–515 (2013).
Article CAS Google Scholar
Goddard, M. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica 136, 245–257 (2009).
Article Google Scholar
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nature Genet. 42, 565–569 (2010).
Article CAS Google Scholar
Hill, W. G. & Weir, B. S. Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genet. Res. 93, 47–64 (2011).
Article CAS Google Scholar
de los Campos, G., Vazquez, A. I., Fernando, R., Klimentidis, Y. C. & Sorensen, D. Prediction of complex human traits using the genomic best linear unbiased predictor. PLoS Genet. 9, e1003608 (2013).
Article CAS Google Scholar
Janss, L., de los Campos, G., Sheehan, N. & Sorensen, D. A. Inferences from genomic models in stratified populations. Genetics 192, 693–704 (2012).
Article Google Scholar
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 76–82 (2011).
Article CAS Google Scholar

Download references

Acknowledgements

G.d.l.C. acknowledges financial support from the US National Institutes of Health grants R01GM099992 and R01GM101219.

Author information

Authors and Affiliations

Biostatistics Department, University of Alabama at Birmingham, Birmingham, 35294, Alabama, USA
Gustavo de los Campos
Department of Molecular Biology and Genetics, Faculty of Science and Technology, Aarhus University, PB 50, Tjele, DK-8830, Denmark
Daniel A. Sorensen

Authors

Gustavo de los Campos
View author publications
You can also search for this author in PubMed Google Scholar
Daniel A. Sorensen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Gustavo de los Campos or Daniel A. Sorensen.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

de los Campos, G., Sorensen, D. A commentary on Pitfalls of predicting complex traits from SNPs. Nat Rev Genet 14, 894 (2013). https://doi.org/10.1038/nrg3457-c1

Download citation

Published: 18 November 2013
Issue Date: December 2013
DOI: https://doi.org/10.1038/nrg3457-c1

This article is cited by

Understanding the potential bias of variance components estimators when using genomic models
- Beatriz C. D. Cuyabano
- A. Christian Sørensen
- Peter Sørensen
Genetics Selection Evolution (2018)
Increased prediction accuracy using a genomic feature model including prior information on quantitative trait locus regions in purebred Danish Duroc pigs
- Pernille Sarup
- Just Jensen
- Peter Sørensen
BMC Genetics (2016)
Author reply to A commentary on Pitfalls of predicting complex traits from SNPs
- Naomi R. Wray
- Jian Yang
- Peter M. Visscher
Nature Reviews Genetics (2013)

A commentary on Pitfalls of predicting complex traits from SNPs

Subjects

Assessment of prediction accuracy

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

This article is cited by

Understanding the potential bias of variance components estimators when using genomic models

Increased prediction accuracy using a genomic feature model including prior information on quantitative trait locus regions in purebred Danish Duroc pigs

Author reply to A commentary on Pitfalls of predicting complex traits from SNPs

Search

Quick links

Subjects

Assessment of prediction accuracy

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Understanding the potential bias of variance components estimators when using genomic models

Increased prediction accuracy using a genomic feature model including prior information on quantitative trait locus regions in purebred Danish Duroc pigs

Author reply to A commentary on Pitfalls of predicting complex traits from SNPs

Search

Quick links