In their recent Opinion article (Pitfalls of predicting complex traits from SNPs. Nature Rev. Genet. 14, 507–515 (2013))1, Wray and co-authors discuss prediction of complex traits using single-nucleotide polymorphisms (SNPs). We would like to further elaborate and qualify some topics.

As stated by Wray and co-authors1, knowing the proportion of variance of a trait that is explained by regression on markers in the population (h2M) is relevant because, in principle, h2M represents the maximum prediction accuracy (R2TST) that is achievable in testing (TST) data if marker effects were known2. Following one study3, Wray and co-authors1 suggest estimating h2M using a ratio of variance components that are inferred from a G-BLUP analysis (h2G-BLUP). However, the realized proportions of allele sharing at markers and at causal loci can be very different4 owing to, for example, imperfect marker–causal loci linkage disequilibrium (LD). Consequently, the marker-based model may largely misrepresent the data-generating process; this is exacerbated with unrelated individuals5. Under these conditions, it is not clear that the finite sample estimate of h2G-BLUP is an unbiased estimate of h2M (Ref. 5), conseqeuenty, it is not obvious that R2TST can achieve values equal to the finite sample estimate of h2G-BLUP. In a recent article5, we studied the R2TST of G-BLUP and its relationship with h2G-BLUP. We show analytically that mis-specification of the training–testing (TRN–TST) genomic relationships (owing to, for example, imperfect marker–causal loci LD) can impose a large-sample upper bound on R2TST that is considerably lower than the finite sample estimate of h2G-BLUP. The same study5 also presents simulation scenarios with nominally unrelated individuals, where R2TST can be extremely low in situations with markedly different h2G-BLUP, suggesting a tenuous relationship between h2G-BLUP and R2TST, even with moderately large TRN samples.

Assessment of prediction accuracy

In the models discussed by Wray and co-authors1, R2TST is expected to be zero when TRN and TST samples are statistically independent5. Therefore, we disagree with the statement “problems occur in the validation stage, when data are not fully independent from those in the discovery phase” (Ref. 1).

We agree with Wray and co-authors that estimates of R2TST can be biased if the TST sample is not representative of the population in which predictions will be used. But we cannot reconcile this with the general advice of eliminating individuals in the TST sample based on predetermined thresholds for SNP-based relationships. Each prediction problem has its own level of accuracy, and proper representation may or may not involve realized relationships above such thresholds.

Wray and co-authors1 discuss problems due to stratification in TST samples and, as practical advice, suggest including principal component covariates. One study6 shows that inclusion of principal components as fixed effects in a G-BLUP analysis3,7 leads to a procedure with undesirable statistical properties. The same study6 provides statistically sound methods to quantify the relative contribution of each marker-derived principal component to estimates of variances and predictions of genetic values. Because principal components are linear functions of genotypes, removing their effects will, by construction, remove genetic signal that is potentially captured by markers. In general, unless the underlying causes of the signals that are captured by a principal-component analysis can be unambiguously interpreted, it is not clear that 'correcting' for their effects will mitigate the problems emerging from having a non-representative TST sample.