Although genome-wide association studies have identified markers that are associated with various human traits and diseases, our ability to predict such phenotypes remains limited. A perhaps overlooked explanation lies in the limitations of the genetic models and statistical techniques commonly used in association studies. We propose that alternative approaches, which are largely borrowed from animal breeding, provide potential for advances. We review selected methods and discuss the challenges and opportunities ahead.
Access optionsAccess options
Subscribe to Journal
Get full journal access for 1 year
only $22.08 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Guttmacher, A. E. & Collins, F. S. Genomic medicine — a primer. N. Engl. J. Med. 347, 1512–1520 (2002).
Dominiczak, A. F. & McBride, M. W. Genetics of common polygenic stroke. Nature Genet. 35, 116–117 (2003).
Maher, B. Personal genomes: the case of the missing heritability. Nature 456, 18–21 (2008).
Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).
Hill, W. G., Goddard, M. E. & Visscher, P. M. Data and theory point to mainly additive genetic variance for complex traits. PLoS Genet. 4, e1000008 (2008).
Lander, E. S. & Schork, N. J. Genetic dissection of complex traits. Science 265, 2037–2048 (1994).
Goddard, M. E. & Hayes, B. J. Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nature Rev. Genet. 10, 381–391 (2009).
Falconer, D. S. & Mackay, T. F. C. Introduction to Quantitative Genetics 4th edn (Longman, Harlow, UK, 1996).
Hill, W. G. Understanding and using quantitative genetic variation. Philos. Trans. R. Soc. Lond. B 365, 73–85 (2010).
Fisher, R. The correlation between relatives on the supposition of Mendelian inheritance. Trans. R. Soc. Edinb. Earth Sci. 52, 399–433 (1918).
Wright, S. Systems of mating. Parts I.–V. Genetics 6, 111–178 (1921).
Henderson, C. R. Estimation of genetic parameters. Ann. Math. Stat. 21, 309–310 (1950).
Henderson, C. R. Best linear unbiased estimation and prediction under a selection model. Biometrics 31, 423–447 (1975).
Meuwissen, T. H., Hayes, B. J. & Goddard, M. E. Prediction of total genetic values using genome-wide dense marker maps. Genetics 157, 1819–1829 (2001).
Habier, D. Fernando, R. L. & Dekkers, J. C. M. The impact of genetic relationships information on genome-assisted breeding values. Genetics 177, 2389–2397 (2007).
González-Recio, O. et al. Non-parametric methods for incorporating genomic information into genetic evaluations: an application to mortality in broilers. Genetics 178, 2305–2313 (2008).
VanRaden, P. M. et al. Reliability of genomic predictions for North American Holstein bulls. J. Dairy Sci. 92, 16–24 (2009).
Hayes, B. J., Bowman, P. J., Chamberlain, A. J. & Goddard, M. E. Genomic selection in dairy cattle: progress and challenges. J. Dairy Sci. 92, 433–443 (2009).
de los Campos, G. et al. Predicting quantitative traits with regression models for dense molecular markers and pedigrees. Genetics 182, 375–385 (2009).
Weigel, K. A. et al. Predictive ability of direct genomic values for lifetime net merit of Holstein sires using selected subsets of single nucleotide polymorphism markers. J. Dairy Sci. 92, 5248–5257 (2009).
Vazquez, A. et al. Predictive ability of subsets of SNP with and without parent average in US Holsteins. J. Dairy Sci. 2010 (doi:10.3168/jds.2010–3335).
Hoerl, A. E. & Kennard, R. W. Ridge regression: biased estimation for non-orthogonal problems. Technometrics 12, 55–67 (1970).
Tibshirani, R. Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. Series B 58, 267–288 (1996).
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J.R. Stat. Soc. Series B 67, 301–320 (2005).
Park, T. & Casella, G. The Bayesian LASSO. J. Am. Stat. Assoc. 103, 681–686 (2008).
Wahba, G. Spline Models for Observational Data (Society for Industrial and Applied Mathematics, Philadelphia, 1990).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction 2nd edn (Springer-Verlag, New York, 2009).
Gianola, D., Fernando, R. L. & Stella, A. Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics 173, 1761–1776 (2006).
Gianola, D. & van Kaam, J. B. Reproducing kernel Hilbert spaces regression methods for genomic assisted prediction of quantitative traits. Genetics 178, 2289–2303 (2008).
Kimeldorf, G. S. & Wahba, G. A correspondence between Bayesian estimation on stochastic process and smoothing by splines. Ann. Math. Stat. 41, 495–502 (1970).
de los Campos, G., Gianola, D. & Rosa, G. J. M. Reproducing kernel Hilbert spaces regression: a general framework for genetic evaluation. J. Anim. Sci. 87, 1883–1887 (2009).
de los Campos, G., Gianola, D., Rosa, G. J. M., Weigel, K. & Crossa, J. Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces regressions. Genetics Res. 92, 295–308 (2010).
Shawe-Taylor, J. & Cristianini, N. Kernel Methods for Pattern Analysis (Cambridge Univ. Press, UK, 2004).
Schaid, D. J. Genomic similarity and kernel methods I: advancements by building on mathematical and statistical foundations. Hum. Hered. 70, 109–131 (2010).
Garrick, D. J. The nature, scope and impact of some whole-genome analyses in beef cattle in 9th World Congress on Genetics Applied to Livestock (Leipzig, Germany, 2010).
Long, N. et al. Radial basis function regression methods for predicting quantitative traits using SNP markers. Genetics Res. 92, 209–225 (2010).
Crossa, J. et al. Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics 2 Sep 2010 (doi:10.1534/genetics.110.118521).
Piepho, H. P. Ridge regression and extensions for genomewide selection in maize. Crop Sci. 49, 1165–1176 (2009).
Legarra, A., Robert-Granié, C., Manfredi, E. & Elsen, J. M. Performance of genomic selection in mice. Genetics 180, 611–618 (2008).
Jannink, J. L., Lorenz, A. J. & Hiroyoshi, I. Genomic selection in plant breeding: from theory to practice. Brief. Funct. Genomics 9, 166–177 (2010).
Goddard, M. E. Genomic selection: prediction of accuracy and maximization of long term response. Genetica 136, 245–257 (2009).
Zhong, S., Dekkers, J. C., Fernando R. L. & Jannink, J. L. Factors affecting accuracy from genomic selection in populations derived from multiple inbred lines: a barley case study. Genetics 182, 355–364 (2009).
Gianola, D. Theory and analysis of threshold characters. J. Anim. Sci. 54, 1079–1096 (1982).
Holzapfel, C. et al. Genes and lifestyle factors in obesity: results from 12462 subjects from MONICA/KORA. Int. J. Obes. 1–8 (2010).
Seshadri, S. et al. Genome-wide analysis of genetic loci associated with Alzheimer disease. JAMA 303, 1832–1840 (2010).
Valenzuela, R. K. et al. Predicting phenotype from genotype: normal pigmentation. J. Forensic Sci. Soc. 55, 315–322 (2010).
Willer, C. J. et al. Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nature Genet. 41, 25–34 (2008).
Zhao, J. et al. The role of obesity-associated loci identified in genome-wide association studies in the determination of pediatric BMI. Obesity 17, 2254–2257 (2009).
van Hoek, M. et al. Predicting type 2 diabetes based on polymorphisms from genome-wide association studies: a population-based study. Diabetes 57, 3122–3128 (2008).
Wary, N. R., Goddard, M. E. & Visscher, P. M. Prediction of indivual genetic risk to diseases from genome-wide association studies. Genome Res. 17, 1520–1528 (2007).
Purcell, S. M. et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 (2009).
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nature Genet. 42, 565–569 (2010).
Witten, D. M. & Tibshirani, R. Survival analysis with high-dimensional covariates. Stat. Methods Med. Res. 19, 29–51 (2010).
Box, G. E. P. & Draper, N. R. Empirical Model-Building and Response Surfaces (Wiley, New York, 1987).
Cockerham, C. C. An extension of the concept of partitioning hereditary variance for analysis of covariance among relatives when epistasis is present. Genetics 39, 859–882 (1954).
Kempthorne, O. The correlation between relatives in a random mating population. Proc. R. Soc. Lond. B 143, 103–113 (1954).
Lynch, M. & Ritland, K. Estimation of pairwise relatedness with molecular markers. Genetics 152, 1753–1766 (1999).
Eding, J. H. & Meuwissen, T. H. Marker based estimates of between and within population kinships for the conservation of genetic diversity. J. Anim. Breed. Genet. 118, 141–159 (2001).
Visscher, P. M. et al. Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings. PLoS Genet. 2, e41 (2006).
Hayes, B. J. & Goddard, M. E. Prediction of breeding values using marker-derived relationship matrices. J. Anim. Sci. 86, 2089–2092 (2008).
Feng, R., McClure, L. A., Tiwari, H. K. & Howard, G. A new estimate of family disease history providing improved prediction of disease risks. Stat. Med. 28, 1269–1283 (2009).
We are grateful to K. Grimes, A. Vazquez, Y. Klimentidis and S. Cofield for their helpful comments on this paper.
Gustavode los Campos has served as a consultant to CIMMYT and Aviagen; both organizations work with genomic-enabled prediction of genetic values for plant and poultry breeding, respectively. Daniel Gianola serves on the International Scientific Advisory Board of Aviagen. David Allison has received numerous grants, consulting fees and donations from non-profit and for profit entities, some of which may have interests in the genomic prediction of phenotypes.
- Bayesian estimation
Bayesian inferences are based on the posterior distribution of the unknowns given the data. Following Bayes' rule, this distribution is proportional to the product of the distribution of the data given the unknowns times the prior distribution of the unknowns.
- Basis function
In regression analysis, basis functions are functions of predictors used to construct the regression. Polynomials, exponential and logarithm are examples of basis functions commonly used for parametric regressions.
- Censored phenotype
Censoring occurs when, for some individuals, the phenotypic information consists of bounds but the actual phenotypic value is unknown. This is commonly observed in longevity studies when, at the time of analysis, some patients may still be alive.
- Genomic medicine
The use of genome information in the prevention, diagnosis and treatment of disorders.
- Goodness of fit
A measure of how well a model fits the data in a training sample. The log likelihood and R-squared statistic are commonly used measures of goodness of fit. The residual sum of squares is a commonly used measure of lack of fit.
The Least Absolute Shrinkage and Selection Operator23 is a penalized estimation method commonly used in regression. The penalty function in LASSO is the sum of the absolute value of the regression coefficients. LASSO performs variable selection and shrinkage simultaneously.
- Objective function
The function whose value is minimized or maximized in an optimization problem.
- Ordinary least squares
The ordinary least squares estimates of parameters in a regression model are obtained by minimizing the residual sum of squares of the regression.
A term used to describe the situation in which a model fits the training data well but fails to perform well when used to predict outcomes of a collection of subjects (testing data) that was not used to fit the model.
- Parametric regression model
A regression model in which the regression function is set to have a known functional form (for example, a polynomial).
- Penalized estimation
Penalized estimates are commonly used in situations in which the number of unknowns is large with respect to the number of records. Penalized estimates are obtained by solving an optimization problem whose objective function embeds a compromise between a goodness-of-fit measure and a measure of model complexity or penalty function.
- Quantitative genetic theory
Genetic, mathematical and statistical models used to study traits that are affected by a large number of genes.
- Regression model
A statistical model used to describe relationships (for example, a conditional mean) between a response variable and a set of predictors through a regression function involving some parameter(s) to be estimated from data.
- Semi-parametric regression model
A regression model in which the regression function is not assumed to be a member of a parametric family.
In standard estimation methods (for example, maximum likelihood or OLS) estimates are obtained by optimizing with respect to a goodness-of-it or lack-of-fit measure. Relative to these estimates, Bayesian and penalized estimates are shrunk towards some values (typically zero). This prevents over-fitting and, under certain conditions, may reduce mean-squared error of estimates and predictions.
- Training data
The data set used to fit a model.
About this article
Plant Biotechnology Journal (2019)
BMC Bioinformatics (2019)
Genetic Epidemiology (2019)
Genome-wide association study identifies loci for body shape in the large yellow croaker (Larimichthys crocea)
Aquaculture and Fisheries (2019)