Genetic association studies have yielded a wealth of biological discoveries. However, these studies have mostly analyzed one trait and one SNP at a time, thus failing to capture the underlying complexity of the data sets. Joint genotype-phenotype analyses of complex, high-dimensional data sets represent an important way to move beyond simple genome-wide association studies (GWAS) with great potential. The move to high-dimensional phenotypes will raise many new statistical problems. Here we address the central issue of missing phenotypes in studies with any level of relatedness between samples. We propose a multiple-phenotype mixed model and use a computationally efficient variational Bayesian algorithm to fit the model. On a variety of simulated and real data sets from a range of organisms and trait types, we show that our method outperforms existing state-of-the-art methods from the statistics and machine learning literature and can boost signals of association.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
Patterns of item nonresponse behaviour to survey questionnaires are systematic and associated with genetic loci
Nature Human Behaviour Open Access 29 June 2023
Molecular Psychiatry Open Access 26 January 2023
Genome Medicine Open Access 16 November 2022
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
Marx, V. Human phenotyping on a population scale. Nat. Methods 12, 711–714 (2015).
Soranzo, N. et al. A genome-wide meta-analysis identifies 22 loci associated with eight hematological parameters in the HaemGen consortium. Nat. Genet. 41, 1182–1190 (2009).
Zhou, X. & Stephens, M. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat. Methods 11, 407–409 (2014).
Huffman, J.E. et al. Polymorphisms in B3GAT1, SLC9A9 and MGAT5 are associated with variation within the human plasma N-glycome of 3533 European adults. Hum. Mol. Genet. 20, 5000–5011 (2011).
Lauc, G. et al. Genomics meets glycomics—the first GWAS study of human N-glycome identifies HNF1α as a master regulator of plasma protein fucosylation. PLoS Genet. 6, e1001256 (2010).
O'Reilly, P.F. et al. MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS. PLoS One 7, e34861 (2012).
Lee, S.H., Yang, J., Goddard, M.E., Visscher, P.M. & Wray, N.R. Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism–derived genomic relationships and restricted maximum likelihood. Bioinformatics 28, 2540–2542 (2012).
Schadt, E.E. et al. An integrative genomics approach to infer causal associations between gene expression and disease. Nat. Genet. 37, 710–717 (2005).
Almasy, L. & Blangero, J. Multipoint quantitative-trait linkage analysis in general pedigrees. Am. J. Hum. Genet. 62, 1198–1211 (1998).
Abecasis, G.R., Cardon, L.R., Cookson, W.O., Sham, P.C. & Cherny, S.S. Association analysis in a variance components framework. Genet. Epidemiol. 21 (suppl. 1), S341–S346 (2001).
Meuwissen, T.H., Hayes, B.J. & Goddard, M.E. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829 (2001).
Hai, R. et al. Bivariate genome-wide association study suggests that the DARC gene influences lean body mass and age at menarche. Sci. China Life Sci. 55, 516–520 (2012).
Piccolo, S.R. et al. Evaluation of genetic risk scores for lipid levels using genome-wide markers in the Framingham Heart Study. BMC Proc. 3 (suppl. 7), S46 (2009).
Choi, Y.-H., Chowdhury, R. & Swaminathan, B. Prediction of hypertension based on the genetic analysis of longitudinal phenotypes: a comparison of different modeling approaches for the binary trait of hypertension. BMC Proc. 8 (suppl. 1) Genetic Analysis Workshop 18Vanessa Olmo, S78 (2014).
Scutari, M., Howell, P., Balding, D.J. & Mackay, I. Multiple quantitative trait analysis using Bayesian networks. Genetics 198, 129–137 (2014).
Park, S.H., Lee, J.Y. & Kim, S. A methodology for multivariate phenotype-based genome-wide association studies to mine pleiotropic genes. BMC Syst. Biol. 5 (suppl. 2), S13 (2011).
Cui, X., Sha, Q., Zhang, S. & Chen, H.-S. A combinatorial approach for detecting gene-gene interaction using multiple traits of Genetic Analysis Workshop 16 rheumatoid arthritis data. BMC Proc. 3 (suppl. 7), S43 (2009).
Shin, S.-Y. et al. An atlas of genetic influences on human blood metabolites. Nat. Genet. 46, 543–550 (2014).
Suhre, K. et al. Human metabolic individuality in biomedical and pharmaceutical research. Nature 477, 54–60 (2011).
Meuwissen, T.H.E., Odegard, J., Andersen-Ranberg, I. & Grindflek, E. On the distance of genetic relationships and the accuracy of genomic prediction in pig breeding. Genet. Sel. Evol. 46, 49 (2014).
Schifano, E.D., Li, L., Christiani, D.C. & Lin, X. Genome-wide association analysis for multiple continuous secondary phenotypes. Am. J. Hum. Genet. 92, 744–759 (2013).
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569 (2010).
Igl, W., Johansson, A. & Gyllensten, U. The Northern Swedish Population Health Study (NSPHS)—a paradigmatic study in a rural population combining community health and basic research. Rural Remote Health 10, 1363 (2010).
Bloom, J.S., Ehrenreich, I.M., Loo, W.T., Lite, T.-L.V. & Kruglyak, L. Finding the sources of missing heritability in a yeast cross. Nature 494, 234–237 (2013).
Baud, A. et al. Combined sequence-based and genetic mapping analysis of complex traits in outbred rats. Nat. Genet. 45, 767–775 (2013).
Abdollahi-Arpanahi, R. et al. Dissection of additive genetic variability for quantitative traits in chickens using SNP markers. J. Anim. Breed. Genet. 131, 183–193 (2014).
Mackay, I.J. et al. An eight-parent multiparent advanced generation inter-cross population for winter-sown wheat: creation, properties, and validation. G3 (Bethesda) 4, 1603–1610 (2014).
Ferreira, M.A.R. & Purcell, S.M. A multivariate test of association. Bioinformatics 25, 132–133 (2009).
Galesloot, T.E., van Steen, K., Kiemeney, L.A.L.M., Janss, L.L. & Vermeulen, S.H. A comparison of multivariate genome-wide association methods. PLoS One 9, e95923 (2014).
Casale, F.P., Rakitsch, B., Lippert, C. & Stegle, O. Efficient set tests for the genetic analysis of correlated traits. Nat. Methods 12, 755–758 (2015).
Dahl, A., Hore, V., Iotchkova, V. & Marchini, J. Network inference in matrix-variate Gaussian models with non-independent noise. arXiv http://arxiv.org/abs/1312.1622v1 (2013).
Mott, R., Talbot, C.J., Turri, M.G., Collins, A.C. & Flint, J. A method for fine mapping quantitative trait loci in outbred animal stocks. Proc. Natl. Acad. Sci. USA 97, 12649–12654 (2000).
Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010).
Hers, I. Insulin-like growth factor-1 potentiates platelet activation via the IRS/PI3Kα pathway. Blood 110, 4243–4252 (2007).
Cho, J. & Mosher, D.F. Role of fibronectin assembly in platelet thrombus formation. J. Thromb. Haemost. 4, 1461–1469 (2006).
Prévost, N. et al. Signaling by ephrinB1 and Eph kinases in platelets promotes Rap1 activation, platelet adhesion, and aggregation via effector pathways that do not require phosphorylation of ephrinB1. Blood 103, 1348–1355 (2004).
Chen, Y.-R. et al. Y-box binding protein-1 down-regulates expression of carbamoyl phosphate synthetase-I by suppressing CCAAT enhancer-binding protein-α function in mice. Gastroenterology 137, 330–340 (2009).
Shinya, H., Matsuo, N., Takeyama, N. & Tanaka, T. Hyperammonemia inhibits platelet aggregation in rats. Thromb. Res. 81, 195–201 (1996).
Gilson, C.R., Patel, S.R. & Zimring, J.C. CTLA4-Ig prevents alloantibody production and BMT rejection in response to platelet transfusions in mice. Transfusion 52, 2209–2219 (2012).
Zufferey, A. et al. Unraveling modulators of platelet reactivity in cardiovascular patients using omics strategies: towards a network biology paradigm. Adv. Intern. Med. 1, 25–37 (2013).
Szabo, S.J. et al. A novel transcription factor, T-bet, directs Th1 lineage commitment. Cell 100, 655–669 (2000).
Zhou, X., Carbonetto, P. & Stephens, M. Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genet. 9, e1003264 (2013).
GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).
Little, R.J.A. & Rubin, D.B. Statistical Analysis with Missing Data (John Wiley & Sons, 1987).
Giordano, R., Broderick, T. & Jordan, M. Linear response methods for accurate covariance estimates from mean field variational Bayes. arXiv http://arxiv.org/abs/1506.04088v2 (2015).
Cotsapas, C. et al. Pervasive sharing of genetic effects in autoimmune disease. PLoS Genet. 7, e1002254 (2011).
Solovieff, N., Cotsapas, C., Lee, P.H., Purcell, S.M. & Smoller, J.W. Pleiotropy in complex traits: challenges and strategies. Nat. Rev. Genet. 14, 483–495 (2013).
Listgarten, J. et al. A powerful and efficient set test for genetic markers that handles confounders. Bioinformatics 29, 1526–1533 (2013).
Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).
Almasy, L., Dyer, T.D. & Blangero, J. Bivariate quantitative trait linkage analysis: pleiotropy versus co-incident linkages. Genet. Epidemiol. 14, 953–958 (1997).
Mazumder, R., Hastie, T. & Tibshirani, R. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010).
Troyanskaya, O. et al. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525 (2001).
Buuren, S.V. & Groothuis-Oudshoorn, K. mice: multivariate imputation by chained equations in R. J. Stat. Softw. 45, 1–67 (2011).
Allen, G.I. & Tibshirani, R. Transposable regularized covariance models with an application to missing data imputation. Ann. Appl. Stat. 4, 764–790 (2010).
Dawid, A.P. Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika 68, 265–274 (1981).
Jordan, M.I., Ghahramani, Z., Jaakkola, T.S. & Saul, L.K. An introduction to variational methods for graphical models. Mach. Learn. 37, 183–233 (1999).
Liu, D., Zhou, T., Qian, H., Xu, C. & Zhang, Z. in Lecture Notes in Computer Science Vol. 8189 (eds. Hutchison, D. et al.) 210–225 (Springer, 2013).
Wang, Z. et al. Rank-one matrix pursuit for matrix completion. Proc. 31st Int. Conf. Machine Learning 91–99 (2014).
Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G.R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955–959 (2012).
J.M. acknowledges support from the European Research Council (ERC; grant 617306). A.D. acknowledges support from Wellcome Trust grant 099680/Z/12/Z. This work was supported by Wellcome Trust grant 090532/Z/09/Z. A.K. acknowledges support from the Royal Society under the Industry Fellowship scheme.
A.K. is an employee of Aviagen, Ltd., a poultry breeding company that supplies broiler breeding stock worldwide. A.K. also holds an Industry Fellowship from the Royal Society and is based part time in the Roslin Institute.
Integrated supplementary information
Simulation results measuring imputation accuracy with mean-squared error (MSE) rather than correlation. Model 1: the scenario simulated using an empirical kinship matrix derived from the human NSPHS study. Model 2: the scenario simulated using 75 unrelated families of four siblings. Data sets were simulated at various levels of heritability (x axis) for the traits. Three hundred individuals and 15 traits were simulated. Five percent of phenotype values were set to missing before imputation. Seven different methods (legend) were applied to impute the missing values. The MSE between the imputed values and the true values is plotted on the y axis for each method. Perfect imputation has MSE = 0, and, because phenotypes are centered and standardized, imputing all entries to 0 has MSE = 1. Compared to Figure 1, which uses correlation as an imputation metric, the results do not qualitatively change.
Simulation results with opposing genetic and environmental correlations. Rather than an AR matrix, this plot chooses genetic correlation B to cancel the environmental correlation, Bpq = –Epq for p ≠ q. Five percent of phenotype values were held out, and the correlation between the true and imputed values is plotted on the y axis for each method. The dotted lines show the results from Figure 1 for reference.
Simulation results using larger data sets. This figure uses (N, P) = (1,000, 50), while the dotted lines use (N, P) = (300, 15). Five percent of phenotype values were held out, and the correlation between the true and imputed values is plotted on the y axis for each method. Increasing the data size nearly always improves imputation accuracy, although this effect is attenuated when using the sibling relatedness matrix, as family sizes are fixed and increasing N does not increase the amount of between-sample correlation. The dotted lines show the results from Figure 1 for reference.
Simulation results varying the amount of genetic correlation. We vary the overall genetic correlation matrix B by changing ρ, the AR parameter. The top row shows simulations with ρ =0.275, decreasing the average genetic correlation between traits compared to the dotted lines (from Figure 1) that use the baseline choice ρ =0.45; the bottom row shows simulations with ρ = 0.675. Analogous results were obtained using ρ = –0.275 (data not shown). Five percent of phenotype values were held out, and the correlation between the true and imputed values is plotted on the y axis for each method. The imputation accuracy of multi-trait methods increases with genetic correlation, and this effect increases with h2.
Simulation results at a higher level of missingness. Ten percent of phenotype values were set to missing before imputation, rather than 5% as for the dotted lines. The correlation between the imputed and true values is plotted on the y axis for each method. The dotted lines show the results from Figure 1 for reference.
Simulation results with non-ignorable missingness. We hold out 5% of the entries of the phenotype matrix with probability increasing in their values, and the correlation between the true and imputed values is plotted on the y axis for each method. The dotted lines show the results from Figure 1 for reference.
Simulation results with confounding cryptic relatedness. The contribution of the additive genetic term U in a typical MPMM is a2; each row increases the contribution of the contaminating shared environment, c2, to the overall heritability, here defined as h2 = a2 + c2. The first row uses c2 = 0.1a2; the second uses c2 = 0.3a2; and the last uses c2 = a2. Five percent of phenotype values were held out, and the correlation between the true and imputed values is plotted on the y axis for each method. The dotted lines show the results from Figure 1 for reference.
Simulation results with non-normal noise. We exponentially transform the environmental contribution ɛ to create log-normal noise. The resulting phenotypes are imputed without (top) or with (bottom) quantile normalization. Five percent of phenotype values were held out, and the correlation between the true and imputed values is plotted on the y axis for each method. The dotted lines show the results from Figure 1 for reference.
QQ plots from performing GWAS on 15 truly unassociated phenotypes with different imputation options (panel titles). Phenotypes are generated from our baseline simulation with the relevant K matrix. Rather than represent each of the 15 GWAS for each panel, we plot the point-wise minimum and maximum (dotted lines) and median (solid line) for the 15 lines. Top row, kinship and genotypes correspond to independent sets of four siblings. Bottom row, kinship and genotypes taken from the NSPHS study.
Power to detect a simulated, causal SNP using a univariate mixed model (LMM). Five thousand samples, comprising independent sets of four siblings, have 15 simulated phenotypes with pleiotropy. Five percent of phenotypes are deleted, and an LMM is then run with GEMMA after dropping missing data (unimputed) or imputing with PHENIX. Power is calculated by averaging over 1,000 independently simulated data sets using the standard GWAS P-value threshold 5 × 10−7.
Power to detect a simulated, causal SNP using a multiple-phenotype mixed model (MPMM). Five thousand samples, comprising independent sets of four siblings, have 15 simulated phenotypes with three levels of pleiotropy (legend). Five percent of phenotypes are deleted, and an MPMM is then run with our method by dropping samples with any missing phenotype data (unimputed) or imputing with PHENIX. Power is calculated by averaging over 5,000 independently simulated datasets using the standard GWAS P-value threshold 5 × 10−7.
Calibration of our r metric for imputation accuracy. Data are from the baseline model, but we now record estimated imputation accuracies, which we call r, as well as the true imputation accuracies. Top row, imputation correlation is plotted against h2. The black line is the true imputation accuracy and agrees with the PHENIX line (red) in Figure 1. We estimate r in two ways: by hiding 1% (brown line) or 5% (purple line) of observed entries. Point colors correspond to values of h2. Bottom row, estimated r is compared to the true r, with variability created by varying h2. Each point corresponds to the point in the above plot with the same color.
About this article
Cite this article
Dahl, A., Iotchkova, V., Baud, A. et al. A multiple-phenotype imputation method for genetic studies. Nat Genet 48, 466–472 (2016). https://doi.org/10.1038/ng.3513
This article is cited by
Molecular Psychiatry (2023)
Patterns of item nonresponse behaviour to survey questionnaires are systematic and associated with genetic loci
Nature Human Behaviour (2023)
Genome Medicine (2022)
Clinical and genotypic analysis in determining dystonia non-motor phenotypic heterogeneity: a UK Biobank study
Journal of Neurology (2022)
Genome Biology (2021)