Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Technical Report
  • Published:

A multiple-phenotype imputation method for genetic studies

Abstract

Genetic association studies have yielded a wealth of biological discoveries. However, these studies have mostly analyzed one trait and one SNP at a time, thus failing to capture the underlying complexity of the data sets. Joint genotype-phenotype analyses of complex, high-dimensional data sets represent an important way to move beyond simple genome-wide association studies (GWAS) with great potential. The move to high-dimensional phenotypes will raise many new statistical problems. Here we address the central issue of missing phenotypes in studies with any level of relatedness between samples. We propose a multiple-phenotype mixed model and use a computationally efficient variational Bayesian algorithm to fit the model. On a variety of simulated and real data sets from a range of organisms and trait types, we show that our method outperforms existing state-of-the-art methods from the statistics and machine learning literature and can boost signals of association.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Simulation results.
Figure 2: Imputation performance in real data sets.
Figure 3: Missing phenotype imputation in 140 rat GWAS.
Figure 4: Platelet phenotype associations.
Figure 5: T cell phenotype associations.

Similar content being viewed by others

References

  1. Marx, V. Human phenotyping on a population scale. Nat. Methods 12, 711–714 (2015).

    Article  CAS  PubMed  Google Scholar 

  2. Soranzo, N. et al. A genome-wide meta-analysis identifies 22 loci associated with eight hematological parameters in the HaemGen consortium. Nat. Genet. 41, 1182–1190 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Zhou, X. & Stephens, M. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat. Methods 11, 407–409 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Huffman, J.E. et al. Polymorphisms in B3GAT1, SLC9A9 and MGAT5 are associated with variation within the human plasma N-glycome of 3533 European adults. Hum. Mol. Genet. 20, 5000–5011 (2011).

    Article  CAS  PubMed  Google Scholar 

  5. Lauc, G. et al. Genomics meets glycomics—the first GWAS study of human N-glycome identifies HNF1α as a master regulator of plasma protein fucosylation. PLoS Genet. 6, e1001256 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. O'Reilly, P.F. et al. MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS. PLoS One 7, e34861 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Lee, S.H., Yang, J., Goddard, M.E., Visscher, P.M. & Wray, N.R. Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism–derived genomic relationships and restricted maximum likelihood. Bioinformatics 28, 2540–2542 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. Schadt, E.E. et al. An integrative genomics approach to infer causal associations between gene expression and disease. Nat. Genet. 37, 710–717 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Almasy, L. & Blangero, J. Multipoint quantitative-trait linkage analysis in general pedigrees. Am. J. Hum. Genet. 62, 1198–1211 (1998).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Abecasis, G.R., Cardon, L.R., Cookson, W.O., Sham, P.C. & Cherny, S.S. Association analysis in a variance components framework. Genet. Epidemiol. 21 (suppl. 1), S341–S346 (2001).

    Article  PubMed  Google Scholar 

  11. Meuwissen, T.H., Hayes, B.J. & Goddard, M.E. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. Hai, R. et al. Bivariate genome-wide association study suggests that the DARC gene influences lean body mass and age at menarche. Sci. China Life Sci. 55, 516–520 (2012).

    Article  CAS  PubMed  Google Scholar 

  13. Piccolo, S.R. et al. Evaluation of genetic risk scores for lipid levels using genome-wide markers in the Framingham Heart Study. BMC Proc. 3 (suppl. 7), S46 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  14. Choi, Y.-H., Chowdhury, R. & Swaminathan, B. Prediction of hypertension based on the genetic analysis of longitudinal phenotypes: a comparison of different modeling approaches for the binary trait of hypertension. BMC Proc. 8 (suppl. 1) Genetic Analysis Workshop 18Vanessa Olmo, S78 (2014).

  15. Scutari, M., Howell, P., Balding, D.J. & Mackay, I. Multiple quantitative trait analysis using Bayesian networks. Genetics 198, 129–137 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  16. Park, S.H., Lee, J.Y. & Kim, S. A methodology for multivariate phenotype-based genome-wide association studies to mine pleiotropic genes. BMC Syst. Biol. 5 (suppl. 2), S13 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Cui, X., Sha, Q., Zhang, S. & Chen, H.-S. A combinatorial approach for detecting gene-gene interaction using multiple traits of Genetic Analysis Workshop 16 rheumatoid arthritis data. BMC Proc. 3 (suppl. 7), S43 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Shin, S.-Y. et al. An atlas of genetic influences on human blood metabolites. Nat. Genet. 46, 543–550 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Suhre, K. et al. Human metabolic individuality in biomedical and pharmaceutical research. Nature 477, 54–60 (2011).

    Article  CAS  PubMed  Google Scholar 

  20. Meuwissen, T.H.E., Odegard, J., Andersen-Ranberg, I. & Grindflek, E. On the distance of genetic relationships and the accuracy of genomic prediction in pig breeding. Genet. Sel. Evol. 46, 49 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Schifano, E.D., Li, L., Christiani, D.C. & Lin, X. Genome-wide association analysis for multiple continuous secondary phenotypes. Am. J. Hum. Genet. 92, 744–759 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Igl, W., Johansson, A. & Gyllensten, U. The Northern Swedish Population Health Study (NSPHS)—a paradigmatic study in a rural population combining community health and basic research. Rural Remote Health 10, 1363 (2010).

    PubMed  Google Scholar 

  24. Bloom, J.S., Ehrenreich, I.M., Loo, W.T., Lite, T.-L.V. & Kruglyak, L. Finding the sources of missing heritability in a yeast cross. Nature 494, 234–237 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Baud, A. et al. Combined sequence-based and genetic mapping analysis of complex traits in outbred rats. Nat. Genet. 45, 767–775 (2013).

    Article  CAS  PubMed  Google Scholar 

  26. Abdollahi-Arpanahi, R. et al. Dissection of additive genetic variability for quantitative traits in chickens using SNP markers. J. Anim. Breed. Genet. 131, 183–193 (2014).

    Article  CAS  PubMed  Google Scholar 

  27. Mackay, I.J. et al. An eight-parent multiparent advanced generation inter-cross population for winter-sown wheat: creation, properties, and validation. G3 (Bethesda) 4, 1603–1610 (2014).

    Article  Google Scholar 

  28. Ferreira, M.A.R. & Purcell, S.M. A multivariate test of association. Bioinformatics 25, 132–133 (2009).

    Article  CAS  PubMed  Google Scholar 

  29. Galesloot, T.E., van Steen, K., Kiemeney, L.A.L.M., Janss, L.L. & Vermeulen, S.H. A comparison of multivariate genome-wide association methods. PLoS One 9, e95923 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  30. Casale, F.P., Rakitsch, B., Lippert, C. & Stegle, O. Efficient set tests for the genetic analysis of correlated traits. Nat. Methods 12, 755–758 (2015).

    Article  CAS  PubMed  Google Scholar 

  31. Dahl, A., Hore, V., Iotchkova, V. & Marchini, J. Network inference in matrix-variate Gaussian models with non-independent noise. arXiv http://arxiv.org/abs/1312.1622v1 (2013).

  32. Mott, R., Talbot, C.J., Turri, M.G., Collins, A.C. & Flint, J. A method for fine mapping quantitative trait loci in outbred animal stocks. Proc. Natl. Acad. Sci. USA 97, 12649–12654 (2000).

    Article  PubMed  PubMed Central  Google Scholar 

  33. Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010).

    Article  CAS  PubMed  Google Scholar 

  34. Hers, I. Insulin-like growth factor-1 potentiates platelet activation via the IRS/PI3Kα pathway. Blood 110, 4243–4252 (2007).

    Article  CAS  PubMed  Google Scholar 

  35. Cho, J. & Mosher, D.F. Role of fibronectin assembly in platelet thrombus formation. J. Thromb. Haemost. 4, 1461–1469 (2006).

    Article  CAS  PubMed  Google Scholar 

  36. Prévost, N. et al. Signaling by ephrinB1 and Eph kinases in platelets promotes Rap1 activation, platelet adhesion, and aggregation via effector pathways that do not require phosphorylation of ephrinB1. Blood 103, 1348–1355 (2004).

    Article  PubMed  Google Scholar 

  37. Chen, Y.-R. et al. Y-box binding protein-1 down-regulates expression of carbamoyl phosphate synthetase-I by suppressing CCAAT enhancer-binding protein-α function in mice. Gastroenterology 137, 330–340 (2009).

    Article  CAS  PubMed  Google Scholar 

  38. Shinya, H., Matsuo, N., Takeyama, N. & Tanaka, T. Hyperammonemia inhibits platelet aggregation in rats. Thromb. Res. 81, 195–201 (1996).

    Article  CAS  PubMed  Google Scholar 

  39. Gilson, C.R., Patel, S.R. & Zimring, J.C. CTLA4-Ig prevents alloantibody production and BMT rejection in response to platelet transfusions in mice. Transfusion 52, 2209–2219 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Zufferey, A. et al. Unraveling modulators of platelet reactivity in cardiovascular patients using omics strategies: towards a network biology paradigm. Adv. Intern. Med. 1, 25–37 (2013).

    CAS  Google Scholar 

  41. Szabo, S.J. et al. A novel transcription factor, T-bet, directs Th1 lineage commitment. Cell 100, 655–669 (2000).

    Article  CAS  PubMed  Google Scholar 

  42. Zhou, X., Carbonetto, P. & Stephens, M. Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genet. 9, e1003264 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).

  44. Little, R.J.A. & Rubin, D.B. Statistical Analysis with Missing Data (John Wiley & Sons, 1987).

  45. Giordano, R., Broderick, T. & Jordan, M. Linear response methods for accurate covariance estimates from mean field variational Bayes. arXiv http://arxiv.org/abs/1506.04088v2 (2015).

  46. Cotsapas, C. et al. Pervasive sharing of genetic effects in autoimmune disease. PLoS Genet. 7, e1002254 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Solovieff, N., Cotsapas, C., Lee, P.H., Purcell, S.M. & Smoller, J.W. Pleiotropy in complex traits: challenges and strategies. Nat. Rev. Genet. 14, 483–495 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Listgarten, J. et al. A powerful and efficient set test for genetic markers that handles confounders. Bioinformatics 29, 1526–1533 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Almasy, L., Dyer, T.D. & Blangero, J. Bivariate quantitative trait linkage analysis: pleiotropy versus co-incident linkages. Genet. Epidemiol. 14, 953–958 (1997).

    Article  CAS  PubMed  Google Scholar 

  51. Mazumder, R., Hastie, T. & Tibshirani, R. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010).

    PubMed  PubMed Central  Google Scholar 

  52. Troyanskaya, O. et al. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525 (2001).

    Article  CAS  PubMed  Google Scholar 

  53. Buuren, S.V. & Groothuis-Oudshoorn, K. mice: multivariate imputation by chained equations in R. J. Stat. Softw. 45, 1–67 (2011).

    Article  Google Scholar 

  54. Allen, G.I. & Tibshirani, R. Transposable regularized covariance models with an application to missing data imputation. Ann. Appl. Stat. 4, 764–790 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  55. Dawid, A.P. Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika 68, 265–274 (1981).

    Article  Google Scholar 

  56. Jordan, M.I., Ghahramani, Z., Jaakkola, T.S. & Saul, L.K. An introduction to variational methods for graphical models. Mach. Learn. 37, 183–233 (1999).

    Article  Google Scholar 

  57. Liu, D., Zhou, T., Qian, H., Xu, C. & Zhang, Z. in Lecture Notes in Computer Science Vol. 8189 (eds. Hutchison, D. et al.) 210–225 (Springer, 2013).

  58. Wang, Z. et al. Rank-one matrix pursuit for matrix completion. Proc. 31st Int. Conf. Machine Learning 91–99 (2014).

  59. Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G.R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955–959 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

J.M. acknowledges support from the European Research Council (ERC; grant 617306). A.D. acknowledges support from Wellcome Trust grant 099680/Z/12/Z. This work was supported by Wellcome Trust grant 090532/Z/09/Z. A.K. acknowledges support from the Royal Society under the Industry Fellowship scheme.

Author information

Authors and Affiliations

Authors

Contributions

A.D., V.I. and J.M. developed the method. A.D. carried out all analysis. J.M. and A.D. wrote the manuscript. A.B. and R.M. provided extensive advice on analysis of the rat GWAS data set. Å.J. and U.G. provided the NSPHS data set. N.S. provided the UKNBS data set. A.K. provided the chicken data set and advice on analysis. All authors critiqued the manuscript.

Corresponding author

Correspondence to Jonathan Marchini.

Ethics declarations

Competing interests

A.K. is an employee of Aviagen, Ltd., a poultry breeding company that supplies broiler breeding stock worldwide. A.K. also holds an Industry Fellowship from the Royal Society and is based part time in the Roslin Institute.

Integrated supplementary information

Supplementary Figure 1 Assessing phenotype imputation on simulated data using mean-squared error.

Simulation results measuring imputation accuracy with mean-squared error (MSE) rather than correlation. Model 1: the scenario simulated using an empirical kinship matrix derived from the human NSPHS study. Model 2: the scenario simulated using 75 unrelated families of four siblings. Data sets were simulated at various levels of heritability (x axis) for the traits. Three hundred individuals and 15 traits were simulated. Five percent of phenotype values were set to missing before imputation. Seven different methods (legend) were applied to impute the missing values. The MSE between the imputed values and the true values is plotted on the y axis for each method. Perfect imputation has MSE = 0, and, because phenotypes are centered and standardized, imputing all entries to 0 has MSE = 1. Compared to Figure 1, which uses correlation as an imputation metric, the results do not qualitatively change.

Supplementary Figure 2 Cancellation of genetic and environmental covariances.

Simulation results with opposing genetic and environmental correlations. Rather than an AR matrix, this plot chooses genetic correlation B to cancel the environmental correlation, Bpq = –Epq for pq. Five percent of phenotype values were held out, and the correlation between the true and imputed values is plotted on the y axis for each method. The dotted lines show the results from Figure 1 for reference.

Supplementary Figure 3 Increasing sample size and number of phenotypes to N = 1,000 and P = 50.

Simulation results using larger data sets. This figure uses (N, P) = (1,000, 50), while the dotted lines use (N, P) = (300, 15). Five percent of phenotype values were held out, and the correlation between the true and imputed values is plotted on the y axis for each method. Increasing the data size nearly always improves imputation accuracy, although this effect is attenuated when using the sibling relatedness matrix, as family sizes are fixed and increasing N does not increase the amount of between-sample correlation. The dotted lines show the results from Figure 1 for reference.

Supplementary Figure 4 Varying levels of genetic correlation between phenotypes.

Simulation results varying the amount of genetic correlation. We vary the overall genetic correlation matrix B by changing ρ, the AR parameter. The top row shows simulations with ρ =0.275, decreasing the average genetic correlation between traits compared to the dotted lines (from Figure 1) that use the baseline choice ρ =0.45; the bottom row shows simulations with ρ = 0.675. Analogous results were obtained using ρ = –0.275 (data not shown). Five percent of phenotype values were held out, and the correlation between the true and imputed values is plotted on the y axis for each method. The imputation accuracy of multi-trait methods increases with genetic correlation, and this effect increases with h2.

Supplementary Figure 5 Increasing data missingness to 10%.

Simulation results at a higher level of missingness. Ten percent of phenotype values were set to missing before imputation, rather than 5% as for the dotted lines. The correlation between the imputed and true values is plotted on the y axis for each method. The dotted lines show the results from Figure 1 for reference.

Supplementary Figure 6 Effect of non-random missingness.

Simulation results with non-ignorable missingness. We hold out 5% of the entries of the phenotype matrix with probability increasing in their values, and the correlation between the true and imputed values is plotted on the y axis for each method. The dotted lines show the results from Figure 1 for reference.

Supplementary Figure 7 Effect of unmodeled, shared environment.

Simulation results with confounding cryptic relatedness. The contribution of the additive genetic term U in a typical MPMM is a2; each row increases the contribution of the contaminating shared environment, c2, to the overall heritability, here defined as h2 = a2 + c2. The first row uses c2 = 0.1a2; the second uses c2 = 0.3a2; and the last uses c2 = a2. Five percent of phenotype values were held out, and the correlation between the true and imputed values is plotted on the y axis for each method. The dotted lines show the results from Figure 1 for reference.

Supplementary Figure 8 Non-normally distributed phenotypes.

Simulation results with non-normal noise. We exponentially transform the environmental contribution ɛ to create log-normal noise. The resulting phenotypes are imputed without (top) or with (bottom) quantile normalization. Five percent of phenotype values were held out, and the correlation between the true and imputed values is plotted on the y axis for each method. The dotted lines show the results from Figure 1 for reference.

Supplementary Figure 9 Type I error calibration after phenotype imputation.

QQ plots from performing GWAS on 15 truly unassociated phenotypes with different imputation options (panel titles). Phenotypes are generated from our baseline simulation with the relevant K matrix. Rather than represent each of the 15 GWAS for each panel, we plot the point-wise minimum and maximum (dotted lines) and median (solid line) for the 15 lines. Top row, kinship and genotypes correspond to independent sets of four siblings. Bottom row, kinship and genotypes taken from the NSPHS study.

Supplementary Figure 10 Power of single-phenotype tests after phenotype imputation.

Power to detect a simulated, causal SNP using a univariate mixed model (LMM). Five thousand samples, comprising independent sets of four siblings, have 15 simulated phenotypes with pleiotropy. Five percent of phenotypes are deleted, and an LMM is then run with GEMMA after dropping missing data (unimputed) or imputing with PHENIX. Power is calculated by averaging over 1,000 independently simulated data sets using the standard GWAS P-value threshold 5 × 10−7.

Supplementary Figure 11 Power of multiple-phenotype tests after phenotype imputation.

Power to detect a simulated, causal SNP using a multiple-phenotype mixed model (MPMM). Five thousand samples, comprising independent sets of four siblings, have 15 simulated phenotypes with three levels of pleiotropy (legend). Five percent of phenotypes are deleted, and an MPMM is then run with our method by dropping samples with any missing phenotype data (unimputed) or imputing with PHENIX. Power is calculated by averaging over 5,000 independently simulated datasets using the standard GWAS P-value threshold 5 × 10−7.

Supplementary Figure 12 Calibration of the imputation metric r.

Calibration of our r metric for imputation accuracy. Data are from the baseline model, but we now record estimated imputation accuracies, which we call r, as well as the true imputation accuracies. Top row, imputation correlation is plotted against h2. The black line is the true imputation accuracy and agrees with the PHENIX line (red) in Figure 1. We estimate r in two ways: by hiding 1% (brown line) or 5% (purple line) of observed entries. Point colors correspond to values of h2. Bottom row, estimated r is compared to the true r, with variability created by varying h2. Each point corresponds to the point in the above plot with the same color.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–12 and Supplementary Note. (PDF 2931 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dahl, A., Iotchkova, V., Baud, A. et al. A multiple-phenotype imputation method for genetic studies. Nat Genet 48, 466–472 (2016). https://doi.org/10.1038/ng.3513

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng.3513

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing