Technical Report | Published:

Covariate selection for association screening in multiphenotype genetic studies

Nature Genetics volume 49, pages 17891795 (2017) | Download Citation


Testing for associations in big data faces the problem of multiple comparisons, wherein true signals are difficult to detect on the background of all associations queried. This difficulty is particularly salient in human genetic association studies, in which phenotypic variation is often driven by numerous variants of small effect. The current strategy to improve power to identify these weak associations consists of applying standard marginal statistical approaches and increasing study sample sizes. Although successful, this approach does not leverage the environmental and genetic factors shared among the multiple phenotypes collected in contemporary cohorts. Here we developed covariates for multiphenotype studies (CMS), an approach that improves power when correlated phenotypes are measured on the same samples. Our analyses of real and simulated data provide direct evidence that correlated phenotypes can be used to achieve increases in power to levels often surpassing the power gained by a twofold increase in sample size.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.


  1. 1.

    , & Progress and promise of genome-wide association studies for human complex trait genetics. Genetics 187, 367–383 (2011).

  2. 2.

    & Statistical power and significance testing in large-scale genetic studies. Nat. Rev. Genet. 15, 335–346 (2014).

  3. 3.

    et al. Genetic studies of body mass index yield new insights for obesity biology. Nature 518, 197–206 (2015).

  4. 4.

    & Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat. Methods 11, 407–409 (2014).

  5. 5.

    et al. MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS. PLoS One 7, e34861 (2012).

  6. 6.

    et al. Maximizing the power of principal-component analysis of correlated phenotypes in genome-wide association studies. Am. J. Hum. Genet. 94, 662–676 (2014).

  7. 7.

    A unified framework for association analysis with multiple related phenotypes. PLoS One 8, e65245 (2013).

  8. 8.

    et al. A cross-platform analysis of 14,177 expression quantitative trait loci derived from lymphoblastoid cell lines. Genome Res. 23, 716–726 (2013).

  9. 9.

    , , & A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comput. Biol. 6, e1000770 (2010).

  10. 10.

    , & Causal diagrams for epidemiologic research. Epidemiology 10, 37–48 (1999).

  11. 11.

    , , & Causal knowledge as a prerequisite for confounding evaluation: an application to birth defects epidemiology. Am. J. Epidemiol. 155, 176–184 (2002).

  12. 12.

    et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).

  13. 13.

    & Multicollinearity in regression analysis: the problem revisited. Rev. Econ. Stat. 49, 92–107 (1967).

  14. 14.

    , , , & Adjusting for heritable covariates can bias effect estimates in genome-wide association studies. Am. J. Hum. Genet. 96, 329–339 (2015).

  15. 15.

    et al. Genome-wide association study identifies multiple loci influencing human serum metabolite levels. Nat. Genet. 44, 269–276 (2012).

  16. 16.

    et al. An atlas of genetic influences on human blood metabolites. Nat. Genet. 46, 543–550 (2014).

  17. 17.

    et al. Human metabolic individuality in biomedical and pharmaceutical research. Nature 477, 54–60 (2011).

  18. 18.

    et al. A genome-wide association study of the human metabolome in a community-based cohort. Cell Metab. 18, 130–143 (2013).

  19. 19.

    & Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, 1724–1735 (2007).

  20. 20.

    et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).

  21. 21.

    , & Consensus genome-wide expression quantitative trait loci and their relationship with human complex trait disease. OMICS 20, 400–414 (2016).

  22. 22.

    et al. Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat. Genet. 45, 1238–1243 (2013).

  23. 23.

    Rare and common variants: twenty arguments. Nat. Rev. Genet. 13, 135–145 (2012).

  24. 24.

    et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47, 1236–1241 (2015).

  25. 25.

    , , , & Adjusting for principal components of molecular phenotypes induces replicating false positives. Preprint at (2017).

  26. 26.

    et al. A multiple-phenotype imputation method for genetic studies. Nat. Genet. 48, 466–472 (2016).

  27. 27.

    & Some surprising results about covariate adjustment in logistic regression models. Int. Stat. Rev. 59, 227–240 (1991).

  28. 28.

    , , & Many phenotypes without many false discoveries: error controlling strategies for multitrait association studies. Genet. Epidemiol. 40, 45–56 (2016).

  29. 29.

    Computing the nearest correlation matrix: a problem from finance. IMA J. Numer. Anal. 22, 329–343 (2002).

  30. 30.

    , & Genomic control, a new approach to genetic-based association studies. Theor. Popul. Biol. 60, 155–166 (2001).

  31. 31.

    , , , & Iterative usage of fixed and random effect models for powerful and efficient genome-wide association studies. PLoS Genet. 12, e1005767 (2016).

  32. 32.

    et al. Metabolite profiles and the risk of developing diabetes. Nat. Med. 17, 448–453 (2011).

  33. 33.

    et al. Reproducibility of metabolomic profiles among men and women in 2 large cohort studies. Clin. Chem. 59, 1657–1667 (2013).

  34. 34.

    et al. Elevation of circulating branched-chain amino acids is an early event in human pancreatic adenocarcinoma development. Nat. Med. 20, 1193–1198 (2014).

  35. 35.

    et al. Genome-wide association study identifies multiple susceptibility loci for pancreatic cancer. Nat. Genet. 46, 994–1000 (2014).

  36. 36.

    & RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).

Download references


H.A. and N.Z. were supported by NIH grant R03DE025665. H.A. was also supported by NIH grant R21HG007687, and N.Z. was also supported by NIH career development award K25HL121295 and NIH grant U01HG009080. C.J.P. was supported by NIH grant R00 ES023504.

Author information

Author notes

    • Peter Kraft
    •  & Noah Zaitlen

    These authors jointly directed this work.


  1. Centre de Bioinformatique, Biostatistique et Biologie Intégrative (C3BI), Institut Pasteur, Paris, France.

    • Hugues Aschard
    •  & Vincent Guillemot
  2. Department of Epidemiology, Harvard TH Chan School of Public Health, Boston, Massachusetts, USA.

    • Hugues Aschard
    •  & Peter Kraft
  3. Program in Genetic Epidemiology and Statistical Genetics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.

    • Hugues Aschard
    •  & Peter Kraft
  4. Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark.

    • Bjarni Vilhjalmsson
  5. Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA.

    • Chirag J Patel
  6. Division of Infectious Diseases, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, USA.

    • David Skurnik
  7. Massachusetts Technology and Analytics, Brookline, Massachusetts, USA.

    • David Skurnik
  8. Department of Microbiology, Necker Hospital, University Paris-Descartes, Paris, France

    • David Skurnik
  9. Institut Necker–Enfants Malades, INSERM U1151–Equipe 11, Paris, France.

    • David Skurnik
  10. Department of Epidemiology and Biostatistics, Institute of Human Genetics, San Francisco, California, USA.

    • Chun J Ye
  11. Center for Gastrointestinal Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, Massachusetts, USA.

    • Brian Wolpin
  12. Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, Massachusetts, USA.

    • Peter Kraft
  13. Department of Medicine, University of California, San Francisco, San Francisco, California, USA.

    • Noah Zaitlen


  1. Search for Hugues Aschard in:

  2. Search for Vincent Guillemot in:

  3. Search for Bjarni Vilhjalmsson in:

  4. Search for Chirag J Patel in:

  5. Search for David Skurnik in:

  6. Search for Chun J Ye in:

  7. Search for Brian Wolpin in:

  8. Search for Peter Kraft in:

  9. Search for Noah Zaitlen in:


H.A. conceived the approach and performed all real-data analyses. H.A., N.Z., B.V., C.J.P., D.S., and P.K. contributed substantially to improving the approach and the study design. C.J.Y. contributed to the quality control and analysis of the gEUVADIS data. B.W. collected the metabolite data and contributed to quality control and analysis of the metabolite data. H.A. and N.Z. conceptualized and performed the simulation study. V.G. contributed to the simulation study. H.A. and N.Z. wrote the manuscript.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Hugues Aschard.

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figures 1–45, Supplementary Tables 1–8 and Supplementary Note

  2. 2.

    Life Sciences Reporting Summary

About this article

Publication history