Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Prioritization of disease genes from GWAS using ensemble-based positive-unlabeled learning

Abstract

A primary challenge in understanding disease biology from genome-wide association studies (GWAS) arises from the inability to directly implicate causal genes from association data. Integration of multiple-omics data sources potentially provides important functional links between associated variants and candidate genes. Machine-learning is well-positioned to take advantage of a variety of such data and provide a solution for the prioritization of disease genes. Yet, classical positive-negative classifiers impose strong limitations on the gene prioritization procedure, such as a lack of reliable non-causal genes for training. Here, we developed a novel gene prioritization tool—Gene Prioritizer (GPrior). It is an ensemble of five positive-unlabeled bagging classifiers (Logistic Regression, Support Vector Machine, Random Forest, Decision Tree, Adaptive Boosting), that treats all genes of unknown relevance as an unlabeled set. GPrior selects an optimal composition of algorithms to tune the model for each specific phenotype. Altogether, GPrior fills an important niche of methods for GWAS data post-processing, significantly improving the ability to pinpoint disease genes compared to existing solutions.

Your institute does not have access to this article

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Fig. 1: GPrior ensemble positive-unlabeled learning framework.
Fig. 2: Gene prioritization for inflammatory bowel disease GWAS.
Fig. 3: Gene prioritization for coronary artery disease GWAS.
Fig. 4: Gene prioritization for schizophrenia GWAS.

Code availability

https://github.com/faramer86/GPrior.

References

  1. Ding K, Kullo IJ. Methods for the selection of tagging SNPs: A comparison of tagging efficiency and performance. Eur J Hum Genet. 2007;15:228–36.

    CAS  Article  Google Scholar 

  2. Foulkes AS. Applied statistical genetics with R. New York: Springer New York; 2009. https://doi.org/10.1007/978-0-387-89554-3.

  3. Spain SL, Barrett JC. Strategies for fine-mapping complex traits. Hum Mol Genet. 2015;24:R111–R119.

    CAS  Article  Google Scholar 

  4. Stephens M, Balding DJ. Bayesian statistical methods for genetic association studies. Nat Rev Genet. 2009;10:681–90.

    CAS  Article  Google Scholar 

  5. Benner C, Spencer CCA, Havulinna AS, Salomaa V, Ripatti S, Pirinen M. FINEMAP: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics. 2016;32:1493–501.

    CAS  Article  Google Scholar 

  6. Kichaev G, Yang WY, Lindstrom S, Hormozdiari F, Eskin E, Price AL, et al. Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet. 2014;10:1004722.

    Article  Google Scholar 

  7. Pickrell JK. Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. Am J Hum Genet. 2014;94:559–73.

    CAS  Article  Google Scholar 

  8. Wang G, Sarkar A, Carbonetto P, Stephens M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J R Stat Soc Ser B. 2020;82:1273–300.

    Article  Google Scholar 

  9. Rossin EJ, Lage K, Raychaudhuri S, Xavier RJ, Tatar D, Benita Y, et al. Proteins encoded in genomic regions associated with immune-mediated disease physically interact and suggest underlying biology. PLoS Genet. 2011;7:e1001273.

    CAS  Article  Google Scholar 

  10. Peat G, Jones W, Nuhn M, Marugán JC, Newell W, Dunham I, et al. The open targets post-GWAS analysis pipeline. Bioinformatics. 2020;36:2936–7.

    CAS  Article  Google Scholar 

  11. Erratum: Genetic effects on gene expression across human tissues (Nature (2017) 550 (204-13). Nature. 2018;553:530.

  12. Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, et al. An atlas of active enhancers across human cell types and tissues. Nature. 2014;507:455–61.

    CAS  Article  Google Scholar 

  13. Boyle AP, Hong EL, Hariharan M, Cheng Y, Schaub MA, Kasowski M, et al. Annotation of functional variation in personal genomes using RegulomeDB. Genome Res. 2012;22:1790–7.

    CAS  Article  Google Scholar 

  14. Bromberg Y. Chapter 15: disease gene prioritization. PLoS Comput Biol. 2013;9:e1002902.

    CAS  Article  Google Scholar 

  15. Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS. Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinform. 2005; 6. https://doi.org/10.1186/1471-2105-6-55.

  16. Xu J, Li Y. Discovering disease-genes by topological features in human protein–protein interaction network. Bioinformatics. 2006;22:2800–5.

    CAS  Article  Google Scholar 

  17. Smalter A, Seak FL, Chen XW Human disease-gene classification with integrative sequence-based and topological features of protein-protein interaction networks. In: Proceedings of the 2007 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2007. 2007, pp. 209–14.

  18. Isakov O, Dotan I, Ben-Shachar S. Machine learning–based gene prioritization identifies novel candidate risk genes for inflammatory bowel disease. Inflamm Bowel Dis. 2017;23:1516–23.

    Article  Google Scholar 

  19. Denis F PAC learning from positive statistical queries*. In: Lecture notes in computer science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer Verlag; 1998. pp. 112–26.

  20. Sriphaew K, Takamura H, Okumura M. Cool blog classification from positive and unlabeled examples. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Berlin, Heidelberg: Springer; 2009. pp. 62–73.

  21. Bekker J, Davis J. Learning from positive and unlabeled data: a survey. Mach Learn. 2020;109:719–60.

    Article  Google Scholar 

  22. Mordelet F, Vert JP. A bagging SVM to learn from positive and unlabeled examples. Pattern Recognit Lett. 2014;37:201–9.

    Article  Google Scholar 

  23. Yang P, Li X, Chua H-N, Kwoh C-K, Ng S-K. Ensemble positive unlabeled learning for disease gene identification. PLoS One. 2014;9:e97079.

    Article  Google Scholar 

  24. Scott C, Blanchard G. Novelty detection: unlabeled data definitely help. In: van Dyk D, Welling M (eds). Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics. PMLR: Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 2009, pp. 464–71.

  25. Chen J, Bardes EE, Aronow BJ, Jegga AG. ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res. 2009; 37. https://doi.org/10.1093/nar/gkp427.

  26. de Leeuw CA, Mooij JM, Heskes T, Posthuma D. MAGMA: generalized gene-set analysis of GWAS data. PLOS Comput Biol. 2015;11:e1004219.

    Article  Google Scholar 

  27. Lehne B, Lewis CM, Schlitt T. From SNPs to genes: disease association at the gene level. PLoS One. 2011;6:e20133.

    CAS  Article  Google Scholar 

  28. Stranger BE, Nica AC, Forrest MS, Dimas A, Bird CP, Beazley C, et al. Population genomics of human gene expression. Nat Genet. 2007;39:1217–24.

    CAS  Article  Google Scholar 

  29. Ala U, Piro RM, Grassi E, Damasco C, Silengo L, Oti M, et al. Prediction of human disease genes by human-mouse conserved coexpression analysis. PLoS Comput Biol. 2008;4:e1000043.

    Article  Google Scholar 

  30. Fine RS, Pers TH, Amariuta T, Raychaudhuri S, Hirschhorn JN. Benchmarker: an unbiased, association-data-driven strategy to evaluate gene prioritization algorithms. Am J Hum Genet. 2019;104:1025–39.

    CAS  Article  Google Scholar 

  31. Lee WS, Liu B. Learning with positive and unlabeled examples using weighted logistic regression. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML). 2003. p. 2003.

  32. Claesen M, De Smet F, Suykens JAK, De Moor B. A robust ensemble approach to learn from positive and unlabeled data using SVM base models. Neurocomputing. 2015;160:73–84.

    Article  Google Scholar 

  33. Boyle EA, Li YI, Pritchard JK. An expanded view of complex traits: from polygenic to omnigenic. Cell. 2017;169:1177–86.

    CAS  Article  Google Scholar 

  34. Huang H, Fang M, Jostins L, Umićević Mirkov M, Boucher G, Anderson CA, et al. Fine-mapping inflammatory bowel disease loci to single-variant resolution. Nature. 2017;547:173–8.

    CAS  Article  Google Scholar 

  35. Graham DB, Xavier RJ. Pathway paradigms revealed from the genetics of inflammatory bowel disease. Nature 2020;578:527–39.

    CAS  Article  Google Scholar 

  36. Rivas MA, Beaudoin M, Gardet A, Stevens C, Sharma Y, Zhang CK, et al. Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease. Nat Genet. 2011;43:1066–73.

    CAS  Article  Google Scholar 

  37. Momozawa Y, Dmitrieva J, Théâtre E, Deffontaine V, Rahmouni S, Charloteaux B, et al. IBD risk loci are enriched in multigenic regulatory modules encompassing putative causative genes. Nat Commun. 2018;9:1–18.

    Article  Google Scholar 

  38. Liu JZ, Van Sommeren S, Huang H, Ng SC, Alberts R, Takahashi A, et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat Genet. 2015;47:979–86.

    CAS  Article  Google Scholar 

  39. Lee JJ, Wedow R, Okbay A, Kong E, Maghzian O, Zacher M, et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat Genet. 2018;50:1112–21.

    CAS  Article  Google Scholar 

  40. Kaplanis J, Samocha KE, Wiel L, Zhang Z, Arvai KJ, Eberhardt RY, et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature. 2020;586:757–62.

    CAS  Article  Google Scholar 

  41. Van Der Harst P, Verweij N. Identification of 64 novel genetic loci provides an expanded view on the genetic architecture of coronary artery disease. Circ Res. 2018;122:433–43.

    Article  Google Scholar 

  42. Khera AV, Kathiresan S. Genetics of coronary artery disease: Discovery, biology and clinical translation. Nat Rev Genet 2017;18:331–44.

    CAS  Article  Google Scholar 

  43. Pardiñas AF, Holmans P, Pocklington AJ, Escott-Price V, Ripke S, Carrera N, et al. Common schizophrenia alleles are enriched in mutation-intolerant genes and in regions under strong background selection. Nat Genet. 2018;50:381–9.

    Article  Google Scholar 

  44. Singh T, Poterba T, Curtis D, Akil H, Eissa M Al, Barchas JD et al. Exome sequencing identifies rare coding variants in 10 genes which confer substantial risk for schizophrenia. medRxiv. 2020; 2020.09.18.20192815.

  45. Tang J, Chen X, Xu X, Wu R, Zhao J, Hu Z, et al. Significant linkage and association between a functional (GT)n polymorphism in promoter of the N-methyl-d-aspartate receptor subunit gene (GRIN2A) and schizophrenia. Neurosci Lett. 2006;409:80–2.

    CAS  Article  Google Scholar 

  46. Koide T, Banno M, Aleksic B, Yamashita S, Kikuchi T, Kohmura K, et al. Correction: Common Variants in MAGI2 Gene Are Associated with Increased Risk for Cognitive Impairment in Schizophrenic Patients. PLoS One. 2012; 7. https://doi.org/10.1371/annotation/47ca9c23-9fdd-47f6-9d36-db0a31769f22.

  47. Pinacho R, Saia G, Meana JJ, Gill G, Ramos B. Transcription factor SP4 phosphorylation is altered in the postmortem cerebellum of bipolar disorder and schizophrenia subjects. Eur Neuropsychopharmacol. 2015;25:1650–60.

    CAS  Article  Google Scholar 

  48. Ripke S, Walters JTR, O’Donovan MC. Mapping genomic loci prioritises genes and implicates synaptic biology in schizophrenia. medRxiv. 2020; 2020.09.12.20192922.

Download references

Acknowledgements

The authors would like to thank Dr. Alexey Sergushichev (ITMO University) and Dr. Maxim Artyomov (Washington University in St. Louis) for helpful discussions.

Funding

N.K. was supported by the grant of the Ministry of Science and Higher Education of the Russian Federation (Agreement No. 075-15-2020-901).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Mark J. Daly or Mykyta Artomov.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical approval

The study did not require ethical approval as no human subject data was involved.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kolosov, N., Daly, M.J. & Artomov, M. Prioritization of disease genes from GWAS using ensemble-based positive-unlabeled learning. Eur J Hum Genet 29, 1527–1535 (2021). https://doi.org/10.1038/s41431-021-00930-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41431-021-00930-w

Further reading

Search

Quick links