Abstract
Genome-wide association (GWA) studies have linked thousands of loci to human diseases, but the causal genes and variants at these loci generally remain unknown. Although investigators typically focus on genes closest to the associated polymorphisms, the causal gene is often more distal. Reliance on published work to prioritize candidates is biased toward well-characterized genes. We describe a 'prix fixe' strategy and software that uses genome-scale shared-function networks to identify sets of mutually functionally related genes spanning multiple GWA loci. Using associations from ∼100 GWA studies covering ten cancer types, our approach outperformed the common alternative strategy in ranking known cancer genes. As more GWA loci are discovered, the strategy will have increased power to elucidate the causes of human disease.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Bodmer, W. & Bonilla, C. Common and rare variants in multifactorial susceptibility to common diseases. Nat. Genet. 40, 695–701 (2008).
Risch, N. & Merikangas, K. The future of genetic studies of complex human diseases. Science 273, 1516–1517 (1996).
Chakravarti, A., Clark, A.G. & Mootha, V.K. Distilling pathophysiology from complex disease genetics. Cell 155, 21–26 (2013).
Gilman, S.R. et al. Rare de novo variants associated with autism implicate a large functional network of genes involved in formation and function of synapses. Neuron 70, 898–907 (2011).
Raychaudhuri, S. et al. Identifying relationships among genomic disease regions: predicting genes at pathogenic SNP associations and rare deletions. PLoS Genet. 5, e1000534 (2009).
Rossin, E.J. et al. Proteins encoded in genomic regions associated with immune-mediated disease physically interact and suggest underlying biology. PLoS Genet. 7, e1001273 (2011).
Han, S. et al. Integrating GWASs and human protein interaction networks identifies a gene subnetwork underlying alcohol dependence. Am. J. Hum. Genet. 93, 1027–1034 (2013).
Vanunu, O., Magger, O., Ruppin, E., Shlomi, T. & Sharan, R. Associating genes and protein complexes with disease via network propagation. PLoS Comput. Biol. 6, e1000641 (2010).
Das, J. & Yu, H. HINT: high-quality protein interactomes and their applications in understanding human disease. BMC Syst. Biol. 6, 92 (2012).
Venkatesan, K. et al. An empirical framework for binary interactome mapping. Nat. Methods 6, 83–90 (2009).
Rolland, T. et al. A Proteome-scale map of the human interactome network. Cell 159, 1212–1226 (2014).
Hirschhorn, J.N. Genomewide association studies—illuminating biologic pathways. N. Engl. J. Med. 360, 1699–1701 (2009).
Cantor, R.M., Lange, K. & Sinsheimer, J.S. Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am. J. Hum. Genet. 86, 6–22 (2010).
Lee, I., Date, S.V., Adai, A.T. & Marcotte, E.M. A probabilistic functional network of yeast genes. Science 306, 1555–1558 (2004).
Wang, P.I. & Marcotte, E.M. It's the machine that matters: predicting gene function and phenotype from protein networks. J. Proteomics 73, 2277–2289 (2010).
Hwang, S., Rhee, S.Y., Marcotte, E.M. & Lee, I. Systematic prediction of gene function in Arabidopsis thaliana using a probabilistic functional gene network. Nat. Protoc. 6, 1429–1442 (2011).
Peña-Castillo, L. et al. A critical assessment of Mus musculus gene function prediction using integrated genomic evidence. Genome Biol. 9 (suppl. 1), S2 (2008).
Mostafavi, S. & Morris, Q. Fast integration of heterogeneous data sources for predicting gene function with limited annotation. Bioinformatics 26, 1759–1765 (2010).
Tas¸an, M. et al. A resource of quantitative functional annotation for Homo sapiens genes. G3 (Bethesda) 2, 223–233 (2012).
Huttenhower, C. et al. Exploring the human genome with functional maps. Genome Res. 19, 1093–1106 (2009).
Franke, L. et al. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am. J. Hum. Genet. 78, 1011–1025 (2006).
Lee, I., Blom, U.M., Wang, P.I., Shim, J.E. & Marcotte, E.M. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res. 21, 1109–1121 (2011).
Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106, 9362–9367 (2009).
Warde-Farley, D. et al. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res. 38, W214–W220 (2010).
Goldberg, D.E. Genetic Algorithms in Search, Optimization, and Machine Learning (Addison-Wesley, 1989).
de Resende, M.F. et al. Prognostication of OCT4 isoform expression in prostate cancer. Tumour Biol. 34, 2665–2673 (2013).
Hu, Y.L. et al. HNF1b is involved in prostate cancer risk via modulating androgenic hormone effects and coordination with other genes. Genet. Mol. Res. 12, 1327–1335 (2013).
Futreal, P.A. et al. A census of human cancer genes. Nat. Rev. Cancer 4, 177–183 (2004).
Berriz, G.F., Beaver, J.E., Cenik, C., Tasan, M. & Roth, F.P. Next generation software for functional trend analysis. Bioinformatics 25, 3043–3044 (2009).
Memarzadeh, S. et al. Enhanced paracrine FGF10 expression promotes formation of multifocal prostate adenocarcinoma and an increase in epithelial androgen receptor. Cancer Cell 12, 572–585 (2007).
Heinlein, C.A. & Chang, C. Androgen receptor in prostate cancer. Endocr. Rev. 25, 276–308 (2004).
Bhatia-Gaur, R. et al. Roles for Nkx3.1 in prostate development and cancer. Genes Dev. 13, 966–977 (1999).
Gao, W. Androgen receptor as a therapeutic target. Adv. Drug Deliv. Rev. 62, 1277–1284 (2010).
Katoh, M. & Nakagama, H. FGF receptors: cancer biology and therapeutics. Med. Res. Rev. 34, 280–300 (2014).
Kandoth, C. et al. Mutational landscape and significance across 12 major cancer types. Nature 502, 333–339 (2013).
King, O.D. et al. Predicting phenotype from patterns of annotation. Bioinformatics 19 (suppl. 1), i183–i189 (2003).
Liu, J.Z. et al. A versatile gene-based test for genome-wide association studies. Am. J. Hum. Genet. 87, 139–145 (2010).
Lee, D.-S. et al. The implications of human metabolic network topology for disease comorbidity. Proc. Natl. Acad. Sci. USA 105, 9880–9885 (2008).
Vandin, F., Upfal, E. & Raphael, B.J. De novo discovery of mutated driver pathways in cancer. Genome Res. 22, 375–385 (2012).
Manolio, T.A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).
Amberger, J., Bocchini, C.A., Scott, A.F. & Hamosh, A. McKusick's Online Mendelian Inheritance in Man (OMIM). Nucleic Acids Res. 37, D793–D796 (2009).
Hunter, S. et al. InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res. 40, D306–D312 (2012).
Gunsalus, K.C., Yueh, W.-C., MacMenamin, P. & Piano, F. RNAiDB and PhenoBlast: web tools for genome-wide phenotypic mapping projects. Nucleic Acids Res. 32, D406–D410 (2004).
Karolchik, D. et al. The UCSC Genome Browser database: 2014 update. Nucleic Acids Res. 42, D764–D770 (2014).
Östlund, G. et al. InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res. 38, D196–D203 (2010).
Su, A.I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl. Acad. Sci. USA 101, 6062–6067 (2004).
Breiman, L. Random Forests. Mach. Learn. 45, 5–32 (2001).
Tas¸an, M. et al. An en masse phenotype and function prediction system for Mus musculus. Genome Biol. 9 (suppl. 1), S8 (2008).
Mostafavi, S., Ray, D., Warde-Farley, D., Grouios, C. & Morris, Q. GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol. 9 (suppl. 1), S4 (2008).
Musso, G. et al. Novel cardiovascular gene functions revealed via systematic phenotype prediction in zebrafish. Development 141, 224–235 (2014).
Tian, W. et al. Combining guilt-by-association and guilt-by-profiling to predict Saccharomyces cerevisiae gene function. Genome Biol. 9 (suppl. 1), S7 (2008).
The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
Lango Allen, H. et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 832–838 (2010).
Ferrari, S. & Cribari-Neto, F. Beta regression for modelling rates and proportions. J. Appl. Stat. 31, 799–815 (2004).
Hill, W.G. & Robertson, A. Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38, 226–231 (1968).
Sved, J.A. Linkage disequilibrium and homozygosity of chromosome segments in finite populations. Theor. Popul. Biol. 2, 125–141 (1971).
Franceschini, A. et al. STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 41, D808–D815 (2013).
Voight, B.F. et al. Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nat. Genet. 42, 579–589 (2010).
THE SIGMA Type 2 Diabetes Consortium. Sequence variants in SLC16A11 are a common risk factor for type 2 diabetes in Mexico. Nature 506, 97–101 (2014).
Hara, K. et al. Genome-wide association study identifies three novel loci for type 2 diabetes. Hum. Mol. Genet. 23, 239–246 (2014).
Boj, S.F. et al. Diabetes risk gene and Wnt effector Tcf7l2/TCF4 controls hepatic response to perinatal and adult metabolic demand. Cell 151, 1595–1607 (2012).
Savic, D. et al. Alterations in TCF7L2 expression define its role as a key regulator of glucose metabolism. Genome Res. 21, 1417–1425 (2011).
Bingham, C. & Hattersley, A.T. Renal cysts and diabetes syndrome resulting from mutations in hepatocyte nuclear factor-1β. Nephrol. Dial. Transplant. 19, 2703–2708 (2004).
Farmer, S.R. Molecular determinants of brown adipocyte formation and function. Genes Dev. 22, 1269–1275 (2008).
Coppari, R. & Bjørbæk, C. Leptin revisited: its mechanism of action and potential for treating diabetes. Nat. Rev. Drug Discov. 11, 692–708 (2012).
Zhang, J., McKenna, L.B., Bogue, C.W. & Kaestner, K.H. The diabetes gene Hhex maintains δ-cell differentiation and islet function. Genes Dev. 28, 829–834 (2014).
Li, B., Ruotti, V., Stewart, R.M., Thomson, J.A. & Dewey, C.N. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493–500 (2010).
Robinson, M.D., McCarthy, D.J. & Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B Stat. Methodol. 57, 289–300 (1995).
Maglott, D., Ostell, J., Pruitt, K.D. & Tatusova, T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 39, D52–D57 (2011).
Acknowledgements
We thank members of the Roth lab and the Center for Cancer Systems Biology (CCSB) at the Dana-Farber Cancer Institute (DFCI) for helpful comments and discussion; the lab of Q. Morris for assistance with GeneMANIA data; and M. Çokol and J. Mellor for useful conversations and advice during manuscript preparation. This work was primarily supported by Center of Excellence in Genomic Science (CEGS) grant P50 (HG004233) from the NHGRI awarded to M.V. and F.P.R. F.P.R. is additionally supported by US National Institutes of Health (NIH) grants (HG003224 and HL107440), the Krembil and Avon Foundations, a Canadian Ontario Research Fund Research Excellence Award and the Canada Excellence Research Chairs Program. C.A.M. was supported in this work by an NIH grant (HL098938), the Leducq Foundation and the Harvard Stem Cell Institute. M.T. was supported by an NIH grant (HG004098).
Author information
Authors and Affiliations
Contributions
M.T., G.M., C.A.M. and F.P.R. conceived of the project. M.T., G.M. and T.H. performed computational analyses. M.T., G.M., C.A.M. and F.P.R. wrote the manuscript. M.V., C.A.M. and F.P.R. oversaw and guided the research effort.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Much of the human proteome is unrepresented in protein-protein interaction databases.
The proteome as defined here consists of all 20484 “protein-coding” genes in the NCBI GENE database. Interactions, both binary and co-complex, are taken from the HINT high-quality amortized protein-protein interaction database.
Supplementary Figure 2 Historical investigation bias in the literature.
Genes characterized at earlier dates continue to appear in publications at higher rates than more-recently characterized genes. Circle positions indicate mean publication rate (y-axis) for genes first characterized during or before each year from 1990–2012 (x-axis). Circle sizes indicate the (cumulative) number of genes characterized during or before each year.
Supplementary Figure 3 Cofunction network (CFN) coverage in terms of genes and gene pairs.
(a,b) Coverage shown for genes (a) and gene pairs (b). “HumanFunc” and “GeneMania” are computed as described in Online Methods. NCBI Gene data downloaded on 2013-07-17.
Supplementary Figure 4 Efficient enrichment for dense prix fixe subnetworks using a prostate cancer case study.
Boxplot shows candidate prix fixe subnetwork fitness evolution over 20 generations, circles within boxes indicated mean fitness, whiskers extend cover the full range of observed fitnesses. Marginal histogram (on right) indicates distribution of final generations’ mean fitnesses for 1000 random trials (see Online Methods). Empirical P-value for final generation’s subnetwork enrichment is computed against this marginal distribution (dashed line).
Supplementary Figure 5 Prix fixe scores are uncorrelated with LD (r2) values.
Each scatter plot point is a candidate breast cancer gene. Correlation is computed using Kendall’s τ rank coefficient. Blue genes indicate significantly differentially-expressed mRNA levels in matched case-control TCGA breast cancer (BRCA) samples, while red genes indicate no evidence of cancer-dependent differential expression. Flanking boxplots indicate score distributions of differentially- and not-differentially-expressed genes. Boxplot whiskers extend to 1.5×IQR; outliers not shown. Boxplots compared by one-sided Wilcoxon rank sum tests.
Supplementary Figure 6 Prix fixe score robustness with respect to varying LD (r2) thresholds.
Each histogram represents a collection of Kendall’s τ rank correlation coefficients. Each single correlation coefficient represents a comparison of prix fixe rank orders for a single analyzed trait when the method is repeated using two different r2 thresholds. (a) “Pure” replication (to measure stochastic variance) of the primary analyses using the identical r2 ≥ 0.50 threshold. (b) Comparison of scores between primary analyses (r2 ≥ 0.50) and a ‘permissive’ (r2 ≥ 0.25) threshold. (c) Comparison of scores between primary analyses (r2 ≥ 0.50) and a ‘restrictive’ (r2 ≥ 0.75) threshold.
Supplementary Figure 7 Rank-based analysis of Sanger cancer gene census (SCGC) prioritization when using a ‘permissive’ LD threshold of r2 ≥0.25.
Genes are ranked within each cancer-associated locus and normalized ranks of known cancer (i.e. SCGC) genes are shown as dots for prix fixe-based (“PF”, left) and LD-based (“r2”, right) rankings (100 is highest ranked, 0 is lowest). Average relative rank of SCGC genes (for both methods) within each locus identified by horizontal bars; number of multigenic loci shown above as “n”. Right-most plot (“Union”) shows pooled results across all cancer-associated loci. PF SCGC ranks significantly outperform LD-based SCGC ranks (P = 0.025, one-sided paired Wilcoxon signed-rank test).
Supplementary Figure 8 Rank-based analysis of Sanger cancer gene census (SCGC) prioritization when using a ‘restrictive’ LD threshold of r2 ≥0.75.
Genes are ranked within each cancer-associated locus and normalized ranks of known cancer (i.e. SCGC) genes are shown as dots for prix fixe-based (“PF”, left) and LD-based (“r2”, right) rankings (100 is highest ranked, 0 is lowest). Average relative rank of SCGC genes (for both methods) within each locus identified by horizontal bars; number of multigenic loci shown above as “n”. Right-most plot (“Union”) shows pooled results across all cancer-associated loci. PF SCGC ranks significantly outperform LD-based SCGC ranks (P = 0.028, one-sided paired Wilcoxon signed-rank test).
Supplementary Figure 9 Prix fixe score robustness with respect to varying cofunction networks (CFNs).
Each histogram represents a collection of Kendall’s τ rank correlation coefficients. Each single correlation coefficient represents a comparison of prix fixe rank orders for a single analyzed trait when the method is repeated using two different CFNs. (a) Comparison of scores between primary analyses’ CFN (HF ⋃ GM) and the HF-alone CFN. (b) Comparison of scores between primary analyses’ CFN (HF ⋃ GM) and the GM-alone CFN. (c) Comparison of scores between primary analyses’ CFN (HF ⋃ GM) and a STRING-augmented CFN (HF ⋃ GM ⋃ STRING).
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–9 (PDF 891 kb)
Supplementary Table 1
GWAS data This table provides all of the input data originating in published GWA studies. The majority of the data here originate directly from the NHGRI GWAS catalog [2]. Each ‘sheet’ represents a single study in this work. Column data is taken directly from the catalog. Each row represents a single trait-associated tagSNP. (XLSX 159 kb)
Supplementary Table 2
Analysis Scores Run results for all studies examined in this work. Each ‘sheet’ corresponds to a single trait. (XLSX 665 kb)
Supplementary Table 3
Functional enrichment results GO term functional enrichment results for all traits. (XLSX 373 kb)
Supplementary Table 4
Extended summary table GO term functional enrichment results for all traits. (XLSX 33 kb)
Supplementary Table 5
T2D replication results Results for the replication experiment using type-II diabetes (T2D) loci (identified independently from the T2D analysis included in our primary study). Three sheets hold (i) genes and prix fixe scores, (ii) annotations for genes having causal links to diabetes, and (iii) ‘ordered’ functional enrichment results. (XLSX 34 kb)
Prix fixe software
Includes the R package to run a prix fixe analysis (beta version), a reference manual, and an R/Bioconductor vignette for the package. (ZIP 280 kb)
Rights and permissions
About this article
Cite this article
Taşan, M., Musso, G., Hao, T. et al. Selecting causal genes from genome-wide association studies via functionally coherent subnetworks. Nat Methods 12, 154–159 (2015). https://doi.org/10.1038/nmeth.3215
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.3215
This article is cited by
-
A novel candidate disease gene prioritization method using deep graph convolutional networks and semi-supervised learning
BMC Bioinformatics (2022)
-
Rare genetic coding variants associated with human longevity and protection against age-related diseases
Nature Aging (2021)
-
Dynamic rewiring of the human interactome by interferon signaling
Genome Biology (2020)
-
Genetics of extreme human longevity to guide drug discovery for healthy ageing
Nature Metabolism (2020)
-
Cross-population analysis for functional characterization of type II diabetes variants
BMC Bioinformatics (2019)