Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Computational tools for prioritizing candidate genes: boosting disease gene discovery

Key Points

  • Gene prioritization aims to integrate complex, heterogeneous data to identify the most promising genes for biological validation among a set of candidates. Its goal is to help biological researchers who face mountains of public and private omics data to maximize the yield of downstream biological validation.

  • Prioritization methods leverage prior knowledge of the phenotype or biological process of interest, either in the form of keywords describing the phenotype of interest or of sets of genes that were previously associated to the phenotype or the process. They then either profile data from candidates against this prior knowledge or diffuse this knowledge across a biological network to identify the most closely associated candidates; methods also exist for the case in which little or no prior knowledge is available.

  • Gene prioritization has contributed to the discovery of many disease-causing genes. High ranking of a candidate gene in prioritization for a phenotype is now accepted as contributing evidence in proving that mutations in this gene cause the phenotype.

  • Numerous prioritization tools are publicly available, often via the Web, and they can easily be used by biologists without specific bioinformatics expertise. Although no tool performs best in all situations, the different tools cover together most experimental situations in which gene prioritization is useful.

  • Computational validation of prioritization results — using procedures such as cross-validation, appropriate negative controls and functional enrichment — is essential to guarantee the effectiveness of the prioritization. More complex prioritization strategies are available to increase the effectiveness of prioritization methods further.

  • Although prioritization methods are now firmly established, many refinements that improve their performance and usability by biologists can be expected. Moreover, prioritization of sequencing variants identified by next-generation sequencing is emerging as a major need for the biological community, in which data integration can have an important role and for which new prioritization strategies are needed.

Abstract

At different stages of any research project, molecular biologists need to choose — often somewhat arbitrarily, even after careful statistical data analysis — which genes or proteins to investigate further experimentally and which to leave out because of limited resources. Computational methods that integrate complex, heterogeneous data sets — such as expression data, sequence information, functional annotation and the biomedical literature — allow prioritizing genes for future study in a more informed way. Such methods can substantially increase the yield of downstream studies and are becoming invaluable to researchers.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Computational strategies for prioritization.
Figure 2: Exome sequencing and disease network analysis of a single family implicate a mutation in KIF1A in hereditary spastic paraparesis.
Figure 3: Haploinsufficiency of TAB2 causes congenital heart defects in humans.

Similar content being viewed by others

References

  1. Aerts, S. et al. Gene prioritization through genomic data fusion. Nature Biotech. 24, 537–544 (2006). This is the original description of the prioritization tool Endeavour, which uses a similarity profiling strategy.

    Article  CAS  Google Scholar 

  2. Franke, L. et al. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am. J. Hum. Genet. 78, 1011–1025 (2006). This is the original description of the prioritization tool Prioritizer, which relies on a human functional network.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Perez-Iratxeta, C., Bork, P. & Andrade, M. A. Association of genes to genetically inherited diseases using data mining. Nature Genet. 31, 316–319 (2002).

    Article  CAS  PubMed  Google Scholar 

  4. Thiel, C. T. et al. Severely incapacitating mutations in patients with extreme short stature identify RNA-processing endoribonuclease RMRP as an essential cell growth regulator. Am. J. Hum. Genet. 77, 795–806 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. van Driel, M. A., Cuelenaere, K., Kemmeren, P. P.C. W., Leunissen, J. A. M. & Brunner, H. G. A new web-based data mining tool for the identification of candidate genes for human genetic disorders. Eur. J. Hum. Genet. 11, 57–63 (2003).

    Article  CAS  PubMed  Google Scholar 

  6. Sparrow, D. B., Guillén-Navarro, E., Fatkin, D. & Dunwoodie, S. L. Mutation of hairy-and-enhancer-of-split-7 in humans causes spondylocostal dysostosis. Hum. Mol. Genet. 17, 3761–3766 (2008).

    Article  CAS  PubMed  Google Scholar 

  7. Rajab, A. et al. Fatal cardiac arrhythmia and long-QT syndrome in a new form of congenital generalized lipodystrophy with muscle rippling (CGL4) due to PTRF-CAVIN mutations. PLoS Genet. 6, e1000874 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Kaufmann, R . et al. Infantile cerebral and cerebellar atrophy is associated with a mutation in the MED17 subunit of the transcription preinitiation mediator complex. Am. J. Hum. Genet. 87, 667–670 (2010). This study shows that MED17 mutations are associated with infantile cerebral and cerebellar atrophy using GeneDistiller.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Spinazzola, A. et al. MPV17 encodes an inner mitochondrial membrane protein and is mutated in infantile hepatic mitochondrial DNA depletion. Nature Genet. 38, 570–575 (2006).

    Article  CAS  PubMed  Google Scholar 

  10. Seelow, D., Schwarz, J. M. & Schuelke, M. GeneDistiller—distilling candidate genes from linkage intervals. PLoS ONE 3, e3874 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. George, R. A. et al. Analysis of protein sequence and interaction data for candidate disease gene prediction. Nucleic Acids Res. 34, e130 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).

    Article  CAS  PubMed  Google Scholar 

  13. Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. & Tanabe, M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 40, D109–D114 (2012).

    Article  CAS  PubMed  Google Scholar 

  14. Flicek, P. et al. Ensembl 2012. Nucleic Acids Res. 40, D84–D90 (2012).

    Article  CAS  PubMed  Google Scholar 

  15. Dreszer, T. R. et al. The UCSC Genome Browser database: extensions and updates 2011. Nucleic Acids Res. 40, D918–D923 (2012).

    Article  CAS  PubMed  Google Scholar 

  16. Parkinson, H. et al. ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res. 39, D1002–D1004 (2011).

    Article  CAS  PubMed  Google Scholar 

  17. Edgar, R., Domrachev, M. & Lash, A. E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Lee, I., Blom, U. M., Wang, P. I., Shim, J. E. & Marcotte, E. M. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res. 21, 1109–1121 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. van Vliet-Ostaptchouk, J. V. et al. HHEX gene polymorphisms are associated with type 2 diabetes in the Dutch Breda cohort. Eur. J. Hum. Genet. 16, 652–656 (2008). This is a biological validation of Prioritizer, showing that variants near the HHEX gene contribute to the risk of T2D in a Dutch population.

    Article  CAS  PubMed  Google Scholar 

  20. Pers, T. H. et al. Meta-analysis of heterogeneous data sources for genome-scale identification of risk genes in complex phenotypes. Genet. Epidemiol. 35, 318–332 (2011).

    Article  PubMed  Google Scholar 

  21. Cantor, R. M., Lange, K. & Sinsheimer, J. S. Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am. J. Hum. Genet. 86, 6–22 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Perez-Iratxeta, C., Bork, P. & Andrade-Navarro, M. A. Update of the G2D tool for prioritization of gene candidates to inherited diseases. Nucleic Acids Res. 35, W212–W216 (2007).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Tremblay, K. et al. Genes to diseases (G2D) computational method to identify asthma candidate genes. PLoS ONE 3, e2907 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Aerts, S. et al. Integrating computational biology and forward genetics in Drosophila. PLoS Genet. 5, e1000351 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Goh, K.-I. et al. The human disease network. Proc. Natl Acad. Sci. USA 104, 8685–8690 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Smith, N. G. C. & Eyre-Walker, A. Human disease genes: patterns and predictions. Gene 318, 169–175 (2003).

    Article  CAS  PubMed  Google Scholar 

  27. Oti, M. & Brunner, H. G. The modular nature of genetic diseases. Clin. Genet. 71, 1–11 (2007). This paper provides a motivation to use the guilt by association principle to identify novel disease causing genes.

    Article  CAS  PubMed  Google Scholar 

  28. Rual, J.-F. et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature 437, 1173–1178 (2005).

    Article  CAS  PubMed  Google Scholar 

  29. Lage, K. et al. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nature Biotech. 25, 309–316 (2007).

    Article  CAS  Google Scholar 

  30. Tiffin, N., Andrade-Navarro, M. A. & Perez-Iratxeta, C. Linking genes to diseases: it's all in the data. Genome Med. 1, 77 (2009). In this paper, a discussion is presented of how disease gene discovery will be facilitated by improved data integration and the use of clinical data.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Lanckriet, G. R. G., De Bie, T., Cristianini, N., Jordan, M. I. & Noble, W. S. A statistical framework for genomic data fusion. Bioinformatics 20, 2626–2635 (2004).

    Article  CAS  PubMed  Google Scholar 

  32. De Bie, T., Tranchevent, L.-C., van Oeffelen, L. M. M. & Moreau, Y. Kernel-based data fusion for gene prioritization. Bioinformatics 23, i125–i132 (2007).

    Article  CAS  PubMed  Google Scholar 

  33. Troyanskaya, O. G., Dolinski, K., Owen, A. B., Altman, R. B. & Botstein, D. A. Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl Acad. Sci. USA 100, 8348–8353 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Kondor, R. I. & Lafferty, J. Diffusion kernels on graphs and other discrete input spaces. Proc. 19th Int. Conf. Machine Learning 2002, 315–322 (2002).

    Google Scholar 

  35. Tranchevent, L.-C. et al. A guide to web tools to prioritize candidate genes. Brief. Bioinformat. 12, 22–32 (2011). This paper discusses a Web portal describing multiple prioritization tools and supporting the selection of appropriate tools for given requirements.

    Article  CAS  Google Scholar 

  36. Oti, M., Ballouz, S. & Wouters, M. A. Web tools for the prioritization of candidate disease genes. Methods Mol. Biol. 760, 189–206 (2011). This paper provides a detailed description of several Web-based prioritization methods together with their specificities.

    Article  CAS  PubMed  Google Scholar 

  37. Tiffin, N. Conceptual thinking for in silico prioritization of candidate disease genes. Methods Mol. Biol. 760, 175–187 (2011). This is a review on gene prioritization that also describes the development of your own data integration method.

    Article  CAS  PubMed  Google Scholar 

  38. Piro, R. M. & Di Cunto, F. Computational approaches to disease-gene prediction: rationale, classification and successes. FEBS J. 279, 678–696 (2012). This review focuses on the different data sources and the algorithms underlying the prioritization methods.

    Article  CAS  PubMed  Google Scholar 

  39. Kann, M. G. Advances in translational bioinformatics: computational approaches for the hunting of disease genes. Brief. Bioinformat. 11, 96–110 (2010).

    Article  CAS  Google Scholar 

  40. Stuart, J. M., Segal, E., Koller, D. & Kim, S. K. A gene-coexpression network for global discovery of conserved genetic modules. Science 302, 249–255 (2003).

    Article  CAS  PubMed  Google Scholar 

  41. Ma, X., Lee, H., Wang, L. & Sun, F. CGI: a new approach for prioritizing genes by combining gene expression and protein-protein interaction data. Bioinformatics 23, 215–221 (2007).

    Article  CAS  PubMed  Google Scholar 

  42. Jenssen, T. K., Laegreid, A., Komorowski, J. & Hovig, E. A literature network of human genes for high-throughput analysis of gene expression. Nature Genet. 28, 21–28 (2001).

    CAS  PubMed  Google Scholar 

  43. Barabási, A.-L., Gulbahce, N. & Loscalzo, J. Network medicine: a network-based approach to human disease. Nature Rev. Genet. 12, 56–68 (2011). This is a review of network-based methods to unravel the molecular mechanisms underlying diseases.

    Article  CAS  PubMed  Google Scholar 

  44. Nitsch, D. et al. PINTA: a web server for network-based gene prioritization from expression data. Nucleic Acids Res. 39, W334–W338 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Keyser, R. J., Oppon, E., Carr, J. A. & Bardien, S. Identification of Parkinson's disease candidate genes using CAESAR and screening of MAPT and SNCAIP in South African Parkinson's disease patients. J. Neural Transm. 118, 889–897 (2011).

    Article  PubMed  Google Scholar 

  46. Oti, M., Huynen, M. A. & Brunner, H. G. The biological coherence of human phenome databases. Am. J. Hum. Genet. 85, 801–808 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Hamosh, A., Scott, A. F., Amberger, J., Valle, D. & McKusick, V. A. Online Mendelian Inheritance in Man (OMIM). Hum. Mutat. 15, 57–61 (2000).

    Article  CAS  PubMed  Google Scholar 

  48. Antonarakis, S. E. & McKusick, V. A. OMIM passes the 1,000-disease-gene mark. Nature Genet. 25, 11 (2000).

    Article  CAS  PubMed  Google Scholar 

  49. Becker, K. G., Barnes, K. C., Bright, T. J. & Wang, S. A. The genetic association database. Nature Genet. 36, 431–432 (2004).

    Article  CAS  PubMed  Google Scholar 

  50. Doms, A. & Schroeder, M. GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res. 33, W783–W786 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Racine, J. et al. Comparison of genomic and proteomic data in recurrent airway obstruction affected horses using ingenuity pathway analysis®. BMC Vet. Res. 7, 48 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  52. Thomas, S. & Bonchev, D. A survey of current software for network analysis in molecular biology. Hum. Genom. 4, 353–360 (2010).

    Article  CAS  Google Scholar 

  53. Wickramasinghe, S., Rincon, G., Islas-Trejo, A. & Medrano, J. F. Transcriptional profiling of bovine milk using RNA sequencing. BMC Genom. 13, 45 (2012).

    Article  CAS  Google Scholar 

  54. Ekins, S., Nikolsky, Y., Bugrim, A., Kirillov, E. & Nikolskaya, T. Pathway mapping tools for analysis of high content data. Methods Mol. Biol. 356, 319–350 (2007).

    CAS  PubMed  Google Scholar 

  55. Stenson, P. D. et al. Human Gene Mutation Database (HGMD): 2003 update. Hum. Mutat. 21, 577–581 (2003).

    Article  CAS  PubMed  Google Scholar 

  56. Stenson, P. D. et al. The Human Gene Mutation Database: 2008 update. Genome Med. 1, 13 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Franke, L. et al. TEAM: a tool for the integration of expression, and linkage and association maps. Eur. J. Hum. Genet. 12, 633–638 (2004).

    Article  CAS  PubMed  Google Scholar 

  58. Bush, W. S., Dudek, S. M. & Ritchie, M. D. Biofilter: a knowledge-integration system for the multi-locus analysis of genome-wide association studies. Pac. Symp. Biocomput. 14, 368–379 (2009).

    Google Scholar 

  59. Krallinger, M., Valencia, A. & Hirschman, L. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol. 9 (Suppl. 2), S8 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Winnenburg, R., Wächter, T., Plake, C., Doms, A. & Schroeder, M. Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? Brief. Bioinformat. 9, 466–478 (2008).

    Article  CAS  Google Scholar 

  61. Schadt, E. E. Molecular networks as sensors and drivers of common human diseases. Nature 461, 218–223 (2009).

    Article  CAS  PubMed  Google Scholar 

  62. Baudot, A., Gómez-López, G. & Valencia, A. Translational disease interpretation with molecular networks. Genome Biol. 10, 221 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  63. Vidal, M., Cusick, M. E. & Barabási, A.-L . Interactome networks and human disease. Cell 144, 986–998 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Yu, W., Wulf, A., Liu, T., Khoury, M. J. & Gwinn, M. Gene Prospector: an evidence gateway for evaluating potential susceptibility genes and interacting risk factors for human diseases. BMC Bioinformat. 9, 528 (2008).

    Article  CAS  Google Scholar 

  65. Van Vooren, S. et al. Mapping biomedical concepts onto the human genome by mining literature on chromosomal aberrations. Nucleic Acids Res. 35, 2533–2543 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Firth, H. V. et al. DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. Am. J. Hum. Genet. 84, 524–533 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Kowald, A. & Schmeier, S. Data Mining in Proteomics. Inform. Retrieval 696, 305–318 (Humana Press, 2011).

    Book  Google Scholar 

  68. Tranchevent, L.-C. et al. ENDEAVOUR update: a web resource for gene prioritization in multiple species. Nucleic Acids Res. 36, W377–W384 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Chen, J., Bardes, E. E., Aronow, B. J. & Jegga, A. G. ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res. 37, W305–W311 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Fontaine, J.-F., Priller, F., Barbosa-Silva, A. & Andrade-Navarro, M. A. Génie: literature-based gene prioritization at multi genomic scale. Nucleic Acids Res. 39, W455–W461 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Britto, R. et al. GPSy: a cross-species gene prioritization system for conserved biological processes—application in male gamete development. Nucleic Acids Res. 8 May 2012 (doi:10.1093/nar/gks380).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Kann, M. G. Protein interactions and disease: computational approaches to uncover the etiology of diseases. Brief. Bioinformat. 8, 333–346 (2007).

    Article  CAS  Google Scholar 

  74. Navlakha, S. & Kingsford, C. The power of protein interaction networks for associating genes with diseases. Bioinformatics 26, 1057–1063 (2010). This is a recent review about predicting disease–gene associations using gene–protein networks and network-based algorithms.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Köhler, S., Bauer, S., Horn, D. & Robinson, P. N. Walking the interactome for prioritization of candidate disease genes. Am. J. Hum. Genet. 82, 949–958 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  76. Chen, J., Xu, H., Aronow, B. J. & Jegga, A. G. Improved human disease candidate gene prioritization using mouse phenotype. BMC Bioinformat. 8, 392 (2007).

    Article  CAS  Google Scholar 

  77. Breitkreutz, B.-J., Stark, C. & Tyers, M. The GRID: the General Repository for Interaction Datasets. Genome Biol. 4, R23 (2003).

    Article  PubMed  PubMed Central  Google Scholar 

  78. Linghu, B., Snitkin, E. S., Hu, Z., Xia, Y. & Delisi, C. Genome-wide prioritization of disease genes and identification of disease–disease associations from an integrated human functional linkage network. Genome Biol. 10, R91 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Snel, B., Lehmann, G., Bork, P. & Huynen, M. A. STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res. 28, 3442–3444 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  80. López-Bigas, N. & Ouzounis, C. A. Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic Acids Res. 32, 3108–3114 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  81. Adie, E. A., Adams, R. R., Evans, K. L., Porteous, D. J. & Pickard, B. S. Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformat. 6, 55 (2005).

    Article  CAS  Google Scholar 

  82. Thornblad, T. A., Elliott, K. S., Jowett, J. & Visscher, P. M. Prioritization of positional candidate genes using multiple web-based software tools. Twin Res. Hum. Genet. 10, 861–870 (2007).

    Article  Google Scholar 

  83. Perez-Iratxeta, C., Wjst, M., Bork, P. & Andrade, M. A. G2D: a tool for mining genes associated with disease. BMC Genet. 6, 45 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. Hutz, J. E., Kraja, A. T., McLeod, H. L. & Province, M. A. CANDID: a flexible method for prioritizing candidate genes for complex human traits. Genet. Epidemiol. 32, 779–790 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  85. Cheng, D. et al. PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites. Nucleic Acids Res. 36, W399–W405 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  86. Tiffin, N. et al. Computational disease gene identification: a concert of methods prioritizes type 2 diabetes and obesity candidate genes. Nucleic Acids Res. 34, 3067–3081 (2006). This is an example of the application of prioritization to a complex disorder using multiple prediction algorithms to create a consensus.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  87. Teber, E. T., Liu, J. Y., Ballouz, S., Fatkin, D. & Wouters, M. A. Comparison of automated candidate gene prediction systems using genes implicated in type 2 diabetes by genome-wide association studies. BMC Bioinformatics 10 (Suppl. 1), S69 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. Elbers, C. C. et al. A strategy to search for common obesity and type 2 diabetes genes. Trends Endocrinol. Metab. 18, 19–26 (2007).

    Article  CAS  PubMed  Google Scholar 

  89. Thienpont, B. et al. Haploinsufficiency of TAB2 causes congenital heart defects in humans. Am. J. Hum. Genet. 86, 839–849 (2010). This is a biological validation of Endeavour that shows a role for TAB2 in human cardiac development.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  90. Qiao, Y. et al. Outcome of array CGH analysis for 255 subjects with intellectual disability and search for candidate genes using bioinformatics. Hum. Genet. 128, 179–194 (2010).

    Article  CAS  PubMed  Google Scholar 

  91. Hwang, S., Rhee, S. Y., Marcotte, E. M. & Lee, I. Systematic prediction of gene function in Arabidopsis thaliana using a probabilistic functional gene network. Nature Protoc. 6, 1429–1442 (2011).

    Article  CAS  Google Scholar 

  92. Hess, D. C. et al. Computationally driven, quantitative experiments discover genes required for mitochondrial biogenesis. PLoS Genet. 5, e1000407 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  93. Huttenhower, C. et al. Exploring the human genome with functional maps. Genome Res. 19, 1093–1106 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  94. Lee, I. et al. Genetic dissection of the biotic stress response using a genome-scale gene network for rice. Proc. Natl Acad. Sci. USA 108, 18548–18553 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  95. Kohavi, R. A. A study of cross-validation and bootstrap for accuracy estimation and model selection. Proc. 15th Int. Joint Comp. Artificial Intelligence 2, 1137–1143 (1995).

    Google Scholar 

  96. Chen, Y. et al. In silico gene prioritization by integrating multiple data sources. PLoS ONE 6, e21137 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  97. Schuierer, S., Tranchevent, L.-C., Dengler, U. & Moreau, Y. Large-scale benchmark of Endeavour using MetaCore maps. Bioinformatics 26, 1922–1923 (2010).

    Article  CAS  PubMed  Google Scholar 

  98. Huttenhower, C. et al. The impact of incomplete knowledge on evaluation: an experimental benchmark for protein function prediction. Bioinformatics 25, 2404–2410 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  99. Erlich, Y. et al. Exome sequencing and disease-network analysis of a single family implicate a mutation in KIF1A in hereditary spastic paraparesis. Genome Res. 21, 658–664 (2011). This is a study in which traditional mapping methods, new sequencing tools and network analysis are combined to identify the causal mutation for a rare monogenic disease.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  100. Huang, D. W., Sherman, B. T. & Lempicki, R. A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13 (2009).

    Article  CAS  Google Scholar 

  101. Szklarczyk, D. et al. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 39, D561–D568 (2011).

    Article  CAS  PubMed  Google Scholar 

  102. Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protoc. 4, 44–57 (2009).

    Article  CAS  Google Scholar 

  103. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  104. Casci, T. Human disease: something old, something new. Nature Rev. Genet. 12, 382–383 (2011).

    Article  CAS  PubMed  Google Scholar 

  105. Gillis, J. & Pavlidis, P. The impact of multifunctional genes on “guilt by association” analysis. PLoS ONE 6, e17258 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  106. Gillis, J. & Pavlidis, P. “Guilt by association” is the exception rather than the rule in gene networks. PLoS Comput. Biol. 8, e1002444 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  107. Moult, J., Hubbard, T., Bryant, S. H., Fidelis, K. & Pedersen, J. T. Critical assessment of methods of protein structure prediction (CASP): round II. Proteins 29 (Suppl. 1), 2–6 (1997).

    Article  Google Scholar 

  108. Moult, J., Fidelis, K., Kryshtafovych, A. & Tramontano, A. Critical assessment of methods of protein structure prediction (CASP)—round IX. Proteins 79 (Suppl. 1), 1–5 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  109. Arighi, C. N. et al. BioCreative III interactive task: an overview. BMC Bioinformatics 12 (Suppl. 8), S4 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  110. Hirschman, L., Yeh, A., Blaschke, C. & Valencia, A. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6 (Suppl. 1), S1 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  111. Tilstone, C. DNA microarrays: vital statistics. Nature 424, 610–612 (2003).

    Article  CAS  PubMed  Google Scholar 

  112. Johnson, K. & Lin, S. Call to work together on microarray data analysis. Nature 411, 885 (2001).

    Article  CAS  PubMed  Google Scholar 

  113. Prill, R. J., Saez-Rodriguez, J., Alexopoulos, L. G., Sorger, P. K. & Stolovitzky, G. Crowdsourcing network inference: the DREAM predictive signaling network challenge. Sci. Signal. 4, mr7 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  114. Stein, L. D. Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges. Nature Rev. Genet. 9, 678–688 (2008).

    Article  CAS  PubMed  Google Scholar 

  115. Yoshida, Y. et al. PosMed (Positional Medline): prioritizing genes with an artificial neural network comprising medical documents to accelerate positional cloning. Nucleic Acids Res. 37, W147–W152 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  116. Mardis, E. R. et al. Recurring mutations found by sequencing an acute myeloid leukemia genome. N. Engl. J. Med. 361, 1058–1066 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  117. Lupski, J. R. et al. Whole-genome sequencing in a patient with Charcot–Marie–Tooth neuropathy. N. Engl. J. Med. 362, 1181–1191 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  118. Cooper, G. M. & Shendure, J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nature Rev. Genet. 12, 628–640 (2011).

    Article  CAS  PubMed  Google Scholar 

  119. Zhong, Q. et al. Edgetic perturbation models of human inherited disorders. Mol. Syst. Biol. 5, 321 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  120. Kuhn, M., von Mering, C., Campillos, M., Jensen, L. J. & Bork, P. STITCH: interaction networks of chemicals and proteins. Nucleic Acids Res. 36, D684–D688 (2008).

    Article  CAS  PubMed  Google Scholar 

  121. Baron, D. et al. MADGene: retrieval and processing of gene identifier lists for the analysis of heterogeneous microarray datasets. Bioinformatics 27, 725–726 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  122. Chen, R., Li, L. & Butte, A. J. AILUN: reannotating gene expression data automatically. Nature Methods 4, 879 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  123. Robinson, P. N. et al. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am. J. Hum. Genet. 83, 610–615 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  124. Osborne, J. D. et al. Annotating the human genome with Disease Ontology. BMC Genomics 10 (Suppl. 1), S6 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  125. Smedley, D. et al. BioMart—biological queries made easy. BMC Genom. 10, 22 (2009).

    Article  CAS  Google Scholar 

  126. O'Brien, K. P., Remm, M. & Sonnhammer, E. L. L. Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. 33, D476–D480 (2005).

    Article  CAS  PubMed  Google Scholar 

  127. Yu, H. et al. Annotation transfer between genomes: protein–protein interologs and protein-DNA regulogs. Genome Res. 14, 1107–1118 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  128. Ebermann, I. et al. A novel gene for Usher syndrome type 2: mutations in the long isoform of whirlin are associated with retinitis pigmentosa and sensorineural hearing loss. Hum. Genet. 121, 203–211 (2007).

    Article  CAS  PubMed  Google Scholar 

  129. Barriot, R. et al. Collaboratively charting the gene-to-phenotype network of human congenital heart defects. Genome Med. 2, 16 (2010). This study describes CHDWiki, the first knowledge portal to annotate and analyse gene–phenotype networks collaboratively.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the following grants: KUL PFV/10/016 SymBioSys, KUL GOA MaNet, Hercules III PacBio RS and FP7-HEALTH CHeartED.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yves Moreau or Léon-Charles Tranchevent.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary information S1 (table)

This document represents a tutorial about gene prioritization methods. (XLS 523 kb)

Related links

Related links

FURTHER INFORMATION

ArrayExpress

Ensembl Genome Browser

Gene Expression Omnibus

Gene Prioritization Portal

Gene Ontology

Genetic Association Database

GoPubmed

HUGO Gene Nomenclature Committee

Human Gene Mutation Database

Ingenuity Pathway Analysis

KU Leuven Bioinformatics Laboratory

KU Leuven SymBioSys Center for Computational Systems Biology

Kyoto Encyclopedia of Genes and Genomes (KEGG)

MetaCore (from GeneGO)

Online Mendelian Inheritance in Man (OMIM)

STRING

UCSC Genome Browser

Glossary

Homozygosity mapping

A form of recombination mapping that allows the localization of rare recessive traits by identifying unusually long stretches of homozygosity at consecutive markers.

Guilt by association

A statistical rule of thumb that asserts that reliable predictions about the function or disease involvement ('guilt') of a gene or protein can generally be made if several of its partners (for example, genes with correlated expression profiles or protein–protein interaction partners) share a corresponding 'guilty' status ('association').

Machine learning methods

The design and development of algorithms that allow computers automatically to learn to recognize complex patterns in data and to make intelligent decisions on the basis of such data.

Principal components analysis

A statistical method that is used to simplify a complex data set by transforming a series of correlated variables into a smaller number of uncorrelated variables called principal components.

Interologue

A protein–protein interaction that is conserved between orthologous proteins in different species.

Random walk

A mathematical formalization of the path resulting from taking successive random steps. Classical examples of random walks are Brownian motion, the fortune of a gambler flipping a coin or fluctuations of the stock market. In the context of graphs, a random walk typically describes a process in which a 'walker' moves from one node of the graph into another with a probability proportional to the weight of the edge connecting them.

Diffusion kernel

A type of kernel similarity matrix that is derived from the notion of a random walk on a graph. Diffusion kernels measure similarity between nodes of a graph (in this case, between genes) — for example, by estimating the average length of a random walk from one node to the other.

Locus heterogeneity

The appearance of phenotypically similar characteristics that result from mutations at different genetic loci. Differences in effect size or in replication between studies and samples are often ascribed to different loci leading to the same disease.

Multiple testing

A statistical problem that arises from carrying out multiple hypothesis tests together. P values obtained from hypothesis tests under the assumption of a single test must be appropriately corrected to reflect multiple testing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Moreau, Y., Tranchevent, LC. Computational tools for prioritizing candidate genes: boosting disease gene discovery. Nat Rev Genet 13, 523–536 (2012). https://doi.org/10.1038/nrg3253

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg3253

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research