Phenolyzer: phenotype-based prioritization of candidate genes for human diseases

Journal name:
Nature Methods
Volume:
12,
Pages:
841–843
Year published:
DOI:
doi:10.1038/nmeth.3484
Received
Accepted
Published online

Prior biological knowledge and phenotype information may help to identify disease genes from human whole-genome and whole-exome sequencing studies. We developed Phenolyzer (http://phenolyzer.usc.edu), a tool that uses prior information to implicate genes involved in diseases. Phenolyzer exhibits superior performance over competing methods for prioritizing Mendelian and complex disease genes, based on disease or phenotype terms entered as free text.

At a glance

Figures

  1. Workflow of Phenolyzer.
    Figure 1: Workflow of Phenolyzer.

    (1) Disease match: each disease or phenotype query term is separately translated into sets of disease names by word match, offspring search, synonym retrieval and phenotype interpretation in disease name databases. (2) Gene query: each retrieved disease name is queried in the gene-disease databases based on an exact match, to get a list of genes. (3) Gene score system: a score based on the type and confidence of the gene-disease relationship is generated for each gene corresponding to each disease name. Then, for each input term, a weighted sum score is calculated for each reported gene by adding all the scores retrieved in previous step. The seed gene set is generated by collating all the genes of all input terms, and each gene score is normalized. (4) Seed gene growth: candidate disease genes are expanded beyond the seed gene set based on four types of gene-gene relationships; scores are calculated for all genes that connect with seed genes. (5) Gene ranking: all the information is integrated to generate a score for each gene, with the weights trained from a logistic regression model. The scores are renormalized to the final prioritized gene list. HPRD, Human Protein Reference Database; HTRI, Human Transcriptional Regulation Interaction Database.

  2. Comparison between Phenolyzer and other tools to find well-known monogenic disease genes and predict recently published novel disease genes.
    Figure 2: Comparison between Phenolyzer and other tools to find well-known monogenic disease genes and predict recently published novel disease genes.

    (a) The ranking distribution of genes for 14 monogenic diseases. (b) The ranking distribution of 55 recently published disease genes from four human genetics journals.

  3. An example of the gene-term-disease network automatically generated by Phenolyzer web server for 'autism'.
    Supplementary Fig. 1: An example of the gene-term-disease network automatically generated by Phenolyzer web server for 'autism'.

    The largest word represents the user’s input term, ‘Autism’. The pink round rectangles represent disease names corresponding to each term. The balls represent genes. The darker the color, the more the node contributes. The longer the round rectangles, the more a disease name contributes. The larger the ball, the more significantly a gene is related with the term. A reported gene is blue and a predicted gene is yellow. Four types of edges correspond to four different types of gene-gene relationships as illustrated in the legend. The figure can be zoomed in by mouse operations in the web server to facilitate closer examination of the edge types.

  4. The wordcloud for all the interpreted names related to /`Cancer/'.
    Supplementary Fig. 2: The wordcloud for all the interpreted names related to ‘Cancer’.

    This is the wordcloud generated by Phenolyzer corresponding to the input term ‘cancer’. Some of the most frequently occurring words include ‘cancer’, ‘neoplasm’, ‘cell’, ‘carcinoma’, ‘tumor’ and ‘malignant’.

  5. A snapshot of the output from the Phenolyzer web server.
    Supplementary Fig. 3: A snapshot of the output from the Phenolyzer web server.

    (a) ‘Summary’ includes the link to wordcloud, as well as the input settings and the links to output files for download. (b) ‘Barplot’ is a bar plot on at most 500 highest-ranked genes with normalized scores. (c) ‘Network’ is the interactive gene-disease-term network, with controlling buttons below it. (d) ‘Details’ shows how the score for each gene is calculated, including the links to each publication or database website.

  6. Comparison between Phenolyzer and other tools to find disease genes for cancer, rheumatoid arthritis, autism and anemia.
    Supplementary Fig. 4: Comparison between Phenolyzer and other tools to find disease genes for cancer, rheumatoid arthritis, autism and anemia.

    The AUC and ROC curve plot showing the performance comparison between Phenolyzer and other tools, on four gene sets of different complex diseases. (a, c, e and g) For each software, the AUC is calculated as the area under the ROC curve. (b, d, f and h) The ROC curve is plotted as True Positive Rate versus False Positive Rate. ‘Phenolyzer Phenotype’ is the Phenolyzer results with phenotype terms as input (the same input as Phenomizer). ‘Phenolyzer Logistic’ is Phenolyzer with weights trained with Logistic Regression model, compared with ‘Phenolyzer no training’. ‘Phenolyzer Seed’ is Phenolyzer’s seed gene result without the seed gene growth step, thus only representing the genes found in Phenolyzer’s disease-gene mapping knowledgebase.

  7. Evaluation of Phenolyzer by phenotype terms as input.
    Supplementary Fig. 5: Evaluation of Phenolyzer by phenotype terms as input.

    Phenolyzer with phenotype terms (rather than disease names) as input is able to prioritize most genes as ‘Top 1’ for the 14 monogenic diseases, which is similar as Phenomizer.

  8. Phenolyzer/'s results on four case studies.
    Supplementary Fig. 6: Phenolyzer’s results on four case studies.

    (a and b) The candidate gene lists generated from two studies on ‘Craniopharyngiomas’ and ‘SHORT syndrome’ were used as input into Phenolyer. The network plot shows that BRAF and PIK3R1 are the genes with the highest scores corresponding to each disease separately. (c) For the CNV study of ‘Osteoporosis’, the generated significant CNV regions were used as input, and the Phenolyzer network successfully identified the correct gene, UGT2B17. (d) Combined with wANNOVAR, we first filtered the variants into a small list, then included all the genes in the variant list as the input into Phenolyzer. The correct gene PKLR was identified as the top gene for ‘hemolytic anemia’.

  9. Illustration of the disease or phenotype term interpretation process.
    Supplementary Fig. 7: Illustration of the disease or phenotype term interpretation process.

    The term is first processed through a word match to several different data source, DO (Disease Ontology), CTD Medic disease ontology vocabulary, HPO (Human Phenotype Ontology), OMIM synonym, OMIM descriptors, and Phenolyzer’s compiled disease vocabulary. After the first match, the disease names are directly returned for Phenolyzer’s compiled disease names and OMIM synonyms. For DO and CTD, an ontology search will retrieve all the descendent disease names and synonyms. For OMIM descriptors, they are mapped into OMIM diseases with a conditional probability as reliability. For HPO, an ontology search first finds all the descendent phenotypes, then the phenotypes are mapped into diseases with reliabilities.

  10. Learning curve through the training with Gradient Descent Algorithm.
    Supplementary Fig. 8: Learning curve through the training with Gradient Descent Algorithm.

    The gradient descent algorithm iteratively reduces the cost and changes weight vector w into the direction opposite to the gradient of the cost function. After 10,000 iterations, with learning rate at 1, the slope of the curve is close to 0 and demonstrates that the number of learning steps is sufficient.

References

  1. Lyon, G.J. & Wang, K. Genome Med. 4, 58 (2012).
  2. Wang, K., Li, M. & Hakonarson, H. Nucleic Acids Res. 38, e164 (2010).
  3. Cingolani, P. et al. Fly (Austin) 6, 8092 (2012).
  4. McLaren, W. et al. Bioinformatics 26, 20692070 (2010).
  5. Jäger, M. et al. Hum. Mutat. 35, 548555 (2014).
  6. Habegger, L. et al. Bioinformatics 28, 22672269 (2012).
  7. Bamshad, M.J. et al. Nat. Rev. Genet. 12, 745755 (2011).
  8. Aerts, S. et al. Nat. Biotechnol. 24, 537544 (2006).
  9. Schlicker, A., Lengauer, T. & Albrecht, M. Bioinformatics 26, i561i567 (2010).
  10. Singleton, M.V. et al. Am. J. Hum. Genet. 94, 599610 (2014).
  11. Javed, A., Agrawal, S. & Ng, P.C. Nat. Methods 11, 935937 (2014).
  12. Safran, M. et al. Database 2010, baq020 (2010).
  13. Makita, Y. et al. Nucleic Acids Res. 41, W109W114 (2013).
  14. Yue, P., Melamud, E. & Moult, J. BMC Bioinformatics 7, 166 (2006).
  15. Köhler, S. et al. Am. J. Hum. Genet. 85, 457464 (2009).
  16. Robinson, P.N. et al. Genome Res. 24, 340348 (2014).
  17. Chial, H. Nat. Educ. 1, 192 (2008).
  18. Saunders, C.J. et al. Sci. Translat. Med. 4, 154ra135 (2012).
  19. Forbes, S.A. et al. Nucleic Acids Res. 39, D945D950 (2011).
  20. De Rubeis, S. et al. Nature 515, 209215 (2014).
  21. Zhang, R. et al. Database 2014, bau090 (2014).
  22. van der Harst, P. et al. Nature 492, 369375 (2012).
  23. Chang, X. & Wang, K. J. Med. Genet. 49, 433436 (2012).
  24. Amberger, J., Bocchini, C. & Hamosh, A. Hum. Mutat. 32, 564567 (2011).
  25. Rath, A. et al. Hum. Mutat. 33, 803808 (2012).
  26. Landrum, M.J. et al. Nucleic Acids Res. 42, D980D985 (2014).
  27. Pagon, R.A. et al. GeneReviews http://www.ncbi.nlm.nih.gov/books/NBK1116/?partid=1250 (1993).
  28. Hindorff, L.A., Junkins, H.A., Mehta, J. & Manolio, T. A Catalog of Published Genome-Wide Association Studies. https://www.genome.gov/26525384 (National Human Genome Research Institute, 2011).
  29. Peri, S. et al. Nucleic Acids Res. 32, D497D501 (2004).
  30. Geer, L.Y. et al. Nucleic Acids Res. 38, D492D496 (2010).
  31. Seal, R.L., Gordon, S.M., Lush, M.J., Wright, M.W. & Bruford, E.A. Nucleic Acids Res. 39, D514D519 (2011).
  32. Bovolenta, L.A., Acencio, M.L. & Lemke, N. BMC Genomics 13, 405 (2012).
  33. Davis, A.P., Wiegers, T.C., Rosenstein, M.C. & Mattingly, C.J. Database 2012, bar065 (2012).
  34. Schriml, LM. et al. Nucleic Acids Res. 40, D940D946 (2012).
  35. Robinson, P.N. & Mundlos, S. Clin. Genet. 77, 525534 (2010).
  36. Burren, O.S. et al. Nucleic Acids Res. 39, D997D1001 (2011).
  37. Lim, J.E. et al. BMC Med. Inform. Decis. Mak. 10, 76 (2010).
  38. Elding, H., Lau, W., Swallow, D.M. & Maniatis, N. Am. J. Hum. Genet. 92, 107113 (2013).
  39. Liu, H. et al. Nucleic Acids Res. 39, D991D996 (2011).

Download references

Author information

Affiliations

  1. Zilkha Neurogenetic Institute, University of Southern California, Los Angeles, California, USA.

    • Hui Yang &
    • Kai Wang
  2. Neuroscience Graduate Program, University of Southern California, Los Angeles, California, USA.

    • Hui Yang
  3. Institute for Medical and Human Genetics, Charité-Universitätsmedizin Berlin, Berlin, Germany.

    • Peter N Robinson
  4. Max Planck Institute for Molecular Genetics, Berlin, Germany.

    • Peter N Robinson
  5. Berlin Brandenburg Center for Regenerative Therapies (BCRT), Charité-Universitätsmedizin Berlin, Berlin, Germany.

    • Peter N Robinson
  6. Institute for Bioinformatics, Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany.

    • Peter N Robinson
  7. Department of Psychiatry, University of Southern California, Los Angeles, California, USA.

    • Kai Wang
  8. Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, California, USA.

    • Kai Wang

Contributions

H.Y. compiled the data, performed the computational experiments, developed software tools and drafted the manuscript. P.N.R. advised on phenotype data analysis and interpretation. K.W. designed the study, supervised its execution and revised the manuscript.

Competing financial interests

K.W. is a board member and stock holder of Tute Genomics, a bioinformatics software company.

Corresponding author

Correspondence to:

Author details

Supplementary information

Supplementary Figures

  1. Supplementary Figure 1: An example of the gene-term-disease network automatically generated by Phenolyzer web server for 'autism'. (239 KB)

    The largest word represents the user’s input term, ‘Autism’. The pink round rectangles represent disease names corresponding to each term. The balls represent genes. The darker the color, the more the node contributes. The longer the round rectangles, the more a disease name contributes. The larger the ball, the more significantly a gene is related with the term. A reported gene is blue and a predicted gene is yellow. Four types of edges correspond to four different types of gene-gene relationships as illustrated in the legend. The figure can be zoomed in by mouse operations in the web server to facilitate closer examination of the edge types.

  2. Supplementary Figure 2: The wordcloud for all the interpreted names related to ‘Cancer’. (347 KB)

    This is the wordcloud generated by Phenolyzer corresponding to the input term ‘cancer’. Some of the most frequently occurring words include ‘cancer’, ‘neoplasm’, ‘cell’, ‘carcinoma’, ‘tumor’ and ‘malignant’.

  3. Supplementary Figure 3: A snapshot of the output from the Phenolyzer web server. (82 KB)

    (a) ‘Summary’ includes the link to wordcloud, as well as the input settings and the links to output files for download. (b) ‘Barplot’ is a bar plot on at most 500 highest-ranked genes with normalized scores. (c) ‘Network’ is the interactive gene-disease-term network, with controlling buttons below it. (d) ‘Details’ shows how the score for each gene is calculated, including the links to each publication or database website.

  4. Supplementary Figure 4: Comparison between Phenolyzer and other tools to find disease genes for cancer, rheumatoid arthritis, autism and anemia. (237 KB)

    The AUC and ROC curve plot showing the performance comparison between Phenolyzer and other tools, on four gene sets of different complex diseases. (a, c, e and g) For each software, the AUC is calculated as the area under the ROC curve. (b, d, f and h) The ROC curve is plotted as True Positive Rate versus False Positive Rate. ‘Phenolyzer Phenotype’ is the Phenolyzer results with phenotype terms as input (the same input as Phenomizer). ‘Phenolyzer Logistic’ is Phenolyzer with weights trained with Logistic Regression model, compared with ‘Phenolyzer no training’. ‘Phenolyzer Seed’ is Phenolyzer’s seed gene result without the seed gene growth step, thus only representing the genes found in Phenolyzer’s disease-gene mapping knowledgebase.

  5. Supplementary Figure 5: Evaluation of Phenolyzer by phenotype terms as input. (42 KB)

    Phenolyzer with phenotype terms (rather than disease names) as input is able to prioritize most genes as ‘Top 1’ for the 14 monogenic diseases, which is similar as Phenomizer.

  6. Supplementary Figure 6: Phenolyzer’s results on four case studies. (146 KB)

    (a and b) The candidate gene lists generated from two studies on ‘Craniopharyngiomas’ and ‘SHORT syndrome’ were used as input into Phenolyer. The network plot shows that BRAF and PIK3R1 are the genes with the highest scores corresponding to each disease separately. (c) For the CNV study of ‘Osteoporosis’, the generated significant CNV regions were used as input, and the Phenolyzer network successfully identified the correct gene, UGT2B17. (d) Combined with wANNOVAR, we first filtered the variants into a small list, then included all the genes in the variant list as the input into Phenolyzer. The correct gene PKLR was identified as the top gene for ‘hemolytic anemia’.

  7. Supplementary Figure 7: Illustration of the disease or phenotype term interpretation process. (75 KB)

    The term is first processed through a word match to several different data source, DO (Disease Ontology), CTD Medic disease ontology vocabulary, HPO (Human Phenotype Ontology), OMIM synonym, OMIM descriptors, and Phenolyzer’s compiled disease vocabulary. After the first match, the disease names are directly returned for Phenolyzer’s compiled disease names and OMIM synonyms. For DO and CTD, an ontology search will retrieve all the descendent disease names and synonyms. For OMIM descriptors, they are mapped into OMIM diseases with a conditional probability as reliability. For HPO, an ontology search first finds all the descendent phenotypes, then the phenotypes are mapped into diseases with reliabilities.

  8. Supplementary Figure 8: Learning curve through the training with Gradient Descent Algorithm. (39 KB)

    The gradient descent algorithm iteratively reduces the cost and changes weight vector w into the direction opposite to the gradient of the cost function. After 10,000 iterations, with learning rate at 1, the slope of the curve is close to 0 and demonstrates that the number of learning steps is sufficient.

PDF files

  1. Supplementary Text and Figures (2,289 KB)

    Supplementary Figures 1–8, Supplementary Tables 1 and 2, and Supplementary Note 1

Excel files

  1. Supplementary Data 1 (18,782 KB)

    The result dataset of causal genes for the 14 monogenic diseases.

  2. Supplementary Data 2 (29,510 KB)

    The result dataset of the 590 monogenic disease genes.

  3. Supplementary Data 3 (9,220 KB)

    The result dataset of candidate genes for the four complex diseases

  4. Supplementary Data 4 (46,166 KB)

    The result dataset of novel discovered genes from four high-profile human genetics journals.

  5. Supplementary Data 5 (20,161 KB)

    The original prioritized gene lists for ‘craniopharyngiomas’

  6. Supplementary Data 6 (19,266 KB)

    The original prioritized gene lists for ‘SHORT syndrome’

  7. Supplementary Data 7 (10,679 KB)

    The original prioritized gene lists for ‘osteoporosis’

  8. Supplementary Data 8 (47,700 KB)

    The full disease names and HPO identifiers for the term ‘autism’

Zip files

  1. Supplementary Software (216 KB)

    Phenolyzer Software

Additional data