Phen-Gen: combining phenotype and genotype to analyze rare disorders


We introduce Phen-Gen, a method that combines patients' disease symptoms and sequencing data with prior domain knowledge to identify the causative genes for rare disorders. Simulations revealed that the causal variant was ranked first in 88% of cases when it was a coding variant—a 52% advantage over a genotype-only approach—and Phen-Gen outperformed other existing prediction methods by 13–58%. If disease etiology was unknown, the causal variant was assigned the top rank in 71% of simulations. Phen-Gen is available at

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: Comparison with VAAST, eXtasy and VAAST+PHEVOR.


  1. 1

    de Ligt, J. et al. N. Engl. J. Med. 367, 1921–1929 (2012).

  2. 2

    Yang, Y. et al. N. Engl. J. Med. 369, 1502–1511 (2013).

  3. 3

    Cordero, J.F. N. Engl. J. Med. 352, 2032 (2005).

  4. 4

    Amberger, J., Bocchini, C.A., Scott, A.F. & Hamosh, A. Nucleic Acids Res. 37, D793–D796 (2009).

  5. 5

    Chakravarti, A. Genome Res. 21, 643–644 (2011).

  6. 6

    Köhler, S. et al. Am. J. Hum. Genet. 85, 457–464 (2009).

  7. 7

    Sifrim, A. et al. Nat. Methods 10, 1083–1084 (2013).

  8. 8

    Yandell, M. et al. Genome Res. 21, 1529–1542 (2011).

  9. 9

    Singleton, M.V. et al. Am. J. Hum. Genet. 94, 599–610 (2014).

  10. 10

    Robinson, P.N. et al. Genome Res. 24, 340–348 (2014).

  11. 11

    Stenson, P.D. et al. Hum. Genet. 133, 1–9 (2014).

  12. 12

    Fu, W. et al. Nature 493, 216–220 (2013).

  13. 13

    Visel, A. et al. Nature 464, 409–412 (2010).

  14. 14

    Khurana, E. et al. Science 342, 1235587 (2013).

  15. 15

    Pruitt, K.D. et al. Genome Res. 19, 1316–1323 (2009).

  16. 16

    Sim, N.-L. et al. Nucleic Acids Res. 40, W452–W457 (2012).

  17. 17

    Adzhubei, I.A. et al. Nat. Methods 7, 248–249 (2010).

  18. 18

    Kryukov, G.V., Shpunt, A., Stamatoyannopoulos, J.A. & Sunyaev, S.R. Proc. Natl. Acad. Sci. USA 106, 3871–3876 (2009).

  19. 19

    Schwarz, J.M., Rödelsperger, C., Schuelke, M. & Seelow, D. Nat. Methods 7, 575–576 (2010).

  20. 20

    Lewin, B. Genes VIII (Benjamin Cummings, 2004).

  21. 21

    Price, A.L. et al. Am. J. Hum. Genet. 86, 832–838 (2010).

  22. 22

    Davydov, E.V. et al. PLoS Comput. Biol. 6, e1001025 (2010).

  23. 23

    Cooper, G.M. et al. Genome Res. 15, 901–913 (2005).

  24. 24

    Prabhakar, S. et al. Genome Res. 16, 855–863 (2006).

  25. 25

    Derrien, T. et al. Genome Res. 22, 1775–1789 (2012).

  26. 26

    Kozomara, A. & Griffiths-Jones, S. Nucleic Acids Res. 39, D152–D157 (2011).

  27. 27

    Smith, N.G.C., Webster, M.T. & Ellegren, H. Genome Res. 12, 1350–1356 (2002).

  28. 28

    He, L. & Hannon, G.J. Nat. Rev. Genet. 5, 522–531 (2004).

  29. 29

    Esteller, M. Nat. Rev. Genet. 12, 861–874 (2011).

  30. 30

    McLean, C.Y. et al. Nat. Biotechnol. 28, 495–501 (2010).

  31. 31

    The 1000 Genomes Project Consortium. Nature 467, 1061–1073 (2010).

  32. 32

    Sherry, S.T. et al. Nucleic Acids Res. 29, 308–311 (2001).

  33. 33

    MacArthur, D.G. et al. Science 335, 823–828 (2012).

  34. 34

    Wu, G., Feng, X. & Stein, L. Genome Biol. 11, R53 (2010).

  35. 35

    Matthews, L. et al. Nucleic Acids Res. 37, D619–D622 (2009).

  36. 36

    Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. & Tanabe, M. Nucleic Acids Res. 40, D109–D114 (2012).

  37. 37

    Schaefer, C.F. et al. Nucleic Acids Res. 37, D674–D679 (2009).

  38. 38

    Stark, C. et al. Nucleic Acids Res. 34, D535–D539 (2006).

  39. 39

    Franceschini, A. et al. Nucleic Acids Res. 41, D808–D815 (2013).

  40. 40

    Ashburner, M. et al. Nat. Genet. 25, 25–29 (2000).

  41. 41

    Obayashi, T. et al. Nucleic Acids Res. 41, D1014–D1020 (2013).

  42. 42

    The 1000 Genomes Project Consortium. Nature 491, 56–65 (2012).

Download references


This work was supported by the Agency for Science, Technology and Research (A*STAR), Singapore. We thank Radboud University Nijmegen Medical Centre for sharing the 100 intellectual disability patient data sets, particularly J. de Ligt for his help with this data. We also thank S. Köhler for his help with Phenomizer, N. Jinawath for her help interpreting patient symptoms, and S. Prabhakar and N. Clarke for their comments on the genomic predictor. We thank S. Prabhakar, S. Davila, A. Wilm and R. del Rosario for their comments on the manuscript.

Author information

A.J. conceived of and designed the project, designed and implemented the analysis framework, implemented methods, conducted experiments, interpreted results, wrote the initial manuscript and revised and proofread the paper. S.A. implemented methods, conducted experiments, set up the web server and revised and proofread the paper. P.C.N. conceived of and designed the project, revised and proofread the paper and supervised the project.

Correspondence to Asif Javed or Pauline C Ng.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Overall workflow.

Patient disease symptoms are matched against known disorders and the probability of a symptomatic match is assigned to genes implicated for the respective disorder. These probabilities are permeated to known gene associates using a random walk with restart on the interaction network. In parallel the patient’s sequencing data is analyzed and the damaging impact of each variant estimated and pooled within genes. These two predictions are combined to implicate the gene(s) involved.

Supplementary Figure 2 Distribution of SIFT and PolyPhen-2 scores for damaging and benign nonsynonymous mutations.

The distribution of SIFT and PolyPhen-2 scores for HGMD-reported damaging nonsynonymous mutations and neutral nonsynonymous fixed substitutions inferred from human-chimp alignment are shown. The plots indicate general agreement between the two methods. Source data

Supplementary Figure 3 Deleteriousness predictions around splice site.

The figure depicts the probability of deleteriousness around donor and acceptor sites for splice site mutations. Source data

Supplementary Figure 4 Probability of deleteriousness using the genomic predictor.

The figure illustrates the predicted deleteriousness of different combination of five annotations: GERP++ (G), PhyloP (P), near-genic (N), transcription factor binding sites (T), and DNase hypersensitive sites (D). The predictions are binned according to the number of annotations (shown on the x-axis). Each bin is further canonically sorted based on the fore mentioned order of annotations. Source data

Supplementary Figure 5 Confidence intervals for positive and benign mutation set combinations.

The 90% confidence intervals of different combination of genomic annotations are shown. The order from Supplementary Figure 4 is maintained. With the four sub-figures representing combinations of the two positive sets (HGMD and GWAS) and the two neutral sets (common variation in dbSNP and Complete Genomics MAF>0.30), respectively. Source data

Supplementary Figure 6 Histograms of the null distribution of deleteriousness of genes.

The top 1 percentile of damaging variants in each gene is shown. The histogram of this null distribution cutoff for all genes under dominant and recessive inheritance pattern for coding and genomic predictors is shown. Most genes do not harbor any putative damaging variants and hence the distributions are dominated by the left most bar; which has been truncated for better visual representation. Source data

Supplementary Figure 7 Performance of variant predictors.

The distribution of damaging probabilities assigned to different classes of HGMD variants is shown. The top three panels employ the coding predictor. A genomic predictor was used for the bottom panel and applied to noncoding regulatory variants. The histograms depict the distribution of the scored variants. The pie charts on the right explicate the distribution of omitted and predicted variants in each category. Common variants (white) were observed in 1000 Genomes, ESP, or dbSNP with MAF 0.01. Commonly mutated genes indicate that the variants failed to exceed the null distribution of the respective gene (light green). Missed indicates that the variant eluded our regions of interest (dark blue). Source data

Supplementary Figure 8 Prediction of heterozygous variants.

The figure depicts how compound heterozygous variants are evaluated. When both damaging variants reside within the coding region, the coding predictor is used to estimate the damaging impact of these variants. In cases when one or both variants lay outside the exon boundaries, both variants are evaluated using the genomic predictor.

Supplementary Figure 9 Phen-Gen and VAAST comparison for phenotypically heterogeneous disorders.

The comparison of Phen-Gen and VAAST in simulations using 44 phenotypically heterogeneous disorders and nonsynonymous mutations in HGMD is shown. In both panels the ability of both methods to narrow down the true gene search within 1, 5 and 10 genes is depicted. For Phen-Gen, the bar is split into the predictive power based on genotypic prediction and the added advantage gained from disease symptoms. VAAST only uses the genomic data and assign multiple genes the same rank at the top of the order. For a fair comparison, the true gene was assigned the worst, average and best rank among similarly ranked peers. The three components of the bar reflect the performance across these scenarios. Source data

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–9, Supplementary Tables 1–7 and Supplementary Note (PDF 5144 kb)

Source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Javed, A., Agrawal, S. & Ng, P. Phen-Gen: combining phenotype and genotype to analyze rare disorders. Nat Methods 11, 935–937 (2014).

Download citation

Further reading