Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Phen-Gen: combining phenotype and genotype to analyze rare disorders


We introduce Phen-Gen, a method that combines patients' disease symptoms and sequencing data with prior domain knowledge to identify the causative genes for rare disorders. Simulations revealed that the causal variant was ranked first in 88% of cases when it was a coding variant—a 52% advantage over a genotype-only approach—and Phen-Gen outperformed other existing prediction methods by 13–58%. If disease etiology was unknown, the causal variant was assigned the top rank in 71% of simulations. Phen-Gen is available at

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: Comparison with VAAST, eXtasy and VAAST+PHEVOR.


  1. 1

    de Ligt, J. et al. N. Engl. J. Med. 367, 1921–1929 (2012).

    CAS  Article  Google Scholar 

  2. 2

    Yang, Y. et al. N. Engl. J. Med. 369, 1502–1511 (2013).

    CAS  Article  Google Scholar 

  3. 3

    Cordero, J.F. N. Engl. J. Med. 352, 2032 (2005).

    CAS  Article  Google Scholar 

  4. 4

    Amberger, J., Bocchini, C.A., Scott, A.F. & Hamosh, A. Nucleic Acids Res. 37, D793–D796 (2009).

    CAS  Article  Google Scholar 

  5. 5

    Chakravarti, A. Genome Res. 21, 643–644 (2011).

    CAS  Article  Google Scholar 

  6. 6

    Köhler, S. et al. Am. J. Hum. Genet. 85, 457–464 (2009).

    Article  Google Scholar 

  7. 7

    Sifrim, A. et al. Nat. Methods 10, 1083–1084 (2013).

    CAS  Article  Google Scholar 

  8. 8

    Yandell, M. et al. Genome Res. 21, 1529–1542 (2011).

    CAS  Article  Google Scholar 

  9. 9

    Singleton, M.V. et al. Am. J. Hum. Genet. 94, 599–610 (2014).

    CAS  Article  Google Scholar 

  10. 10

    Robinson, P.N. et al. Genome Res. 24, 340–348 (2014).

    CAS  Article  Google Scholar 

  11. 11

    Stenson, P.D. et al. Hum. Genet. 133, 1–9 (2014).

    CAS  Article  Google Scholar 

  12. 12

    Fu, W. et al. Nature 493, 216–220 (2013).

    CAS  Article  Google Scholar 

  13. 13

    Visel, A. et al. Nature 464, 409–412 (2010).

    CAS  Article  Google Scholar 

  14. 14

    Khurana, E. et al. Science 342, 1235587 (2013).

    Article  Google Scholar 

  15. 15

    Pruitt, K.D. et al. Genome Res. 19, 1316–1323 (2009).

    CAS  Article  Google Scholar 

  16. 16

    Sim, N.-L. et al. Nucleic Acids Res. 40, W452–W457 (2012).

    CAS  Article  Google Scholar 

  17. 17

    Adzhubei, I.A. et al. Nat. Methods 7, 248–249 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. 18

    Kryukov, G.V., Shpunt, A., Stamatoyannopoulos, J.A. & Sunyaev, S.R. Proc. Natl. Acad. Sci. USA 106, 3871–3876 (2009).

    CAS  Article  Google Scholar 

  19. 19

    Schwarz, J.M., Rödelsperger, C., Schuelke, M. & Seelow, D. Nat. Methods 7, 575–576 (2010).

    CAS  Article  Google Scholar 

  20. 20

    Lewin, B. Genes VIII (Benjamin Cummings, 2004).

  21. 21

    Price, A.L. et al. Am. J. Hum. Genet. 86, 832–838 (2010).

    Article  Google Scholar 

  22. 22

    Davydov, E.V. et al. PLoS Comput. Biol. 6, e1001025 (2010).

    Article  Google Scholar 

  23. 23

    Cooper, G.M. et al. Genome Res. 15, 901–913 (2005).

    CAS  Article  Google Scholar 

  24. 24

    Prabhakar, S. et al. Genome Res. 16, 855–863 (2006).

    CAS  Article  Google Scholar 

  25. 25

    Derrien, T. et al. Genome Res. 22, 1775–1789 (2012).

    CAS  Article  Google Scholar 

  26. 26

    Kozomara, A. & Griffiths-Jones, S. Nucleic Acids Res. 39, D152–D157 (2011).

    CAS  Article  Google Scholar 

  27. 27

    Smith, N.G.C., Webster, M.T. & Ellegren, H. Genome Res. 12, 1350–1356 (2002).

    CAS  Article  Google Scholar 

  28. 28

    He, L. & Hannon, G.J. Nat. Rev. Genet. 5, 522–531 (2004).

    CAS  Article  Google Scholar 

  29. 29

    Esteller, M. Nat. Rev. Genet. 12, 861–874 (2011).

    CAS  Article  Google Scholar 

  30. 30

    McLean, C.Y. et al. Nat. Biotechnol. 28, 495–501 (2010).

    CAS  Article  Google Scholar 

  31. 31

    The 1000 Genomes Project Consortium. Nature 467, 1061–1073 (2010).

  32. 32

    Sherry, S.T. et al. Nucleic Acids Res. 29, 308–311 (2001).

    CAS  Article  Google Scholar 

  33. 33

    MacArthur, D.G. et al. Science 335, 823–828 (2012).

    CAS  Article  Google Scholar 

  34. 34

    Wu, G., Feng, X. & Stein, L. Genome Biol. 11, R53 (2010).

    Article  Google Scholar 

  35. 35

    Matthews, L. et al. Nucleic Acids Res. 37, D619–D622 (2009).

    CAS  Article  Google Scholar 

  36. 36

    Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. & Tanabe, M. Nucleic Acids Res. 40, D109–D114 (2012).

    CAS  Article  Google Scholar 

  37. 37

    Schaefer, C.F. et al. Nucleic Acids Res. 37, D674–D679 (2009).

    CAS  Article  Google Scholar 

  38. 38

    Stark, C. et al. Nucleic Acids Res. 34, D535–D539 (2006).

    CAS  Article  Google Scholar 

  39. 39

    Franceschini, A. et al. Nucleic Acids Res. 41, D808–D815 (2013).

    CAS  Article  Google Scholar 

  40. 40

    Ashburner, M. et al. Nat. Genet. 25, 25–29 (2000).

    CAS  Article  Google Scholar 

  41. 41

    Obayashi, T. et al. Nucleic Acids Res. 41, D1014–D1020 (2013).

    CAS  Article  Google Scholar 

  42. 42

    The 1000 Genomes Project Consortium. Nature 491, 56–65 (2012).

Download references


This work was supported by the Agency for Science, Technology and Research (A*STAR), Singapore. We thank Radboud University Nijmegen Medical Centre for sharing the 100 intellectual disability patient data sets, particularly J. de Ligt for his help with this data. We also thank S. Köhler for his help with Phenomizer, N. Jinawath for her help interpreting patient symptoms, and S. Prabhakar and N. Clarke for their comments on the genomic predictor. We thank S. Prabhakar, S. Davila, A. Wilm and R. del Rosario for their comments on the manuscript.

Author information




A.J. conceived of and designed the project, designed and implemented the analysis framework, implemented methods, conducted experiments, interpreted results, wrote the initial manuscript and revised and proofread the paper. S.A. implemented methods, conducted experiments, set up the web server and revised and proofread the paper. P.C.N. conceived of and designed the project, revised and proofread the paper and supervised the project.

Corresponding authors

Correspondence to Asif Javed or Pauline C Ng.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Overall workflow.

Patient disease symptoms are matched against known disorders and the probability of a symptomatic match is assigned to genes implicated for the respective disorder. These probabilities are permeated to known gene associates using a random walk with restart on the interaction network. In parallel the patient’s sequencing data is analyzed and the damaging impact of each variant estimated and pooled within genes. These two predictions are combined to implicate the gene(s) involved.

Supplementary Figure 2 Distribution of SIFT and PolyPhen-2 scores for damaging and benign nonsynonymous mutations.

The distribution of SIFT and PolyPhen-2 scores for HGMD-reported damaging nonsynonymous mutations and neutral nonsynonymous fixed substitutions inferred from human-chimp alignment are shown. The plots indicate general agreement between the two methods. Source data

Supplementary Figure 3 Deleteriousness predictions around splice site.

The figure depicts the probability of deleteriousness around donor and acceptor sites for splice site mutations. Source data

Supplementary Figure 4 Probability of deleteriousness using the genomic predictor.

The figure illustrates the predicted deleteriousness of different combination of five annotations: GERP++ (G), PhyloP (P), near-genic (N), transcription factor binding sites (T), and DNase hypersensitive sites (D). The predictions are binned according to the number of annotations (shown on the x-axis). Each bin is further canonically sorted based on the fore mentioned order of annotations. Source data

Supplementary Figure 5 Confidence intervals for positive and benign mutation set combinations.

The 90% confidence intervals of different combination of genomic annotations are shown. The order from Supplementary Figure 4 is maintained. With the four sub-figures representing combinations of the two positive sets (HGMD and GWAS) and the two neutral sets (common variation in dbSNP and Complete Genomics MAF>0.30), respectively. Source data

Supplementary Figure 6 Histograms of the null distribution of deleteriousness of genes.

The top 1 percentile of damaging variants in each gene is shown. The histogram of this null distribution cutoff for all genes under dominant and recessive inheritance pattern for coding and genomic predictors is shown. Most genes do not harbor any putative damaging variants and hence the distributions are dominated by the left most bar; which has been truncated for better visual representation. Source data

Supplementary Figure 7 Performance of variant predictors.

The distribution of damaging probabilities assigned to different classes of HGMD variants is shown. The top three panels employ the coding predictor. A genomic predictor was used for the bottom panel and applied to noncoding regulatory variants. The histograms depict the distribution of the scored variants. The pie charts on the right explicate the distribution of omitted and predicted variants in each category. Common variants (white) were observed in 1000 Genomes, ESP, or dbSNP with MAF 0.01. Commonly mutated genes indicate that the variants failed to exceed the null distribution of the respective gene (light green). Missed indicates that the variant eluded our regions of interest (dark blue). Source data

Supplementary Figure 8 Prediction of heterozygous variants.

The figure depicts how compound heterozygous variants are evaluated. When both damaging variants reside within the coding region, the coding predictor is used to estimate the damaging impact of these variants. In cases when one or both variants lay outside the exon boundaries, both variants are evaluated using the genomic predictor.

Supplementary Figure 9 Phen-Gen and VAAST comparison for phenotypically heterogeneous disorders.

The comparison of Phen-Gen and VAAST in simulations using 44 phenotypically heterogeneous disorders and nonsynonymous mutations in HGMD is shown. In both panels the ability of both methods to narrow down the true gene search within 1, 5 and 10 genes is depicted. For Phen-Gen, the bar is split into the predictive power based on genotypic prediction and the added advantage gained from disease symptoms. VAAST only uses the genomic data and assign multiple genes the same rank at the top of the order. For a fair comparison, the true gene was assigned the worst, average and best rank among similarly ranked peers. The three components of the bar reflect the performance across these scenarios. Source data

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–9, Supplementary Tables 1–7 and Supplementary Note (PDF 5144 kb)

Source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Javed, A., Agrawal, S. & Ng, P. Phen-Gen: combining phenotype and genotype to analyze rare disorders. Nat Methods 11, 935–937 (2014).

Download citation

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing