We introduce Phen-Gen, a method that combines patients' disease symptoms and sequencing data with prior domain knowledge to identify the causative genes for rare disorders. Simulations revealed that the causal variant was ranked first in 88% of cases when it was a coding variant—a 52% advantage over a genotype-only approach—and Phen-Gen outperformed other existing prediction methods by 13–58%. If disease etiology was unknown, the causal variant was assigned the top rank in 71% of simulations. Phen-Gen is available at http://phen-gen.org/.
Subscribe to Journal
Get full journal access for 1 year
only $4.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
de Ligt, J. et al. N. Engl. J. Med. 367, 1921–1929 (2012).
Yang, Y. et al. N. Engl. J. Med. 369, 1502–1511 (2013).
Cordero, J.F. N. Engl. J. Med. 352, 2032 (2005).
Amberger, J., Bocchini, C.A., Scott, A.F. & Hamosh, A. Nucleic Acids Res. 37, D793–D796 (2009).
Chakravarti, A. Genome Res. 21, 643–644 (2011).
Köhler, S. et al. Am. J. Hum. Genet. 85, 457–464 (2009).
Sifrim, A. et al. Nat. Methods 10, 1083–1084 (2013).
Yandell, M. et al. Genome Res. 21, 1529–1542 (2011).
Singleton, M.V. et al. Am. J. Hum. Genet. 94, 599–610 (2014).
Robinson, P.N. et al. Genome Res. 24, 340–348 (2014).
Stenson, P.D. et al. Hum. Genet. 133, 1–9 (2014).
Fu, W. et al. Nature 493, 216–220 (2013).
Visel, A. et al. Nature 464, 409–412 (2010).
Khurana, E. et al. Science 342, 1235587 (2013).
Pruitt, K.D. et al. Genome Res. 19, 1316–1323 (2009).
Sim, N.-L. et al. Nucleic Acids Res. 40, W452–W457 (2012).
Adzhubei, I.A. et al. Nat. Methods 7, 248–249 (2010).
Kryukov, G.V., Shpunt, A., Stamatoyannopoulos, J.A. & Sunyaev, S.R. Proc. Natl. Acad. Sci. USA 106, 3871–3876 (2009).
Schwarz, J.M., Rödelsperger, C., Schuelke, M. & Seelow, D. Nat. Methods 7, 575–576 (2010).
Lewin, B. Genes VIII (Benjamin Cummings, 2004).
Price, A.L. et al. Am. J. Hum. Genet. 86, 832–838 (2010).
Davydov, E.V. et al. PLoS Comput. Biol. 6, e1001025 (2010).
Cooper, G.M. et al. Genome Res. 15, 901–913 (2005).
Prabhakar, S. et al. Genome Res. 16, 855–863 (2006).
Derrien, T. et al. Genome Res. 22, 1775–1789 (2012).
Kozomara, A. & Griffiths-Jones, S. Nucleic Acids Res. 39, D152–D157 (2011).
Smith, N.G.C., Webster, M.T. & Ellegren, H. Genome Res. 12, 1350–1356 (2002).
He, L. & Hannon, G.J. Nat. Rev. Genet. 5, 522–531 (2004).
Esteller, M. Nat. Rev. Genet. 12, 861–874 (2011).
McLean, C.Y. et al. Nat. Biotechnol. 28, 495–501 (2010).
The 1000 Genomes Project Consortium. Nature 467, 1061–1073 (2010).
Sherry, S.T. et al. Nucleic Acids Res. 29, 308–311 (2001).
MacArthur, D.G. et al. Science 335, 823–828 (2012).
Wu, G., Feng, X. & Stein, L. Genome Biol. 11, R53 (2010).
Matthews, L. et al. Nucleic Acids Res. 37, D619–D622 (2009).
Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. & Tanabe, M. Nucleic Acids Res. 40, D109–D114 (2012).
Schaefer, C.F. et al. Nucleic Acids Res. 37, D674–D679 (2009).
Stark, C. et al. Nucleic Acids Res. 34, D535–D539 (2006).
Franceschini, A. et al. Nucleic Acids Res. 41, D808–D815 (2013).
Ashburner, M. et al. Nat. Genet. 25, 25–29 (2000).
Obayashi, T. et al. Nucleic Acids Res. 41, D1014–D1020 (2013).
The 1000 Genomes Project Consortium. Nature 491, 56–65 (2012).
This work was supported by the Agency for Science, Technology and Research (A*STAR), Singapore. We thank Radboud University Nijmegen Medical Centre for sharing the 100 intellectual disability patient data sets, particularly J. de Ligt for his help with this data. We also thank S. Köhler for his help with Phenomizer, N. Jinawath for her help interpreting patient symptoms, and S. Prabhakar and N. Clarke for their comments on the genomic predictor. We thank S. Prabhakar, S. Davila, A. Wilm and R. del Rosario for their comments on the manuscript.
The authors declare no competing financial interests.
Integrated supplementary information
Patient disease symptoms are matched against known disorders and the probability of a symptomatic match is assigned to genes implicated for the respective disorder. These probabilities are permeated to known gene associates using a random walk with restart on the interaction network. In parallel the patient’s sequencing data is analyzed and the damaging impact of each variant estimated and pooled within genes. These two predictions are combined to implicate the gene(s) involved.
Supplementary Figure 2 Distribution of SIFT and PolyPhen-2 scores for damaging and benign nonsynonymous mutations.
The distribution of SIFT and PolyPhen-2 scores for HGMD-reported damaging nonsynonymous mutations and neutral nonsynonymous fixed substitutions inferred from human-chimp alignment are shown. The plots indicate general agreement between the two methods.
The figure depicts the probability of deleteriousness around donor and acceptor sites for splice site mutations.
The figure illustrates the predicted deleteriousness of different combination of five annotations: GERP++ (G), PhyloP (P), near-genic (N), transcription factor binding sites (T), and DNase hypersensitive sites (D). The predictions are binned according to the number of annotations (shown on the x-axis). Each bin is further canonically sorted based on the fore mentioned order of annotations.
The 90% confidence intervals of different combination of genomic annotations are shown. The order from Supplementary Figure 4 is maintained. With the four sub-figures representing combinations of the two positive sets (HGMD and GWAS) and the two neutral sets (common variation in dbSNP and Complete Genomics MAF>0.30), respectively.
The top 1 percentile of damaging variants in each gene is shown. The histogram of this null distribution cutoff for all genes under dominant and recessive inheritance pattern for coding and genomic predictors is shown. Most genes do not harbor any putative damaging variants and hence the distributions are dominated by the left most bar; which has been truncated for better visual representation.
The distribution of damaging probabilities assigned to different classes of HGMD variants is shown. The top three panels employ the coding predictor. A genomic predictor was used for the bottom panel and applied to noncoding regulatory variants. The histograms depict the distribution of the scored variants. The pie charts on the right explicate the distribution of omitted and predicted variants in each category. Common variants (white) were observed in 1000 Genomes, ESP, or dbSNP with MAF 0.01. Commonly mutated genes indicate that the variants failed to exceed the null distribution of the respective gene (light green). Missed indicates that the variant eluded our regions of interest (dark blue).
The figure depicts how compound heterozygous variants are evaluated. When both damaging variants reside within the coding region, the coding predictor is used to estimate the damaging impact of these variants. In cases when one or both variants lay outside the exon boundaries, both variants are evaluated using the genomic predictor.
The comparison of Phen-Gen and VAAST in simulations using 44 phenotypically heterogeneous disorders and nonsynonymous mutations in HGMD is shown. In both panels the ability of both methods to narrow down the true gene search within 1, 5 and 10 genes is depicted. For Phen-Gen, the bar is split into the predictive power based on genotypic prediction and the added advantage gained from disease symptoms. VAAST only uses the genomic data and assign multiple genes the same rank at the top of the order. For a fair comparison, the true gene was assigned the worst, average and best rank among similarly ranked peers. The three components of the bar reflect the performance across these scenarios.
About this article
Cite this article
Javed, A., Agrawal, S. & Ng, P. Phen-Gen: combining phenotype and genotype to analyze rare disorders. Nat Methods 11, 935–937 (2014). https://doi.org/10.1038/nmeth.3046
AnnotSV and knotAnnotSV: a web server for human structural variations annotations, ranking and analysis
Nucleic Acids Research (2021)
Systematic identification of genetic systems associated with phenotypes in patients with rare genomic copy number variations
Human Genetics (2021)
Personalised virtual gene panels reduce interpretation workload and maintain diagnostic rates of proband-only clinical exome sequencing for rare disorders
Journal of Medical Genetics (2021)
Nucleic Acids Research (2021)
Scientific Reports (2020)