Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

A framework for automated gene selection in genomic applications



An efficient framework to identify disease-associated genes is needed to evaluate genomic data for both individuals with an unknown disease etiology and those undergoing genomic screening. Here, we propose a framework for gene selection used in genomic analyses, including applications limited to genes with strong or established evidence levels and applications including genes with less or emerging evidence of disease association.


We extracted genes with evidence for gene–disease association from the Human Gene Mutation Database, OMIM, and ClinVar to build a comprehensive gene list of 6,145 genes. Next, we applied stringent filters in conjunction with computationally curated evidence (DisGeNET) to create a restrictive list limited to 3,929 genes with stronger disease associations.


When compared to manual gene curation efforts, including the Clinical Genome Resource, genes with strong or definitive disease associations are included in both gene lists at high percentages, while genes with limited evidence are largely removed. We further confirmed the utility of this approach in identifying pathogenic and likely pathogenic variants in 45 genomes.


Our approach efficiently creates highly sensitive gene lists for genomic applications, while remaining dynamic and updatable, enabling time savings in genomic applications.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Overview of comprehensive and restrictive gene list.
Fig. 2: Comprehensive and restrictive gene lists compared to the GDA classifications assigned by six resources.

Data availability

The gene lists and data used to develop the lists can be found at


  1. 1.

    Strande, N. T. et al. Evaluating the clinical validity of gene-disease associations: an evidence-based framework developed by the Clinical Genome Resource. Am. J. Hum. Genet. 100, 895–906 (2017).

    CAS  Article  Google Scholar 

  2. 2.

    Piñero, J. et al. DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database (Oxford). 2015, bav028 (2015).

    Article  Google Scholar 

  3. 3.

    Piñero, J. et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 45(D1), D833–D839 (2017).

    Article  Google Scholar 

  4. 4.

    Piñero, J. et al. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 48(D1), D845–D855 (2020).

    Google Scholar 

  5. 5.

    Martin, A. R. et al. PanelApp crowdsources expert knowledge to establish consensus diagnostic gene panels. Nat. Genet. 51, 1560–1565 (2019).

    CAS  Article  Google Scholar 

  6. 6.

    The Gene Curation Coalition. (2021).

  7. 7.

    Ceyhan-Birsoy, O. et al. A curated gene list for reporting results of newborn genomic sequencing. Genet. Med. 19, 809–818 (2017).

    CAS  Article  Google Scholar 

  8. 8.

    Machini, K. et al. Analyzing and reanalyzing the genome: findings from the MedSeq Project. Am. J. Hum. Genet. 105, 177–188 (2019).

    CAS  Article  Google Scholar 

  9. 9.

    Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46(D1), D1062–D1067 (2018).

    CAS  Article  Google Scholar 

  10. 10.

    OMIM. (2020).

  11. 11.

    Stenson, P. D. et al. The Human Gene Mutation Database (HGMD®): optimizing its use in a clinical diagnostic or research setting. Hum. Genet. 139, 1197–1207 (2020).

    Article  Google Scholar 

  12. 12.

    Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405–424 (2015).

    Article  Google Scholar 

Download references


Funding support was partly provided by grant 5R01HL143295 from the National Institutes of Health/National Heart, Lung, and Blood Institute (L.L.d.l.V., C.L.B.Z., R.C.G., H.L.R., M.S.L.). The authors would like to thank the Gene Curation Coalition (GenCC) for generating curated content used in this project. GenCC’s curated content was obtained at (13 March 2021) and includes contributions from the following organizations: Invitae, Illumina, Myriad Women’s Health, Ambry Genetics, and TGMI/G2P.

Author information




Conceptualization: M.S.L., K.M,. C.A.A.T., H.M.S. Data curation: W.Y., L.H. Formal analysis: L.L.d.l.V., W.Y. Funding acquisition: R.C.G. Investigation: L.L.d.l.V., K.M. Methodology: M.S.L., H.L.R. Project administration: C.L.B.Z. Software: W.Y., L.H. Supervision: M.S.L. Visualization: W.Y., L.L.d.l.V. Writing—original draft: L.L.d.l.V., M.S.L. Writing—review and editing: H.L.R., C.L.B.Z., R.C.G., C.A.A.T., H.M.S. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to M. S. Lebo.

Ethics declarations

Ethics declaration

This project has been reviewed and approved by the Mass General Brigham institutional review board (IRB). All individuals consented for clinical genomic screening and all individual data was de-identified.

Competing interests

M.S.L., L.L.d.l.V., and C.L.B.Z. report grants from the National Institutes of Health (NIH) during the conduct of the study. H.L.R. reports grants from NIH during the conduct of the study; she also reports personal fees from Genome Medical outside the submitted work. R.C.G. reports grants from NIH during the conduct of the study; he also reports personal fees from AIA, SavvySherpa, Verily, and Wamberg, all outside the submitted work. The other authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lazo de la Vega, L., Yu, W., Machini, K. et al. A framework for automated gene selection in genomic applications. Genet Med (2021).

Download citation


Quick links