Phenotypic signatures in clinical data enable systematic identification of patients for genetic testing

Morley, Theodore J.; Han, Lide; Castro, Victor M.; Morra, Jonathan; Perlis, Roy H.; Cox, Nancy J.; Bastarache, Lisa; Ruderfer, Douglas M.

doi:10.1038/s41591-021-01356-z

Article
Published: 03 June 2021

Phenotypic signatures in clinical data enable systematic identification of patients for genetic testing

Theodore J. Morley^1,2,
Lide Han ORCID: orcid.org/0000-0002-0132-2656^1,2,
Victor M. Castro³,
Jonathan Morra⁴,
Roy H. Perlis³,
Nancy J. Cox^1,2,
Lisa Bastarache^2,5 &
…
Douglas M. Ruderfer ORCID: orcid.org/0000-0002-2365-386X^1,2,5,6

Nature Medicine volume 27, pages 1097–1104 (2021)Cite this article

4523 Accesses
15 Citations
90 Altmetric
Metrics details

Subjects

Abstract

Around 5% of the population is affected by a rare genetic disease, yet most endure years of uncertainty before receiving a genetic test. A common feature of genetic diseases is the presence of multiple rare phenotypes that often span organ systems. Here, we use diagnostic billing information from longitudinal clinical data in the electronic health records (EHRs) of 2,286 patients who received a chromosomal microarray test, and 9,144 matched controls, to build a model to predict who should receive a genetic test. The model achieved high prediction accuracies in a held-out test sample (area under the receiver operating characteristic curve (AUROC), 0.97; area under the precision–recall curve (AUPRC), 0.92), in an independent hospital system (AUROC, 0.95; AUPRC, 0.62), and in an independent set of 172,265 patients in which cases were broadly defined as having an interaction with a genetics provider (AUROC, 0.9; AUPRC, 0.63). Patients carrying a putative pathogenic copy number variant were also accurately identified by the model. Compared with current approaches for genetic test determination, our model could identify more patients for testing while also increasing the proportion of those tested who have a genetic disease. We demonstrate that phenotypic patterns representative of a wide range of genetic diseases can be captured from EHRs to systematize decision-making for genetic testing, with the potential to speed up diagnosis, improve care and reduce costs.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Predictive performance of the model in a held-out CMA test set and a general hospital population.**

**Fig. 2: Identification of patients with CNV syndromes and interpretability.**

**Fig. 3: Proportion of patients with a putative pathogenic CNV identified by the model.**

**Fig. 4: Prediction performance across diverse genetic diseases.**

**Fig. 5: Clinical time period preceding the genetic test.**

Best practices for the interpretation and reporting of clinical whole genome sequencing

Article Open access 08 April 2022

Genomes in clinical care

Article Open access 14 March 2024

Personalised analytics for rare disease diagnostics

Article Open access 21 November 2019

Data availability

Summary level data on frequency and importance of phecodes in the model are presented in Supplementary Table 3. Summary data on clinical and genetic information are provided throughout the paper. All requests for raw (for example CNV and phenotype) data and materials are reviewed by Vanderbilt University Medical Center to determine whether the request is subject to any intellectual property or confidentiality obligations. For example, patient-related data not included in the paper may be subject to patient confidentiality. Any such data and materials that can be shared will be released via a material transfer agreement. ClinGen data were downloaded from UCSC Genome Browser June 2019 (https://genome.ucsc.edu/cgi-bin/hgGateway). DECIPHER CNV syndromes were extracted from https://www.deciphergenomics.org/disorders/syndromes/list.

Code availability

All code used to construct and run the model is provided at https://github.com/RuderferLab/chromosomalMicroarray.

References

Nguengang Wakap, S. et al. Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database. Eur. J. Hum. Genet. 28, 165–173 (2020).
Article Google Scholar
Ferreira, C. R. The burden of rare diseases. Am. J. Med. Genet. A 179, 885–892 (2019).
Article Google Scholar
Rosenthal, E. T., Biesecker, L. G. & Biesecker, B. B. Parental attitudes toward a diagnosis in children with unidentified multiple congenital anomaly syndromes. Am. J. Med. Genet. 103, 106–114 (2001).
Article CAS Google Scholar
About Rare Diseases (Orphanet, accessed June 2020); https://www.orpha.net/consor/cgi-bin/Education_AboutRareDiseases.php?lng=EN
About Rare Diseases (EURORDIS Rare Diseases Europe, accessed June 2020); https://www.eurordis.org/about-rare-diseases
Suther, S. & Kiros, G.-E. Barriers to the use of genetic testing: a study of racial and ethnic disparities. Genet. Med. 11, 655–662 (2009).
Article Google Scholar
Noll, A. et al. Barriers to Lynch syndrome testing and preoperative result availability in early-onset colorectal cancer: a national physician survey study. Clin. Transl. Gastroenterol. 9, 185 (2018).
Article Google Scholar
Moreno-de-Luca, D. et al. Clinical genetic testing in autism spectrum disorder in a large community-based population sample. JAMA Psychiatry 77, 979–981 (2020).
Article Google Scholar
OMIM: Online Mendelian Inheritance in Man (Johns Hopkins University, accessed June 2020); https://omim.org
McKusick, V. A. Mendelian Inheritance in Man and its online version, OMIM. Am. J. Hum. Genet. 80, 588–604 (2007).
Article CAS Google Scholar
Cooper, D. N., Krawczak, M., Polychronakos, C., Tyler-Smith, C. & Kehrer-Sawatzki, H. Where genotype is not predictive of phenotype: towards an understanding of the molecular basis of reduced penetrance in human inherited disease. Hum. Genet. 132, 1077–1130 (2013).
Article CAS Google Scholar
Girirajan, S. et al. Phenotypic heterogeneity of genomic disorders and rare copy-number variants. N. Engl. J. Med. 367, 1321–1331 (2012).
Article CAS Google Scholar
Posey, J. E. et al. Resolution of disease phenotypes resulting from multilocus genomic variation. N. Engl. J. Med. 376, 21–31 (2017).
Article CAS Google Scholar
Goldstein, B. A., Navar, A. M., Pencina, M. J. & Ioannidis, J. P. A. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J. Am. Med. Inform. Assoc. 24, 198–208 (2017).
Article Google Scholar
Bastarache, L. et al. Improving the phenotype risk score as a scalable approach to identifying patients with Mendelian disease. J. Am. Med. Inform. Assoc. 26, 1437–1447 (2019).
Article Google Scholar
Bastarache, L. et al. Phenotype risk scores identify patients with unrecognized Mendelian disease patterns. Science 359, 1233–1239 (2018).
Article CAS Google Scholar
Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1111 (2013).
Article CAS Google Scholar
Firth, H. V. et al. DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. Am. J. Hum. Genet. 84, 524–533 (2009).
Article CAS Google Scholar
Corbett-Davies, S. & Goel, S. The measure and mismeasure of fairness: a critical review of fair machine learning. Preprint at arXiv https://arxiv.org/abs/1808.00023 (2018).
Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2, 56–67 (2020).
Article Google Scholar
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for Dimension reduction. Preprint at arXiv https://arxiv.org/abs/1802.03426 (2018).
Brokamp, C. et al. Material community deprivation and hospital utilization during the first year of life: an urban population-based cohort study. Ann. Epidemiol. 30, 37–43 (2019).
Article Google Scholar
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Article Google Scholar
Wang, K. et al. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 17, 1665–1674 (2007).
Article CAS Google Scholar
Diskin, S. J. et al. Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res. 36, e126 (2008).
Article Google Scholar
Amemiya, H. M., Kundaje, A. & Boyle, A. P. The ENCODE blacklist: identification of problematic regions of the genome. Sci. Rep. 9, 9354 (2019).
Article Google Scholar

Download references

Acknowledgements

This work was supported by R01MH111776 (to D.M.R.), R01MH113362 (to N.J.C. and D.M.R.), R01LM010685 (to L.B.) and U01HG009068 (to N.J.C.). This study makes use of data generated by the DECIPHER community. A full list of centers that contributed to the generation of the data is available from https://decipher.sanger.ac.uk and via email from decipher@sanger.ac.uk. Funding for the project was provided by Wellcome. The dataset(s) used for the analyses described were obtained from Vanderbilt University Medical Center’s BioVU, which is supported by institutional funding, private agencies and federal grants. These include the NIH-funded Shared Instrumentation Grant S10RR025141, and CTSA grants UL1TR002243, UL1TR000445 and UL1RR024975. Genomic data are also supported by investigator-led projects that include U01HG004798, R01NS032830, RC2GM092618, P50GM115305, U01HG006378, U19HL065962 and R01HD074711; and additional funding sources listed at https://victr.vanderbilt.edu/pub/biovu/.

Author information

Authors and Affiliations

Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
Theodore J. Morley, Lide Han, Nancy J. Cox & Douglas M. Ruderfer
Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA
Theodore J. Morley, Lide Han, Nancy J. Cox, Lisa Bastarache & Douglas M. Ruderfer
Center for Quantitative Health, Division of Clinical Research, Massachusetts General Hospital, Boston, MA, USA
Victor M. Castro & Roy H. Perlis
Zefr, Los Angeles, CA, USA
Jonathan Morra
Center for Precision Medicine, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
Lisa Bastarache & Douglas M. Ruderfer
Department of Psychiatry and Behavioral Sciences, Vanderbilt University Medical Center, Nashville, TN, USA
Douglas M. Ruderfer

Authors

Theodore J. Morley
View author publications
You can also search for this author in PubMed Google Scholar
Lide Han
View author publications
You can also search for this author in PubMed Google Scholar
Victor M. Castro
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Morra
View author publications
You can also search for this author in PubMed Google Scholar
Roy H. Perlis
View author publications
You can also search for this author in PubMed Google Scholar
Nancy J. Cox
View author publications
You can also search for this author in PubMed Google Scholar
Lisa Bastarache
View author publications
You can also search for this author in PubMed Google Scholar
Douglas M. Ruderfer
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.M.R. and T.J.M. designed and conceived the study. L.B. and T.J.M. extracted data from the EHRs for training and validation. L.H. generated the CNV data. T.J.M., D.M.R. and J.M. designed and implemented the prediction model. T.J.M. and D.M.R. performed the analyses. V.M.C. and R.H.P. performed external validation at MGB. D.M.R., T.J.M., L.B. and N.J.C. interpreted the results. T.J.M. and D.M.R. drafted the paper. All authors read the paper, provided feedback and approved the submission.

Corresponding author

Correspondence to Douglas M. Ruderfer.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer reviewer information Nature Medicine thanks Marc Williams and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Michael Basson was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 PheWAS of CMA cases versus matched controls.

PheWAS Manhattan plot showing significance of associations from logistic regressions of each of 1,620 phecodes and whether an individual received a CMA vs controls. Triangle points represent direction of effect and points are colored by phecode category. For clarity, only phecodes with uncorrected p-values below 5 × 10⁻¹⁵⁰ are labeled.

Extended Data Fig. 2 Age of patients at date of CMA testing differs by syndrome.

Age of patients at the time of their CMA report grouped into the most common syndromic region by combining diagnosis and genomic coordinates of reported abnormal variant. Independent patient numbers within each category: 15q11.2 syndromes (32), 16p11.2 syndromes (14), 1q21.1 syndromes (9), CMT/HNPP (18), DiGeorge/22q11.2 Duplication syndrome (31), Down Syndrome (7), Turner/Klinefelter (14), Williams syndrome (9).

Supplementary information

Reporting Summary

Supplementary Tables 1–4.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Morley, T.J., Han, L., Castro, V.M. et al. Phenotypic signatures in clinical data enable systematic identification of patients for genetic testing. Nat Med 27, 1097–1104 (2021). https://doi.org/10.1038/s41591-021-01356-z

Download citation

Received: 12 August 2020
Accepted: 16 April 2021
Published: 03 June 2021
Issue Date: June 2021
DOI: https://doi.org/10.1038/s41591-021-01356-z

This article is cited by

Performance and clinical utility of a new supervised machine-learning pipeline in detecting rare ciliopathy patients based on deep phenotyping from electronic health records and semantic similarity
- Carole Faviez
- Marc Vincent
- Anita Burgun
Orphanet Journal of Rare Diseases (2024)
Cluster analysis and visualisation of electronic health records data to identify undiagnosed patients with rare genetic diseases
- Daniel Moynihan
- Sean Monaco
- Saumya Shekhar Jamuar
Scientific Reports (2024)
A machine learning model identifies patients in need of autoimmune disease testing using electronic health records
- Iain S. Forrest
- Ben O. Petrazzini
- Ron Do
Nature Communications (2023)
Ontologizing health systems data at scale: making translational discovery a reality
- Tiffany J. Callahan
- Adrianne L. Stefanski
- Michael G. Kahn
npj Digital Medicine (2023)
Leveraging genomic diversity for discovery in an electronic health record linked biobank: the UCLA ATLAS Community Health Initiative
- Ruth Johnson
- Yi Ding
- Bogdan Pasaniuc
Genome Medicine (2022)