Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases

Kobren, Shilpa Nadimpalli; Baldridge, Dustin; Velinder, Matt; Krier, Joel B.; LeBlanc, Kimberly; Esteves, Cecilia; Pusey, Barbara N.; Züchner, Stephan; Blue, Elizabeth; Lee, Hane; Huang, Alden; Bastarache, Lisa; Bican, Anna; Cogan, Joy; Marwaha, Shruti; Alkelai, Anna; Murdock, David R.; Liu, Pengfei; Wegner, Daniel J.; Paul, Alexander J.; Sunyaev, Shamil R.; Kohane, Isaac S.

doi:10.1038/s41436-020-01084-8

Download PDF

Article
Open access
Published: 12 February 2021

Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases

Shilpa Nadimpalli Kobren¹,
Dustin Baldridge²,
Matt Velinder³,
Joel B. Krier⁴,
Kimberly LeBlanc¹,
Cecilia Esteves¹,
Barbara N. Pusey⁵,
Stephan Züchner⁶,
Elizabeth Blue⁷,
Hane Lee^8,9,
Alden Huang⁸,
Lisa Bastarache¹⁰,
Anna Bican¹⁰,
Joy Cogan¹⁰,
Shruti Marwaha¹¹,
Anna Alkelai¹²,
David R. Murdock¹³,
Pengfei Liu^13,14,
Daniel J. Wegner²,
Alexander J. Paul¹⁵,
Undiagnosed Diseases Network,
Shamil R. Sunyaev^1,4 &
…
Isaac S. Kohane ORCID: orcid.org/0000-0003-2192-5160¹

Genetics in Medicine volume 23, pages 1075–1085 (2021)Cite this article

5211 Accesses
10 Citations
26 Altmetric
Metrics details

Abstract

Purpose

Genomic sequencing has become an increasingly powerful and relevant tool to be leveraged for the discovery of genetic aberrations underlying rare, Mendelian conditions. Although the computational tools incorporated into diagnostic workflows for this task are continually evolving and improving, we nevertheless sought to investigate commonalities across sequencing processing workflows to reveal consensus and standard practice tools and highlight exploratory analyses where technical and theoretical method improvements would be most impactful.

Methods

We collected details regarding the computational approaches used by a genetic testing laboratory and 11 clinical research sites in the United States participating in the Undiagnosed Diseases Network via meetings with bioinformaticians, online survey forms, and analyses of internal protocols.

Results

We found that tools for processing genomic sequencing data can be grouped into four distinct categories. Whereas well-established practices exist for initial variant calling and quality control steps, there is substantial divergence across sites in later stages for variant prioritization and multimodal data integration, demonstrating a diversity of approaches for solving the most mysterious undiagnosed cases.

Conclusion

The largest differences across diagnostic workflows suggest that advances in structural variant detection, noncoding variant interpretation, and integration of additional biomedical data may be especially promising for solving chronically undiagnosed cases.

Refining the impact of genetic evidence on clinical success

Article Open access 17 April 2024

Genome-wide association studies

Article 26 August 2021

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

INTRODUCTION

Next-generation exome sequencing (ES) and genome sequencing (GS) have revolutionized the process for diagnosing rare and novel genetic conditions.¹ Traditionally, the diagnostic process has primarily been driven by phenotype, with clinicians comparing patients’ symptoms to others encountered in their prior experience and clinical training and/or to a knowledgebase of known human diseases.² In a typical undiagnosed case, however, either a patient’s phenotype is not indicative of any known disease, or tests to confirm the presence of a suspected genetic condition are inconclusive. In these instances, ES and GS have enabled health-care providers to pursue a genetics-driven diagnostic approach in parallel, where the genetic variation uncovered in a patient can be assessed with respect to not only its known phenotypic associations³ but also to its prevalence in background populations,⁴ predicted pathogenicity,⁵ functional consequences, and mode of inheritance to reveal novel disease-causing loci. Indeed, while traditional clinical case review and directed diagnostic assays continue to solve difficult cases, ~74% of newly diagnosed genetic conditions have been attributed to analyses of ES and GS data.^6,7 However, the diagnosis rate for patients with potentially unique genetic conditions is still ~35%,⁷ suggesting ample opportunity for methodological improvements to advance our understanding of the genetic underpinnings of phenotypic extremes.

With this goal in mind, cross-institutional initiatives such as Care4Rare in Canada (http://care4rare.ca) and Solve-RD in Europe (http://solve-rd.eu) have been established to connect and enable clinical researchers to uncover the genetic origins of disease in undiagnosed patients. In addition to furthering basic genetics research, these efforts have provided scores of patients with an end to diagnostic uncertainty and access to additional services.⁸ The most expansive undiagnosed initiative in the United States is the Undiagnosed Diseases Network (UDN), which encompasses 12 clinical sites and has, since its inception in 2014, cumulatively diagnosed over 400 individuals and described over 30 novel syndromes.⁷ Each UDN clinical site is staffed with specialists who develop and apply complex suites of bioinformatics tools to analyze sequencing data and uncover disease-causing variants.⁹ These sites each underwent a competitive application process and were selected to join the UDN due to their demonstrated track record of diagnosing difficult cases and characterizing novel genetic conditions through ongoing research efforts. The workflows implemented at these sites are thus representative of the state-of-the-art in rare disease diagnostic efforts.

We gathered details about 12 UDN bioinformatics pipelines, determined recurrent steps in a typical diagnostic evaluation, and identified consensus approaches. Moreover, we highlight substantial differences across pipelines regarding overall organization and incorporated tools. The comprehensive snapshot of effective computational workflows presented here can direct clinical teams interested in initiating genomic sequencing usage or re-evaluating patients who have had inconclusive genetic testing.

MATERIALS AND METHODS

Participating sites

Sequence analysis pipeline details were collected from the CLIA-certified sequencing core at Baylor Genetics (BaylorSeq) and 11 UDN clinical sites: Baylor College of Medicine (BCM), Duke University and Columbia University Institute for Genomic Medicine (Duke/Columbia), three Harvard-affiliated hospitals and Brigham Genomic Medicine (Harvard), University of Miami Miller School of Medicine (Miami), National Institutes of Health (NIH), University of Washington School of Medicine and Seattle Children’s Hospital (PacificNW), Stanford Center for Undiagnosed Diseases (Stanford), University of California–Los Angeles (UCLA), University of Utah Health Center for Genetic Discovery (Utah), Vanderbilt University Medical Center (Vanderbilt), and Washington University School of Medicine (WUSTL). The University of Pennsylvania and Children’s Hospital of Philadelphia clinical site had yet to process sequencing data for a UDN case at the time of writing and thus is excluded from this study.

Data collection

We systematically collected details about each UDN site’s computational diagnostic workflows using a combination of in-person and virtual meetings with bioinformaticians and genetic counselors, online survey forms, and inspections of published papers and internal protocols.^10,11,12

RESULTS

Overview of diagnostic workflow components

Before applying to the UDN, a patient has typically endured extensive prior testing by multiple clinicians over the course of a multiyear “diagnostic odyssey.” As part of the application process, UDN clinical sites review patients’ health records to assess whether the UDN evaluation may aid in the identification of a diagnosis. Accepted patients undergo an in-person evaluation at a clinical site (Fig. 1a). In most cases, blood, saliva, and/or fibroblast samples of affected and unaffected individuals in the family are collected during this evaluation or beforehand via mailed-in collection kits. These samples are sequenced at BaylorSeq; all sequencing data are made available to the clinical site within weeks (Fig. 1b). Variants in disease-causing genes related to the clinical phenotype, medically actionable pathogenic variants in disease-causing genes unrelated to the clinical phenotype, and heterozygote status for select recessive Mendelian conditions are listed in a clinical report issued by BaylorSeq in accordance with the UDN protocol and following American College of Medical Genetics and Genomics (ACMG) variant classification guidelines.¹³ At 8 of the 11 clinical sites surveyed, researchers simultaneously perform local analyses of the sequencing data in an attempt to identify “strong candidate” variants that may explain the patient’s symptoms (Fig. 1c, d); three surveyed sites run their local pipelines only when BaylorSeq’s clinical report is inconclusive. Once candidate variants are highlighted via clinical sites’ and BaylorSeq’s analyses, there are three ways by which their causality is established. First, human and animal databases are queried for genotype-matched individuals with symptomatic concordance with the patient.^14,15,16,17 Second, experiments are simultaneously performed to evaluate the in vivo effect of candidate variants in model organisms or cell lines. Third, the presence of secondary phenotypes indicated by genotype-matched individuals or in vivo experiments are confirmed in affected patients (Fig. 1e). Causal variants revealed through these steps are confirmed by Sanger sequencing, broadly shared by the UDN (Extended Data Note 1), and ideally lead to a molecular diagnosis for a patient, which in and of itself represents a turning point in a patient’s diagnostic odyssey, and also can inform positive therapeutic changes (Fig. 1f).¹⁸

**Fig. 1: Representative clinical workflow to uncover disease-causing genetic variants in undiagnosed patients.**

The computational tools used to find explanatory genetic variants change constantly with newly available technologies and newly encountered disease etiologies. Despite these iterative improvements to bioinformatics pipelines, the primary roles that computational tools play in the overall variant prioritization process can be categorized as follows: (1) aligning sequencing reads to a reference human genome (Fig. 1g), (2) identifying genetic variants present in the individual from the sequencing reads (Fig. 1h), (3) annotating those variants with relevant information (Fig. 1i), and finally (4) filtering and prioritizing variants that are likely to cause the patient’s condition (Fig. 1j). In the following sections, we delve into the purpose of and tools used in each of these categories.

Aligning next-generation sequencing reads

Aligning next-generation sequencing reads to a reference human genome is the necessary first step for all sequence analysis pipelines (Fig. 1g); the ubiquity of this step has resulted in community-driven standardization.¹⁹ Eight sites regularly realign reads after BaylorSeq’s initial alignment, whereas three sites realign reads only in specific circumstances, such as during reanalysis of a patient’s prior sequencing data. Realignment is necessary for six sites whose pipelines are configured for the GRCh37/hg19 human genome build, as genetic testing laboratories including BaylorSeq now provide reads aligned to the newer GRCh38/hg38 build. Realignment uses either an open-source implementation of the Burrows–Wheeler Aligner (BWA-MEM) (used regularly by six sites and in specific circumstances, as described above, by two sites) or Illumina/Edico’s DRAGEN aligner (used regularly by BaylorSeq and two clinical sites and in specific circumstances by one clinical site).

Simple variant calling

Calling single-nucleotide variants (SNVs) and short insertions and deletions (indels) from aligned reads is the next step in sequence processing (Fig. 1h) and is often accomplished using the Genome Analysis Toolkit (GATK) best practices workflow,²⁰ though Google’s DeepVariant²¹ and Real Time Genomics’ PolyBayes implementation (https://www.realtimegenomics.com) perform competitively for this task and are used in addition to GATK by two clinical sites. BaylorSeq calls variants using Illumina/Edico’s DRAGEN platform. Six clinical sites and BaylorSeq “jointly” call variants across samples as recommended in GATK to rescue low coverage true variants and accurately model false variants. In practice, variants are jointly called with (1) members of the same family, (2) other UDN patients at the same site, and/or (3) healthy patients internal or external to an institution. The Variant Quality Score Recalibration (VQSR) step recommended by GATK to identify technical artifacts, however, may misclassify real rare variants as false positives; this step is carefully reviewed or omitted in practice.

Structural variant detection

In contrast to calling simple variants, calling structural variants (SVs) from GS data is a relatively divergent step, indicating that best practices have yet to be determined. SVs refer to large (>50 bp) insertions and deletions, duplications and other copy-number variants (CNVs), short tandem repeat (STR) expansions, translocations where genomic regions have moved within or across chromosomes, and inversions where a detached stretch of DNA was reattached in the opposite orientation. Combining the output from many SV calling tools—each optimized for detecting complementary types of SVs and often using distinct information (e.g., read depth, paired-end reads, or split reads)—is necessary for comprehensive SV detection.²² Existing SV detection tools have been reviewed in depth;²³ here we list the subset of tools that are actively used by UDN sites (Table 1, Extended Data Table 1). The most commonly used tool, Manta, has been shown by independent evaluations to have high sensitivity but also a high false positive rate.²⁴ Future development of SV benchmarking data sets for assessing the accuracy of SV detection tools will be essential in directing the current diverse exploration of techniques toward community-established best practices.

Table 1 Structural variant (SV) callers in use at clinical sites.

Full size table

Quality control of called variants

Confirming the quality of sequencing data and variants is critical to avoid expending downstream analyses on false variants. CLIA-certified genetic testing laboratories check the quality of unaligned and map-aligned sequencing reads prior to variant calling for all clinical grade sequencing (Extended Data Note 2). Four UDN clinical sites regularly confirm the quality of sequencing reads using a combination of FASTQC, FASTP, MultiQC, BEDTools (to check coverage), and bam.iobio. Other clinical sites begin quality control (QC) only after read alignment and variant calling.

QC for Mendelian disease diagnosis encompasses three checks: (1) sequencing reads are high quality, (2) sequenced samples correspond to the correct individuals and have expected relatedness, and (3) inheritance patterns across families are as expected (Table 2, Extended Data Table 1). BaylorSeq performs QC for all clinical genomic sequencing before providing data to UDN clinical sites. However, when patients provide their own sequencing data (as opposed to BaylorSeq providing newly acquired data) or when “research” (as opposed to clinical) sequencing is provided, clinical sites perform QC. Most sites have nearly identical steps for check 1 and similar QC for checks 2 and 3. In practice, QC has identified incorrectly related or labeled samples and poor overall quality of sequencing reads that were remedied via resequencing before subsequent analyses.¹¹ Notably, existing QC tools rarely “flag” anomalous samples; users must accurately interpret results.

Table 2 Quality control (QC) checks of variants for rare disease diagnosis.

Full size table

Annotation and filtering of genetic variants

Even after removing low quality calls, a single genome can have several thousand unique genetic variants uncovered. Efficient, automated annotation and filtering of these variants is the next step of the variant prioritization process (Fig. 1i, Extended Data Table 2). Annotations fall into four categories: (1) known disease associations, (2) prevalence across healthy human populations, (3) predicted pathogenicity and functional effect, and (4) inheritance. Many scores exist across the first three categories;²⁵ in the following sections we explore those that are used in practice for rare disease diagnosis.

Known disease-associated genes

Many specific genetic variants have previously been determined to cause human disease, and it is useful to first look for the presence of these variants in a patient’s sequencing data. Databases compiling disease-causing variants, the genes they impact, and their phenotypic associations are used by ten clinical sites (Table 3). Genetic testing laboratories, including BaylorSeq, use these in addition to internal databases containing similar information. Disease-relevant variants are listed on clinical reports and are considered during the initial pass of each UDN case at all clinical sites.

Table 3 Human genetic variation data sets and derived tools.

Full size table

Variant segregation in healthy human populations

Several positions within the human genome naturally vary across healthy individuals, and “common” variants at these positions are unlikely to cause the conditions under investigation by the UDN. Though rare combinations of otherwise common variants may lead to disease,²⁶ clinical sites do not currently consider all common variant combinations. Instead, variants observed more than 1 in 100 times across healthy populations (i.e., minor allele frequency [MAF] > 0.01) are typically excluded during the first pass of the data. The exact MAF threshold used depends on the suspected mode of inheritance. Lower MAF thresholds are used for suspected dominant conditions because the variants causing the extremely rare phenotypes of UDN patients are assumed to be naturally selected against and thus equally rare in the general population and entirely absent in control population databases. Higher MAF thresholds are used for suspected recessive conditions because heterozygous individuals would not be expected to manifest severe disease features.

All UDN sites use data from the Broad Institute’s Genome Aggregation Database (gnomAD) to compute MAFs, and seven sites also compute MAFs from smaller or population-specific data sets on a case-by-case basis (Table 3). Two sites eliminate variants that are homozygous in three or more healthy individuals in these data sets. At the NIH site, rather than thresholding on MAFs computed directly from variant proportions in gnomAD, 95% Wilson confidence score intervals computed from these proportions are used to retain rare variants occurring in low coverage regions. Finally, five sites flag variants that are present in data sets internal to their institutions, because variants present in asymptomatic or differently symptomatic individuals are unlikely to be disease-relevant.

Eight sites consult SV databases to check the existence and/or MAF of detected SVs (Table 3, Extended Data Table 1). Multiple databases are checked in practice because the SV detection tools used across databases differ, so the absence or rarity of an SV in one database may reflect a particular SV detection approach rather than true population rarity.

Simple genetic variation observed across healthy humans tends to be sparsely distributed with varying degrees of impact. These features can be used to capture how regions of the human genome may be intolerant of loss-of-function (LoF) variants, such as frameshift or protein-truncating variants. Nine surveyed sites incorporate selective constraint scores derived from and released with gnomAD data in their diagnostic pipelines, with the probability of heterozygous LoF intolerance scores and missense constraint Z scores used most commonly (Table 3).

Predicted pathogenicity and functional effect of variants

Various tools predict the pathogenicity of uncovered variants.²⁵ Values derived from cross-species comparative genomics contribute heavily to pathogenicity predictors, as positions that are conserved across species tend to be functionally critical. However, since most candidate coding variants are evolutionarily well-conserved, only five sites directly consider conservation in their diagnostic pipelines (Table 4, Extended Data Table 1).

Table 4 Tools for assigning the pathogenic likelihood or functional impact of variants.

Full size table

The most commonly used pathogenicity predictors for rare disease diagnosis—used by eight clinical sites each—are Combined Annotation Dependent Depletion (CADD) and Rare Exome Variant Ensemble Learner (REVEL), each of which consider multiple variant annotations and where scores >25 and >0.3 respectively indicate likely pathogenic variants. Nearly all predicted pathogenicity scores used, with the exception of ReMM, indicate disease relevance primarily for coding variants.²⁷

Indeed, predicting and experimentally validating the pathogenic impact of noncoding variants is notoriously difficult. All 12 sites use tools to predict how noncoding variants alter expected gene expression and splicing. Few sites use the same subset of tools for this task, though SpliceAI is the most commonly used tool overall (Table 4).

Mode of inheritance

After variants have been quality checked, MAF filtered, and annotated, Mendelian mode of inheritance is evaluated next by the clinical sites. Some sites simultaneously consider the functional impact of variants, where, for instance, intergenic or perceived synonymous variants are excluded.³ Despite the ubiquity of this step, each site uses different tools for computing inheritance patterns.

For a dominantly inherited genetic condition to manifest, only one defective copy of the relevant gene is required, whereas recessive disease manifestation requires two defective gene copies. GS of unrelated or distantly related affected individuals is desired in suspected dominant cases to find rare, shared variants.

In sporadic cases—caused by a single de novo dominant or two recessive variants—GS of at least the affected individual and both unaffected parents is desired. Selecting heterozygous variants in the affected individual that are absent in both unaffected parents or homozygous variants in the affected individual that are absent in at least one parent via straightforward segregation analysis results in a majority of spurious de novo calls. These false positive calls stem from inadequate sequence coverage or alignment in parents from whom variants were in fact inherited and/or inaccurate modeling of underlying variant frequencies. Four sites regularly use specialized de novo calling tools or databases to offset these issues (Table 2). Fixing de novo calling errors requires analysis of sequencing reads, which many genetic testing centers do not readily provide.

Occasionally in sporadic and/or recessive cases, the same disease-causing variant is inherited from both heterozygous parents and can be easily detected as a homozygous variant. Genomic regions containing only homozygous variants in an affected individual with nonconsanguineous parents can also indicate an inherited deletion from one parent or uniparental isodisomy. These latter phenomena, revealed as Mendelian violations during the QC process (Table 2), can manifest in a recessive disease despite only one parent being heterozygous for the disease-causing variant. Often in undiagnosed recessive cases, two or more different heterozygous variants, each either inherited or occurring de novo, can give rise to the disease phenotype; these variants are referred to as compound heterozygous pairs. The complete set of compound heterozygous variant pairs in any given case is very large, and so filters—such as restricting to rare, LoF, likely pathogenic variants—are applied beforehand. If too few candidate explanatory variants pass these filters, the NIH, WUSTL and Miami sites use internal “second tier” schemes, such as increasing the allowable MAF threshold, to rescue additional compound heterozygous pairs.²⁸

Integration of nonsequencing data

Cases with nondiagnostic genetic testing have eventually been solved by reanalysis approaches that leverage additional data, such as transcriptome sequencing^29,30 (RNA-seq) or “deep phenotyping,”^31,32 to complement ES and GS.

Transcriptome sequencing

RNA-seq is increasingly utilized to (1) confirm suspected expression- or splice-altering variants initially prioritized through genomic sequencing, and/or (2) highlight genes that are aberrantly expressed relative to healthy, tissue-matched samples from databases such as GTEx (https://gtexportal.org/).^29,30 BCM, Stanford, and UCLA regularly use RNA-seq data for variant prioritization, and two other sites are actively working to incorporate RNA-seq data into their workflows as well (Extended Data Table 3). Vanderbilt uses PrediXcan to correlate observed phenotypes with imputed, rather than directly measured, gene expression.³³

Structured phenotyping

Deep phenotyping of patients is critical to the overall UDN process (Fig. 1a) and enables clinicians to focus on genes associated with a patient’s symptoms or suspected disease. Symptom terms are standardized via the Human Phenotype Ontology (HPO) and explicitly annotated for each UDN case during the in-person evaluation.³⁴ Computational tools can reason over these terms to generate gene panels that complement manual efforts.³⁵ All clinical sites have access to genes ranked by PhenoTips, a program embedded into the UDN data server. Eight clinical sites and BaylorSeq use additional tools to prioritize genes from patients’ phenotypes (Fig. 1j, Extended Data Table 4).³⁶ Amelie is used by five sites to scour the literature for examples of genes causing patients’ observed phenotypes, a process typically performed manually using the Monarch Initiative’s gene–phenotype browser. Exomiser is used by three sites to integrate genotype–phenotype data and runs in parallel to existing pipelines. Finally, pairwise associations between genes and HPO terms are downloadable from the HPO website; the union of genes associated with all annotated HPO terms per patient can be used directly or intersected with sets of disease-relevant genes from OMIM and HGMD. This approach is used by three sites regularly but has been implemented for various projects at all clinical sites.

Workflow management and wrapper tools

The complex workflows described here must be well-documented, customizable per case, and provide results in a timely manner and intuitive format. Case materials should be accessible by collaborative teams of clinicians, bioinformaticians, and genetic counselors. In practice, all sites use automated platforms to call, annotate, and prioritize candidate diagnostic variants (Extended Data Table 5, Extended Data Table 6). Spreadsheets are the most common tool used by all sites for storing, sharing, and commenting on variant-level data. Many sites also use commercial solutions for case management, which has enabled secure transition of certain workflow components to the cloud.

DISCUSSION

Pinpointing the genetic variants giving rise to ultrarare, undiagnosed diseases is a challenging and pressing problem being tackled on a case-by-case basis by clinical researchers worldwide. The computational tools utilized during these investigative efforts reflect relevant community standards but can also diverge across institutions and even across cases handled by the same clinical team.

The diverse, exploratory techniques employed by UDN clinical sites can overcome inherent limitations of clinical case review and standard sequencing interpretation provided by genetic testing laboratories—both of which rely on existing disease gene knowledge—by uncovering novel disease loci. For instance, when no compelling variants were found in phenotypically prioritized genes in two patients presenting with muscular and white matter abnormalities, a genetics-driven UDN pipeline uncovered diagnostic de novo missense variants in both individuals in TOMM70, a gene previously unassociated with disease.³⁷ Similarly, sequencing analyses were able to uncover de novo, heterozygous variants in nine individuals with neurodevelopmental delay and other multisystem anomalies in CDH2, a gene previously unassociated with a Mendelian neurodevelopmental condition.³⁸

Indeed, divergent aspects of UDN pipelines reflect promising avenues for case reanalysis and reveal areas where technical developments would be most impactful. Improving SV detection specificity would aid in cases with nondiagnostic microarrays, gene panels, and GS. Experimentally verifiable pathogenicity predictions for noncoding variants may solve cases with nondiagnostic ES. Finally, automated integration of additional data, such as RNA-seq,^29,30 long-read sequencing,³⁹ and epigenetic modifications,⁴⁰ may also increase the diagnostic rate for cases with inconclusive GS.

Consensus tools used across sites by multiple clinical research teams have been convincingly evaluated and are easily incorporated into existing workflows external to their original development environment. Clinical sites strive to incorporate better tools—including those developed in-house—as they emerge over time. Flexible, open-source implementations ease this process and can ultimately shorten the time to and improve the rate of diagnosis. Initiatives like the UDN provide an excellent opportunity to assess and share tools and ideas and jointly develop methods inspired by the most challenging undiagnosed cases.

Data availability

All data used in this analysis are available in the Main and Extended Data Tables.

References

Boycott, K. M., Vanstone, M. R., Bulman, D. E. & MacKenzie, A. E. Rare-disease genetics in the era of next-generation sequencing: discovery to translation. Nat. Rev. Genet. 14, 681–691 (2013).
Article CAS Google Scholar
Online Mendelian Inheritance in Man, OMIM. (McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD). https://omim.org.
Robinson, P. N. et al. Improved exome prioritization of disease genes through cross-species phenotype comparison. Genome Res. 24, 340–348 (2014).
Article CAS Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article CAS Google Scholar
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Article CAS Google Scholar
Posey, J. E. et al. Insights into genetics, human biology and disease gleaned from family based genomic studies. Genet. Med. 21, 798–812 (2019).
Article Google Scholar
Splinter, K. et al. Effect of genetic diagnosis on patients with previously undiagnosed disease. N. Engl. J. Med. 379, 2131–2139 (2018).
Article CAS Google Scholar
Macnamara, E. F. et al. Cases from the Undiagnosed Diseases Network: The continued value of counseling skills in a new genomic era. J. Genet. Couns. 28, 194–201 (2019).
Article Google Scholar
Macnamara, E. F. & D’Souza, P, Undiagnosed Diseases Network & Tifft, C. J. The undiagnosed diseases program: approach to diagnosis. Transl. Sci. Rare Dis. 4, 179–188 (2020).
Wambach, J. A. et al. Functional characterization of biallelic RTTN variants identified in an infant with microcephaly, simplified gyral pattern, pontocerebellar hypoplasia, and seizures. Pediatr. Res. 84, 435–441 (2018).
Article CAS Google Scholar
Lee, H. et al. Clinical exome sequencing for genetic identification of rare Mendelian disorders. JAMA 312, 1880–1887 (2014).
Article Google Scholar
Haghighi, A. et al. An integrated clinical program and crowdsourcing strategy for genomic sequencing and Mendelian disease gene discovery. NPJ Genom. Med. 3, 21 (2018).
Article Google Scholar
Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405–424 (2015).
Article Google Scholar
Philippakis, A. A. et al. The Matchmaker Exchange: a platform for rare disease gene discovery. Hum. Mutat. 36, 915–921 (2015).
Article Google Scholar
Frost, J. H. & Massagli, M. P. Social uses of personal health information within PatientsLikeMe, an online patient community: what can happen when patients have access to one another’s data. J. Med. Internet Res. 10, e15 (2008).
Article Google Scholar
Wang, J. et al. MARRVEL: integration of human and model organism genetic resources to facilitate functional annotation of the human genome. Am. J. Hum. Genet. 100, 843–853 (2017).
Article CAS Google Scholar
Bimber, B. N., Yan, M. Y., Peterson, S. M. & Ferguson, B. mGAP: the macaque genotype and phenotype resource, a framework for accessing and interpreting macaque variant data, and identifying new models of human disease. BMC Genomics 20, 176 (2019).
Article Google Scholar
Meyer, E. et al. Mutations in the histone methyltransferase gene KMT2B cause complex early-onset dystonia. Nat. Genet. 49, 223–237 (2017).
Article CAS Google Scholar
Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nat. Commun. 9, 4038 (2018).
Article Google Scholar
Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 43, 11.10.1–11.10.33 (2013).
Google Scholar
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Article CAS Google Scholar
Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020).
Article CAS Google Scholar
Mahmoud, M. et al. Structural variant calling: the long and the short of it. Genome Biol. 20, 246 (2019).
Article Google Scholar
Kosugi, S. et al. Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biol 20, 117 (2019).
Article Google Scholar
Liu, X., Wu, C., Li, C. & Boerwinkle, E. dbNSFP v3.0: a one-stop database of functional predictions and annotations for human nonsynonymous and splice-site SNVs. Hum. Mutat. 37, 235–241 (2016).
Article Google Scholar
Posey, J. E. Genome sequencing and implications for rare disorders. Orphanet J. Rare Dis. 14, 153 (2019).
Article Google Scholar
Mather, C. A. et al. CADD score has limited clinical validity for the identification of pathogenic variants in noncoding regions in a hereditary cancer panel. Genet. Med. 18, 1269–1275 (2016).
Article CAS Google Scholar
Gu, F. et al. A suite of automated sequence analyses reduces the number of candidate deleterious variants and reveals a difference between probands and unaffected siblings. Genet. Med. 21, 1772–1780 (2019).
Article Google Scholar
Lee, H. et al. Diagnostic utility of transcriptome sequencing for rare Mendelian diseases. Genet. Med. 22, 490–499 (2020).
Article CAS Google Scholar
Frésard, L. et al. Identification of rare-disease genes using blood transcriptome sequencing and large control cohorts. Nat. Med. 25, 911–919 (2019).
Article Google Scholar
Shashi, V. et al. A comprehensive iterative approach is highly effective in diagnosing individuals who are exome negative. Genet. Med. 21, 161–172 (2019).
Article CAS Google Scholar
Pena, L. D. M. et al. Looking beyond the exome: a phenotype-first approach to molecular diagnostic resolution in rare and undiagnosed diseases. Genet. Med. 20, 464–469 (2018).
Article Google Scholar
Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015).
Article CAS Google Scholar
Köhler, S. et al. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Res. 47, D1018–D1027 (2019).
Article Google Scholar
Smedley, D. & Robinson, P. N. Phenotype-driven strategies for exome prioritization of human Mendelian disease genes. Genome Med. 7, 81 (2015).
Article Google Scholar
Gonzalez, M. et al. Innovative genomic collaboration using the GENESIS (GEM.app) platform. Hum. Mutat. 36, 950–956 (2015).
Article Google Scholar
Dutta, D. et al. De novo mutations in TOMM70, a receptor of the mitochondrial import translocase, cause neurological impairment. Hum. Mol. Genet 29, 1568–1579 (2020).
Article CAS Google Scholar
Accogli, A. et al. De novo pathogenic variants in N-cadherin cause a syndromic neurodevelopmental disorder with corpus collosum, axon, cardiac, ocular, and genital defects. Am. J. Hum. Genet. 105, 854–868 (2019).
Article CAS Google Scholar
Merker, J. D. et al. Long-read genome sequencing identifies causal structural variation in a Mendelian disease. Genet. Med. 20, 159–163 (2017).
Article Google Scholar
Turro, E. et al. Whole-genome sequencing of patients with rare diseases in a national health system. Nature 583, 96–102 (2020).
Article CAS Google Scholar

Download references

Acknowledgements

Thank you to the UDN Tool Building Coalition for discussions about tools in use or under development, to Daniel Traviglia for clarifications on UDN data availability, and to Rebecca Reimers for writing feedback. Research reported here was supported by the NIH Common Fund, through the Office of Strategic Coordination/Office of the NIH Director under award numbers U01HG007530, U01HG007942, U01HG007672, U01HG007690, U01HG010218, U01HG007703, U01HG010230, U01HG010217, U01HG010233, U01HG007674, and U01HG010215, and by the Intramural Research Program of the National Human Genome Research Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Shilpa Nadimpalli Kobren, Kimberly LeBlanc, Cecilia Esteves, Jyoti G. Daya, Shamil R. Sunyaev & Isaac S. Kohane
Department of Pediatrics, Washington University School of Medicine, St. Louis, MO, USA
Dustin Baldridge & Daniel J. Wegner
Center for Genomic Discovery, University of Utah, Salt Lake City, UT, USA
Matt Velinder
Division of Genetics, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
Joel B. Krier & Shamil R. Sunyaev
National Human Genome Research Institute (NHGRI) at the National Institutes of Health (NIH), Bethesda, MD, USA
Barbara N. Pusey
Department of Human Genetics and Hussman Institute for Human Genomics, University of Miami Health System, Miami, FL, USA
Stephan Züchner
Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA, USA
Elizabeth Blue
Department of Human Genetics, David Geffen School of Medicine at the University of California, Los Angeles, CA, USA
Hane Lee & Alden Huang
Department of Pathology and Laboratory Medicine, David Geffen School of Medicine at the University of California, Los Angeles, CA, USA
Hane Lee
Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
Lisa Bastarache, Anna Bican & Joy Cogan
Stanford Center for Undiagnosed Diseases, Stanford, CA, USA
Shruti Marwaha
Institute for Genomic Medicine, Columbia University Medical Center, New York City, NY, USA
Anna Alkelai
Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
David R. Murdock & Pengfei Liu
Baylor Genetics, Houston, TX, USA
Pengfei Liu
McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
Alexander J. Paul
National Institutes of Health, Undiagnosed Diseases Program Clinical Site, Bethesda, MD, USA
Maria T. Acosta, David R. Adams, Eva Baker, Carsten Bonnenmann, Elizabeth A. Burke, Heather A. Colley, Precilla D’Souza, Joie Davis, Argenia L. Doss, David D. Draper, David J. Eckstein, Carlos Ferreira, Laurie C. Findley, William A. Gahl, Bernadette Gochuico, Rena A. Godfrey, Madison P. Goldrich, Catherine A. Groden, Laryssa Huryn, Donna M. Krasnewich, Grace L. LaMoure, Lea Latham, John MacDowall, Ellen F. Macnamara, Valerie V. Maduro, May Christine V. Malicdan, Laura A. Mamounas, Teri A. Manolio, Thomas C. Markello, Deborah Mosbrook-Davis, John J. Mulvihill, Avi Nath, Donna Novacic, Bradley Power, Barbara N. Pusey, Francis Rossignol, Ben Solomon, Audrey Thurm, Cynthia J. Tifft, Camilo Toro, Tiina K. Urv, Colleen E. Wahl, Lynne A. Wolfe, John Yang, Muhammad Yousef & Wadih Zein
University of Washington and Seattle Children’s Hospital Clinical Site, Seattle, WA, USA
Margaret Adam, Laura Amendola, Michael Bamshad, Anita Beck, Jimmy Bennett, Beverly Berg-Rood, Elizabeth Blue, Brenna Boyd, Peter Byers, Sirisak Chanprasert, Michael Cunningham, Katrina Dipple, Daniel Doherty, Dawn Earl, Ian Glass, Katie Golden-Grant, Sihoun Hahn, Anne Hing, Fuki M. Hisama, Martha Horike-Pyne, Gail P. Jarvik, Jeffrey Jarvik, Suman Jayadev, Christina Lam, Kenneth Maravilla, Heather Mefford, J. Lawrence Merritt, Ghayda Mirzaa, Deborah Nickerson, Wendy Raskind, Natalie Rosenwasser, C. Ron Scott, Angela Sun, Virginia Sybert, Stephanie Wallace, Mark Wener & Tara Wenger
Harvard-affiliated Boston Children’s Hospital, Massachusetts General Hospital, Brigham and Women’s Hospital, and Brigham Genomic Medicine Clinical Site, Boston, MA, USA
Pankaj B. Agrawal, Alan H. Beggs, Gerard T. Berry, Lauren C. Briere, Laurel A. Cobban, Matthew Coggins, Cynthia M. Cooper, Elizabeth L. Fieg, Frances High, Ingrid A. Holm, Susan Korrick, Joel B. Krier, Sharyn A. Lincoln, Joseph Loscalzo, Richard L. Maas, Calum A. MacRae, J. Carl Pallais, Deepak A. Rao, Lance H. Rodan, Edwin K. Silverman, Joan M. Stoler, David A. Sweetser, Melissa Walker & Chris A. Walsh
Baylor College of Medicine, Clinical Site, Houston, TX, USA
Mercedes E. Alejandro, Mahshid S. Azamian, Carlos A. Bacino, Ashok Balasubramanyam, Lindsay C. Burrage, Hsiao-Tuan Chao, Gary D. Clark, William J. Craigen, Hongzheng Dai, Shweta U. Dhar, Lisa T. Emrick, Alica M. Goldman, Neil A. Hanchard, Fariha Jamal, Lefkothea Karaviti, Seema R. Lalani, Brendan H. Lee, Richard A. Lewis, Ronit Marom, Paolo M. Moretti, David R. Murdock, Sarah K. Nicholas, James P. Orengo, Jennifer E. Posey, Lorraine Potocki, Jill A. Rosenfeld, Susan L. Samson, Daryl A. Scott, Alyssa A. Tran & Tiphanie P. Vogel
University of Utah Clinical Site, Salt Lake City, UT, USA
Justin Alvey, Ashley Andrews, Jim Bale, Pinar Bayrak-Toydemir, John Bohnsack, Lorenzo Botto, John Carey, Nicola Longo, Rong Mao, Gabor Marth, Paolo Moretti, Laura Pace, Aaron Quinlan, Matt Velinder & Dave Viskochil
Stanford University Clinical Site, Stanford, CA, USA
Euan A. Ashley, Gill Bejerano, Jonathan A. Bernstein, Devon Bonner, Terra R. Coakley, Liliana Fernandez, Paul G. Fisher, Laure Fresard, Jason Hom, Yong Huang, Jennefer N. Kohler, Elijah Kravets, Marta M. Majcherska, Beth A. Martin, Shruti Marwaha, Colleen E. McCormack, Archana N. Raja, Chloe M. Reuter, Maura Ruzhnikov, Jacinda B. Sampson, Kevin S. Smith, Shirley Sutton, Holly K. Tabor, Brianna M. Tucker, Matthew T. Wheeler, Diane B. Zastrow & Chunli Zhao
University of Miami Clinical Site, Miami, FL, USA
Guney Bademci, Deborah Barbouth, Stephanie Bivona, Olveen Carrasquillo, Ta Chen Peter Chang, Irman Forghani, Alana Grajewski, Rosario Isasi, Byron Lam, Roy Levitt, Xue Zhong Liu, Jacob McCauley, Ralph Sacco, Mario Saporta, Judy Schaechter, Mustafa Tekin, Fred Telischi, Willa Thorson & Stephan Zuchner
Washington University of Saint Louis, Clinical Site, Saint Louis, MO, USA
Dustin Baldridge, F. Sessions Cole, Nichole Hayes, Dana Kiley, Kathy Sisco, Jennifer Wambach & Daniel Wegner
Washington University of Saint Louis, Model Organism Screening Center, Saint Louis, MO, USA
Dustin Baldridge, Stephen Pak, Timothy Schedl, Jimann Shin & Lilianna Solnica-Krezel
Children’s Hospital of Philadelphia or University of Pennsylvania Clinical Site, Philadelphia, PA, USA
Edward Behrens, Matthew Deardorff, Marni Falk, Kelly Hassey, Kathleen Sullivan & Adeline Vanderver
Vanderbilt University Clinical Site, Nashville, TN, USA
Anna Bican, Elly Brokamp, Joy D. Cogan, Laura Duncan, Rizwan Hamid, Jennifer Kennedy, Mary Kozuira, John H. Newman, John A. Phillips III, Lynette Rives, Amy K. Robertson & Emily Solem
University of California, Los Angeles, Clinical Site, Los Angeles, CA, USA
Gabrielle Brown, Manish J. Butte, Esteban C. Dell’Angelica, Naghmeh Dorrani, Emilie D. Douine, Brent L. Fogel, Irma Gutierrez, Alden Huang, Deborah Krakow, Hane Lee, Sandra K. Loo, Bryan C. Mak, Martin G. Martin, Julian A. Martinez-Agosto, Elisabeth McGee, Stanley F. Nelson, Shirley Nieves-Rodriguez, Christina G. S. Palmer, Jeanette C. Papp, Neil H. Parker, Genecee Renteria, Rebecca H. Signer, Janet S. Sinsheimer, Jijun Wan, Lee-kai Wang, Katherine Wesseling Perry & Jeremy D. Woods
University of Alabama Coordinating Center, Birmingham, AL, USA
William E. Byrd, Andrew B. Crouse, Matthew Might, Mariko Nakano-Okuno & Jordan Whitlock
Duke University Clinical Site, Durham, NC, USA
Heidi Cope, Allyn McConkie-Rosell, Kelly Schoch, Vandana Shashi, Edward C. Smith, Rebecca C. Spillmann, Jennifer A. Sullivan, Queenie K.-G. Tan & Nicole M. Walley
Mayo Clinic Metabolomics Core, Rochester, MN, USA
Surendra Dasari, Brendan C. Lanpher, Ian R. Lanza, Eva Morava & Devin Oglesbee
Baylor Genetics Sequencing Core, Houston, TX, USA
Christine M. Eng, Pengfei Liu & Patricia A. Ward
Harvard Medical School Coordinating Center, Boston, MA, USA
Cecilia Esteves, Isaac S. Kohane, Kimberly LeBlanc, Alexa T. McCray, Anna Nagy & Amelia L. M. Tan
Columbia University Clinical Site, New York City, NY, USA
David B. Goldstein
Baylor College of Medicine, Model Organism Screening Center, Houston, TX, USA
Michael F. Wangler & Shinya Yamamoto
University of Oregon, Model Organism Screening Center, Eugene, OR, USA
Monte Westerfield

Authors

Shilpa Nadimpalli Kobren
View author publications
You can also search for this author in PubMed Google Scholar
Dustin Baldridge
View author publications
You can also search for this author in PubMed Google Scholar
Matt Velinder
View author publications
You can also search for this author in PubMed Google Scholar
Joel B. Krier
View author publications
You can also search for this author in PubMed Google Scholar
Kimberly LeBlanc
View author publications
You can also search for this author in PubMed Google Scholar
Cecilia Esteves
View author publications
You can also search for this author in PubMed Google Scholar
Barbara N. Pusey
View author publications
You can also search for this author in PubMed Google Scholar
Stephan Züchner
View author publications
You can also search for this author in PubMed Google Scholar
Elizabeth Blue
View author publications
You can also search for this author in PubMed Google Scholar
Hane Lee
View author publications
You can also search for this author in PubMed Google Scholar
Alden Huang
View author publications
You can also search for this author in PubMed Google Scholar
Lisa Bastarache
View author publications
You can also search for this author in PubMed Google Scholar
Anna Bican
View author publications
You can also search for this author in PubMed Google Scholar
Joy Cogan
View author publications
You can also search for this author in PubMed Google Scholar
Shruti Marwaha
View author publications
You can also search for this author in PubMed Google Scholar
Anna Alkelai
View author publications
You can also search for this author in PubMed Google Scholar
David R. Murdock
View author publications
You can also search for this author in PubMed Google Scholar
Pengfei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Daniel J. Wegner
View author publications
You can also search for this author in PubMed Google Scholar
Alexander J. Paul
View author publications
You can also search for this author in PubMed Google Scholar
Shamil R. Sunyaev
View author publications
You can also search for this author in PubMed Google Scholar
Isaac S. Kohane
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

Contributions

Conceptualization: S.N.K., S.R.S., I.S.K. Data curation: S.N.K., D.B., M.V., J.B.K., B.N.P., S.Z., E.B., H.L., A.H., L.B., A.B., J.C., S.M., A.A., D.R.M., P.L., D.J.W., A.J.P. Formal analysis: S.N.K. Funding acquisition: I.S.K. Investigation: S.N.K., S.R.S., I.S.K. Methodology: S.N.K. Visualization: S.N.K.; Writing—original draft: S.N.K. Writing—review & editing: S.N.K., D.B., M.V., K.L., C.E., S.R.S.

Corresponding author

Correspondence to Isaac S. Kohane.

Ethics declarations

Competing interests

P.L. is an employee of Baylor College of Medicine and derives support through a professional services agreement with Baylor Genetics, which performs clinical genetic testing services. The other authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Extended Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kobren, S.N., Baldridge, D., Velinder, M. et al. Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases. Genet Med 23, 1075–1085 (2021). https://doi.org/10.1038/s41436-020-01084-8

Download citation

Received: 11 August 2020
Revised: 14 December 2020
Accepted: 17 December 2020
Published: 12 February 2021
Issue Date: June 2021
DOI: https://doi.org/10.1038/s41436-020-01084-8

This article is cited by

International Undiagnosed Diseases Programs (UDPs): components and outcomes
- Ela Curic
- Lisa Ewans
- Elizabeth Emma Palmer
Orphanet Journal of Rare Diseases (2023)
Simulation of undiagnosed patients with novel genetic conditions
- Emily Alsentzer
- Samuel G. Finlayson
- Isaac S. Kohane
Nature Communications (2023)
Genetic pain loss disorders
- Annette Lischka
- Petra Lassuthova
- Ingo Kurth
Nature Reviews Disease Primers (2022)

Abstract

Purpose

Methods

Results

Conclusion

Similar content being viewed by others

Refining the impact of genetic evidence on clinical success

Genome-wide association studies

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

INTRODUCTION

MATERIALS AND METHODS

Participating sites

Data collection

RESULTS

Overview of diagnostic workflow components

Aligning next-generation sequencing reads

Simple variant calling

Structural variant detection

Quality control of called variants

Annotation and filtering of genetic variants

Known disease-associated genes

Variant segregation in healthy human populations

Predicted pathogenicity and functional effect of variants

Mode of inheritance

Integration of nonsequencing data

Transcriptome sequencing

Structured phenotyping

Workflow management and wrapper tools

DISCUSSION

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Consortia

Undiagnosed Diseases Network

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Extended Data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

International Undiagnosed Diseases Programs (UDPs): components and outcomes

Simulation of undiagnosed patients with novel genetic conditions

Genetic pain loss disorders

Search

Quick links