Abstract
Purpose
Genomic sequencing has become an increasingly powerful and relevant tool to be leveraged for the discovery of genetic aberrations underlying rare, Mendelian conditions. Although the computational tools incorporated into diagnostic workflows for this task are continually evolving and improving, we nevertheless sought to investigate commonalities across sequencing processing workflows to reveal consensus and standard practice tools and highlight exploratory analyses where technical and theoretical method improvements would be most impactful.
Methods
We collected details regarding the computational approaches used by a genetic testing laboratory and 11 clinical research sites in the United States participating in the Undiagnosed Diseases Network via meetings with bioinformaticians, online survey forms, and analyses of internal protocols.
Results
We found that tools for processing genomic sequencing data can be grouped into four distinct categories. Whereas well-established practices exist for initial variant calling and quality control steps, there is substantial divergence across sites in later stages for variant prioritization and multimodal data integration, demonstrating a diversity of approaches for solving the most mysterious undiagnosed cases.
Conclusion
The largest differences across diagnostic workflows suggest that advances in structural variant detection, noncoding variant interpretation, and integration of additional biomedical data may be especially promising for solving chronically undiagnosed cases.
Similar content being viewed by others
INTRODUCTION
Next-generation exome sequencing (ES) and genome sequencing (GS) have revolutionized the process for diagnosing rare and novel genetic conditions.1 Traditionally, the diagnostic process has primarily been driven by phenotype, with clinicians comparing patients’ symptoms to others encountered in their prior experience and clinical training and/or to a knowledgebase of known human diseases.2 In a typical undiagnosed case, however, either a patient’s phenotype is not indicative of any known disease, or tests to confirm the presence of a suspected genetic condition are inconclusive. In these instances, ES and GS have enabled health-care providers to pursue a genetics-driven diagnostic approach in parallel, where the genetic variation uncovered in a patient can be assessed with respect to not only its known phenotypic associations3 but also to its prevalence in background populations,4 predicted pathogenicity,5 functional consequences, and mode of inheritance to reveal novel disease-causing loci. Indeed, while traditional clinical case review and directed diagnostic assays continue to solve difficult cases, ~74% of newly diagnosed genetic conditions have been attributed to analyses of ES and GS data.6,7 However, the diagnosis rate for patients with potentially unique genetic conditions is still ~35%,7 suggesting ample opportunity for methodological improvements to advance our understanding of the genetic underpinnings of phenotypic extremes.
With this goal in mind, cross-institutional initiatives such as Care4Rare in Canada (http://care4rare.ca) and Solve-RD in Europe (http://solve-rd.eu) have been established to connect and enable clinical researchers to uncover the genetic origins of disease in undiagnosed patients. In addition to furthering basic genetics research, these efforts have provided scores of patients with an end to diagnostic uncertainty and access to additional services.8 The most expansive undiagnosed initiative in the United States is the Undiagnosed Diseases Network (UDN), which encompasses 12 clinical sites and has, since its inception in 2014, cumulatively diagnosed over 400 individuals and described over 30 novel syndromes.7 Each UDN clinical site is staffed with specialists who develop and apply complex suites of bioinformatics tools to analyze sequencing data and uncover disease-causing variants.9 These sites each underwent a competitive application process and were selected to join the UDN due to their demonstrated track record of diagnosing difficult cases and characterizing novel genetic conditions through ongoing research efforts. The workflows implemented at these sites are thus representative of the state-of-the-art in rare disease diagnostic efforts.
We gathered details about 12 UDN bioinformatics pipelines, determined recurrent steps in a typical diagnostic evaluation, and identified consensus approaches. Moreover, we highlight substantial differences across pipelines regarding overall organization and incorporated tools. The comprehensive snapshot of effective computational workflows presented here can direct clinical teams interested in initiating genomic sequencing usage or re-evaluating patients who have had inconclusive genetic testing.
MATERIALS AND METHODS
Participating sites
Sequence analysis pipeline details were collected from the CLIA-certified sequencing core at Baylor Genetics (BaylorSeq) and 11 UDN clinical sites: Baylor College of Medicine (BCM), Duke University and Columbia University Institute for Genomic Medicine (Duke/Columbia), three Harvard-affiliated hospitals and Brigham Genomic Medicine (Harvard), University of Miami Miller School of Medicine (Miami), National Institutes of Health (NIH), University of Washington School of Medicine and Seattle Children’s Hospital (PacificNW), Stanford Center for Undiagnosed Diseases (Stanford), University of California–Los Angeles (UCLA), University of Utah Health Center for Genetic Discovery (Utah), Vanderbilt University Medical Center (Vanderbilt), and Washington University School of Medicine (WUSTL). The University of Pennsylvania and Children’s Hospital of Philadelphia clinical site had yet to process sequencing data for a UDN case at the time of writing and thus is excluded from this study.
Data collection
We systematically collected details about each UDN site’s computational diagnostic workflows using a combination of in-person and virtual meetings with bioinformaticians and genetic counselors, online survey forms, and inspections of published papers and internal protocols.10,11,12
RESULTS
Overview of diagnostic workflow components
Before applying to the UDN, a patient has typically endured extensive prior testing by multiple clinicians over the course of a multiyear “diagnostic odyssey.” As part of the application process, UDN clinical sites review patients’ health records to assess whether the UDN evaluation may aid in the identification of a diagnosis. Accepted patients undergo an in-person evaluation at a clinical site (Fig. 1a). In most cases, blood, saliva, and/or fibroblast samples of affected and unaffected individuals in the family are collected during this evaluation or beforehand via mailed-in collection kits. These samples are sequenced at BaylorSeq; all sequencing data are made available to the clinical site within weeks (Fig. 1b). Variants in disease-causing genes related to the clinical phenotype, medically actionable pathogenic variants in disease-causing genes unrelated to the clinical phenotype, and heterozygote status for select recessive Mendelian conditions are listed in a clinical report issued by BaylorSeq in accordance with the UDN protocol and following American College of Medical Genetics and Genomics (ACMG) variant classification guidelines.13 At 8 of the 11 clinical sites surveyed, researchers simultaneously perform local analyses of the sequencing data in an attempt to identify “strong candidate” variants that may explain the patient’s symptoms (Fig. 1c, d); three surveyed sites run their local pipelines only when BaylorSeq’s clinical report is inconclusive. Once candidate variants are highlighted via clinical sites’ and BaylorSeq’s analyses, there are three ways by which their causality is established. First, human and animal databases are queried for genotype-matched individuals with symptomatic concordance with the patient.14,15,16,17 Second, experiments are simultaneously performed to evaluate the in vivo effect of candidate variants in model organisms or cell lines. Third, the presence of secondary phenotypes indicated by genotype-matched individuals or in vivo experiments are confirmed in affected patients (Fig. 1e). Causal variants revealed through these steps are confirmed by Sanger sequencing, broadly shared by the UDN (Extended Data Note 1), and ideally lead to a molecular diagnosis for a patient, which in and of itself represents a turning point in a patient’s diagnostic odyssey, and also can inform positive therapeutic changes (Fig. 1f).18
The computational tools used to find explanatory genetic variants change constantly with newly available technologies and newly encountered disease etiologies. Despite these iterative improvements to bioinformatics pipelines, the primary roles that computational tools play in the overall variant prioritization process can be categorized as follows: (1) aligning sequencing reads to a reference human genome (Fig. 1g), (2) identifying genetic variants present in the individual from the sequencing reads (Fig. 1h), (3) annotating those variants with relevant information (Fig. 1i), and finally (4) filtering and prioritizing variants that are likely to cause the patient’s condition (Fig. 1j). In the following sections, we delve into the purpose of and tools used in each of these categories.
Aligning next-generation sequencing reads
Aligning next-generation sequencing reads to a reference human genome is the necessary first step for all sequence analysis pipelines (Fig. 1g); the ubiquity of this step has resulted in community-driven standardization.19 Eight sites regularly realign reads after BaylorSeq’s initial alignment, whereas three sites realign reads only in specific circumstances, such as during reanalysis of a patient’s prior sequencing data. Realignment is necessary for six sites whose pipelines are configured for the GRCh37/hg19 human genome build, as genetic testing laboratories including BaylorSeq now provide reads aligned to the newer GRCh38/hg38 build. Realignment uses either an open-source implementation of the Burrows–Wheeler Aligner (BWA-MEM) (used regularly by six sites and in specific circumstances, as described above, by two sites) or Illumina/Edico’s DRAGEN aligner (used regularly by BaylorSeq and two clinical sites and in specific circumstances by one clinical site).
Simple variant calling
Calling single-nucleotide variants (SNVs) and short insertions and deletions (indels) from aligned reads is the next step in sequence processing (Fig. 1h) and is often accomplished using the Genome Analysis Toolkit (GATK) best practices workflow,20 though Google’s DeepVariant21 and Real Time Genomics’ PolyBayes implementation (https://www.realtimegenomics.com) perform competitively for this task and are used in addition to GATK by two clinical sites. BaylorSeq calls variants using Illumina/Edico’s DRAGEN platform. Six clinical sites and BaylorSeq “jointly” call variants across samples as recommended in GATK to rescue low coverage true variants and accurately model false variants. In practice, variants are jointly called with (1) members of the same family, (2) other UDN patients at the same site, and/or (3) healthy patients internal or external to an institution. The Variant Quality Score Recalibration (VQSR) step recommended by GATK to identify technical artifacts, however, may misclassify real rare variants as false positives; this step is carefully reviewed or omitted in practice.
Structural variant detection
In contrast to calling simple variants, calling structural variants (SVs) from GS data is a relatively divergent step, indicating that best practices have yet to be determined. SVs refer to large (>50 bp) insertions and deletions, duplications and other copy-number variants (CNVs), short tandem repeat (STR) expansions, translocations where genomic regions have moved within or across chromosomes, and inversions where a detached stretch of DNA was reattached in the opposite orientation. Combining the output from many SV calling tools—each optimized for detecting complementary types of SVs and often using distinct information (e.g., read depth, paired-end reads, or split reads)—is necessary for comprehensive SV detection.22 Existing SV detection tools have been reviewed in depth;23 here we list the subset of tools that are actively used by UDN sites (Table 1, Extended Data Table 1). The most commonly used tool, Manta, has been shown by independent evaluations to have high sensitivity but also a high false positive rate.24 Future development of SV benchmarking data sets for assessing the accuracy of SV detection tools will be essential in directing the current diverse exploration of techniques toward community-established best practices.
Quality control of called variants
Confirming the quality of sequencing data and variants is critical to avoid expending downstream analyses on false variants. CLIA-certified genetic testing laboratories check the quality of unaligned and map-aligned sequencing reads prior to variant calling for all clinical grade sequencing (Extended Data Note 2). Four UDN clinical sites regularly confirm the quality of sequencing reads using a combination of FASTQC, FASTP, MultiQC, BEDTools (to check coverage), and bam.iobio. Other clinical sites begin quality control (QC) only after read alignment and variant calling.
QC for Mendelian disease diagnosis encompasses three checks: (1) sequencing reads are high quality, (2) sequenced samples correspond to the correct individuals and have expected relatedness, and (3) inheritance patterns across families are as expected (Table 2, Extended Data Table 1). BaylorSeq performs QC for all clinical genomic sequencing before providing data to UDN clinical sites. However, when patients provide their own sequencing data (as opposed to BaylorSeq providing newly acquired data) or when “research” (as opposed to clinical) sequencing is provided, clinical sites perform QC. Most sites have nearly identical steps for check 1 and similar QC for checks 2 and 3. In practice, QC has identified incorrectly related or labeled samples and poor overall quality of sequencing reads that were remedied via resequencing before subsequent analyses.11 Notably, existing QC tools rarely “flag” anomalous samples; users must accurately interpret results.
Annotation and filtering of genetic variants
Even after removing low quality calls, a single genome can have several thousand unique genetic variants uncovered. Efficient, automated annotation and filtering of these variants is the next step of the variant prioritization process (Fig. 1i, Extended Data Table 2). Annotations fall into four categories: (1) known disease associations, (2) prevalence across healthy human populations, (3) predicted pathogenicity and functional effect, and (4) inheritance. Many scores exist across the first three categories;25 in the following sections we explore those that are used in practice for rare disease diagnosis.
Known disease-associated genes
Many specific genetic variants have previously been determined to cause human disease, and it is useful to first look for the presence of these variants in a patient’s sequencing data. Databases compiling disease-causing variants, the genes they impact, and their phenotypic associations are used by ten clinical sites (Table 3). Genetic testing laboratories, including BaylorSeq, use these in addition to internal databases containing similar information. Disease-relevant variants are listed on clinical reports and are considered during the initial pass of each UDN case at all clinical sites.
Variant segregation in healthy human populations
Several positions within the human genome naturally vary across healthy individuals, and “common” variants at these positions are unlikely to cause the conditions under investigation by the UDN. Though rare combinations of otherwise common variants may lead to disease,26 clinical sites do not currently consider all common variant combinations. Instead, variants observed more than 1 in 100 times across healthy populations (i.e., minor allele frequency [MAF] > 0.01) are typically excluded during the first pass of the data. The exact MAF threshold used depends on the suspected mode of inheritance. Lower MAF thresholds are used for suspected dominant conditions because the variants causing the extremely rare phenotypes of UDN patients are assumed to be naturally selected against and thus equally rare in the general population and entirely absent in control population databases. Higher MAF thresholds are used for suspected recessive conditions because heterozygous individuals would not be expected to manifest severe disease features.
All UDN sites use data from the Broad Institute’s Genome Aggregation Database (gnomAD) to compute MAFs, and seven sites also compute MAFs from smaller or population-specific data sets on a case-by-case basis (Table 3). Two sites eliminate variants that are homozygous in three or more healthy individuals in these data sets. At the NIH site, rather than thresholding on MAFs computed directly from variant proportions in gnomAD, 95% Wilson confidence score intervals computed from these proportions are used to retain rare variants occurring in low coverage regions. Finally, five sites flag variants that are present in data sets internal to their institutions, because variants present in asymptomatic or differently symptomatic individuals are unlikely to be disease-relevant.
Eight sites consult SV databases to check the existence and/or MAF of detected SVs (Table 3, Extended Data Table 1). Multiple databases are checked in practice because the SV detection tools used across databases differ, so the absence or rarity of an SV in one database may reflect a particular SV detection approach rather than true population rarity.
Simple genetic variation observed across healthy humans tends to be sparsely distributed with varying degrees of impact. These features can be used to capture how regions of the human genome may be intolerant of loss-of-function (LoF) variants, such as frameshift or protein-truncating variants. Nine surveyed sites incorporate selective constraint scores derived from and released with gnomAD data in their diagnostic pipelines, with the probability of heterozygous LoF intolerance scores and missense constraint Z scores used most commonly (Table 3).
Predicted pathogenicity and functional effect of variants
Various tools predict the pathogenicity of uncovered variants.25 Values derived from cross-species comparative genomics contribute heavily to pathogenicity predictors, as positions that are conserved across species tend to be functionally critical. However, since most candidate coding variants are evolutionarily well-conserved, only five sites directly consider conservation in their diagnostic pipelines (Table 4, Extended Data Table 1).
The most commonly used pathogenicity predictors for rare disease diagnosis—used by eight clinical sites each—are Combined Annotation Dependent Depletion (CADD) and Rare Exome Variant Ensemble Learner (REVEL), each of which consider multiple variant annotations and where scores >25 and >0.3 respectively indicate likely pathogenic variants. Nearly all predicted pathogenicity scores used, with the exception of ReMM, indicate disease relevance primarily for coding variants.27
Indeed, predicting and experimentally validating the pathogenic impact of noncoding variants is notoriously difficult. All 12 sites use tools to predict how noncoding variants alter expected gene expression and splicing. Few sites use the same subset of tools for this task, though SpliceAI is the most commonly used tool overall (Table 4).
Mode of inheritance
After variants have been quality checked, MAF filtered, and annotated, Mendelian mode of inheritance is evaluated next by the clinical sites. Some sites simultaneously consider the functional impact of variants, where, for instance, intergenic or perceived synonymous variants are excluded.3 Despite the ubiquity of this step, each site uses different tools for computing inheritance patterns.
For a dominantly inherited genetic condition to manifest, only one defective copy of the relevant gene is required, whereas recessive disease manifestation requires two defective gene copies. GS of unrelated or distantly related affected individuals is desired in suspected dominant cases to find rare, shared variants.
In sporadic cases—caused by a single de novo dominant or two recessive variants—GS of at least the affected individual and both unaffected parents is desired. Selecting heterozygous variants in the affected individual that are absent in both unaffected parents or homozygous variants in the affected individual that are absent in at least one parent via straightforward segregation analysis results in a majority of spurious de novo calls. These false positive calls stem from inadequate sequence coverage or alignment in parents from whom variants were in fact inherited and/or inaccurate modeling of underlying variant frequencies. Four sites regularly use specialized de novo calling tools or databases to offset these issues (Table 2). Fixing de novo calling errors requires analysis of sequencing reads, which many genetic testing centers do not readily provide.
Occasionally in sporadic and/or recessive cases, the same disease-causing variant is inherited from both heterozygous parents and can be easily detected as a homozygous variant. Genomic regions containing only homozygous variants in an affected individual with nonconsanguineous parents can also indicate an inherited deletion from one parent or uniparental isodisomy. These latter phenomena, revealed as Mendelian violations during the QC process (Table 2), can manifest in a recessive disease despite only one parent being heterozygous for the disease-causing variant. Often in undiagnosed recessive cases, two or more different heterozygous variants, each either inherited or occurring de novo, can give rise to the disease phenotype; these variants are referred to as compound heterozygous pairs. The complete set of compound heterozygous variant pairs in any given case is very large, and so filters—such as restricting to rare, LoF, likely pathogenic variants—are applied beforehand. If too few candidate explanatory variants pass these filters, the NIH, WUSTL and Miami sites use internal “second tier” schemes, such as increasing the allowable MAF threshold, to rescue additional compound heterozygous pairs.28
Integration of nonsequencing data
Cases with nondiagnostic genetic testing have eventually been solved by reanalysis approaches that leverage additional data, such as transcriptome sequencing29,30 (RNA-seq) or “deep phenotyping,”31,32 to complement ES and GS.
Transcriptome sequencing
RNA-seq is increasingly utilized to (1) confirm suspected expression- or splice-altering variants initially prioritized through genomic sequencing, and/or (2) highlight genes that are aberrantly expressed relative to healthy, tissue-matched samples from databases such as GTEx (https://gtexportal.org/).29,30 BCM, Stanford, and UCLA regularly use RNA-seq data for variant prioritization, and two other sites are actively working to incorporate RNA-seq data into their workflows as well (Extended Data Table 3). Vanderbilt uses PrediXcan to correlate observed phenotypes with imputed, rather than directly measured, gene expression.33
Structured phenotyping
Deep phenotyping of patients is critical to the overall UDN process (Fig. 1a) and enables clinicians to focus on genes associated with a patient’s symptoms or suspected disease. Symptom terms are standardized via the Human Phenotype Ontology (HPO) and explicitly annotated for each UDN case during the in-person evaluation.34 Computational tools can reason over these terms to generate gene panels that complement manual efforts.35 All clinical sites have access to genes ranked by PhenoTips, a program embedded into the UDN data server. Eight clinical sites and BaylorSeq use additional tools to prioritize genes from patients’ phenotypes (Fig. 1j, Extended Data Table 4).36 Amelie is used by five sites to scour the literature for examples of genes causing patients’ observed phenotypes, a process typically performed manually using the Monarch Initiative’s gene–phenotype browser. Exomiser is used by three sites to integrate genotype–phenotype data and runs in parallel to existing pipelines. Finally, pairwise associations between genes and HPO terms are downloadable from the HPO website; the union of genes associated with all annotated HPO terms per patient can be used directly or intersected with sets of disease-relevant genes from OMIM and HGMD. This approach is used by three sites regularly but has been implemented for various projects at all clinical sites.
Workflow management and wrapper tools
The complex workflows described here must be well-documented, customizable per case, and provide results in a timely manner and intuitive format. Case materials should be accessible by collaborative teams of clinicians, bioinformaticians, and genetic counselors. In practice, all sites use automated platforms to call, annotate, and prioritize candidate diagnostic variants (Extended Data Table 5, Extended Data Table 6). Spreadsheets are the most common tool used by all sites for storing, sharing, and commenting on variant-level data. Many sites also use commercial solutions for case management, which has enabled secure transition of certain workflow components to the cloud.
DISCUSSION
Pinpointing the genetic variants giving rise to ultrarare, undiagnosed diseases is a challenging and pressing problem being tackled on a case-by-case basis by clinical researchers worldwide. The computational tools utilized during these investigative efforts reflect relevant community standards but can also diverge across institutions and even across cases handled by the same clinical team.
The diverse, exploratory techniques employed by UDN clinical sites can overcome inherent limitations of clinical case review and standard sequencing interpretation provided by genetic testing laboratories—both of which rely on existing disease gene knowledge—by uncovering novel disease loci. For instance, when no compelling variants were found in phenotypically prioritized genes in two patients presenting with muscular and white matter abnormalities, a genetics-driven UDN pipeline uncovered diagnostic de novo missense variants in both individuals in TOMM70, a gene previously unassociated with disease.37 Similarly, sequencing analyses were able to uncover de novo, heterozygous variants in nine individuals with neurodevelopmental delay and other multisystem anomalies in CDH2, a gene previously unassociated with a Mendelian neurodevelopmental condition.38
Indeed, divergent aspects of UDN pipelines reflect promising avenues for case reanalysis and reveal areas where technical developments would be most impactful. Improving SV detection specificity would aid in cases with nondiagnostic microarrays, gene panels, and GS. Experimentally verifiable pathogenicity predictions for noncoding variants may solve cases with nondiagnostic ES. Finally, automated integration of additional data, such as RNA-seq,29,30 long-read sequencing,39 and epigenetic modifications,40 may also increase the diagnostic rate for cases with inconclusive GS.
Consensus tools used across sites by multiple clinical research teams have been convincingly evaluated and are easily incorporated into existing workflows external to their original development environment. Clinical sites strive to incorporate better tools—including those developed in-house—as they emerge over time. Flexible, open-source implementations ease this process and can ultimately shorten the time to and improve the rate of diagnosis. Initiatives like the UDN provide an excellent opportunity to assess and share tools and ideas and jointly develop methods inspired by the most challenging undiagnosed cases.
Data availability
All data used in this analysis are available in the Main and Extended Data Tables.
References
Boycott, K. M., Vanstone, M. R., Bulman, D. E. & MacKenzie, A. E. Rare-disease genetics in the era of next-generation sequencing: discovery to translation. Nat. Rev. Genet. 14, 681–691 (2013).
Online Mendelian Inheritance in Man, OMIM. (McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD). https://omim.org.
Robinson, P. N. et al. Improved exome prioritization of disease genes through cross-species phenotype comparison. Genome Res. 24, 340–348 (2014).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Posey, J. E. et al. Insights into genetics, human biology and disease gleaned from family based genomic studies. Genet. Med. 21, 798–812 (2019).
Splinter, K. et al. Effect of genetic diagnosis on patients with previously undiagnosed disease. N. Engl. J. Med. 379, 2131–2139 (2018).
Macnamara, E. F. et al. Cases from the Undiagnosed Diseases Network: The continued value of counseling skills in a new genomic era. J. Genet. Couns. 28, 194–201 (2019).
Macnamara, E. F. & D’Souza, P, Undiagnosed Diseases Network & Tifft, C. J. The undiagnosed diseases program: approach to diagnosis. Transl. Sci. Rare Dis. 4, 179–188 (2020).
Wambach, J. A. et al. Functional characterization of biallelic RTTN variants identified in an infant with microcephaly, simplified gyral pattern, pontocerebellar hypoplasia, and seizures. Pediatr. Res. 84, 435–441 (2018).
Lee, H. et al. Clinical exome sequencing for genetic identification of rare Mendelian disorders. JAMA 312, 1880–1887 (2014).
Haghighi, A. et al. An integrated clinical program and crowdsourcing strategy for genomic sequencing and Mendelian disease gene discovery. NPJ Genom. Med. 3, 21 (2018).
Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405–424 (2015).
Philippakis, A. A. et al. The Matchmaker Exchange: a platform for rare disease gene discovery. Hum. Mutat. 36, 915–921 (2015).
Frost, J. H. & Massagli, M. P. Social uses of personal health information within PatientsLikeMe, an online patient community: what can happen when patients have access to one another’s data. J. Med. Internet Res. 10, e15 (2008).
Wang, J. et al. MARRVEL: integration of human and model organism genetic resources to facilitate functional annotation of the human genome. Am. J. Hum. Genet. 100, 843–853 (2017).
Bimber, B. N., Yan, M. Y., Peterson, S. M. & Ferguson, B. mGAP: the macaque genotype and phenotype resource, a framework for accessing and interpreting macaque variant data, and identifying new models of human disease. BMC Genomics 20, 176 (2019).
Meyer, E. et al. Mutations in the histone methyltransferase gene KMT2B cause complex early-onset dystonia. Nat. Genet. 49, 223–237 (2017).
Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nat. Commun. 9, 4038 (2018).
Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 43, 11.10.1–11.10.33 (2013).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020).
Mahmoud, M. et al. Structural variant calling: the long and the short of it. Genome Biol. 20, 246 (2019).
Kosugi, S. et al. Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biol 20, 117 (2019).
Liu, X., Wu, C., Li, C. & Boerwinkle, E. dbNSFP v3.0: a one-stop database of functional predictions and annotations for human nonsynonymous and splice-site SNVs. Hum. Mutat. 37, 235–241 (2016).
Posey, J. E. Genome sequencing and implications for rare disorders. Orphanet J. Rare Dis. 14, 153 (2019).
Mather, C. A. et al. CADD score has limited clinical validity for the identification of pathogenic variants in noncoding regions in a hereditary cancer panel. Genet. Med. 18, 1269–1275 (2016).
Gu, F. et al. A suite of automated sequence analyses reduces the number of candidate deleterious variants and reveals a difference between probands and unaffected siblings. Genet. Med. 21, 1772–1780 (2019).
Lee, H. et al. Diagnostic utility of transcriptome sequencing for rare Mendelian diseases. Genet. Med. 22, 490–499 (2020).
Frésard, L. et al. Identification of rare-disease genes using blood transcriptome sequencing and large control cohorts. Nat. Med. 25, 911–919 (2019).
Shashi, V. et al. A comprehensive iterative approach is highly effective in diagnosing individuals who are exome negative. Genet. Med. 21, 161–172 (2019).
Pena, L. D. M. et al. Looking beyond the exome: a phenotype-first approach to molecular diagnostic resolution in rare and undiagnosed diseases. Genet. Med. 20, 464–469 (2018).
Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015).
Köhler, S. et al. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Res. 47, D1018–D1027 (2019).
Smedley, D. & Robinson, P. N. Phenotype-driven strategies for exome prioritization of human Mendelian disease genes. Genome Med. 7, 81 (2015).
Gonzalez, M. et al. Innovative genomic collaboration using the GENESIS (GEM.app) platform. Hum. Mutat. 36, 950–956 (2015).
Dutta, D. et al. De novo mutations in TOMM70, a receptor of the mitochondrial import translocase, cause neurological impairment. Hum. Mol. Genet 29, 1568–1579 (2020).
Accogli, A. et al. De novo pathogenic variants in N-cadherin cause a syndromic neurodevelopmental disorder with corpus collosum, axon, cardiac, ocular, and genital defects. Am. J. Hum. Genet. 105, 854–868 (2019).
Merker, J. D. et al. Long-read genome sequencing identifies causal structural variation in a Mendelian disease. Genet. Med. 20, 159–163 (2017).
Turro, E. et al. Whole-genome sequencing of patients with rare diseases in a national health system. Nature 583, 96–102 (2020).
Acknowledgements
Thank you to the UDN Tool Building Coalition for discussions about tools in use or under development, to Daniel Traviglia for clarifications on UDN data availability, and to Rebecca Reimers for writing feedback. Research reported here was supported by the NIH Common Fund, through the Office of Strategic Coordination/Office of the NIH Director under award numbers U01HG007530, U01HG007942, U01HG007672, U01HG007690, U01HG010218, U01HG007703, U01HG010230, U01HG010217, U01HG010233, U01HG007674, and U01HG010215, and by the Intramural Research Program of the National Human Genome Research Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Author information
Authors and Affiliations
Consortia
Contributions
Conceptualization: S.N.K., S.R.S., I.S.K. Data curation: S.N.K., D.B., M.V., J.B.K., B.N.P., S.Z., E.B., H.L., A.H., L.B., A.B., J.C., S.M., A.A., D.R.M., P.L., D.J.W., A.J.P. Formal analysis: S.N.K. Funding acquisition: I.S.K. Investigation: S.N.K., S.R.S., I.S.K. Methodology: S.N.K. Visualization: S.N.K.; Writing—original draft: S.N.K. Writing—review & editing: S.N.K., D.B., M.V., K.L., C.E., S.R.S.
Corresponding author
Ethics declarations
Competing interests
P.L. is an employee of Baylor College of Medicine and derives support through a professional services agreement with Baylor Genetics, which performs clinical genetic testing services. The other authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kobren, S.N., Baldridge, D., Velinder, M. et al. Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases. Genet Med 23, 1075–1085 (2021). https://doi.org/10.1038/s41436-020-01084-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41436-020-01084-8
This article is cited by
-
International Undiagnosed Diseases Programs (UDPs): components and outcomes
Orphanet Journal of Rare Diseases (2023)
-
Simulation of undiagnosed patients with novel genetic conditions
Nature Communications (2023)
-
Genetic pain loss disorders
Nature Reviews Disease Primers (2022)