Application of full-genome analysis to diagnose rare monogenic disorders

Shieh, Joseph T.; Penon-Portmann, Monica; Wong, Karen H. Y.; Levy-Sakin, Michal; Verghese, Michelle; Slavotinek, Anne; Gallagher, Renata C.; Mendelsohn, Bryce A.; Tenney, Jessica; Beleford, Daniah; Perry, Hazel; Chow, Stephen K.; Sharo, Andrew G.; Brenner, Steven E.; Qi, Zhongxia; Yu, Jingwei; Klein, Ophir D.; Martin, David; Kwok, Pui-Yan; Boffelli, Dario

doi:10.1038/s41525-021-00241-5

Download PDF

Article
Open access
Published: 23 September 2021

Application of full-genome analysis to diagnose rare monogenic disorders

npj Genomic Medicine volume 6, Article number: 77 (2021) Cite this article

7621 Accesses
23 Citations
54 Altmetric
Metrics details

Subjects

A Publisher Correction to this article was published on 12 October 2021

This article has been updated

Abstract

Current genetic tests for rare diseases provide a diagnosis in only a modest proportion of cases. The Full-Genome Analysis method, FGA, combines long-range assembly and whole-genome sequencing to detect small variants, structural variants with breakpoint resolution, and phasing. We built a variant prioritization pipeline and tested FGA’s utility for diagnosis of rare diseases in a clinical setting. FGA identified structural variants and small variants with an overall diagnostic yield of 40% (20 of 50 cases) and 35% in exome-negative cases (8 of 23 cases), 4 of these were structural variants. FGA detected and mapped structural variants that are missed by short reads, including non-coding duplication, and phased variants across long distances of more than 180 kb. With the prioritization algorithm, longer DNA technologies could replace multiple tests for monogenic disorders and expand the range of variants detected. Our study suggests that genomes produced from technologies like FGA can improve variant detection and provide higher resolution genome maps for future application.

A structural variation reference for medical and population genetics

Article Open access 27 May 2020

Diagnostic and clinical utility of whole genome sequencing in a cohort of undiagnosed Chinese families with rare diseases

Article Open access 18 December 2019

Systematic analysis of paralogous regions in 41,755 exomes uncovers clinically relevant variation

Article Open access 27 October 2023

Introduction

Current approaches to diagnosis of monogenic conditions include short-read sequencing of exomes or genomes^1,2,3,4,5,6 Although the diagnostic yield from these methods is promising, ranging from 26 to 40%^1,5, they leave many cases unresolved^2,7. The yield can be augmented by reanalysis against recently discovered disease-associated variants and genes^6,8, or by using family based analysis to identify de novo variants⁴, but improvements are modest.

Two factors are principally responsible for diagnostic failures using short-read sequencing. First, short-read sequencing does not give a complete representation of the genome. For example, exome sequencing does not detect the majority of structural variants (SVs), cannot create chromosomal maps, misses variants in exons that are not captured efficiently, rarely detects repeats, and misses non-exonic variants^{9,10,11,12,13}. Even whole-genome sequencing (WGS), which provides up to 9% additional diagnostic yield compared to exome sequencing^5,6,7,14 cannot detect all structural variants (especially duplications, inversions, and translocations), create chromosomal maps, or provide phasing information. In addition, detection of structural variants by short-read WGS requires additional analytical processes that are often not fully implemented in clinical settings^15,16. Second, genetic diagnosis of rare disorders often entails “experiments of one,” where many sequence variants found in the proband must be vetted against current knowledge (gene/variants and genome reference) to decide if variants meet the diagnostic criteria¹⁷. Our incomplete biological knowledge limits our ability to identify the “causal variant” for any particular patient. Until we understand the functional consequences of more variants, or more patients with the same phenotypes are found, many candidate variants remain variants of uncertain significance.

Newly developed long DNA sequencing and mapping technologies can help solve the problem of incomplete genome analysis^9,10,12. Genome sequencing methods that produce long contigs (sets of adjacent DNA segments that together represent a consensus region of DNA) promise several advantages over short-read sequencing alone^{9,10,11,13,18}. First, long-read methods can resolve more easily large SVs (including large deletions, large insertions, translocations, and inversions), eliminating the need for additional genetic tests^9,19. Second, long-read sequencing detects insertions/deletions (indels) of intermediate size (500 bp to 50 kb)¹¹ more readily. These variants escape detection by clinical microarray because they are too small and by short-read sequencing because of clinical pipeline limitations or challenging filtering. Third, the methodology of genome reconstruction provides opportunities to detect rearrangement variants that have evaded detection¹¹. Fourth, single-basepair resolution of rearrangement breakpoints allows for the determination of the precise location of each insertion or duplication, making it possible to see if structural variants disrupt genes or other sequences of functional significance. Finally, long contigs provide unique phasing information to determine the haplotype on which a variant occurs (e.g. cis or trans) and can help resolve recessive disease-associated alleles¹⁸.

Here we test the diagnostic capabilities of an approach we call “Full-Genome Analysis” (FGA), which combines linked-read sequencing technology and optical mapping to produce contigs with a median length of ~100 Mb. To analyze the data in an unbiased and comprehensive fashion, an automated genetic variant interpretation pipeline was built to select and prioritize variants based on the individual patient phenotype. We show that FGA, when applied to patients with rare disorders in a clinical setting, leads to new diagnoses for patients with a variety of variant types. We further show that FGA is capable of detecting translocations, intermediate-sized copy number variants, phased biallelic variants—variations that are responsible for disease but usually missed by analysis of short-read sequencing data alone. Based on its efficiency in diagnosis, FGA opens the prospect of resolving more complex parts of the genome and identifying a more comprehensive set of genetic variants in rare disorder diagnosis^9,10,12.

Results

Automated genetic variant interpretation pipeline performance

Using an automated genetic variant interpretation pipeline, we performed FGA on 50 undiagnosed cases to determine diagnostic yield and asked if it could help solve cases that had not been diagnosed with the previous testing. The test was carried out in a clinical environment, allowing clinicians to propose unsolved cases for further genomic sequencing; cases were included only if prior testing was negative and characteristics of the case did not suggest any further specific test. Of the 50 cases, 23 had negative prior commercial trio whole-exome sequencing and 42 prior negative microarray. FGA was performed and also compared to WGS structural variant (SV) calling.

An automated pathogenicity assessment pipeline was built in-house to filter and prioritize variants based on the reported clinical features of recruited patients. In addition to analyzing single-nucleotide variants (SNVs) and small indels, this pipeline was designed specifically to handle structural variations that often evade detection by short-read technologies. Our SV modules integrate the two complementary datasets (linked-read sequencing and optical mapping) and are capable of reporting clinically relevant translocations, inversions, deletions, duplications, insertions and other types of complex SVs of virtually all sizes. This pipeline ranks variants for every case and considers each potential inheritance pattern (details in methods and supplementary methods).

Overall, our automated pipeline identified 20 diagnostic cases (14 SNPs/indels and 6 SVs, Table 1). Most diagnostic cases (n = 14) were ranked as the top variant in their respective inheritance/SV groups (Supplementary Table 1). By detecting additional variation beyond standard clinical testing modalities, FGA yielded novel genomic information for discovery in undiagnosed patients. The total diagnostic yield was 40% from the 50 cases tested (20 of 50 cases), with FGA detecting both new structural variants and SNVs that were missed by previous sequencing and short read annotation. FGA diagnosed 35% of exome-negative cases (8 of 23 cases) (Table 1). Four of these were structural variants missed by exome sequencing, three were SNVs/Indels missed due to lack of annotation, and one was a suspected mosaic Indel, pending further validation. The diagnostic yield for cases that did not have prior exome was 44% (12 of 27 cases). We also identified candidate variants in another 60% (18 of 30 cases) for future follow-up (Supplementary Table 2).

Table 1 Clinical and molecular features of diagnostic cases tested with FGA (n = 20).

Full size table

Full-genome analysis

FGA solved three classes of cases where short-read sequencing or microarray analyses had previously failed to detect the causal variants: (1) cryptic heterozygous structural variants (e.g. NHEJ1-IHH, WAC), particularly variants of intermediate size; (2) translocations; and (3) missed phased heterozygous variants, for example in trans for recessive disorders (e.g. TSPEAR) (Table 1). Here we describe examples of diagnostic findings and the performance of the automated pipeline.

Non-coding structural variation

In a 9-month-old female with craniosynostosis and syndactyly, we found a rare 32 kb heterozygous de novo intronic duplication within the NHEJ1 gene (case 1703). FGA identified the breakpoints of a duplication at chr2: 219,102,933–219,134,970 (genome version GRCh38) (Fig. 1). Only by familial mapping studies have similar duplications been described in cases of craniosynostosis and syndactyly^20,21 (named Chromosome 2q35 Duplication Syndrome, OMIM #185900). The breakpoints identified here narrow the critical region of the NHEJ1 intron that is important for the condition^20,21. This heterozygous de novo duplication was detected by both optical mapping (Bionano) and linked-read sequencing (10x Genomics) technologies. The duplication occurred adjacent to the original segment in tandem, information readily identified using optical mapping (Fig. 1a). This mid-sized structural variant (~32 kb) was not detected by standard microarray analysis because it is small and intronic. It also escaped detection by our short-read WGS copy number variant calling–but was easily identified by FGA (Table 2).

**Fig. 1: Heterozygous, intronic tandem duplication (32 kb) in *NHEJ1*.**

Table 2 Comparison of duplication calls between short-read WGS CNV and genome assembly technologies. Calls for the 32 kb intronic NHEJ1 duplication case.

Full size table

The duplication affects an enhancer for the Indian Hedgehog (IHH) gene, located upstream in the third intron of the neighboring NHEJ1²⁰. ENCODE data support the enhancer function of this intronic region (Supplementary Fig. 1). The structural variant breakpoints defined by FGA narrow the intronic region responsible for this condition.

Genomic rearrangements

It can be challenging to readily detect translocations with current clinical sequencing pipelines without specific additional informatic analyses. In contrast, we were able to detect translocations readily using FGA with our automated pipeline. For example, we found a germline translocation between chromosomes 1 and 9 in a 2-year-old male with a history of neuroblastoma and developmental delay who had negative microarray and exome sequencing (Fig. 2—both linked-read genome sequencing and optical mapping support the translocations, case 0703). Trio analysis indicated that the translocation occurred de novo; it was subsequently verified by cytogenetic chromosome analysis (Supplementary Fig. 2, karyotype: 46,XY,t(1;9)(p32.3;p21). FGA identified the precise breakpoint locations on chromosome 1 and chromosome 9 (chr1: 49,553,194 and chr9: 29,096,674, respectively, genome version GRCh38). The breakpoints were non-exonic, occurring in an intronic region of AGBL4 and an upstream/untranslated region near LINGO2. FGA also revealed that the translocations occurred on the paternal allele, with a small breakpoint deletion suggesting non-homologous end joining, with an additional maternally inherited intronic deletion present on the other allele. AGBL4, encoding a cytosolic carboxypeptidase, has a potential role in neuroblastoma, autism and developmental delay. Copy number alteration has also been reported in LINGO2 in neuroblastoma cell lines^22,23. Although several other genes in this case had de novo variants (MYH11, GABRA2, TFE3), none of these were clearly etiologic. Neuroblastoma predisposition genes ALK and PHOX2B were also negative²⁴. The de novo translocation suggested a new etiology for this condition, which could be explored in future studies^22,25. Interestingly, the short-read WGS copy number calling shows hundreds of potential breakpoint junctions that needed further analysis for diagnostic use. In contrast, FGA had at least 8-fold fewer candidates (Supplementary Table 3). Furthermore, de novo assembly from FGA promptly identified the event as a translocation.

**Fig. 2: Structural rearrangement detection with de novo assembly and linked reads; t(1:9)(p33,p21).**

Translocations have implications for future reproductive risks. A diagnostic strategy that encompasses what chromosome analysis and microarray do in a single diagnostic test could also serve to detect balanced events which are important for family planning in carriers. More complex rearrangements were also promptly detected among significantly fewer candidates and localized with FGA (e.g. unbalanced insertional translocation, case 4603, Supplementary Fig. 3, Supplementary Table 4) providing novel variants for future characterization.

Deletions

FGA was also capable of pinpointing genomic breakpoints of clinically significant deletion copy number variants. FGA identified 36 kb deletions disrupting TANGO2 (OMIM #616878, case 5103) in siblings with a history of episodic rhabdomyolysis, metabolic acidosis, and ketosis (Fig. 3). A 1480 bp de novo deletion in WAC (Desanto-Shinawi syndrome, OMIM #616708 case 4203) was found in a male patient with seizures, hypotonia, developmental delay and nonfamilial features like low-set ears and brachycephaly (Supplementary Fig. 4). We also identified a 5000 bp de novo deletion in 2p15 in a female with seizures, developmental delay (2p16.1-p15 deletion syndrome, OMIM #612513, case 4803) which implicates USP34. FGA simultaneously identified and verified such variants and breakpoints without additional copy number prediction tools or external validation required by current short-read sequencing pipelines. Indeed, short-read sequencing copy number calling was able to detect these deletions albeit with sometimes incorrect zygosity (Supplementary Tables 5–7). FGA had an advantage compared to short-read sequencing in identifying deletions in challenging regions of the genome near segmental duplications, in agreement with our previous studies⁹.

**Fig. 3: Deletion disrupting *TANGO2* (chr22: 20,039,637–20,075,714 and chr22: 20,041,469—20,075,432, genome version GRCh38).**

Small variant detection and biallelic phased variants

FGA also yielded coding variants, similar to short-read WGS; however, phasing was now uniquely possible given the longer DNA segments. Discerning that variants reside on separate chromosomes is important for diagnoses involving compound heterozygous recessive variants; FGA is capable of making this determination in a single proband test. We identified two TSPEAR variants in a female with oligo/hypodontia, missing 15 adult teeth, but no previous family history. The two TSPEAR NM_144991 variants were found 180 kb away from each other. FGA phasing clearly showed that the variants occurred in trans, suggesting that both parents are heterozygous carriers, which was confirmed. The first variant is a 10 base insertion which leads to frameshift, c.51_52insGGCCCCCGGC, p.His18fs, while the second variant is nonsense, c.1281G>A, p.Trp427Ter; together the variants confirm the biallelic etiology (Fig. 4a). The large phasing block (chr21: 29,801,272–44,927,448, genome version GRCh38) created by FGA was able to discern the two haplotypes and determine that the variants are in trans, even with the affected individual’s sequence only. TSPEAR has recently been associated with tooth agenesis, thus missed by prior sequencing, and loss-of-function variants in TSPEAR are associated with ectodermal dysplasia 14, hair/tooth type, with or without hypohidrosis (OMIM #618180)^26,27. Indeed, WGS may have identified these variants, but only phasing using FGA is sufficient to make a diagnosis on proband alone. In detecting compound heterozygous variants, phasing information is valuable since one of the variants might be de novo and data from parents are not always available to exclude the possibility that variants are in cis. FGA also localized additional diagnostic de novo SNVs, similar to whole-genome or exome sequencing, but also determined which parental allele was affected by the mutation. For example, a 14-year-old girl with short stature, neurodevelopmental disability, and cardiac valvular disease who had not had prior exome testing had a de novo missense variant by FGA, occurring on the paternal allele in SMAD4, NM_005359:c.1498A>G, p.Ile500Val (Fig. 4b), diagnostic of Myhre syndrome (OMIM #139210), which is associated with increased risk of pericardial, pulmonary, and tracheal fibrosis, as well as skeletal and vascular complications. The diagnosis was useful for patient management. De novo mutations become more common with advancing paternal age²⁸. For some medical conditions, the allele affected (maternal or paternal) could also determine whether mutations are clinically significant or not (e.g. imprinted regions).

**Fig. 4: Variant haplotype distinction.**

Discussion

In genomic medicine, rare disease diagnostics has traditionally been challenged by the rarity of the disorders and testing limitations. Here, we described the FGA approach with automated analysis using linked-read sequencing and optical mapping to evaluate a full spectrum of genetic variants implicated in rare genetic diseases. The automated pipeline integrates the longer DNA technologies into the diagnostic realm by enabling a streamlined variant detection protocol and minimizing biases introduced during the analysis process. This data-driven approach results in a drastic decrease in human intervention and ensures that every case is evaluated thoroughly. We find that genome assemblies can be used in clinical testing strategies detecting all types of genetic variants concurrently. FGA detects and localizes SV such as duplications that are missed by WGS and can easily identify translocations and phase variants across long distances. With variant detection from longer DNA technologies, we can improve the detection of diagnostic variants and provide higher resolution genome maps for future studies.

For individuals with undiagnosed conditions, these technologies encompass what is currently provided by the combination of chromosome analysis–karyotyping, microarray testing, and short-read WGS⁵. By identifying novel SVs and phasing, it provides diagnostic information beyond current clinical tests. The automated pipeline also provides internal validation of SVs, bypassing the need for additional time and blood for testing¹⁴. By constructing de novo genome assemblies and identifying variants that do not map to the genome reference, these technologies can also provide additional information for future analysis. These strengths make the technologies highly suitable for early implementation in diagnostic evaluations, particularly if a specific genetic condition or type of variant is not immediately suspected⁵.

As expected, genome assemblies are able to detect duplications and translocations more efficiently than the short-read sequencing. The longer DNA techniques also have practical advantages over traditional genetic testing strategies because they can detect phased variants for recessive conditions, as well as the full spectrum of structural variants. Therefore, FGA makes it possible to effectively test probands even when parents/family members are not available for testing. This is useful in intensive care units or in other settings where rapid diagnosis is vital to clinical care^29,30. Variant phasing or the cis or trans configuration can be critical in the rapid evaluation for clinical significance. FGA also returns a high-quality genome reconstruction, which is useful for resolving complex or novel regions of the genome. Such de novo assemblies are not reference-dependent and SV calling can be achieved without making inferences that are necessary in short-read sequencing^9,10,11,19.

The number of diagnostic cases attributable to SVs was striking in our study, as 50% of exome-negative cases (4 out of 8 cases) were solved by identifying an SV or rearrangement. We also identified at least one highly probable SV or SNV candidate in more than half of the remaining undiagnosed patients. These cases do not meet diagnostic criteria due to several reasons. SVs overlapping similar regions do not always produce the same phenotype. This is particularly limiting since most SVs are not recurrent and thus do not share identical breakpoints. Furthermore, unless a critical region can be established or a syndrome is associated with very distinctive phenotype, it is unclear whether an SV or SNV can be diagnostic even if it is de novo. Most importantly, SV/CNV databases are strikingly sparse and inconsistent, in contrast to SNV databases. Further genotype correlation and functional testing are needed in the future.

The application of hybrid technologies with long-range sequencing, like FGA, in genomic medicine is not without limitations. First, our automated variant interpretation pipeline is based on existing annotation databases. Genetic variations cannot be ranked or annotated well if they are not found in these resources¹⁴. Second, even with the use of long molecules averaging 200–300 kb in our optical mapping experiments, they are not long enough to resolve the large, near-identical segmental duplications in some of the most complex regions of the human genome. A small number of these complex regions remain inaccessible despite using long-range sequencing and mapping technologies⁹. Truly whole or complete sequencing of genomes depends on the technical platform, analytical pipelines and thorough annotation⁵. Third, the current human reference genome is a set of composite haplotypes generated from 8 anonymous DNA donors³¹. As such, there are functionally important sequences found in many people around the world but that are missing from the reference genome^10,32. Since the reference genome serves as the benchmark for all analyses, missing sequences are never assessed, thus making variants in these regions undiagnosable.

FGA can be implemented for diagnostic purposes with minor modification of workflow in the clinical laboratory. Sample handling must be adapted to protect DNA integrity, which is required to obtain longer DNA fragments. The bioinformatic workflow is easily implemented in a clinical setting with phased haplotypes and structural variants as direct output. This is in sharp contrast to the workflow required for the detection of structural variants from short-read data.

We can expect that WGS is becoming the method of choice for genetic diagnosis, given the greater number of variants relative to exome sequencing or microarray analysis^5,6,14 In choosing technology for the acquisition of whole-genome data, one should consider costs and the complexity of analysis, as well as the completeness of the data and the continuing value of the data for future reanalysis. The inherent amount of missing data in genomes generated by short-read sequencing reduces their ability to complete clinical diagnoses in challenging cases. Data reanalysis is becoming a successful strategy to identify variants that underlie disease in a patient’s genome¹⁴; as our understanding of deleterious variants grows, it is possible to revisit previously acquired data and assign significance to previously detected variants. FGA’s ability to acquire a more extensive set of variants increases the likelihood that future reanalysis will be productive. More importantly, by identifying previously unknown variants, FGA makes it possible to explore their functional significance.

The increase in diagnostic yield produced by FGA in this study, attributable to advances such as structural variant detection, has made it possible to solve cases that were negative by short-read sequencing. Full realization of FGA’s potential to provide comprehensive detection of clinical variants will require a combination of automated capture of phenotypic terms with expanded expertise in variant interpretation¹⁴. Comprehensive assessment of the genome in every undiagnosed patient would rapidly produce both genome maps of annotated functional variants and new diagnostic possibilities. The result would be a better understanding of population variation, and improved diagnostics for direct clinical care.

Methods

DNA extraction and preparation

High molecular-weight DNA was extracted and isolated using the Bionano Prep Blood Isolation Kit following the manufacturer protocol (Bionano Genomics). Bionano optical mapping libraries were prepared following the manufacturer protocol (Bionano Genomics). 10x Genomics linked-read sequencing libraries were built as published⁹ using the GemCode platform (10x Genomics).

Optical mapping and linked-read data generation and processing

Optical mapping on the Bionano Irys and Saphyr platforms was used to produce de novo assemblies and identify structural variants and rearrangements. DNA was labeled using Nick, Label, Repair and Stain (NLRS) and/or Direct Label and Staining Technologies (DLS). The first uses a nicking endonuclease that recognizes a specific 6-7 basepair sequence and creates a single-strand nick, filled with fluorescent nucleotides. The second uses a single direct-labeling enzymatic reaction to attach a fluorophore to a specific 6-basepair DNA sequence motif. Labeled DNA libraries were loaded onto the Bionano Genomics Irys^TM Chip or Saphyr^TM Chip, linearized and visualized using the Irys^TM or Saphyr^TM system, which detects the fluorescent labels along each molecule. Single molecule maps were assembled de novo into genome maps using Bionano Solve with the default settings¹². Genome assembly and alignment was performed using IrysView/IrysSolve software. For optical mapping, we performed embedding of cells, long DNA extraction and Chip run over a total 3.25 days.

Sequencing data was obtained from 10x Genomics linked-read libraries sequenced to ~60X coverage using an Illumina sequencer. Reads were aligned to GRCh38 using LongRanger and variants were identified using the callers integrated in the 10x pipeline including GATK Haplotype caller for SNPs and indels. SNPs and indels were kept for analysis if the minor allele frequency is ≤5% as reported in the gnomAD database.

Automated variant interpretation pipeline

An automated clinical interpretation and prioritization pipeline was built in-house using custom and publicly available software (Supplementary Fig. 5). Electronic health records were exported into JSON format for parsing with clinical natural language processing (NLP) algorithm ClinPhen. Since some HPO terms were already manually curated from previous sequencing studies, these terms were combined with the non-redundant ones generated from NLP. HPO hierarchical terms separated by 1 degree were also included as part of the clinical phenome. Every HPO term (h) is assigned a weight, which is defined as the inverse of the total number of disease genes associated with it.

$${weight}\,{of}\,h = \frac{1}{{{total}\,{number}\,{of}\,{genes}\,{associated}\,{with}\,h}}$$

(1)

Next, we overlapped the clinical phenome of the proband with a list of known phenotypic features associated with mutations in a given gene (G). The overlapping terms were used to calculate a gene sum score to identify and rank clinically relevant genes.

$${Gene}\,{sum}\,{score}\,{of}\,{gene}\,G = \mathop {\sum }\limits_1^n {weight}(h_n)$$

(2)

SNP and indel exonic variants identified from the proband were overlapped with the ranked gene list. The same strategy was applied to all SVs overlapping genic exons. Scores were normalized for comparing and calculating confidence scores. All SVs were vetted against a set of regions known to be associated with deletion and duplication syndromes. Structural variants and copy number variants were screened first using the entire cohort, for single or double occurrences. These were then compared to the gnomAD SV database. Of note, three out of the four diagnostic SVs are absent from gnomAD and were also rare in our platform. Additionally, all translocations and inversions were included by default. Prioritized variants were reviewed manually to determine which one was diagnostic. We initially focus on structural variants and did not systematically annotate deep intronic variation or short tandem repeats. Mitochondrial DNA candidates were annotated and manually verified.

This tool parses SNPs, indels, and structural variations (SVs) from 10x Genomics linked-read and Bionano optical mapping data based on trio sequencing (singleton is allowed). SNPs/indels analysis can be done alone or in combination with SV analysis. In general, this tool parses a patient’s electronic health record in JSON format and outputs a clinically relevant gene list. This gene list is then used to inform how genetic variants are prioritized. Genetic variants (SNPs, indels, and SVs) are vetted against a set of controls and parents. For SNPs and indels, variants are filtered based on allele frequencies reported by gnomAD. Small variants reported as likely benign or benign by either Clinvar or Intervar are discarded from the pipeline. For SVs, the prevalent of these variants are compared against a set of 1KGP + CIAPM control sequenced previously by the Kwok lab. See below for more details.

A pre-processing step (for SNPs and indels) is required to run this software. This pre-processing step takes the 10xG GATK output and applies filters based on GQ, DP, and PASS. This step removes the bulk of the variants that are likely to be artifacts. Remaining variants are annotated using Intervar, which is a wrapper for Annovar and it assigns ACMG pathogenicity to all variants. Variants are additionally filtered for frameshift, nonframeshift, nonsynonymous, stopgain, stoploss, and splicing. They are overlapped with the ranked gene list generated previously. All remaining variants are ranked by the reported pathogenicity based on ClinVar/Intervar and then by the gene sum score (see manuscript for details).

For SVs, insertions, deletions, and duplications are annotated with known exons (exon-level not gene-level). Duplications and deletions are additionally used to search for known microdeletion and microduplication syndromes. Inversions and translocations are annotated with known genes (gene-level) and every call in these two categories are always reported.

SV scripts have been designed to analyze BioNano Optical Mapping Data (.smap file format) and 10x Linked Reads Data (.vcf file format). All SV scripts read a proband file, mother file, father file, and a reference file which consists of SV calls from the 1000 Genome Project cohort as well as SV calls from all other parents in the study other than the parents of the proband being analyzed. The scripts output filtered proband calls with additional descriptor columns as a tab-delimited txt file.

BioNanoDeletions, BioNanoInsertions, and BioNanoDuplications select calls of the SV type and eliminate calls below the inputted confidence threshold (default: 0.5). They perform a 50% reciprocal overlap with the reference file and remove calls that overlap. They perform a 50% reciprocal overlap with the inputted mother and father file separately and append columns (Found_in_Mother, Found_in_Father) to describe the overlap (True/False). They overlap with exons and phenotypes and append columns (Gene, Phenotype) with the gene name and phenotype if found.

BioNanoInversions and BioNanoTranslocations select calls of the SV type and do not filter for confidence. They create 20 kb intervals around the start point and end point of the call. They overlap the start and end intervals with the reference file and remove calls that overlap. They overlap the start and end intervals with the mother and father file separately and append columns (Found_in_Mother, Found_in_Father) to describe the overlap (True/False). They overlap start and end intervals with genes and phenotypes and append columns (Gene, Phenotype for start point; Gene2, Phenotype2 for end point) with the gene name and phenotype if found.

tenxDeletions reads the 10x Deletion calls (“dels.vcf”) and performs a 50% reciprocal overlap with reference file and removes calls that overlap. It performs a 50% reciprocal overlap with the inputted mother and father file separately and appends columns (Found_in_Mother, Found_in_Father) to describe the overlap (True/False). It overlaps with exons and phenotypes and appends columns (Gene, Phenotype) with the gene name and phenotype if found.

tenxLargeSvDeletions and tenxLargeSvDuplications read the 10x Large SV calls (“large_svs.vcf”) and select calls of the SV type. They perform a 50% reciprocal overlap with reference file and remove calls that overlap. They perform a 50% reciprocal overlap with the inputted mother and father file separately and append columns (Found_in_Mother, Found_in_Father) to describe the overlap (True/False). They overlap with exons and phenotypes and append columns (Gene, Phenotype) with the gene name and phenotype if found.

tenxLargeSvInversions, tenxLargeSvUnknown, and tenxLargeSvBreakends read the 10x Large SV calls (“large_svs.vcf”) and select calls of the SV type. They create 10 kb intervals around the start point and end point of the call. They overlap the start and end intervals with reference file and remove calls that overlap. They overlap the start and end intervals with the mother and father file separately and append columns (Found_in_Mother, Found_in_Father) to describe overlap (True/False). They overlap start and end intervals with genes and phenotypes and append columns (Gene, Phenotype for start point; Gene2, Phenotype2 for end point) with the gene name and phenotype if one is found. For unknown and breakend types, only variants with quality score > 1 standard deviation above the mean are reported. All reported coordinates are based on hg38.

Comparison of short-read WGS to the genome assembly technologies

To assess structural variant calls, we removed linked-read barcodes from sequencing reads to generate short-reads and performed short-read whole-genome calling copy number using Manta with default settings^33,34. From Manta output we assessed deletions, duplications, and breakends called using the short-read data and compared these to output from 10x linked reads and Bionano optical mapping, with attention to variant size, zygosity and type of variant called. We also identified if calls passed more stringent high-quality filters in each of the three platforms.

Approvals and phenotypic assessment

The study was approved by the Institutional review board of Children’s Hospital Oakland and University of California, San Francisco (UCSF), Committee for Human Subjects Research. Recruitment was from UCSF Benioff Children’s Hospital Medical Genetics and Genomics clinics. We focused on cases of two types, chosen to demonstrate the capability of FGA: cases in which whole-exome sequencing had not returned a causal variant; sporadic cases from the pediatric population that are suspected to have a genetic basis, but fall into no clear syndrome and have no clear candidate target for conventional genetic diagnosis. Individuals with undiagnosed conditions and unaffected parents were offered testing and underwent an informed consent process prior to blood draw. The nature and possible risks of the study were explained in the consent process. Phenotypic evaluation was performed by clinical review by at least two genetics professionals, and human phenotype ontology terms were curated for each case.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

Sequence data have been deposited at the European Genome-phenome Archive (EGA), under accession number EGAS00001005553. Variants are available in ClinVar under accession numbers SCV001441628 - SCV001441650. All other data are available from the corresponding author on reasonable request.

Code availability

The automated variant interpretation pipeline is hosted on GitHub (https://github.com/wongkarenhy/Full-Genome-Analysis-Pipeline).

Change history

12 October 2021
A Correction to this paper has been published: https://doi.org/10.1038/s41525-021-00251-3

References

Clark, M. M. et al. Meta-analysis of the diagnostic and clinical utility of genome and exome sequencing and chromosomal microarray in children with suspected genetic diseases. npj Genom. Med. 3, 1–10 (2018).
Article CAS Google Scholar
Biesecker, L. G. & Green, R. C. Diagnostic clinical genome and exome sequencing. N. Engl. J. Med. 370, 2418–2425 (2014).
Article PubMed Google Scholar
Levy, S. E. & Myers, R. M. Advancements in next-generation sequencing. Annu. Rev. Genomics Hum. Genet. 17, 95–115 (2016).
Article CAS PubMed Google Scholar
Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405–424 (2015).
Article PubMed PubMed Central Google Scholar
Stavropoulos, D. J. et al. Whole-genome sequencing expands diagnostic utility and improves clinical management in paediatric medicine. npj Genomic Med. 1, 15012 (2016).
Article CAS Google Scholar
Ostrander, B. E. P. et al. Whole-genome analysis for effective clinical diagnosis and gene discovery in early infantile epileptic encephalopathy. npj Genomic Med. 3, 22 (2018).
Article Google Scholar
Schwarze, K., Buchanan, J., Taylor, J. C. & Wordsworth, S. Are whole-exome and whole-genome sequencing approaches cost-effective? A systematic review of the literature. Genet. Med. 20, 1122–1130 (2018).
Article PubMed Google Scholar
Nambot, S. et al. Clinical whole-exome sequencing for the diagnosis of rare disorders with congenital anomalies and/or intellectual disability: substantial interest of prospective annual reanalysis. Genet. Med. 20, 645–654 (2018).
Article CAS PubMed Google Scholar
Levy-Sakin, M. et al. Genome maps across 26 human populations reveal population-specific patterns of structural variation. Nat. Commun. 10, 1–14 (2019).
Article CAS Google Scholar
Wong, K. H. Y., Levy-Sakin, M. & Kwok, P. Y. De novo human genome assemblies reveal spectrum of alternative haplotypes in diverse populations. Nat. Commun. 9, 1–9 (2018).
Article Google Scholar
Marks, P. et al. Resolving the full spectrum of human genome variation using Linked-Reads. Genome Res. 29, 635–645 (2019).
Article CAS PubMed PubMed Central Google Scholar
Demaerel, W. et al. The 22q11 low copy repeats are characterized by unprecedented size and structure variability. bioRxiv https://doi.org/10.1101/403873 (2018).
Seo, J. S. et al. De novo assembly and phasing of a Korean human genome. Nature 538, 243–247 (2016).
Article CAS PubMed Google Scholar
James, K. N. et al. Partially automated whole-genome sequencing reanalysis of previously undiagnosed pediatric patients can efficiently yield new diagnoses. npj Genom. Med. 5, 1–8 (2020).
Article Google Scholar
Kosugi, S. et al. Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biol. 20, 8–11 (2019).
Article Google Scholar
Gross, A. M. et al. Copy-number variants in clinical genome sequencing: deployment and interpretation for rare and undiagnosed disease. Genet. Med. 21, 1121–1130 (2019).
Article CAS PubMed Google Scholar
Penon, M., Zahed, H., Berger, V., Su, I. & Shieh, J. T. Using exome sequencing to decipher family history in a healthy individual: comparison of pathogenic and population MTM1 variants. Mol. Genet. Genom. Med. 6, 722–727 (2018).
Article CAS Google Scholar
Mostovoy, Y. et al. A hybrid approach for de novo human genome sequence assembly and phasing. Nat. Methods 13, 587–590 (2016).
Article CAS PubMed PubMed Central Google Scholar
Chaisson, M. J. P. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1–16 (2019).
Article CAS Google Scholar
Klopocki, E. et al. Copy-number variations involving the IHH locus are associated with syndactyly and craniosynostosis. Am. J. Hum. Genet. 88, 70–75 (2011).
Article CAS PubMed PubMed Central Google Scholar
Will, A. J. et al. Composition and dosage of a multipartite enhancer cluster control developmental expression of Ihh (Indian hedgehog). Nat. Genet. 49, 1539–1545 (2017).
Article CAS PubMed PubMed Central Google Scholar
Gambin, T. et al. Identification of novel candidate disease genes from de novo exonic copy number variants. Genome Med. 9, 1–15 (2017).
Article Google Scholar
Sarah, Z. H. Common Fragile Site Genes, CNTLN and LINGO2, are Associated with Increased Genome Instability in Different Tumors. (University of Heidelberg, 2010).
Barr, E. & Applebaum, M. Genetic predisposition to neuroblastoma. Children 5, 119 (2018).
Article PubMed Central Google Scholar
Bonaglia, M. C. et al. De novo unbalanced translocations have a complex history/aetiology. Hum. Genet. 137, 817–829 (2018).
Article CAS PubMed Google Scholar
Du, R. et al. Identification of likely pathogenic and known variants in TSPEAR, LAMB3, BCOR, and WNT10A in four Turkish families with tooth agenesis. Hum. Genet. 137, 689–703 (2018).
Article CAS PubMed PubMed Central Google Scholar
Peled, A. et al. Mutations in TSPEAR, encoding a regulator of notch signaling, affect tooth and hair follicle morphogenesis. PLoS Genet. 12, 1–17 (2016).
Article Google Scholar
Goldmann, J. M. et al. Parent-of-origin-specific signatures of de novo mutations. Nat. Genet. 48, 935–939 (2016).
Article CAS PubMed Google Scholar
Wang, H. et al. Clinical utility of 24-h rapid trio-exome sequencing for critically ill infants. npj Genom. Med. 5, 1–6 (2020).
Article CAS Google Scholar
Clark, M. M. et al. Diagnosis of genetic diseases in seriously ill children by rapid whole-genome sequencing and automated phenotyping and interpretation. Sci. Transl. Med. 11, eaat6177 (2019).
Article PubMed Google Scholar
Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).
Article CAS PubMed PubMed Central Google Scholar
Kehr, B. et al. Diversity in non-repetitive human sequences not found in the reference genome. Nat. Genet. 49, 588–593 (2017).
Article CAS PubMed Google Scholar
Cameron, D. L., Di Stefano, L. & Papenfuss, A. T. Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software. Nat. Commun. 10, 1–11 (2019).
Article CAS Google Scholar
Chen, X. et al. Manta: Rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

Funded by the California Initiative to Advance Precision Medicine to D.M. the Marcus Program in Precision Medicine and Established Investigator Grants to J.T.S. and P.-Y.K., and the National Human Genome Research Institute of the National Institutes of Health under award R01 HG005946 to P.-Y.K. The National Science Foundation Graduate Research Fellowship under Grant No. DGE 1752814 support for A.G.S. We thank the families for their participation. We thank the California Initiative to Advance Precision Medicine and the California Governor’s Precision Medicine Advisory Committee.

Author information

These authors contributed equally: Joseph T. Shieh, Monica Penon-Portmann, Karen H. Y. Wong.

Authors and Affiliations

Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA
Joseph T. Shieh, Monica Penon-Portmann, Anne Slavotinek, Renata C. Gallagher, Ophir D. Klein & Pui-Yan Kwok
Division of Medical Genetics, Pediatrics, Benioff Children’s Hospital, University of California San Francisco, San Francisco, CA, USA
Joseph T. Shieh, Monica Penon-Portmann, Anne Slavotinek, Renata C. Gallagher, Bryce A. Mendelsohn, Jessica Tenney, Daniah Beleford, Hazel Perry & Ophir D. Klein
Cardiovascular Research Institute, University of California San Francisco, San Francisco, CA, USA
Karen H. Y. Wong, Michal Levy-Sakin, Michelle Verghese, Stephen K. Chow & Pui-Yan Kwok
Biophysics Graduate Group, University of California Berkeley, Berkeley, CA, USA
Andrew G. Sharo
Department of Plant and Microbial Biology, University of California Berkeley, Berkeley, CA, USA
Steven E. Brenner
Department of Laboratory Medicine, University of California San Francisco, San Francisco, CA, USA
Zhongxia Qi & Jingwei Yu
Craniofacial Biology and Department of Orofacial Sciences, University of California San Francisco, San Francisco, CA, USA
Ophir D. Klein
Children’s Hospital Oakland Research Institute, Benioff Children’s Hospital Oakland, University of California San Francisco, Oakland, CA, USA
David Martin & Dario Boffelli
Department of Dermatology, University of California San Francisco, San Francisco, CA, USA
Pui-Yan Kwok

Authors

Joseph T. Shieh
View author publications
You can also search for this author in PubMed Google Scholar
Monica Penon-Portmann
View author publications
You can also search for this author in PubMed Google Scholar
Karen H. Y. Wong
View author publications
You can also search for this author in PubMed Google Scholar
Michal Levy-Sakin
View author publications
You can also search for this author in PubMed Google Scholar
Michelle Verghese
View author publications
You can also search for this author in PubMed Google Scholar
Anne Slavotinek
View author publications
You can also search for this author in PubMed Google Scholar
Renata C. Gallagher
View author publications
You can also search for this author in PubMed Google Scholar
Bryce A. Mendelsohn
View author publications
You can also search for this author in PubMed Google Scholar
Jessica Tenney
View author publications
You can also search for this author in PubMed Google Scholar
Daniah Beleford
View author publications
You can also search for this author in PubMed Google Scholar
Hazel Perry
View author publications
You can also search for this author in PubMed Google Scholar
Stephen K. Chow
View author publications
You can also search for this author in PubMed Google Scholar
Andrew G. Sharo
View author publications
You can also search for this author in PubMed Google Scholar
Steven E. Brenner
View author publications
You can also search for this author in PubMed Google Scholar
Zhongxia Qi
View author publications
You can also search for this author in PubMed Google Scholar
Jingwei Yu
View author publications
You can also search for this author in PubMed Google Scholar
Ophir D. Klein
View author publications
You can also search for this author in PubMed Google Scholar
David Martin
View author publications
You can also search for this author in PubMed Google Scholar
Pui-Yan Kwok
View author publications
You can also search for this author in PubMed Google Scholar
Dario Boffelli
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.T.S., D.M., P.-Y.K., D.B. conceived and designed the study. J.T.S., A.S., R.C.G., B.A.M., J.T., D.B., H.P., Z.Q., J.Y., O.K., D.M., P.-Y.K., D.B. acquired data. J.T.S., M.P.-P., K.H.Y.W., M.L-S., M.V., H.P., S.K.C. analyzed data. J.T.S., M.P.-P., K.H.Y.W., M.L-S., A.S., R.C.G., B.A.M., J.T., D.B., H.P., A.G.S., S.E.B, Z.Q., J.Y., O.K., D.M., P.-Y.K., D.B. interpreted data. M.P.-P., K.H.Y.W., M.V., S.K.C. created software used in the work. J.T.S., M.P.-P, K.H.Y.W., D.M., P.-Y.K., D.B. drafted the manuscript. J.T.S., M.P.-P., K.H.Y.W., M.L-S., A.S., R.C.G., B.A.M., J.T., D.B., H.P., A.G.S., Z.Q., J.Y., O.K., D.M., P.-Y.K., D.B. provided critical revisions. S.E.B. was unable to review the final manuscript due to injury. All authors contributed to the manuscript. J.T.S., M.P.-P., K.H.Y.W. contributed equally to this work.

Corresponding author

Correspondence to Joseph T. Shieh.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information pdf

Reporting Summary Checklist

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Shieh, J.T., Penon-Portmann, M., Wong, K.H.Y. et al. Application of full-genome analysis to diagnose rare monogenic disorders. npj Genom. Med. 6, 77 (2021). https://doi.org/10.1038/s41525-021-00241-5

Download citation

Received: 29 May 2020
Accepted: 21 October 2020
Published: 23 September 2021
DOI: https://doi.org/10.1038/s41525-021-00241-5

This article is cited by

Combining optical genome mapping and RNA-seq for structural variants detection and interpretation in unsolved neurodevelopmental disorders
- Bing Xiao
- Xiaomei Luo
- Yongguo Yu
Genome Medicine (2024)
Genome-wide association analysis unveils candidate genes and loci associated with aplasia cutis congenita in pigs
- Fuchen Zhou
- Shenghui Wang
- Zebin Zhang
BMC Genomics (2023)
Whole genomic analysis reveals atypical non-homologous off-target large structural variants induced by CRISPR-Cas9-mediated genome editing
- Hsiu-Hui Tsai
- Hsiao-Jung Kao
- John Yu
Nature Communications (2023)
Familial co-segregation and the emerging role of long-read sequencing to re-classify variants of uncertain significance in inherited retinal diseases
- Pankhuri Gupta
- Kenji Nakamichi
- Debarshi Mustafi
npj Genomic Medicine (2023)
An automated 13.5 hour system for scalable diagnosis and acute management guidance for genetic diseases
- Mallory J. Owen
- Sebastien Lefebvre
- Stephen F. Kingsmore
Nature Communications (2022)