Diagnostic implications of pitfalls in causal variant identification based on 4577 molecularly characterized families

Despite large sequencing and data sharing efforts, previously characterized pathogenic variants only account for a fraction of Mendelian disease patients, which highlights the need for accurate identification and interpretation of novel variants. In a large Mendelian cohort of 4577 molecularly characterized families, numerous scenarios in which variant identification and interpretation can be challenging are encountered. We describe categories of challenges that cover the phenotype (e.g. novel allelic disorders), pedigree structure (e.g. imprinting disorders masquerading as autosomal recessive phenotypes), positional mapping (e.g. double recombination events abrogating candidate autozygous intervals), gene (e.g. novel gene-disease assertion) and variant (e.g. complex compound inheritance). Overall, we estimate a probability of 34.3% for encountering at least one of these challenges. Importantly, our data show that by only addressing non-sequencing-based challenges, around 71% increase in the diagnostic yield can be expected. Indeed, by applying these lessons to a cohort of 314 cases with negative clinical exome or genome reports, we could identify the likely causal variant in 54.5%. Our work highlights the need to have a thorough approach to undiagnosed diseases by considering a wide range of challenges rather than a narrow focus on sequencing technologies. It is hoped that by sharing this experience, the yield of undiagnosed disease programs globally can be improved.

empowering individuals (patients and unaffected carriers) to make reproductive choices as well as to understand disease risk in family members and future generations.This underpins the desire to maximize the availability of these tools to end the lengthy diagnostic odyssey and ensure that patients and families achieve their right to an accurate and timely diagnosis 1 .
Despite the remarkable advances in Mendelian disease genetics, current technology fails to identify the underlying causal variant in a significant fraction of patients.Large diagnostic exome sequencing (ES) cohorts typically report <50% diagnostic rate 2 .Even wholegenome sequencing (WGS) falls short of attaining the muchanticipated full capture of all Mendelian variants.A recent real-world study on the clinical implementation of WGS in the diagnosis of suspected Mendelian diseases reported a 35% diagnostic rate 3 .It is clear, therefore, that there are persistent challenges in the diagnosis of Mendelian diseases beyond the coverage issue.The identification of these factors will require large scale deep analysis of Mendelian diseases and sharing of results to facilitate the development of robust tools that learn from these pitfalls.
Efforts to characterize the challenges in Mendelian variant identification have been limited and tend to deal with a single challenge at a time e.g.cryptic transcript-deleterious variants 4,5 .The Undiagnosed Diseases Network in the US recently published its experience involving 791 evaluated individuals: 231 received 240 diagnoses, including 35% that were "straightforward" 6 .The very nature of UDN makes it enriched for challenging clinical scenarios so their finding of 90 diagnoses that occurred after prior nondiagnostic exome sequencing and 45 diagnoses that are non-genetic may not be reflective of the overall landscape of Mendelian diseases.A recent overview by the Centers for Genomic Medicine only very briefly listed some of the pitfalls of standard analysis 7 .The examples listed by the authors include WGS to identify a homozygous inversion in QDPR missed by exome, RNA-seq to identify an intronic variant in trans with a missense in DES, bisulfite sequencing to identify aberrant hypermethylation associated with a pathogenic repeat expansion in the XYLT1 promoter region, and longread sequencing to identify an inverted triplication flanked by duplications in a proband with Temple syndrome 7 .Thus, there remains a need for a detailed analysis of a large and unbiased Mendelian cohort to both quantitively and qualitatively describe the encountered pitfalls and inform similar efforts.
Here, we describe the challenges encountered in a large Mendelian genomics program involving 4577 molecularly characterized families.We identify categories of challenges that cover the phenotype, pedigree structure, positional mapping, gene, and variant, and quantify their relative contribution.Our results can inform current and future efforts to improve the diagnostic yield of Mendelian diseases globally.

Representativeness of the study cohort
Our cohort comprised 4577 families in which a likely causal variant was identified (out of 8024 families in total).The total number of these variants is 2681 (2131 recessive, 455 dominant, 88 X-linked, 6 Y-linked, and one mitochondrial) and the total number of implicated genes is 1604 (400 lacked OMIM listing of the gene-phenotype assertion at the time of analysis).The overwhelming majority of the included cases came through the research lab (94.5%, 4324 / 4577), while 5.5% came through the clinical lab.Similarly, the overwhelming majority of cases came from Saudi Arabia (~96%) with the remaining ~4% coming as international referrals to our program.Consanguinity (defined as parental relatedness equivalent to third cousin or closer) was documented in 81% and lack of consanguinity was documented in 10.5% (consanguinity was unknown in 8.5% families).There was a broad coverage of disease pathologies typical of large Mendelian genomics programs including neurodevelopmental, dysmorphic/congenital malformation syndromes, inborn errors of metabolism, hematological, immunological, ophthalmological, audiological, pulmonary, gastrointestinal, connective tissue-related, cardiovascular, skeletal, reproductive, and renal.The age distribution was also broad ranging from the zygote stage to 80 years of age.Our cohort consisted of an almost equal distribution of sex (51.8% males and 47.1% females) while the remaining 1.1% were cases of undetermined sex (typically fetuses).

Genetic diagnostic challenges
We identified 1570 families (34.3%) in which one or more of the following challenges was observed (Fig. 1 and Supplementary Data 1).The classification of the variants was pathogenic or likely pathogenic in the majority of the variants identified, while the remaining 15% were variants of uncertain significance (based on known genes since variants in novel genes are automatically classified as variants of uncertain significance) (Supplementary Data 1).The variants spanned 861 genes, including candidate genes as well as novel allelic disorders in known morbid genes, the majority of which (87.7%, 221 / 252) achieved at least a moderate level gene-disease assertion (Supplementary Data 1).All variants are submitted to ClinVar and all novel gene-disease assertions to GenCC. 1-Phenotype-related: i. Phenotypic heterogeneity: Supplementary Data 2 lists the families where the phenotype was sufficiently heterogeneous (intrafamilial or interfamilial) to complicate the original molecular diagnosis (~3% of families).For example, the identification of the causal variant of cleft lip and palate IRF6:NM_006147.4:c.179G > C;p.(Trp60Ser), heterozygous in family F750 was challenging because the phenotype varied widely between frank cleft lip and palate to lip pits that were not always apparent clinically due to the use of cosmetic fillers.On the other hand, inter-familial phenotypic heterogeneity significantly delayed the identification of some founder variants shared by families e.g.we identified the same pathogenic founder INSR variant NM_000208.4:c.433C > T;p.(Arg145Cys) in families where the phenotype ranged from classical hyperinsulinism to asymptomatic.ii.Phenotypic expansion: in 79 families (5%), the phenotype provided by the referring physician was sufficiently different from the typical phenotypic expression of the implicated gene to make the molecular diagnosis challenging (see Supplementary Data 3).As well, the causal variant in CDK10 was not considered initially in the interpretation of F5780, a simplex case recruited due to hydrocephalus because this is an atypical presentation of Al Kaissi syndrome (one case of Al Kaissi-related hydrocephalus was published after the initial submission 8 ).Another example is family F7829 where one affected fetus was found to have bilateral renal agenesis and a homozygous LOF variant in CD151 (Fig. 2d), which is typically linked to nephropathy rather than renal agenesis (one report of CD151-related renal agenesis appeared after the initial submission 9 ).Dual molecular diagnosis was specifically investigated and excluded in families under this category.iii.Allelism: in 83 families (5.3%), the phenotype is sufficiently different from the one described in the literature that it justifies labeling as a distinct allelic disorder (52 allelic disorders in total).Indeed, the phenotype was considered a distinct allelic disorder that later acquired a distinct OMIM entry in 37 families or remains a candidate for OMIM listing in 46 families.Supplementary Data 4 lists these cases, which include 19 unpublished (Table 1).Interestingly, most instances of allelism can be attributed to the recessive nature of the identified variant compared to the dominant ones in the literature and these will be discussed later under gene-specific challenges.Several exceptions are worth noting: Family F8629 presented with primary amenorrhea and infertility and was found to have severe ovarian insufficiency with normal karyotype and FMR1 repeat number.The novel homozygous variant in TAL-DO1:NM_006755.2:c.486_500dup;p.(His163_Thr167dup) in her exome was initially dismissed as irrelevant due to the lack of clinical features of transaldolase deficiency 10 .However, its association with elevated urinary excretion of erythritol, arabitol, and ribitol in the index, which is consistent with transaldolase deficiency, and its full segregation with the phenotype in the family prompted us to upgrade it to likely pathogenic and to propose a TALDO1-related ovarian insufficiency as a novel allelic disorder.Families F6581 and F6582 presented with fetal akinesia, and each had a different homozygous LOF allele in COL25A1.This strongly supports fetal akinesia as an allelic disorder distinct from COL25A1-related congenital fibrosis of extraocular muscles 11 (Fig. 2e, f).Family F4367 was clinically diagnosed with HADHBrelated non-syndromic peripheral neuropathy, which is a distinct allelic disorder from the OMIM-listed HADHB-related trifunctional protein deficiency.Family F4607 is a simplex Bardet-Biedl syndrome case likely caused by a homozygous variant in SCLT1, a gene originally described in connection to oral-facial-digital syndrome 12 .We also highlight an apparently novel recessive allelic disorder caused by homozygous LOF in VPS50 and comprises severe congenital hydrocephalus in family F9792 (no biallelic LOF variants have been reported before, which may explain the severe nature of this phenotype).Although SCYL1-related CALFAN syndrome (cholestasis, acute liver failure, and neurodegeneration) has been published 13 , it is worth highlighting family F7600 with this disease because this distinct allelic disorder remains unlisted in OMIM (only spinocerebellar ataxia is listed under SCYL1).2g).It is worth highlighting that the blended phenotype need not be due to independently inherited variants in multiple genes.For example, the phenotype of hyperinsulinism and inherited retinal degeneration in families F4296, F6457, and F8752 was found to be caused by a founder deletion involving In a cohort of 4577 molecularly characterized families, we encountered 5 main scenarios we found to be challenging in 1570 families.First, there are phenotype-related challenges that can be further subcategorized into phenotypic heterogeneity, phenotypic expansion, novel allelic disorders, blended phenotype, and misleading diagnoses.Gene-related challenges comprise the discovery of novel disease genes, novel mutation mechanisms, and cases where the animal model did not corroborate the phenotype observed in human patients.Pedigree-related challenges comprise gonadal mosaicism, allelic and genetic heterogeneity, and pseudo-dominance.Variant-related challenges comprise interpretation and technical level challenges.Pitfalls of autozygosity include the lack of detectable ROH at the disease locus and apparent sharing of the candidate ROH with an unaffected member of the family.We then dissect the prevalence of these challenges in 314 families referred to us with negative clinical exome or genome sequencing.We observed that most of the challenges encountered are either because the causal gene was novel at the time of analysis or because of non-technical variant related challenges.Logos for Gene-related, phenotyperelated, and variant-related challenges are created using BioRender.com.

Mapping
ABCC8 and USH1C.Remarkably, one patient (F8752) presented with this phenotype and tested negative for this deletion.Instead, he was found to have independent homozygous pathogenic variants in ABCC8 and LCA5.Perhaps the most striking examples are families F900 and F8348, each molecularly and clinically diagnosed with three different diseases in the same individual.Family F900 was referred to us with albinism and was found to be homozygous for Hermansky-Pudlak syndrome-related variant HPS4:ENST00000398145.    is an important consideration.Nonetheless, there are instances where the reported phenotype in the animal model is sufficiently incompatible to delay the establishment of the gene-disease assertion.For example, MPDZ-related congenital hydrocephalus seen in families F2268, F2699, and F5606 was initially dismissed because a chick model lacked hydrocephalus 14 .It was only after the original publication of MPDZ-related hydrocephalus 15 that a compatible mouse model recapitulating the human phenotype was published 16 .Additional examples include the animal models of MICU2 and AGBL5 deficiency identified 17,18 , respectively, in family F5563 with severe neurodevelopmental disorder and family F2707 with non-syndromic retinal dystrophy.Supplementary Data 8 details this category that potentially affected 7 families (0.45%).iii.Known gene, novel mutation mechanism: the likely causal variant was dismissed in 71 families (4.5%) because of the perceived incompatibility of the identified homozygous variant with the established dominant inheritance pattern of the implicated gene.Supplementary Data 9 lists these cases, including 23 unpublished (Table 2).As compared to the typical dominant phenotype, the recessive phenotype ranged from similar e.g.SLC20A2-related Basal ganglia calcification in family F4159 (Fig. 2h), TCOF1-related Treacher-Collin syndrome in family F732 (Fig. 2i) and MAPRE2related circumferential skin creases in family F531, to more severe e.g.ADSS1-related embryonic lethality in family F8399, to distinct allelic disorders.One remarkable example of the latter is family F6666 in which two siblings presented with a dysmorphic syndrome and associated global developmental delay (Fig. 3a).
The biallelic loss of function variant in ABL1 these two siblings share is very different from the gain of function de novo variants identified in ABL1-related congenital heart defects and skeletal malformations syndrome CHDSKM.The dysmorphology profile can be viewed as the opposite of CHDSKM, which has been likened to Van de Ende Gupta syndrome 19 (Fig. 3b-h).A similarly remarkable example is family F141 where an OTX2-related inherited retinal degeneration with female infertility caused by a homozygous cryptic splicing variant is a stark contrast to the dominant OTX2-related anophthalmia phenotype (Supplementary Figure 1A and B).A third example is F454 and F2924, two families in which a non-syndromic retinitis pigmentosa phenotype fully segregated with a founder recessive NOTCH2 variant even though this gene is only linked to very distinct autosomal dominant conditions in OMIM.Family F3151 with dysmorphia was molecularly diagnosed with a missense (rather than LOF) variant in VPS13B (Fig. 2j).Another example is two families F7810 and F9597, where we identified a homozygous founder LOF variant in RHOBTB2 leading to a RHOBTB2-related neurodevelopmental disorder characterized by global developmental delay, mild facial dysmorphism, normal brain MRI and no epilepsy in stark contrast to the RHOBTB2-related developmental and epileptic encephalopathy caused by de novo gain of function variants.challenge until it was later realized that this was a penetrance issue.Age-related penetrance was most common and was observed in 12 families, typically related to progressive retinal degeneration.Pathogen-specific penetrance was observed in families F8448 and F5452 with C8B and MSN variants, respectively.We also observed sex-related penetrance e.g. in the case of F8606 with testicular regression syndrome, the identified variant in DHX37 was also initially dismissed due to its presence in unaffected members before realizing this was a sex-limited phenotype.However, no explanation could be found for the segregation results in some cases.For example, the index patient in family F9182 was found to have the CNV (1:145390101_145786290 heterozygous deletion) inherited from a healthy father even though it was reported pathogenic in 7 individuals with overlapping neurodevelopmental disorders in DECIPHER 26 .Reduced penetrance is well established for CNVs 27 but unusual in autosomal recessive SNVs.Thus, family F1028 was particularly surprising given the apparent non-penetrance of a homozygous LOF variant in a typically fully penetrant morbid gene.Specifically, a homozygous truncating variant in DENN-D5A:NM_015213.4:c.3387+3G > T;r.3305_3387del;p.(Lys1102Thrfs*27)was initially ignored in three affected siblings with the typical DENND5A-related developmental and epileptic encephalopathy phenotype because it was also homozygous in their unaffected sibling (Supplementary Fig. 3).The reason for incomplete penetrance in this family remains obscure and we suspect this may be a rare form of resilience (see Discussion).On the other hand, apparent non-penetrance can be explained by inadequate phenotyping.This can be seen when the patient is difficult to phenotype due to a complex presentation.For example, the classical Congenital symmetric circumferential skin creases syndrome was overlooked in a child (family F531) with a pathogenic variant in MAPRE2 because he also suffered from CPLANE1-related Joubert syndrome, but it could readily be observed in the sister who is only homozygote for the MAPRE2 variant.In some cases, the phenotype is observed but not perceived as an extension of the phenotype under study in the index.An example is F5409, where NIHF (nonimmune hydrops fetalis) was investigated.One sister was initially recruited as unaffected because she only had club nails which were later found to be an extension of the FZD6-related NIHF phenotype (Supplementary Fig. 3).We suspect inadequate phenotyping may also play a role in explaining the presence of homozygotes for some pathogenic variants in gnomAD e.g.there is a homozygote in gnomAD for the pathogenic stopgain variant we identified in KATNIP in family F6141 with classical Joubert syndrome.e. Distraction by other variants: In 240 families (15.3%, Supplementary Data 14), the search for the causal variant was derailed by another variant.These "red herring" variants can be grouped as follows: i. Presumptive loss of function variants in known disease-related genes: These were obviously considered compelling candidates but were later found to be non-disease causing.We have encountered this phenomenon in 58 families (3.7%, Table 4 and Supplementary Data 14).The reasons for these variants not being disease-causing include: 1. LOF is not a disease mechanism: For example, the stopgain variant in COL8A2:NM_005202.4:c.1815C>A;p.(Tyr605*) was identified in three families that lacked corneal dystrophy.All COL8A2-related corneal dystrophy variants to date are amino acid substitutions and pLI is low (0.12). 2. The variant is not true LOF: For example, EYS: NM_001142800.1:c.2137+1G > A was never shown to cause LOF on RT-PCR.Its high AF in gnomAD suggests it is not disease-causing, at  Updated high MAF Truncates 7.7% of the protein †Variants involved in complex compound heterozygosity are listed in Table 3.
least not in the homozygous state we encountered in 6 families.It is possible that this variant is involved in the complex compound inheritance phenomenon (see above).Another example is RPGR:NM_000328.3:c.1905+1G > A, which we found hemizygous in three families with no evidence of retinal dystrophy.The reason for the non-pathogenicity of this variant is unclear but it is worth highlighting that its predicted impact on splicing was not experimentally confirmed and that it represents a missense variant in a different isoform.In family F2516, a homozygous IQSEC1:NM_001134382.3:c.2978del;p.(Pro993Hisfs*127)variant was identified in the index patient but also the normal father and another individual in our database who lacks the phenotype.The deleted bp is the last bp in the exon followed by a single bp intron raising the possibility of a resulting in-frame deletion rather than a frameshift.Another example is NPHP4:NM_015102.5:c.2818-2T > A, with an extremely high AF in gnomAD including homozygotes.The lack of phenotype could be related to the in-frame nature of the splicing aberration 28 .Similarly, OFD1:NM_003611.3:c.2600-1G > C, which we identified in hemizygosity in 3 families was found to cause in-frame aberration when we tested it on RT-PCR (Supplementary Figure 3).EVC: NM_001306090.2:c.2731C > T;p.(Arg911*) is a variant that has a very high frequency in our population with several homozygotes that lack the expected ciliopathy phenotype.We think it is not LOF because it only truncates 8% of the protein.Similarly, RP1L1: NM_178857.6:c.5959C > T;p.(Gln1987*) variant with a very high frequency is probably non-pathogenic because it only truncates a small part of the C-terminus that has not been reported to harbor any pathogenic variants.Furthermore, the truncated part resulting from GNAT1:NM_144499.3:c.858C > G;p.(Tyr286*) we identified in family F5037 with horizontal gaze palsy and in a normal parent, is devoid of clinically proven pathogenic variants.3. The reported gene-disease assertion is refuted: LIPN:NM_001102469.2:c.302delG;p.(Gly101Glufs*7) was identified in homozygosity in F2178 in the absence of ichthyosis.The OMIM listing of LIPN-related ichthyosis is based on a single family.i. Presumptive LOF variants in genes with no OMIM phenotypes: Although priority was always given to known disease-related genes, the finding of homozygous truncating variants in novel genes can distract from the actual disease-causing variant especially if the novel gene is compelling.Supplementary Data 15 lists all the "knockout" events that turned out to be unrelated to the phenotype in question and these were observed in 60 families (3.8%).ii.The variant was in linkage disequilibrium with the causal variant, which was therefore overshadowed by the other variant.For example, family F168 was referred to us with epidermolysis bullosa.We initially made a genetic diagnosis based on LAMB3:NM_000228.3:c.2723C > T;p.(Thr908Ile).However, the variant was subsequently found to have a high AF, which prompted us to reclassify this variant and revisit the case.This revealed a pathogenic NM_000228.3:c.958_1034dup;p.(Asn345-Lysfs*77)variant in the same gene.Similarly, in family F4429 with congenital adrenal hyperplasia we made a molecular diagnosis based on CYP21A2:NM_000500.9:c.92C > T;p.(Pro31Leu) until we later discovered the causal variant to be a genomic rearrangement involving exon 1-3 of the same gene.In family F5113, WGS reported a very deep intronic variant NM_003477.2:c.1023+2267A > G in PDHX as the likely cause.However, the causal variant was genomic rearrangement resulting in exon 1 deletion of the same gene as confirmed by targeted analysis.i. Technical: a. Deep intronic variants: Supplementary Data 16 for tentative TDV (see above) includes variants >50 bp from the exon-intron junction, which is typically not covered by exome capture.
b. Regulatory elements (Supplementary Data 17): We highlight a remarkable example of this challenging class.Family F3029 comprises four children with severe syndactyly (Fig. 4a-f).We initially reported a large homozygous genomic deletion NC_000010.10:g.54337730_54933961del as the likely causal variant and attributed the pathogenesis to the resulting total loss of MBL2 within the deleted region even though MBL2 is not linked to any disease in humans 29 .Interestingly, a mouse model homozygous for a null allele 30 did not corroborate the phenotype observed in the four siblings prompting us to revisit our interpretation.Upon further investigation, we found that DKK1 lies ~260 kb upstream of the deleted region (Fig. 4g).DKK1 is crucial for normal limb development and null mice display syndactyly 31 (Fig. 4h).We hypothesized that the deletion may have impacted DKK1 transcription indirectly by effecting a putative enhancer region.Indeed, we observed several H3K27Ac peaks in the deleted genomic region when inspecting publicly available databases (Fig. 4g).Consistent with this hypothesis, RT-qPCR experiments using fibroblasts isolated from two affected siblings showed a marked reduction (60-80%) of DKK1 transcript levels compared to controls (Fig. 4i).c.Repeat expansion (Supplementary Data 18): As expected, all families with Fragile X syndrome and other expansion disorders were missed by exome except for one case caused by a de novo indel in FMR1 32 .d. Genomic rearrangements: Variants >50 bp in size but below the limit of detection of chromosomal microarray were challenging to call on exome sequencing.These accounted for the molecular diagnosis in 42 families (2.7%, Supplementary Data 19).Supplementary Fig. 4 shows examples of how optical genome mapping was very helpful in this class of variants.e. Pseudogenes: The known limitation related to SMN1/SMN2 and CYP21A2 loci was encountered in 13 families (0.8%).f.Platform and bioinformatic limitations: We have encountered 68 families (4.3%) in which the causal variant was missed or miscalled (sometimes because of discrepant performance of Ion Proton vs Novaseq, especially for indels).We have instances where the variant was misannotated as intronic or as ncRNA, but it was actually in an exon of a protein-coding gene.g.Epigenetic: These epigenetic changes are missed by current short read sequencing technology e.g.hypomethylation of the maternal GNAS allele-related Pseudohypoparathyroidism Ia in family F79, and DMR2 hypomethylation-related Beckwith-Wiedemann syndrome in family F8188.

Pedigree-related
Pseudodominance was encountered in 15 families (~1%, Supplementary Data 20).Examples of more challenging pedigrees are discussed here.The index in family F2640 was suspected to have an autosomal recessive syndromic form of intellectual disability because of the consanguineous parents and history of a deceased affected sibling so a novel candidate gene (FAM120AOS) was proposed 33 .However, reanalysis revealed a causal 179 kb deletion that had been dismissed because it was inherited from a normal mother.This deletion (chr14:101178072-101457155) spans an imprinted locus with the paternally expressed RTL1 and maternally expressed MEG3 and MEG8, and is linked to Temple and Kagami-Ogata syndrome 34 (Fig. 2).F6386 family was referred to us with two affected sisters diagnosed with hearing impairment, mild microcephaly, developmental delay, and mild strabismus.Despite the absence of consanguinity between the two parents, the hearing impairment phenotype was solved with a homozygous truncating variant in GJB2:NM_004004.6:c.35delG;p.(Gly12-Valfs*2).To our surprise, we found a heterozygous variant ITPR1:NM_001378452.1:c.7660G > A;p.(Gly2554Arg) in both affected sisters (explaining the global developmental delay phenotype) which was absent in the parents strongly suggesting a gonadal mosaic mode of inheritance.Another dramatic example is family F4923 with three affected children from two different half second-degree unions with split hand/foot malformation syndrome (Fig. 5a, b).Molecular karyotyping was initially negative as were exome and RNA-seq.Optical genome mapping revealed a heterozygous duplication in chromosome 10 (chr10:102909908-103459101) in all three affected while the mothers and father were wildtypes indicating a case of paternal gonadal mosaicism (Fig. 5c-e).Gonadal mosaicism is even more challenging in the context of autosomal recessive diseases.For example, in family F8654 with two children with the classical ALG3-related Congenital disorder of glycosylation, type Id we identified a paternally inherited variant NM_005787.5:c.512G > A;p.(Arg171Gln) and a "de novo" variant in trans, indicating maternal gonadal mosaicism.Family F5162 exemplifies how the challenge of intrafamilial genetic heterogeneity is amplified when the phenotype of the two conditions is similar (Fig. 5f).The three affected siblings were referred to us with intellectual disability, and limited jaw openings (trismus) of variable severity.The family was solved with a homozygous LOF variant in THUMPD1:NM_017736.5:c.706C > T;p.(Gln236*) identified in two of the three siblings.The third sibling was found to have de novo variant in HIST1H1E:NM_005321.3:c.265delA;p.(Ser89-Alafs*140) (Fig. 5g-j).Overall, recurrence caused by parental balanced rearrangement or gonadal mosaicism was erroneously assumed to indicate autosomal recessive variants thus delaying the molecular diagnosis and these were observed in 2 (0.13%) and 7 families (0.45%), respectively.We also note the delay in identifying compound heterozygous variants in 16 consanguineous families (1%) because homozygosity was assumed during analysis.Similarly, blended phenotypes are often assumed to be autosomal recessive in consanguineous populations, but we note that even in individuals with an autosomal recessive allele, the additional pathogenic variant was autosomal dominant in 17.7% or X-linked in 16.1%.Intrafamilial genetic and allelic heterogeneity were major challenges.The default assumption of genetic and allelic homogeneity for a given phenotype within families proved erroneous in 140 families (8.9%), including 123 nuclear families.Supplementary Data 20 summarizes these families, including 70 unpublished ones.This list also includes 87 families in which at least one individual had a blended phenotype (see above).Supplementary Fig. 5 showcases a few remarkable families with genetic and allelic heterogeneity.

Positional mapping-related limitations
There were several instances where we excluded a causal variant because it was present within a region of homozygosity (ROH) that was apparently shared with unaffected family members.Notwithstanding the possibility of penetrance (see above), this phenomenon can be explained by IBS (identical by state) rather than IBD (identical by descent).In consanguineous settings, ROH is the surrogate of autozygosity (IBD).However, ROH may also represent IBS.Thus, the apparent sharing of ROH with the unaffected can be due to the region being IBS in the unaffected while IBD in the affected.Typically, IBS is short so it can be very challenging when IBS is long.The largest IBS we have encountered so far is 19.479Mb long (family F873).This mechanism was encountered in 5 families (0.32%).Conversely, a variant may be dismissed because it is not within an ROH that is shared by all affected.Notwithstanding the possibility of genetic heterogeneity, this phenomenon can be due to double recombination.This is very rare, and we have encountered it in only 2 families.An example is family F849 where the two siblings with Cutis laxa are homozygous for ATP6-V1E1:NM_001696.4:c.634C > T;p.(Arg212Trp).The variant was present at the edge of ROH in one affected child but there was no ROH in the affected sibling.We hypothesize that the ROH was abrogated by a second recombination event (Supplementary Fig. 6).The RP1 locus seems to be particularly susceptible to this phenomenon as we observed it in 3 families where at least one affected member did not have a detectable ROH around the causal variant.Finally, the variant may not be within ROH at all (29 families, 1.9%), and this can be due to low SNP coverage near the locus or because the parents are sufficiently removed from the common ancestor that the disease haplotype was reduced below the conventional limit of 2 Mb.Supplementary Data 21 and Supplementary Fig. 6 summarize positional mapping-related challenges.
7. Sample mix-up: Despite having multiple checks to avoid human errors, sample mix-up was responsible for a delay in identifying the likely causal variant in 6 families (0.38%).

Reanalysis of negative cases
In our cohort, 314 families were referred to us after a negative exome, genome, or both.Reanalysis identified a likely causal variant in 54.5% of these families.This offered us an opportunity to explore the relative contribution of the above-described challenges in this special cohort.The single most common challenge was the novelty of the genedisease assertion (48%, this includes novel disease genes as well as known disease genes with novel mutation mechanism).This was followed by variant-related challenges (genomic rearrangements and non-canonical tentative TDVs) (37.4%), phenotype-related issues (11.7%), and pedigree-related challenges (1.8%).Only 15.2% of the variants identified on reanalysis could not have been captured at the technical level by exome sequencing.Figure 1 and Supplementary Data 22 summarize the results.

Discussion
NHGRI (National Human Genome Research Institute) has made "bold predictions" for the state of human genomics by the year 2030 35 .One such prediction is that "The regular use of genomic information will have transitioned from boutique to mainstream in all clinical settings, making genomic testing as routine as complete blood counts (CBCs)".Key to this prediction is our ability to interpret genome sequence data.Nowhere in the field of genomics does this interpretation have the potential to be more accurate and attainable than in Mendelian diseases and yet patients suffering from these diseases have at least 50% chance of remaining undiagnosed after clinical genome sequencing.Clearly, this must change to fulfill the above vision and deliver to these patients their right to an accurate diagnosis.
This study is a step towards shedding light on the factors that render genome sequencing non-diagnostic through deep analysis of real-world data from a large Mendelian program with excellent sampling representation from one of the largest countries in the Middle East.Contrary to the common belief that the missing diagnostic yield of exome is mostly related to technical limitations 36,37 , we have previously shown using an unbiased positional mapping approach that at least in the setting of autosomal recessive phenotypes in consanguineous populations, more than 90% of causal variants should in theory be detectable by exome sequencing 5 .Indeed, we and others have shown that reanalysis of "negative" exome sequencing uncovers causal variants that were missed at the interpretation rather than capture stages [38][39][40][41][42] .While our study offers limited insight into the added value of newer technologies such as optical genome mapping, it provides unprecedented details about the interpretation challenges.
There is a growing interest in the use of artificial intelligence (AI) to improve accuracy and increase the throughput of interpreting clinical genome sequencing 42 .We believe that efforts such as ours to share challenges in analyzing genomes and how such challenges were overcome will be very helpful in training the next generation of AIbased tools and enable genome sequencing and its interpretation and reporting to be performed at scale.Additionally, our work makes several key contributions to clinical genomics.First, we describe 357 gene-disease assertions that were novel at the time of analysis and provide additional support to 120 previously published tentative genedisease assertions.This was largely enabled by the power of autozygosity to produce compelling homozygous loss of function variants (human knockouts) as described before 20,[43][44][45] .This is readily seen in the case of ABL1 homozygous loss of function variant that we suggest produces a mirror-image phenotype to ABL1-related Van den Ende Gupta syndrome-like facial and digit dysmorphism caused by de novo gain of function variants.Similarly, the homozygous splicing variant in OTX2 was shown in our study to result in a human phenotype that recapitulates the retinal degeneration and infertility observed in homozygous mice rather than the haploinsufficiency-related microphthalmia phenotype 46,47 .Beyond these novel disease-gene assertions, we have also added novel aspects to the phenotype of 73 established gene-disease links, which will aid in the molecular diagnosis of these diseases.Second, we publish recessive forms of 23 genes that have hitherto been linked to autosomal dominant phenotypes only.The importance of this finding in interpreting heterozygous pathogenic variants in these genes in individuals who lack the dominant phenotype cannot be overemphasized.This makes the difference between counseling based on a 50% risk of an affected child assuming nonpenetrance of a dominant variant versus a nearly 0% risk of an affected child assuming recessiveness of the variant and a non-carrier spouse 48 .Third, we highlight 85 variants as important founder variants in our population including 48 variants with local allele frequency >0.001 (Supplementary Data 23).Importantly, ~50% of these variants have a gnomAD frequency of 0, which highlights their specific importance to the local and potentially other Middle Eastern populations as shown recently 49 .We believe this is an important step towards realizing the NHGRI prediction that by the year 2030 "individuals from ancestrally diverse backgrounds will benefit equitably from advances in human genomics".Fourth, we provide an interpretation framework for variants in 58 genes that we note harbor homozygous loss of function class variants with no apparent consequences phenotypically.None of these genes had been linked to any Mendelian disease in humans so our study reduces their relevance as candidate disease genes in future studies especially considering that many have corresponding mouse knockout lines with no major developmental consequences.On the other hand, the examples of "human knockouts" we encountered for genes with established links to otherwise completely penetrant Mendelian diseases raises interesting questions about the concept of "resilience".Such cases are extremely valuable to analyze to gain insights into how they remain disease-free and what that can teach us about human genomics-inspired therapies for patients who suffer from the corresponding diseases 50 .Another advantage offered by the enhanced autozygosity in our study population is the ability to observe in the homozygous state known pathogenic recessive variants and the potential discordance of their phenotypic expression in the homozygous vs. compound heterozygous state.The frequency of many of these variants is low enough that it is virtually impossible to encounter them in the homozygous state in outbred populations where the required cohort size is based on q 2 .A good example of this is DHCR7:NM_001360.3:c.1 A > G;p.? with AF of 0.00001 in gnomAD (0 homozygotes), whereas 2 homozygotes were encountered in our much smaller database of <14,000 local exomes.Neither of the two homozygotes displayed features of Smith-Lemli-Opitz syndrome, which lends support to the hypothesis that this variant is only pathogenic in trans with a more severe allele i.e. complex compound heterozygosity.
The small contribution of "technical" challenges we identified in this large cohort is consistent with prior work demonstrating, using an unbiased positional mapping approach, that the overwhelming majority of disease-causing variants, at least in the context of autosomal recessive diseases in consanguineous families, are identifiable by exome sequencing 5,38 .Our work highlights the need to have a thorough approach to cases that remain undiagnosed after clinical exome/genome sequencing by considering all classes of challenges and not focusing solely on improved sequencing technologies.Indeed, many studies on the reanalysis of existing exome and even genome data have demonstrated improved diagnostic rate 39,51,52 .This study suggests that addressing non-sequencing-based challenges alone could boost the diagnostic yield by ~71%.Another important aspect worth highlighting is the value of data sharing and collaboration, which is the motive behind this work.We note that the majority of our novel gene-disease assertions were corroborated by international collaborations we participated in through data sharing (Supplementary Data 24).
In conclusion, we report the largest comprehensive analysis to date on challenges in the identification of causal variants in patients undergoing genome sequencing.Our data argue that investment into new sequencing technologies should be accompanied by similar investment into improved interpretation pipelines if we are to reap the full benefits of clinical genomics.We hope the lessons learned from our analysis will assist the development of such tools and improve the interpretation of genome sequencing both at the variant and the gene levels.

Methods
We confirm that our research complies with all relevant ethical regulations at KFSHRC.

Human subjects
Only patients with a suspected Mendelian disease are included in the Mendelian genomics cohort.Informed consent is obtained prior to enrollment.Different IRB-approved projects were used to enroll subjects depending on their phenotype (KFSHRC RAC# 2070023,  2080006, 2121053, 2170028, 2200030, 2080033, 2210029,  2140016, 2230016, and 2080051).The consent covers the use of human fetal material, and for the generation and use of patient-derived cell-lines (LCLs and Fibroblasts) whenever it is needed.The authors affirm that human research participants provided written informed consent for publication of the images in Figs. 2, 3, 4, 5 and Supplementary Fig. 3.The authors also affirm that human research participants provided written informed consent for the publication of identifiable data.

Testing strategies
Cases recruited prior to 2012 were analyzed using next-generationbased multi-gene panels relevant to their clinical phenotype.Negative cases as well as cases recruited after 2012 were submitted for ES.Simplex cases as well as cases with likely dominant inheritance were additionally subjected to chromosomal microarray.Select negative cases after ES were analyzed using optical genome mapping.All families were submitted for genotyping and positional mapping was used as appropriate.Miscellaneous testing strategies (methylation analysis, MLPA, repeat expansion, and WGS) were requested clinically as appropriate by the ordering physicians from various CAP-accredited laboratories and their results were recorded.Supplemental Methods explain the technical details of the testing platforms.

Cell culture and RT-qPCR
A sub-confluent patient-and control-derived fibroblast cell lines were maintained in Minimum Essential Medium Eagle (MEM) supplemented with 10% fetal bovine serum (FBS, Gibco), 2 mM L-glutamine (Gibco) and penicillin G (200 U ml-1, Gibco).Lymphoblastoid cell lines were maintained using RPMI media supplemented with 15% fetal bovine serum (FBS, Gibco), 1% L-glutamine (Gibco), and 1% penicillin G (200 U ml-1, Gibco).Cell lines were maintained in a humidified 37 °C incubator at 5% CO 2 .For RT-PCR and RT-qPCR total RNA from either fibroblast or lymphoblastoid cell lines were extracted with the QIAamp RNA Mini Kit (QIAGEN) according to the manufacturer's recommendations.Preparation of the cDNA was carried out with the iScriptTM cDNA synthesis kit and Poly T oligonucleotide primers (Applied Biosystems).GAPDH was used as an internal control.Relative quantitative RT-PCR for expression analysis was performed with SYBR Green and Applied Biosystems StepOnePlus Real-Time PCR System (StepOneTm Software).

Variant annotation and classification
Candidate variants and their familial segregation were confirmed using Sanger sequencing.All variants are listed using MANE Select or MANE Plus Clinical when applicable and classified according to ACMG guidelines 53 with the help of VARSOME 54 , which was subsequently verified manually.Segregation level support of variant pathogenicity ranged from supporting to strong as detailed in 55 .The tool ConsCal was used to increase the throughput of segregation level analysis 56 .Novel genes and known genes with novel allelic disorders were classified using ClinGen gene-disease assertion guidelines 57 .Local allele frequency (AF) was calculated based on an in-house dataset of 13,473 local exomes.

Review of challenging scenarios
A detailed review of every single molecularly characterized family was undertaken to identify instances that are deemed challenging by a standard approach 58 .Categorization of the challenges followed a consensus approach among the scientific team.All families that were susceptible to a given challenge were counted.

Fig. 1 |
Fig.1| Overview of the challenges encountered based on a cohort of 4577 molecularly characterized families.In a cohort of 4577 molecularly characterized families, we encountered 5 main scenarios we found to be challenging in 1570 families.First, there are phenotype-related challenges that can be further subcategorized into phenotypic heterogeneity, phenotypic expansion, novel allelic disorders, blended phenotype, and misleading diagnoses.Gene-related challenges comprise the discovery of novel disease genes, novel mutation mechanisms, and cases where the animal model did not corroborate the phenotype observed in human patients.Pedigree-related challenges comprise gonadal mosaicism, allelic

Fig. 2 |
Fig. 2 | Clinical images of select challenges.a-g Phenotype-related challenges: a Pedigree of family F2640 with intrafamilial phenotypic heterogeneity molecularly diagnosed with Temple and Kagami-Ogata syndrome.b Chest X-ray of deceased individual IV:4 displaying "coat hanger" appearance.c Chest X-ray of individual IV:1 with the same pathogenic deletion on Chr 14 showing normal chest shape.d Phenotypic expansion: ultrasound imaging of fetus from family F7829 showing bilateral renal agenesis caused by a homozygous variant in CD151.e, f Novel allelic disorder: ultrasound imaging of fetuses from families F6581 and F6582, respectively, highlighting edema and COL25A1-related fetal akinesia.g Blended phenotype: ultrasound imaging of the right kidney of a child from family F6917 showing polycystic kidneys caused by two heterozygous variants in HNF1B and PKD1.h-j Gene-related challenges: h MRI images of a brain of an affected individual from family F4159 molecularly diagnosed with biallelic variants in SLC20A2 showing calcification.i Clinical images of an affected member of family F732 with a homozygous variant in TCOF1.j Clinical image of an affected child from family F3151 showing typical Cohen syndrome facies with the typical "grimace" upon smiling, molecularly diagnosed with a missense (rather than LOF) variant in VPS13B.k-m Variant-related challenge: k Pedigree of family F6211 with one child affected with Microcephaly, short stature, and limb abnormalities (l, m) caused by a compound heterozygous variant in DONSON.Logos for Gene-related, phenotype-related, and variant-related challenges are created using BioRender.com. c.486_500dup;p.(His163_Thr167dup)

Fig. 3 |
Fig. 3 | A novel allelic disorder caused by biallelic LOF in ABL1. a Pedigree of family F6666 with two similarly affected children and one child affected with Down syndrome.b Schematic representation contrasting the phenotypic differences between monoallelic gain of function and biallelic loss of function variants in ABL1.c-e Clinical photographs showcasing facial dysmorphia in affected individual IV:1.f Photograph showing short fingers in individual IV:1.g Clinical photograph of the similarly affected sister IV:3.h Sanger sequencing showing the homozygous 46 bp duplication in ABL1 in the two affected siblings, which is heterozygous in the parents and brother affected with Down syndrome.Panel (b) was created with BioRender.com.

NM_015102NM_016953
c.1909+22 G > A Homozygous in individuals who lack the phenotype Leukodystrophy, hypomyelinating, 7, with or without oligodontia and/or hypogonadotropic hypogonadism 607694 ABCA4 NM_000350.3:c.5882G > A;p.(Gly1961Glu) Homozygous in individuals who lack c.82A > C;p.(Ser28Arg) Homozygous in individuals who lack the phenotype Microcephaly, short stature, and limb abnormalities 617604 CLCN7 NM_001287.6:c.1208G > A;p.(Arg403Gln) Homozygous in individuals who lack the phenotype Osteopetrosis, autosomal recessive 4 611490 Article https://doi.org/10.1038/s41467-023-40909-3Nature Communications | (2023) 14:5269Table 4 | Presumptive loss of function variants in known disease-related genes that are not disease-causing † c.3387+3G > T;r.3305_3387del;p.(Lys1102Thrfs*27)Homozygous Developmental and epileptic ence-deleted bp is the last bp in the exonintron boundary and the intron is only one base pair long NPHP4 Startloss in some transcript and UTR variant in others.GTEX database reported that the shorter transcripts (where the variant is located in the UTR regions) are the ones abundantly expressed in most tissues especially the brain and reproductive organs PDE11A Truncates a small part of the C-terminus where no deleterious variants are reported CDKL5 NM_003159.2:c.2854C > T;p.(Arg952*) Heterozygous Developmental and epileptic encephalopathy 2 300672

Fig. 4 |Fig. 5 |
Fig. 4 | Identification of a homozygous deletion affecting the regulatory elements of DKK1 in four siblings with complex syndactyly.a Clinical image of index individual of family F3029 showing syndactyly in feet.b X-ray images of the feet.c Photograph of the hands showing syndactyly.d X-ray of the hands.e Clinical image of the hands of a similarly affected sister with syndactyly of the middle finger.f Clinical image of the feet of the affected sister with syndactyly.g Representative illustration of DKK1 gene in mouse and human genomes.In mouse, the highlighted regions A, B, C, and D correspond to the four conserved non-coding elements (CNE25, CNE114, CNE190, and CNE195, respectively) identified downstream of DKK1 and are shown to drive its expression 62 .In humans, peaks corresponding to H3K27Ac and DNase-seq experiments are also highlighted in the deleted region identified in the affected members of this family.h Schematic representation of WT and Dkk1 knockout mice showing syndactyly in hands and feet.i RT-qPCR experiment measuring the transcript levels of DKK1 in the index individual and his affected sister compared to the control.Data show 60-80% reduction in DKK1 transcript levels.Data are presented as mean values +/− standard deviation (SD).Error bars represent the SD of three experiments.Panels (h) and (g) are created with BioRender.com.

Table 1 |
Families with novel allelic disorders described in this manuscript for the first time

Table 2 |
22,23variant was particularly challenging because it is synonymous and does not appear to be splicing in nature (deep exonic).However, RT-PCR experiments revealed aberrant splicing resulting in frameshift and early truncation of the protein (BCS1L:NM_001079866.2:c.441C>T;r.436_460delGTTTTCTTCAA-CATCCTGGAGGAAG;p.(Val146Leufs*4).b.Allele frequency above cut-off: a default AF of <0.001 is often used as a threshold for rare recessive diseases20.However, variants with AF above cut-off accounted for the molecular diagnosis in 255 families (16.2%, Supplementary Data 11 and Supplementary Fig.2).Cases with known gene, novel mutation mechanisms described in this manuscript for the Weaver syndrome was difficult to interpret because it corresponds to the normal allele in marmosets.Poor in silico predictions at the protein level of 8 missense variants belied their deleterious nature at the RNA level (see tentative TDVs).Non-coding genes represented a distinct class of in silico prediction challenges e.g.RNU4ATAC-related primordial dwarfism.b.Complex compound inheritance (Table3): This phenomenon refers to recessive variants that only cause disease when inherited in trans with a different variant (see Allele frequency above cut-off category above).Family F1865 with Stargardt disease is a good example where the ABCA4 variant NM_000350.3:c.5882G>A;p.(Gly1961 Glu) was initially dismissed because of the high frequency in gnomAD in addition to the presence of unaffected homozygous individuals.Of note, functional studies have demonstrated a deleterious nature of this variant22,23.We suggest this is a mild variant only pathogenic in trans with a more severe variant as in this family where it was inherited in trans with a strong loss of function allele NM_000350.3:c.1937+1G> A.
3-Variant-related: i. Interpretation challenges: a. Tentative transcript-deleterious variants (TDV): With the exception of canonical ±1/2 splicing donors and acceptors, all other tentative TDV were challenging to interpret even though they were mostly compatible with exome capture.These accounted for 177 families (11.3%, Supplementary Data 10 includes 85 families described in Maddirevula et al., and 92 described here for the first time).Notable examples include families F2204, F7801, and F8083 with severe lactic acidosis that defied analysis until the synonymous variant BCS1L:NM_001079866.2:c.441C>T;p.(=)wasfound to be TDV.*6C>G(AF of 0.014411), SBDS:NM_ 016038.4:c.258+2T>C(AF of 0.001082), DHCR7:NM_001360.3:c.1 A > G;p.? (AF of 0.003652), POLR3A:NM_007055.4:c.1909+22G>A(AF of 0.005418) and EYS:NM_001142800.2:c.2137+1G>A(AF of 0.007348) associated with thrombocytopeniaabsent radius syndrome, Shwachman-Diamond syndrome 1, Smith-Lemli-Opitz syndrome, Hypomyelinating leukodystrophy, and Retinitis pigmentosa 25, respectively (see complex compound heterozygous inheritance below).3.The variant leads to a highly variable phenotype: an example worth highlighting is ALDOB:NM_000035.4:c.448G>C;p.(Ala150Pro)withlocal AF of 0.002705, which was identified in a 63-year-old male who presented to the clinic for a familial cancer phenotype.ES analysis revealed that the index individual and, subsequently by Sanger analysis, several other family members have a homozygous variant in ALDOB known to cause hereditary Fructose intolerance (Supplementary Fig.1c).This finding is highly unusual because the index patient and other homozygous adults did not have any report, our patient has the classical clefting between stratum corneum and stratum granulosom (Supplementary Fig.3).The highly variable nature of this condition, especially in the local hot environment that induces occasional acral blistering during excessive sweating even among normal individuals likely accounts for the cryptic nature of this condition.Another example C8B:NM_000066.4:c.1282C > T;p.(Arg428*), which causes an immunodeficiency that only manifests under certain achromatopsia.However, we later identified another affected family that is only homozygous for the c.101+1 G > A on the same haplotype background indicating that this is the ancestral disease haplotype and is sufficient to cause the disease.d.Incomplete penetrance: In 65 families (4.1%, Supplementary Data 13), the causal variant was difficult to interpret due to a perceived lack of segregation (including the presence of homozygous state in public or local databases), which posed a

Table 2 (
continued) | Cases with known gene, novel mutation mechanisms described in this manuscript for the

Table 3 |
Variants displaying complex compound heterozygosity