Genomic profiling of 553 uncharacterized neurodevelopment patients reveals a high proportion of recessive pathogenic variant carriers in an outbred population

A substantial portion of Mendelian disease patients suffers from genetic variants that are inherited in a recessive manner. A precise understanding of pathogenic recessive variants in a population would assist in pre-screening births of such patients. However, a systematic understanding of the contribution of recessive variants to Mendelian diseases is still lacking. Therefore, genetic diagnosis and variant discovery of 553 undiagnosed Korean patients with complex neurodevelopmental problems (KND for Korean NeuroDevelopmental cohort) were performed using whole exome sequencing of patients and their parents. Disease-causing variants, including newly discovered variants, were identified in 57.5% of the probands of the KND cohort. Among the patients with the previous reported pathogenic variants, 35.1% inherited these variants in a recessive manner. Genes that cause recessive disorders in our cohort tend to be less constrained by loss-of-function variants and were enriched in lipid metabolism and mitochondrial functions. This observation was applied to an estimation that approximately 1 in 17 healthy Korean individuals carry at least one of these pathogenic variants that develop severe neurodevelopmental problems in a recessive manner. Furthermore, the feasibility of these genes for carrier screening was evaluated. Our results will serve as a foundation for recessive variant screening to reduce occurrences of rare Mendelian disease patients. Additionally, our results highlight the utility and necessity of whole exome sequencing-based diagnostics for improving patient care in a country with a centralized medical system.


Results
Diagnostic success rate of WES analyses. The symptoms experienced by KND patients were mostly of pediatric onset (mean 1.4 years of age). The patients harbored neurodevelopmental problems and were soon referred to tertiary hospitals (mean 1.8 years of age). The majority of the patients visited multiple tertiary hospitals for diagnosis (88.8% visited more than one hospital, mean of 2.3 hospitals), required a mean of 2.3 specialists (31.6% required more than two) and a mean of 5.6 years elapsed before WES analysis at SNUCH (Table 1, Supplementary Fig. S2). The distribution of straight-line distances from home to the clinic strongly correlates with the original population distribution of Korea, suggesting that our cohort covers the entire population (Table 1, Supplementary Fig. S3).
The majority of the patients is sporadic origin (504/553 = 91.1%; Fig. 1a), making them suitable for trio-based WES analysis. Major clinical feature of the KND cohort was neurodevelopmental disease (84.1%; Fig. 1b). Integrative assessments of genetic variants, their clinical impacts, and patient symptoms allowed us to diagnose 40.3% (223/553) of the cohort with high confidence. This group of patients included carriers of CNVs (23/553 = 4.2%; 16 heterozygous deletions and 7 duplications; Supplementary Table S1), in which 20 CNVs originated de novo (3.6% of the entire cohort), which is slightly lower but comparable to a previous observation [20][21][22][23] . Three inherited pathogenic CNVs were identified: a 165.5 kb deletion in a large family containing multiple affected individuals ( Supplementary Fig. S4), and 7.2 Mb and 203.2 kb hemizygous duplications on the X chromosome that were transmitted from healthy moms to their affected sons (Supplementary Table S1). In addition to the high confidence group, additional 7.1% of the cohort (39 patients) harbored previously reported variants that were assumed to be pathogenic but displayed distinct phenotypes, potentially expanding the phenotypic spectrum associated with these genes. For example, two patients that carried a pathogenic heterozygous nonsense or missense variant in COL1A1, known to cause osteogenesis imperfecta 24 , were initially diagnosed with muscle hypotonia. These two patients did not display skeletal problems, but showed blue sclera 25 . Adding this group to the high confidence group yielded an instant diagnostic rate of 47.4% ("known genes"; Supplementary  Table S2). Finally, an additional 10.1% of the cohort (56 patients, 53 genes) harbored variants that are highly likely to be pathogenic but their disease associations are elusive ("novel genes"; Fig. 1c). Among the patients with definite diagnosis, 35.1% are recessive, and 29.9% harbored loss-of-function (LoF) variants (Fig. 1d, Supplementary  Fig. S5). As expected, the known genes showed strong enrichment in disease categories and gene ontologies such as intellectual disability and central nervous system (CNS) development (Fig. 1e).
Novel genes display potential enrichment in neuronal differentiation. We assessed whether the 53 novel genes possess a neurologic disease-causing function. The novel gene set was simulated against the BrainSpan data (Materials and Methods) to evaluate if the expression of these novel genes as a group was strongly correlated with known disease-associated genes during brain development. After 10 5 permutations, we found that the observed involvement of the novel genes was significantly stronger than a randomly selected gene sets across eight developmental windows (P adj < 0.05 from Z-score for all periods; Fig. 1f,g). Furthermore, this test was expanded to the four anatomical domains in each period, yielding 32 spatio-temporal windows 26 (Fig. 1g). It is notable that the most highly enriched windows are concentrated in the frontal cortex area (R1 × P1-4; Fig. 1g). These results suggest that expression of the novel genes is concordant with known disease-causing genes in developing brains and this phenomenon is most prominent in the frontal cortex.
Profile of recessive variants that predispose neurodevelopmental disorders. Using our set of defined pathogenic variants, we explored the genetic properties of the variants that caused disorders. Both dominant and recessive variants displayed similar proportions by functionality ( Supplementary Fig. S5) and carriers of dominant or recessive variants experienced similar ages of onset (data not shown). Next, to test if recessive variants (i.e., compound heterozygous (CH), rare homozygous (RHo) and rare hemizygous (RHe) variants) are more frequently found in patients as compared to healthy individuals, we counted the number of recessive variants in our cohort and compared these values between patients and healthy parents as controls. Counting all recessive variants from patients and controls, we observed that there is no substantial difference in the number of variants for CH, RHo and RHe (Fig. 2a). Extracting LoF variants, variants in OMIM-listed genes or variants in neurodevelopment-related genes did not reveal any difference in burden (Fig. 2a, Supplementary Fig. S6), implying the presence of overwhelming non-pathogenic or non-functional recessive variant calls in the patients. The majority of genes that were found in our patients with definite diagnosis has been previously documented in OMIM, and has good concordance with previously known recessive or dominant inheritance patterns (Fig. 2b). There were two exceptional cases in which the genes are listed as recessive in OMIM but were dominantly inherited in our patients. First, only the recessive ACOX1 phenotype has been recognized to date 27 , but we are currently working on a report that describes this dominant ACOX1 variant. Second, a previous report suggested that the C19orf12 variant has dominant effect, similar to our observations. But this report is not yet listed in OMIM 28 . Next, to test if the genetic properties of recessive variants are different from those of dominant variants, several parameters were compared. Dominant variants (mean allele frequency = 6.2 × 10 −7 ) were found less frequently in gnomAD than recessive variants (mean allele frequency = 1.6 × 10 −5 ; Mann-Whitney U test P = 1.7 × 10 −13 ), since most of the dominant variants originated de novo whereas recessive variants were inherited from healthy parents (Fig. 2c). Recessive variants were slightly less conserved as compared to the dominant variants, based on PhyloP or amino acid conservation among vertebrate species (Mann-Whitney U test P = 0.034 and 0.048, respectively; Fig. 2d). Other functionality test values did not differ significantly between the two groups (CADD P = 0.50, GERP P = 0. 15  Age at WES (years) 7.4(0-37)

Primary clinical diagnosis (n (%))
Rett  Fig. 2e). On the contrary, the genes that contain the recessive variants displayed more lenient constraint as compared to the dominant genes or known haploinsufficiency genes, as documented by observed/expected ratio (o/e) and pLI score in gnomAD 29 . But the recessive genes still display a similar or a slightly more constrained pattern compared to the genes in OMIM (Fig. 2f, Supplementary Fig. S7). Functional characterization of recessive neurodevelopmental disease genes revealed an enrichment for genes involved in lipid metabolism and mitochondrial processes (Fig. 2g), in addition to the expected enrichment in CNS development. The relative position of LoF variants in recessive genes demonstrated a similar lenient pattern, more enriched in the C-terminal portion, as compared to the dominant genes (Fig. 2h). There was no significant difference in basic clinical parameters (displayed in Table 1) between the recessive and dominant patient groups (data not shown).

Profile of pathogenic recessive variants carried in healthy individuals. Unlike de novo variants
that originate largely at random, assembly of recessive variants can be pre-screened and avoided if such variants can be identified in parents. Taking advantage of the extensive coverage of the patient pool maintained by Korea's  disorder clinic every year, (iii) these patients encompass the majority of the Korean population, as exemplified by the sizes of our DMD and Rett cohorts 17,18 and reflected in the geographical distribution of the KND patients (Table 1, Supplementary Fig. S3), (iv) our result from 553 patients revealed a recessive genetic origin in approximately one-third of the patients (Fig. 1d) and (v) Koreans typically marry an individual with minimal genetic similarity. These observations lead to a 1/1,200 incidence rate for developing a severe neurodevelopmental disease in a recessive manner, which can be explained by the existence of one carrier for every 17 healthy individuals (1/1,156; Fig. 3a). One can point out limited evidence for one of our assumptions that we cover the majority of the Korean population. But applying a partial coverage in the estimation will result in increased incidence of neurodevelopment disorder patients and unintentionally inflate the carrier frequency. Therefore, the assumption ensures conservative estimation. Next, we sought to understand the properties of pathogenic recessive variants as compared to the variants found from gnomAD on the same 69 genes that contain these variants. As expected, KND recessive variants were found less frequently (P = 4.2 × 10 −10 , Mann-Whitney U test; Fig. 3b), were strongly conserved during evolution (P = 3.0 × 10 −5 for PhyloP and P = 3.2 × 10 −7 for amino acid conservation, Mann-Whitney U tests; Fig. 3c), and displayed stronger functionality scores (CADD P = 2.5 × 10 −5 , GERP P = 4.0 × 10 −3 and SIFT P = 3.2 × 10 −6 , Mann-Whitney U tests; Fig. 3d) compared to all gnomAD variants found in the same genes. To test the feasibility of accurate pre-screening rare pathogenic variants in a healthy population, we considered gnomAD-originated heterozygous LoF and ClinVar variants found in our recessive genes as a first-tier culprit for pathogenic recessive variants among many variants of obscure functional significances. And we observed that the portion that is attributable to LoF and ClinVar variants by healthy carriers was variable among the genes, and this portion is correlated with the o/e LoF value (Pearson's correlation r = 0.33; Fig. 3e).

Discussion
This study demonstrates the clinical utility of applying WES to children with various and complex neurodevelopmental disorders. We identified genetic causes in 47.4% of the patients and evaluated the characteristics of the variants that caused the disorders in a recessive manner. Consistent with previous studies, we were able to diagnose approximately half of the KND patients (Fig. 1c) [10][11][12]14,16 . Our pathogenic genes and variants were in good correlation with OMIM inheritance pattern and ACMG guideline (Fig. 2b,e). The novel genes formed significantly robust co-expression networks during neurodevelopment processes, which was most prominent in frontal cortex regions (Fig. 1f,g). There was no significant difference in the number of recessive calls between patients and controls, even after stratifying the calls into disease-related gene sets (Fig. 2a, Supplementary Fig. S6). This result suggests following: (i) although categorized as "neurodevelopmental", our patient set is heterogeneous in nature, diluting contribution of single functional entity, (ii) patients carry more non-pathogenic or non-functional recessive calls than expected. To surmount these issues, we need more power to extract biologically relevant signals. The pathogenic recessive variants that cause rare neurodevelopmental disorders displayed moderately increased allele frequency values and marginally increased evolutionary conservation as compared to dominant variants, suggesting that the qualitative differences between these two groups of variants were not dramatically different (Fig. 2c,d). Compared to the relatively mild differences in variant characteristics, the genes that caused such disorders either in recessive or dominant manner displayed more discernible differences based on several parameters. First, the recessive genes harbor increased o/e LoF values, implying that there is no constraint applied to the recessive genes, whereas dominant genes in displayed a highly biased pattern toward known haploinsufficiency genes (Fig. 2f). Similarly, the distribution of the relative locations of LoF variants in genes suggested that recessive genes were less constrained compared to dominant genes (Fig. 2h). Gene ontology (GO) analysis further supports that -while the two groups are predominantly composed of neurodevelopment-related genes -recessive genes contain an enriched number of genes that are involved in lipid metabolism and mitochondrial function (Fig. 2g). This is a plausible result considering that these pathways are essential for normal brain development 31,32 . To summarize, these observations suggest that gene properties are a stronger determinant of whether a disease adopts a recessive or dominant inheritance pattern than variant properties.
Predicting and avoiding the occurrence of recessive disorders is critical. Carrier estimation has been traditionally performed mostly for single-gene diseases such as β-thalassaemia, Tay-Sachs disease and cystic fibrosis, and efficiently reduced the incidence of these patients [33][34][35] . However, even after introducing aggressive analysis of genetic disorders using whole exome or whole genome sequencing (WGS), precise estimation of the incidence and contribution of rare Mendelian disorders in a recessive manner remains variable for study populations and disorders 2 . For example, analysis of the Deciphering Developmental Disorders (DDD) data revealed a small contribution of recessive disorders (3.6%) to patients of European ancestry, whereas this value was 30.9% for patients of Pakistani ancestry 4 . Systematic analysis of schizophrenia data did not detect a substantial contribution of recessive variants 5,6 . These observations differ from ours, where 35.1% of the definitely diagnosed patients emerged in a recessive mode (Fig. 1d), which is in good agreement with previous diagnostic WES studies 10,36,37 . Notably, a recent study using a large autism cohort revealed that a substantial portion of the patients are attributable to the recessive LoF variants 38 .
Since the majority of these pathogenic recessive variants were inherited from healthy parents, and ethnic Koreans comprise a relatively isolated population with a centralized medical system, it was feasible to derive an estimate that one out of 17 individuals are healthy carriers of pathogenic recessive variants for severe neurodevelopmental disorders (Fig. 3a). Accumulating this estimated carrier portion across different rare severe disease entities will certainly increase this ratio. The contribution of known LoF and ClinVar variants varies by genes and is positively correlated with o/e LoF values (Fig. 3e), and pathogenic recessive variants display systematic differences that differentiate them from gnomAD variants (Fig. 3b-d). Thus, it would be feasible to predict potential rare recessive variants from genomic data of healthy parents with the help of large patient and control genomic data in the near future.
Our approach expanded the phenotypic spectrum of known genes (39 cases, 7.1%), and suggested novel genes that may allow us to better understand neurodevelopmental disease mechanisms (56 patients, 10.1%). Nevertheless, 42.5% of the cases (235/533) remained undiagnosed even after our WES effort, suggesting a substantial opportunity for further improvement (Fig. 1c). Related to this, a systematic re-analysis effort with additional bioinformatics pipelines increased the diagnostic rate by 4.2% 39 . Also, searching for functional non-coding variants through WGS and evaluating multiple rare functional variants that may increase disease predisposition may be beneficial 40 , although a recent meta-analysis study claimed only a minimal improvement in the WGS diagnostic rate, presumably due to our limited understanding of the function of noncoding variant 41 . An alternative approach would be to integrate genome data with transcriptome data in order to identify functionally cryptic variants that directly influence expression of critical genes 42,43 , although preparing patient-derived tissue still remains as a practical challenge.
Our study also addresses the clinical challenges of an evolving phenotype over time in growing children and how this can be overcome, which facilitates identification of treatable or actionable cases (  Supplementary Fig. S8). Our patient cohort included a successful drug repositioning case for a rare neurogenetic disease 44 ( Table 2). All of these cases are expected to increase as more genotype-phenotype relationships are discovered and more drugs become available. This study demonstrates that applying WES and subsequent in-depth analysis provides clinical and practical benefits to existing patients and their families and reducing emergence of such patients. Finally, we demonstrated the successful establishment of this approach in Korea, and the necessity of this approach for patients with various undiagnosed neurodevelopmental disorders in countries of similar status.

Methods
Subjects. Blood  www.nature.com/scientificreports www.nature.com/scientificreports/ encephalopathy, Rett syndrome-like encephalopathy, ataxia, neuropathy, myopathies, and multiple congenital anomalies/dysmorphic syndromes with developmental problems ( Table 1). The patients can be categorized into two groups: (i) clinically diagnosable but with substantial genetic heterogeneity (270/553, 48.8%) or (ii) heterogeneous and nonspecific clinical presentations without definite diagnosis (283/553, 51.2%; Supplementary Fig. S1). Prior to the WES analysis, thorough clinical and laboratory evaluations and extensive patient examinations have been conducted to identify possible genetic causes. These included genetic tests with candidate gene sequencing, targeted gene sequencing panel, trinucleotide repeat analysis, microarray, metabolic work-up, brain/spine MRI, or muscle biopsy if applicable. All subjects were evaluated by three pediatric neurologists, two pediatric neuroradiologists, and a pathologist.
Whole exome sequencing. WES was performed at Theragen Etex Bio Institute (Suwon, Korea) following the standard protocol and the data were analyzed as described previously 18 . Depending on the genetic analysis result, each patient was categorized as one of the following: category 1: known disease-causing genes were found; category 2: causative gene for other diseases were found; category 3: potentially pathogenic gene, but without prior disease association, was found; category 4: no disease-causing candidates were found; and category 5: known pathogenic copy number variation (CNV) was found.
Our variant assessment procedures were as follows: firstly, patient-specific CNVs were checked and samples with CNVs were classified as category 5. Then, patient-specific nucleotide variants such as de novo, compound heterozygous (CH; autosomal), and rare homozygous (RHo; autosomal) and hemizygous (RHe; X-linked) variants were selected by comparing against parents and prioritized based on the inheritance pattern (Fig. 1a). Variants with low in-house quality score (<60) or low coverage depths (<10) were excluded. Variants in intron regions and pseudogene were also filtered out.
In correlating the patient's clinical features with genetic variants, if patients carried a known pathogenic variant in OMIM or ClinVar, they were categorized as category 1 or 2, depending on similarity with reported and observed clinical manifestation. For previously unreported variants, if they were never seen in normal individuals (Genome Aggregation Database (gnomAD) 29 , Korean Variant Archive (KOVA) 45 and in-house variants), harbored LoF, or were evolutionarily well-conserved at the amino acid level, they were categorized as potentially pathogenic. ACMG codes for the variants were annotated using a script provided by InterVar 46 . For CNV calling, the normalized coverage depths of each captured intervals were compared to the depths from related individuals. Human brain transcriptome data. The BrainSpan transcriptome database (http://www.brainspan.org) was used to build developing human brain networks 47 . Data from eight post-conceptual weeks to 40 years of age were analyzed. A total of 385 samples were used for the analysis after combining the multiplicates by taking mean values. Probes with TPM (transcripts per million) >5 in at least one sample were used, yielding 23,943 probes as "brain-expressed transcripts".
Brain transcriptome network analysis. Using the above brain-expressed transcripts, we created eight known gene co-expression networks by selecting genes that are highly correlated to our set of genes (n = 164; Pearson's correlation r > 0.7), in which their disease associations were previously reported, from each

Significance of WES-based patient evaluation (treatment) References
Developmental regression with Rett syndrome-like phenotype ST3GAL5 Salt and pepper developmental regression syndrome (#609056) Identified the molecular defect and established an accurate diagnosis 50,51 Hypotonia and motor delay followed by lower extremity weakness

ASAH1
Farber lipogranulomatosis (#228000) Corrected a misdiagnosis 53 Ataxia followed by generalized dystonia ANO3 Expanded spectrum of dystonia 24 (#615034) Suggested a treatment strategy that resulted in gradual improvement within one year (deep brain stimulation) 54 Focal lower leg dystonia, dystonic gait SLC2A1 GLUT1 deficiency syndrome 2 (#612126) Identified disease-specific treatment that resulted in near-elimination of dystonia (ketogenic diet) 55 Leigh syndrome SLC19A3 Thiamine metabolism dysfunction syndrome 2 (#606152) Identified disease-specific treatment that resulted in clinical improvements in dystonia, spasticity, and cognitive function (supplements of thiamine and biotin) 56 Recurrent infections, telangiectatic skin mottling, and brain infarctions TMEM173 STING-associated vasculopathy, infantile-onset (#615934) Provided a rationale for a new treatment strategy that improved the skin lesions (tofacitinib treatment) 44 Severe global developmental delay, seizures, and acanthotic skin lesions

RAB11B
Neurodevelopmental disorder with ataxic gait, absent speech, and decreased cortical white matter (#617807) Identified a new disease gene leading to a neurodevelopmental syndrome 57 Table 2. Notable cases where WES-based analysis conferred correct diagnoses or changed medical treatment strategies.