Powerful use of automated prioritization of candidate variants in genetic hearing loss with extreme etiologic heterogeneity

Variant prioritization of exome sequencing (ES) data for molecular diagnosis of sensorineural hearing loss (SNHL) with extreme etiologic heterogeneity poses a significant challenge. This study used an automated variant prioritization system (“EVIDENCE”) to analyze SNHL patient data and assess its diagnostic accuracy. We performed ES of 263 probands manifesting mild to moderate or higher degrees of SNHL. Candidate variants were classified according to the 2015 American College of Medical Genetics guidelines, and we compared the accuracy, call rates, and efficiency of variant prioritizations performed manually by humans or using EVIDENCE. In our in silico panel, 21 synthetic cases were successfully analyzed by EVIDENCE. In our cohort, the ES diagnostic yield for SNHL by manual analysis was 50.19% (132/263) and 50.95% (134/263) by EVIDENCE. EVIDENCE processed ES data 24-fold faster than humans, and the concordant call rate between humans and EVIDENCE was 97.72% (257/263). Additionally, EVIDENCE outperformed human accuracy, especially at discovering causative variants of rare syndromic deafness, whereas flexible interpretations that required predefined specific genotype–phenotype correlations were possible only by manual prioritization. The automated variant prioritization system remarkably facilitated the molecular diagnosis of hearing loss with high accuracy and efficiency, fostering the popularization of molecular genetic diagnosis of SNHL.

Variant filtering and prioritization. Automated variant prioritization using EVIDENCE. EVIDENCE (https:// 3bill ion. io/) is a software package developed to prioritize and interpret variants based on patient phenotype and perform variant classification 23 . This system involves three major steps: variant filtration, classification, and similarity scoring according to patient phenotype (Fig. 1).
First, we used gnomAD v3.1. 1 (http:// gnomad. broad insti tute. org/) as a population genome database and the 3billion genome database (https:// 3bill ion. io/) to estimate allele frequency. Common variants with minor allele frequencies of > 5% in any subpopulation except for founder populations, such as Finnish and Jewish, were filtered out in accordance with BA1 criterion of the ACMG guidelines 21 . In addition, the exceptional cases reported as BA1 or BS1 variants were also excluded 24 .
Second, we extracted evidence of data on the pathogenicity of variants, including gene function, domain of interest, disease mechanism, inheritance pattern, and clinical relevance, from the scientific literature and disease databases, including OMIM (Access date: August 2020, www. omim. org), ClinVar (Access date: August 2020, https:// www. ncbi. nlm. nih. gov/ clinv ar/), and UniProt (Access date: August 2020, https:// www. unipr ot. org/). Evaluation of predicted functional or splicing effects and the degree of evolutionary conservation of the www.nature.com/scientificreports/ identified variants was performed with several in silico tools, including REVEL, ada_score using AdaBoost, and rf score, using the random forest algorithm 25,26 . The reference articles on the variant information including de novo occurrence, functional studies, and segregation data were daily reviewed by clinical geneticists affiliated with 3 billion and updated in EVIDENCE accordingly. Scores > 0.5 in each tool predicted detrimental effects on the variant. Variant pathogenicity was classified and prioritized according to ACMG guidelines 21 . EVIDENCE was used to prioritize variants classified as pathogenic, likely pathogenic, or VUS according to ACMG guidelines, with these variants categorized into three tiers according to their Bayesian score 27 . The first tier includes variants scoring > 0.9, the second > 0.499, and the third > 0.1. Third, the clinical phenotype(s) of the proband was translated into a corresponding standardized human phenotype ontology (HPO) term and the similarity associated with rare genetic diseases was measured 28,29 . We calculated the similarity score between patient phenotype and symptoms associated with disease caused by prioritized variants according to ACMG guidelines. The processes associated with genetic diagnosis, including processing of raw genomic data, variant prioritization, and phenotype-to-disease similarity measurements, were integrated and automated into a computational framework. The variants were ranked higher according to their increased similarity score based on associations with patient phenotype and disease within each tier. Variants with the highest similarity score within the highest tier were ultimately selected.

Figure 1.
Human and EVIDENCE variant prioritization. A total of 263 unrelated probands from the SNUH and SNUBH sensorineural hearing loss cohort were evaluated using exome sequencing (ES). The ES data was analyzed by human bioinformaticians and using an automated variant prioritization system (EVIDENCE). The prioritization of the variants was compared. The concordant call rate of either prioritized variants or the absence of candidate variants among the entire cohort between humans and EVIDENCE was 97.72% (257/263). www.nature.com/scientificreports/ In silico synthetic cases. To access the EVIDENCE diagnostic yield, we generated 21 synthetic exomes. About 60,000-90,000 common variations, with a minor allele frequency (MAF) > 10% in any subpopulation, were sampled from the GRCh27 phase-3 exomes from the 1000 genome project. Twenty-one of the GRCh27 phase-3 exome VCF files were synthesized using these common variants. Deafness variants were inserted into each synthesized exome VCF file. The deafness variants were selected from previously identified pilot variants, which were classified as pathogenic or likely pathogenic variants in ClinVar (Supp.

Multiplex ligation-dependent probe amplification (MLPA) of stereocilin (STRC).
The mild-tomoderately hearing-impaired probands with only VUS or no possible pathogenic variant were further subjected to MLPA to detect copy number variations (CNVs) encompassing STRC 31 . Single heterozygous STRC variants were confirmed using long-range nested polymerase chain reaction (PCR) in order to avoid contamination by a pseudogene 31 .

Results
Variant prioritization by humans. We Table 2). The addition of molecular genetic testing that enabled the identification of pathogenic CNVs revealed variants in an additional 19 probands (19/263, 7.22%) among the 131 undiagnosed probands (Supp . Table S3), leading to a total diagnostic yield of 57.41%. Of these 19 probands, 10 (10/263, 3.8%) carried one copy of a CNV in a trans configuration with a single heterozygous point mutation detected by ES. For these 10 patients, completion of molecular genetic diagnosis was only possible after the implementation of MLPA encompassing STRC , ultimately leading to the diagnosis of compound heterozygosity and a point mutation in STRC. These point mutations in STRC were further confirmed by a long-range nested PCR. SNHL in the other nine probands that had been undiagnosed using ES data (9/263, 3.4%) was exclusively identified by CNVs revealed within the DFNB16 locus (n = 6), DFNX2 locus (n = 2; SB332-653 and SB430-834), and from chr3q13.11 to chr3q13.31 (n = 1; SB318-627).
Variant prioritization by EVIDENCE. All the deafness variants from the 21 in silico cases were correctly prioritized using EVIDENCE (Supp . Table S2). However, the pathogenic variants of 3 of 21 in silico cases were not prioritized in Exomiser. Three in silico cases had variants of GJB2 c.101T>C and GJB2 c.109G>A. For clinical patients, EVIDENCE prioritized 190 candidate variants from the 134 SNHL probands (134/ 263, 50.95%) (Tables 1, 2) at least 24-fold faster than humans (< 5 min vs. 2 h, respectively) and provided equivalent diagnostic yield relative to humans (50.19%) (P = 0.931, chi-squared test).
Two AD variants from three SNHL probands (SB316-522, and SB422-823) prioritized by EVIDENCE were subsequently rejected based on phenotype-genotypic correlations (Table 3). Specifically, gap junction protein β3 (GJB3) c.538C>T was prioritized by EVIDENCE for SB316-522; however, SB316-522 showed enlarged vestibular aqueduct (EVA; unilateral) with Mondini deformity (bilateral), which could not be explained by GJB3 variants. Similarly, protein tyrosine phosphatase non-receptor type 1 (PTPN11) c.1001T>A was prioritized by EVIDENCE, but this was incompatible to the phenotype of auditory neuropathy spectrum disorder (ANSD) in SB422-823. EVIDENCE selected GJB3 c.538C>T for SB316-522, because this variant met PVS1, PM2, and PP5 criteria based on multiple lines of data and was thus classified as a pathogenic variant according to the 2015 ACMG-AMP guidelines.

Pathogenic variants identified only by EVIDENCE.
Four pathogenic variants were exclusively identified by EVIDENCE (Table 4). In addition to its speed, EVIDENCE showed efficacy in the molecular diagnosis of rare syndromic deafness. For example, two PTPN11 variants of c.922A>G and c.836A>G from three probands were identified by EVIDENCE, none of whom (SH 271-631, SH 250-590, and SB308-611) showed abnormal facial features or skeletal malformations associated with Noonan syndrome, but demonstrated only severe SNHL. Other features were not sufficient to phenotypically suspect Noonan syndrome without molecular genetic confirmation. Additionally, SH 271-631 and SB308-611 did not manifest any syndromic features outside of congenital pulmonary artery stenosis. Moreover, SH 250-590 also did not demonstrate any syndromic features outside of multiple dark spots (lentigines) throughout the body. All of the probands underwent cochlear implantation (CI) and demonstrated favorable hearing outcomes. SH 271-631 and SB308-611 underwent CI at 11 months, with a Categories of Auditory Performance (CAP) score of 5 at 1 year post-operation. SH 250-590 underwent CI at 13 months, with a CAP score of 5 at 15 months post-operation. One EFTUD2 variant of c.271+1G>A was identified by EVIDENCE 32 . A proband (SB542-1014) carrying the EFTUD2 variant showed mixed hearing loss, mandibulofacial anomaly, and congenital heart defect, and the pathogenicity of c.271+1G>A was validated by a minigene assay 32 . Humans were unable to prioritize any variants related to rare syndromic hearing loss in these four SNHL probands. Thus, four SNHL probands, who were not previously reported to harbor any candidate variant by humans, were identified as carrying a pathogenic variant by EVIDENCE. Therefore, the proportion of the SNHL probands who remained "undiagnosed" after ES by humans was reduced from 49.81% (131/263) to 48.29% (127/263) through the assistance of EVIDENCE.

Discussion
This study notably validated the application of automated phenotype-driven analysis software using clinical data from the large-scale hearing loss cohort comprising 263 real patients rather than hypothetical subjects. Although the candidate variant prioritization by humans is not a gold standard method, it is a conventional method for diagnosis of genetic hearing loss. To improve the diagnostic accuracy in manual curations, twelve expertized persons in clinical genetics and genetic hearing loss were involved in manual curation process and conducted consensus discussion more than three times. Moreover, in silico analysis were conducted and the results were compared with other program of Exomiser. In addition to the definitively diagnosed cases carrying exclusively pathogenic or likely pathogenic variants, complex cases harboring single or multiple VUS could also be analyzed by EVIDENCE. Given the increasing number of these complex cases, the findings of the present study promote the clinical use of automated phenotype-driven analysis software for diagnosing and genetically testing SNHL patients. EVIDENCE was able to prioritize candidate variants associated with SNHL with a 97.72% (257/263) concordance rate with variants identified by experienced human bioinformaticians. In terms of molecular diagnostic yield for SNHL using ES data, EVIDENCE narrowly outperformed human bioinformaticians [ www.nature.com/scientificreports/ not have been identified by human bioinformaticians. However, human bioinformaticians managed to identify most of the convincing candidate variants from three SNHL probands after referring to predefined, specific genotype-phenotype correlations, which was not possible using EVIDENCE. Moreover, the combined results of humans and EVIDENCE resulted in an ES diagnostic yield of 51.71% (136/263). We found that EVIDENCE processed variant prioritization from ES data about 24-fold faster than human bioinformaticians (~ 5 min vs. 2 h). Indeed, excessive time would have been required for manual analyses conducted by unskilled bioinformaticians. The time spent curating candidate disease-causing variants in ES data was estimated as ~ 54 min (range 5-223 min) per variant, and ~ 81 h was predicted as the time required for manual prioritization of variant in ES data based on an estimated 90-127 genetic variants curated from each individual 33 . To expedite the analysis of ES data, multiple programs, including Exomiser or Genomiser tools 34,35 and Phevor 36,37 , have been developed. The diagnostic yield of these automated methods is considered comparable with that of manual analyses, although failure to curate a candidate variant could happen with automated software due to inappropriate thresholds related to phenotypic cut-off filters 37 . Given that the diagnostic yield of ES of hearing loss has been superior to that of other disorders (55% vs. 28.8% for overall disorders) 6 , automated phenotype-driven analysis of ES data could be clinically applicable to patients with hearing loss and presumably with the potential for relatively higher diagnostic yield in other diseases. Although previous studies validated phenotype-driven analysis software in comparison with conventional manual analysis 17,37 , no previous studies analyzed patients with SNHL in this context. The syndromic features of SNHL, including facial dysmorphisms and developmental delay, do not become obvious often until later stages; thus, genetic diagnosis of neonatal SNHL could predate manifestation of the syndromic features, as demonstrated by our four cases exclusively diagnosed by EVIDENCE.
Focusing on the pathogenic or likely pathogenic variants, the concordance rate of EVIDENCE with analysis by human bioinformatician was 95.12% (117/ 123) ( Table 2). Notably, EVIDENCE outperformed manual variant prioritization, especially in cases of syndromic deafness. This might be due to the absence of a phenotype or its subclinical syndromic status at the time of genetic diagnosis in these syndromic patients, which is usually no later than the age of 1 year. Thus, it is not infrequent that the clinician could not think of the syndromic SNHL and the variants of causative genes of syndromic SNHL could be discarded. Additionally, the wide spectrum of phenotypes related to syndromic deafness hampers identification of specific candidate causative genes. As a classic example, Noonan syndrome demonstrates various spectrums of clinical features 38,39 . In the present study, three PTPN11 probands, missed by humans, did not exhibit definite syndromic facial features. Furthermore, genes associated with syndromic hearing loss can be detected, even in patients with non-syndromic hearing loss and with no or subclinical syndromic phenotypes 40 , precluding prediction of a causative gene solely based on a syndromic phenotype. For example, our previous study reported an ANSD patient carrying an ATP1A3 variant without pathognomonic features and presenting a cerebellar ataxia, areflexia, pes cavus, optic atrophy, and sensorineural hearing loss (CAPOS) phenotype 41 . EVIDENCE could potentially facilitate early diagnosis of such syndromic diseases before patients manifest the definite clinical features. Another proband with an EFTUD2 splice-site variant was also diagnosed exclusively by EVIDENCE which was retrospectively reviewed by humans and published in another article 32 . Although this proband (SB542-1014) did show syndromic mandibulofacial anomaly and congenital cardiac defect, molecular diagnosis of the EFTUD2 variant was not made by humans, likely due to the rarity and wide spectrum of the phenotypes of mandibulofacial dysostosis, Guion-Almeida type.
The other two discordant calls between EVIDENCE and humans regarding pathogenic or likely pathogenic variants arose from different interpretations of single heterozygous, AR, likely pathogenic variants, which were exclusively prioritized as causative variants only by humans. Human bioinformaticians can consider these monoallelic recessive alleles as causative variants, relying on the very specific radiological or audiological phenotype. Specifically, unilateral EVA accompanied by both sides of incomplete partition type II (referred to as "Mondini malformations" from SB316-522 and prelingual ANSD from SB422-823) was so distinctive that these phenotypes made the monoallelic variant, detected from their signature gene. We speculate that yet-to-be identified noncoding region variants or CNVs in or encompassing SLC26A4 and OTOF might contribute to these specific phenotypes in a trans configuration with the single heterozygous allele. SLC26A4 c.2168A>G is a wellknown recurring pathogenic variant with null function previously demonstrated in an in vitro study 42 . Although SLC26A4 variants that cause hearing loss have AR inheritance, a number of previous studies demonstrated EVA with monoallelic SLC26A4 variants 43,44 . These monoallelic SLC26A4 variants are proposed to cause EVA in combination with either yet-to-be identified pathogenic variants in noncoding regulatory regions of SLC26A4, as supported by analysis of EVA-recurrence rates [43][44][45] , or regulatory genes of SLC26A4, such as EPHA2 46 . On the other hands, EVIDENCE prioritized GJB3 c.538C>T as a candidate variant for SB316-522. GJB3 was first reported as a causative gene for bilateral high-frequency hearing loss 47 , with three additional studies suggesting the pathogenic potential of GJB3 for hearing loss with uncertain significance [48][49][50] . However, although GJB3 c.538C>T co-segregated with hearing loss in two Chinese families as an AD inheritance, one unaffected family member also harbored a monoallelic GJB3 c.538C>T variant 47 , precluding the confirmation of the pathogenic potential of GJB3 c.538C>T. Additionally, the MAF in the KRGDB was reported at 0.09% (3/1722 individuals), implying benign pathogenic potential of this variant.
Another monoallelic, likely pathogenic variant in the AR gene OTOF (c.2521G>A) was prioritized by humans in a proband (SB422-823) with prelingual ANSD. This variant was estimated as the second-most common (as high as 13.6%) OTOF variant in OTOF-related ANSD (DFNB9) in Koreans 51 . The pathogenicity of single heterozygous OTOF variants has been reported in clinical studies 52,53 . Given the etiologic homogeneity of prelingual ANSD, the single heterozygous OTOF variant likely contributes to prelingual ANSD in combination with yet-to-be identified variants in the noncoding region of OTOF or CNVs encompassing OTOF 54 . In the present study, EVIDENCE could not interpret these monoallelic variants in the absence of detailed genotype-phenotype information and data showing the possible presence of variants in a trans configuration. Therefore, the www.nature.com/scientificreports/ second-tier analyses following this variant prioritization by EVIDENCE such as a segregation study (Fig. 2) are mandatory. Additionally, in this study, 19 probands required further molecular genetic studies beyond ES, such as chromosomal and CNV analyses (Fig. 2).. To identify pathogenic genetic deletions, understanding the clinical phenotype of these 19 probands was crucial. Although hearing loss could be a single phenotype in HPO terms, types and degrees of hearing loss can be diverse according to the causal genes. Mild-to-moderate SNHL without any detectable causal variants in known deafness genes could be caused by CNVs in STRC 31 . Given this knowledge, 16 probands of DFNB16 were identified as carrying STRC large deletions using MLPA. Although ES alone did not enable us to reach a conclusive genetic diagnosis, the STRC single heterozygote variant could be a clue for further molecular genetic studies to evaluate the presence of CNVs, in addition to providing information concerning the exclusion of the causal variants in known deafness genes. Indeed, in our cohort, 62.5% (10/16) of DFNB16 probands harbored a single heterozygote STRC variant, which was detected in ES. Two probands with genomic deletion in the POU3F4 upstream region could not be detected in ES. Although no causal variant was selected in ES, the cochlear anomaly of incomplete partition type III in two probands (SB332-653 and SB430-834) provided clues for the diagnosis of DFNX2 55 .

Conclusion
EVIDENCE facilitated the exploration of candidate variants from ES, and its application saved significant time and effort during variant prioritization and improved the detection rate for pathogenic and likely pathogenic variants of hearing loss. Although it was overall estimated that EVIDENCE expedited the variant prioritization process about 24 fold faster than humans, the exact time required for manual variant prioritization by humans varied significantly for each ES, precluding simply displaying the difference in time and efficiency of prioritization between humans and EVIDENCE in a single number. In addition, due to the relatively high detection rate of hearing loss candidate variants in ES, compared to other disorders, the present EVIDENCE diagnostic yield could not be applied to other genetic disorders. However, this is the largest cohort study that validated the diagnostic yield of a phenotype-driven ES analysis software. Moreover, we performed additional downstream genetic studies beyond ES for patients in whom CNV was suspected, allowing subsequent causative genetic diagnoses. Furthermore, cases with discordant calls between EVIDENCE and humans spotlighted the strength of automated prioritization of candidate variants and also provided guidance as to which direction EVIDENCE should evolve and how manual prioritization should improve. The cooperation of EVIDENCE with clinical geneticists could yield higher diagnostic accuracy and efficiency in analyzing and filtering ES data.

Data availability
The raw data of experiments used to support the findings of this study are available from the corresponding author upon request. The variant prioritization using EVIDENCE (https:// 3bill ion. io/) is available after registration with cost.
Received: 4 May 2021; Accepted: 17 August 2021 Figure 2. Proposed workflow to reach the molecular diagnosis of genetic hearing loss cases with available exome sequencing (ES) data. The automatized variant prioritization using EVIDENCE is the first-tier analysis, which is followed by the second-tier analyses including segregation study and Sanger sequencing. Additional molecular genetic studies are also required for cases undiagnosed by ES. www.nature.com/scientificreports/