Improving the diagnostic yield of exome-sequencing, by predicting gene-phenotype associations using large-scale gene expression analysis

Clinical interpretation of exome and genome sequencing data remains challenging and time consuming, with many variants with unknown effects found in genes with unknown functions. Automated prioritization of these variants can improve the speed of current diagnostics and identify previously unknown disease genes. Here, we used 31,499 RNA-seq samples to predict the phenotypic consequences of variants in genes. We developed GeneNetwork Assisted Diagnostic Optimization (GADO), a tool that uses these predictions in combination with a patient’s phenotype, denoted using HPO terms, to prioritize identified variants and ease interpretation. GADO is unique because it does not rely on existing knowledge of a gene and can therefore prioritize variants missed by tools that rely on existing annotations or pathway membership. In a validation trial on patients with a known genetic diagnosis, GADO prioritized the causative gene within the top 3 for 41% of the cases. Applying GADO to a cohort of 38 patients without genetic diagnosis, yielded new candidate genes for seven cases. Our results highlight the added value of GADO (www.genenetwork.nl) for increasing diagnostic yield and for implicating previously unknown disease-causing genes.


Introduction 41
With the increasing use of whole-exome sequencing (WES) and whole-genome sequencing 42 (WGS) to diagnose patients with a suspected genetic disorder, diagnostic yield is steadily 43 increasing [1]. Although our knowledge of the genetic basis of Mendelian diseases has 44 improved considerably, the underlying cause remains elusive for a substantial proportion of 45 cases. The diagnostic yield of genome sequencing varies from 8% to 70% depending on the 46 patient's phenotype and the extent of genetic testing [2]. Sequencing all ~20,000 protein-47

165
Prioritization of known disease genes using the annotated HPO terms 166 Once we had calculated the prediction scores of HPO disease phenotypes, we leveraged 167 these scores to prioritize genes found by sequencing the DNA of a patient. For each 168 individual HPO term-gene combination, we calculated a prediction z-score that can be used 169 to rank genes. In practice, however, patients often present with not one feature but a 170 combination of multiple features. Therefore, we combined the z-scores for each HPO term 171 [24] to generate an overall z-score that explains the full spectrum of features in a patient. 172 GADO uses these combined z-scores to prioritize the candidate genes: the higher the 173 combined z-score for a gene, the more likely it explains the patient's phenotype. 174 Because many HPO terms have fewer than 10 genes annotated, and since we were unable 175 to make significant predictions for some HPO terms, certain HPO terms are not suitable to 176 use for gene prioritization. We solved this problem by taking advantage of the way HPO 177 terms are structured. Each term has at least one parent HPO term that describes a more 178 generic phenotype and thus has also more genes assigned to it. Therefore, if an HPO term 179 cannot be used, GADO will make suggestions for suitable parental terms (supplementary 180 figure 1). 181 To benchmark our prioritization method, we used the OMIM database [5]. We tested how 182 well our method was able to retrospectively rank disease-causing genes listed in OMIM 183 based on the annotated symptoms of these diseases. We took each OMIM disease gene (n 184 = 3,382) and used the associated disease features (15 per gene on average) as input for 185 GADO. What we found was that for 49% of the diseases GADO ranks the causative gene in 186 the top 5% (Figure 3a, b). Moreover, we observed a statistically significant difference 187 11 between the performance of GADO on true gene-phenotype combinations and its 188 performance using a random permutation of gene-phenotype combinations (p-value = 2.16 189 × 10 -532 ). the annotated HPO terms related to the Parkinsonism-dystonia for this gene. This may, 206 however, be due to very low expression levels of SLC6A3 in most tissues except specific 207 brain regions [28]. 208 To better understand why we can't predict HPO terms for all genes, we used the Reactome, 209 GO and KEGG prediction scores. Jointly these databases comprise thousands of gene sets. 210 Since these databases describe such a wide range of biology, we assumed that if a gene 211 does not show any prediction signal for any gene set in these databases, gene co-212 expression is probably not informative for this gene. To quantify this, we calculated, per 213 gene, the average skewness of the z-score distribution of the Reactome, GO and KEGG gene 214 sets. From this we were able to derive a 'gene predictability score' for every gene that is 215 independent of whether this gene is already known to play a role in any a disease or 216 pathway (Figure 3c, d, e). We then ascertained whether these 'gene predictability scores' 217 are correlated with the prediction z-score of the OMIM diseases, and found a strong 218 correlation (Pearson r = 0.54, p-value = 1.14 × 10 -332 ) between the gene predictability 219 scores and GADO's ability to identify a known disease gene (Figure 3c). 220 To investigate why some genes have a high 'gene predictability score' but low prediction 221 performance, we scored a set of genes known to cause cardiomyopathy (CM) for the 222 amount of literature evidence that these genes cause CM. We found several genes for which 223 the prediction score for the CM phenotype is lower than expected based on the gene 224 However, this gene is primarily expressed in the liver. Therefore, its disease mechanism is 228 different from other mechanisms resulting in CM, as many inherited CMs are caused by 229 deleterious variants in genes highly expressed in the heart and directly affecting the 230 function of the cardiac sarcomere. Therefore, the phenotypic function prediction for this 231 gene may be worse than we would expect based on the predictability score. We performed a 232 13 similar analysis using the HPO term 'dilated cardiomyopathy' and observed a low prediction 233 performance for the TMPO gene, despite a high gene predictability score (supplementary 234 figure 2b). Previously, this gene was reported to be related to dilated cardiomyopathy 235 (DCM) and listed as such by OMIM. However, recent reclassification of the reported variants 236 using the ExAC data revealed that the reported variant was far too common to be causative 237 for DCM [30]. 238

Benchmarking GADO using solved cases with realistic phenotyping 239
Although in silico benchmarking demonstrated the potential of GADO, it used all annotated 240 HPO terms for a disease. In practice, however, patients may only present with a limited 241 number of the annotated features. To perform a validation that was a more realistic 242 reflection of clinical practice, we used exome sequencing data of 83 patients with a known 243 genetic diagnosis. We used their phenotypic features as listed in their medical records prior 244 to the genetic diagnosis (supplementary table 2). On average, per patient, GADO yielded 56 245 possible disease-causing genes with variants that are rare and predicted to be deleterious. 246 In 41% of the patients the actual causative gene was ranked in the top 3 and in 50% of the 247 cases it was in the top 5 (mean rank 10) (Figure 4a). 248

Clustering of HPO terms 249
In addition to ranking potentially causative genes based on a patient's phenotype, we 250 observed that GADO can be used to cluster HPO terms based on the genes that are predicted 251 to be associated to these HPO terms. This can help identify pairs of symptoms that often occur 252 together, as well as symptoms that rarely co-occur, and we actually observed this for a patient 253

Reanalysis of previously unsolved cases 268
To assess GADO's ability to discover new disease genes, we applied it to data from 38 269 patients who are suspected to have a Mendelian disease but who have not had a genetic 270 diagnosis. All patients had undergone prior genetic testing (WES with analysis of a gene 271 panel according to their phenotype, supplementary table 3). On average three genes had a 272 z-score ≥ 5 (which we used as an arbitrary cut-off and that correspond to a p-value of 5.7 X 273 10 -7 ) and were further assessed. In seven cases, we identified variants in genes not 274 associated to a disease in OMIM or other databases, but for which we could find literature or 275 15 for which we gained functional evidence implicating their disease relevance ( Table 1)  All analyses described in this paper can be performed using our online toolbox at 290 www.genenetwork.nl. Users can perform gene prioritizations using GADO by providing a set 291 of HPO terms and a list of candidate genes (Figure 5a). Per gene, it is also possible to 292 download all prediction scores for the HPO terms and pathways. Our co-regulation scores 293 between genes can be used for clustering. Furthermore, the predicted pathway and HPO 294 annotations of genes can be used to perform function enrichment analysis (Figure 5b). We 295 also support automated queries to our database. showed that for 49% of the diseases the causative gene ranks in the top 5%. We also 323 investigated the OMIM "provisional" category of genes for which there is limited evidence. 324 Both the OMIM disease-gene annotation and the provisional annotations perform 325 significantly better than a random permutation. While we do find a small but significant 326 difference in prediction performance between the provisionally annotated genes and the 327 more established disease associated genes, we conclude, based on our findings, that these 328 20 provisional OMIM annotations are generally of similar reliability to the other OMIM disease 329

annotations. 330
Benchmarking on sequence data of patients with a known genetic diagnosis revealed that 331 GADO returned the real causative variant within the top 3 results for 41% of the samples, 332 indicating the potential power of GADO for a large number of diseases. Finally, in seven 333 patients, GADO was able to identify potential novel disease genes that are strong candidates 334 based on literature or functional evidence. For other cases we have identified genes with a 335 strong prediction score harboring variants that might explain the phenotype. However, since 336 very little is known about these genes it is not yet possible to draw firm conclusions. 337 Hopefully this will become possible in the near future through initiatives like Genematcher 338 [40]. is also able to identify potential new genes in human disease. However, Exomiser has a 351 limitation in that only a subset of the protein-coding genes has orthologous genes in other 352 species for which a knockout model also exists. Additionally, the used STRING interactions 353 are biased towards well studied genes and rely heavily on existing annotations to biological 354 21 pathways (supplementary figure 4). There are however, still 3,922 protein-coding genes 355 that are not currently annotated in any of the databases we used, and there are even more 356 non-coding genes for which the biological function or role in disease is unknown. Since 357 GADO does not rely on prior knowledge, it can be used to prioritize variants in both coding 358 and non-coding genes (for which no or limited information is available). GADO thus enables 359 the discovery of novel human disease genes and can complement existing tools in analyzing 360 the genomic data of patients who have a broad spectrum of phenotypic abnormalities. 361

Limitations 362
The gene predictability score indicates for which genes we can reliably predict phenotypic 363 associations and for which genes we cannot based on gene co-regulation. This score gives 364 insight into which genes are expected to perform poorly in our prioritization. We found 365 strong correlation between these gene predictability scores and the gene prioritization z-366 scores. Thus, genes with a high predictability score have more accurate HPO term 367 predictions. However, since our predictions primarily rely on co-activation patterns that we 368 identified from RNA-seq data, our method does not perform well for genes where gene-369 expression patterns are not informative of their function. This could, for instance, be the 370 case for proteins relying heavily on post-translation modifications for regulation or genes for 371 which different transcripts have distinct functions. This last limitation can potentially be 372 overcome by predicting HPO-isoform associations by using transcript-based expression 373 quantification. 374 Insufficient statistical power to obtain accurate predictions may be another explanation for 375 the low predictability scores of certain genes. This may be true for genes that are poorly 376 expressed or expressed in only a few of the available RNA-seq samples. The latter issue we 377 expect to overcome in the near future as the availability of RNA-seq data in public 378 repositories is rapidly increasing. Initiatives such as Recount enable easy analysis on these 379 22 samples [42], allowing us to update our predictions in the future, thereby increasing our 380 prediction accuracy. 381 For some genes we are unable to predict annotated disease associations despite having a 382 high gene predictability scores. Some genes, such as TTR, simply act in a manner unique to 383 a specific phenotype. Other genes, such as TMPO, turned out to be false positive disease 384 associations. These examples show that our gene predictability score has the potential to 385 flag genes acting in a unique manner as well as genes that might be incorrectly assigned to 386 a certain disease or phenotype. 387 We noted that the median prediction performance of HPO terms is lower compared to the 388 other gene sets databases used in our study, such as Reactome. This may be due to the 389 fact that phenotypes can arise by disrupting multiple distinct biological pathways. For 390 instance, DCMs can be caused by variants in sarcomeric protein genes, but also by variants 391 in calcium/sodium handling genes or by transcription factor genes [43]. As our methodology 392 makes guilt-by-association predictions based on whether genes are showing similar 393 expression levels, the fact that multiple separately working processes are related to the 394 same phenotype can reduce the accuracy of the predictions (although it is often still 395 possible to use these predictions as the DCM HPO phenotype prediction performance AUC = 396 Connecting variants to disease is a complex multistep process. The early steps are usually 412 highly automated, but the final most critical interpretations still rely on expert review and 413 human interpretation. GADO is a novel approach that can aid users in prioritizing genes 414 using patient-specific HPO terms, thereby speeding-up the diagnostic process. It prioritizes 415 variants in coding and non-coding genes, including genes for which there is no current 416 knowledge about their function and those that have not been annotated in any ontology 417 database. This gene prioritization is based on co-regulation of genes identified by analyzing 418 31,499 publicly available RNA-seq samples. Therefore, in contrast to many other existing 419 prioritization tools, GADO has the capacity to identify novel genes involved in human 420 disease. By providing a statistical measure of the significance of the ranked candidate 421 variants, GADO can provide an indication for which genes its predictions are reliable. GADO 422 can also detect phenotypes that do not cluster together, which can alert users to the 423 possible presence of a second genetic disorder and facilitate the diagnostic process in 424 patients with multiple non-specific phenotypic features. GADO can easily be combined with 425 any filtering tool to prioritize variants within WES or WGS data and can also be used in gene 426 panels such as PanelApp [45]. GADO is freely available at www.genenetwork.nl to help 427 guide the differential diagnostic process in medical genetics. 428 files (Supplementary methods 3). For 11 of the previously solved cases, GAVIN did not flag 506 the causative variant as a candidate. To be able to include these samples in our GADO 507 benchmark, we added the causative genes for these cases manually to the candidate list. 508 The phenotypic features of a patient were translated into HPO terms, which were used as 509 input to GADO. Here we only used features reported in the medical records prior to the 510 molecular diagnosis. If any of the HPO terms did not have significant predictive power, the 511 parent terms were used. From the resulting list of ranked genes, the known disease genes 512 harboring a potentially causative variant were selected. Next, we determined the rank of the 513 gene with the known causative variant among the selected genes. If a patient harbored 514 multiple causative variants in different genes, in case of di-genic inheritance or two 515 inherited conditions, the median rank of these genes was reported (supplementary table 2). 516

Unsolved cases cohorts 517
In addition to the patients with a known genetic diagnosis, we tested 38 unsolved cases 518 (supplementary table 3). These are patients with mainly cardiomyopathies or developmental 519 delay. All patients were previously investigated using exome sequencing, by analyzing a 520 gene panel appropriate for their phenotype. To allow discovery of potential novel disease 521 genes, we used GADO to rank genes with candidate variants (Supplementary methods 3). 522 For genes with a prediction z-score ≥ 5, a literature search for supporting evidence was 523 performed to assess whether these genes are likely candidate genes. 524 Website 525 To make our method and data available we have developed a website available at 526 www.genenetwork.nl that can be used to run GADO, lookup gene functions predictions, 527 visualize networks using co-regulations scores and perform function enrichments of sets of 528 genes (Supplementary methods 4).