Phenotype-loci associations in networks of patients with rare disorders: application to assist in the diagnosis of novel clinical cases

Copy number variations (CNVs) are genomic structural variations (deletions, duplications, or translocations) that represent the 4.8–9.5% of human genome variation in healthy individuals. In some cases, CNVs can also lead to disease, being the etiology of many known rare genetic/genomic disorders. Despite the last advances in genomic sequencing and diagnosis, the pathological effects of many rare genetic variations remain unresolved, largely due to the low number of patients available for these cases, making it difficult to identify consistent patterns of genotype–phenotype relationships. We aimed to improve the identification of statistically consistent genotype–phenotype relationships by integrating all the genetic and clinical data of thousands of patients with rare genomic disorders (obtained from the DECIPHER database) into a phenotype–patient–genotype tripartite network. Then we assessed how our network approach could help in the characterization and diagnosis of novel cases in clinical genetics. The systematic approach implemented in this work is able to better define the relationships between phenotypes and specific loci, by exploiting large-scale association networks of phenotypes and genotypes in thousands of rare disease patients. The application of the described methodology facilitated the diagnosis of novel clinical cases, ranking phenotypes by locus specificity and reporting putative new clinical features that may suggest additional clinical follow-ups. In this work, the proof of concept developed over a set of novel clinical cases demonstrates that this network-based methodology might help improve the precision of patient clinical records and the characterization of rare syndromes.

(n y ) is the total number of nodes in the intermediate layer (patients).
We show in table S1 the main association indices we tested in this work. The Jaccard metric is the simplest one, it is algorithmically light to compute and easy to interpret, but it returns a rudimentary spectrum of solutions with a low level of accuracy, considering the proportion of shared nodes but without taking into account the total number of nodes in the network. However, the Pearson Correlation Coefficient (PCC) measures the linear relationship between two interaction profiles considering if the interactions are present or absent, returning a value between 1 (perfect correlation of profiles) and -1 (perfect anticorrelation), being 0 associated to a random comparison. In biological networks, shared non-partners can be as essential as shared ones. Finally, the Hypergeometric test needs a higher level of computational resources but it provides fine-tuning results, and what is even more important, it measures the statistical significance rather than the raw magnitude, determining the likelihood of observing a certain overlap between the interaction profiles of two given nodes. This constitutes the most powerful metric for our analysis. In Figure S1 we show the behaviors of these three metrics in different situations.
We have made available the main scripts of the methodology developed in this work and the documentation with the instructions to use it. The scripts and guides are available at: https://github.com/bio267lab/HyI Figure S1. Metrics behavior examples: Jaccard, PCC and Hypergeometric measures give the best score to the third example among the three, but they have some differences. The Jaccard metric is unable to distinguish between example 1 and example 2 despite both examples being quite different in terms of similarity of profiles between A and B. Meanwhile, PCC penalizes the second example due to the fact that 50% of A and B connections are not shared, and the metric detects an anticorrelation. The Hypergeometric Index, on the other hand, returns values between 0 and 1, scoring the lowest value for example 2 and the highest for example 3, showing a great level of discrimination.
We would expect these metrics to behave in a reasonable similar way, since they are all measuring network connectivity and profile similarity. In order to check that assumption, we performed a set of comparative plots between the results of the different methods (see Fig. S2 and Section 3). The results showed that although there is a correlation between metrics, HyI shows the best performance. Finally, and taking into account that correlation, the validation benchmarking and the intrinsic nature of each metric, we decided to use the Hypergeometric index in order to build the genotypephenotype association method, for being more accurate and include an intrinsic statistical significance value. Figure S2. Comparative of HyI (Hypergeometric) and Jaccard metrics distributions for the same DECIPHER tripartite networks of duplications de novo (upper plot) and deletions de novo (lower plot).

SECTION 2. Validation and benchmarking of the association metrics.
For validating our method, we performed a 10-fold cross-validation over our data, randomly splitting the DECIPHER patients dataset into 10 sub-samples: each one of these sub-samples was iteratively used as positive dataset of phenotype-CNV associations and the remaining 90% of patients were used to build the tripartite network (training set).
The phenotypes-loci association values were calculated on every tripartite network using three methods: Hypergeometric Index (HyI), Pearson Correlation Coefficient (PCC) and Jaccard; and the precision/recall curve, for each association measure, was calculated using the methodology described in Pandey et al. 2007, and plotted together (see Fig S3). Figure S3. Precision vs. recall curve for the HyI (red), PCC (blue) and Jaccard (green) association indices using DECIPHER data. Precision (prec) and recall (rec) values are set from 0 to 1.
Benchmarking the comparison (precision vs recall curves) amongst the three indices shows a significantly better performance of the HyI compared to Jaccard and PCC measures. These results support the use of the HyI to measure the phenotype-loci associations in this work.

SECTION 3. Comparative statistics amongst the Hypergeometric Index (HyI) values, HPO frequencies, and the number of patients/CNVs.
In order to study possible dependencies we carried out different analyses comparing the relationship between the HPO phenotypes and the patients/CNVs frequencies in the whole network and their relationships with the HyI values distribution.
Phenotype prevalence, associated loci and HyI values.
The distribution plot that represents HPO frequency vs. HyI values (Fig. S4) shows a negative relationship: the higher the frequency of HPO terms the lower the distribution range of HyI values. Additionally, it is observed a positive relationship between phenotype prevalence (HPO frequency) and the number of associated loci when these two variables are plotted (Fig. S5). This negative relationship with HyI values and the positive correlation with the number of loci indicates that high prevalent phenotypes in the DECIPHER patients dataset tend to be associated to more loci than low prevalent ones, increasing the probability of being associated to a locus by chance (HyI null hypothesis). This consequently reduces the general HyI values of phenotypes-loci associations as it is illustrated in the scenario 1 in Figure 3 of the main manuscript.
Conversely, less frequent phenotypes show distributions with wider and higher range of HyI values compared to more prevalent ones (e.g. phenotype frequency < 0.001, see Fig. S4). These results suggest that low prevalent phenotypes have less probability than higher prevalent ones to be associated by chance (HyI null hypothesis) to the same locus, allowing us to identify more significant (high HyI values) phenotype-locus associations, as it is illustrated in the scenario 2 in Figure 3 of the main manuscript. The vast majority of patients are annotated with just one single CNV, therefore we can consider the number of patients and the number of CNVs like practically identical variables. We have performed two analyses in order to study the influence in the HyI of the number of patients/CNVs overlapping in the same locus. In the first analysis we studied the relationship between loci and patients/CNVs. We plotted the number of loci in function of the number of patients per locus (see Fig. S6). Although it is observed a small peak around 60 patients per locus that correspond to an unusual high accumulation of patients around some concrete regions in chromosomes 22 and 16, the mode of the remaining main distribution is about 10 patients per locus. This phenomenon can be explained by the fact that DECIPHER is composed by patients with rare genomic variants making it unlikely the big accumulation of patients with CNVs overlapping on the same locus. In our second analysis we studied the effect of the number of patients per locus on the HyI values (see Fig. S7). This plot shows that the distributions of the HyI values are kept almost constant throughout all the values of the number of patients per locus, indicating a lack of correlation between a higher number of patients per locus with a higher HyI association values. Figure S6. Distribution of the number of loci shared by a determined number of patients.

Figure S7. Distribution of Hypergeometric Index values (HyI) vs. number of patients per locus .
These results indicate that HPO prevalence has influence (correlation) on HyI while the number of patients/CNVs per locus doesn't. This fact could be explained by the particular features of the DECIPHER tripartite network topology, with a group of prevalent HPOs connecting many patients and a mode of 10 patients connecting loci.