Leveraging multiple genomic data to prioritize disease-causing indels from exome sequencing data

The emergence of exome sequencing in recent years has enabled rapid and cost-effective detection of genetic variants in coding regions and offers a great opportunity to combine sequencing experiments with subsequent computational analysis for dissecting genetic basis of human inherited diseases. However, this strategy, though successful in practice, still faces such challenges as limited sample size and substantial number or diversity of candidate variants. To overcome these obstacles, researchers have been concentrated in the development of advanced computational methods and have recently achieved great progress for analysing single nucleotide variant. Nevertheless, it still remains unclear on how to analyse indels, another type of genetic variant that accounts for substantial proportion of known disease-causing variants. In this paper, we proposed an integrative method to effectively identify disease-causing indels from exome sequencing data. Specifically, we put forward a statistical method to combine five functional prediction scores, four genic association scores and a genic intolerance score to produce an integrated p-value, which could then be used for prioritizing candidate indels. We performed extensive simulation studies and demonstrated that our method achieved high accuracy in uncovering disease-causing indels. Our software is available at http://bioinfo.au.tsinghua.edu.cn/jianglab/IndelPrioritizer/.

. Schematic overview of our method. Our method takes a list of candidate indels and a disease of interest as input and outputs a prioritized list according to the likelihood of causing the disease. For each indel, we extract five kinds of functional scores, four kinds of association scores and RVIS score as genomic features. Each genomic feature is transformed into a p-value and we use Fisher's method with independence correction to integrate these p-values into an integrated p-value, which is used to prioritize candidate indels.
Performance in simulation studies. To validate our method for prioritizing disease-causing indels, we conducted the following simulation studies. We first obtained positive samples by extracting from the HGMD database coding indels with genomic length ranging from 1 bp to 20 bp, also called micro-deletions and micro-insertions. For each of these causal indels, we mapped the text description of the annotated disease to an OMIM identifier. To ensure high data quality, we require that such a mapping must satisfy at least one of the following criterions: (1) the disease descriptive text in HGMD exactly matched that of the OMIM record; (2) the same causal variants, either in DNA sequence format or rsid, were shared by the HGMD disease and the OMIM record; (3) the same pubmed ID was shared by the HGMD disease and the OMIM record. Although such rules removed many low quality mapping, one-to-many or many-to-many mapping could still occur due to pleiotropic effects of some variants. We therefore focused on only diseases with one-to-one mapping in the HGMD and OMIM databases and discarded indels with multiple mapped OMIM identifiers. With the above rules, we obtained a set of indels with high quality and uniquely mapped OMIM identifier, and these indels could serve as the gold-standard for validating our method. We then retrieved negative samples from two public sequencing project data, including 1000 Genomes Project (KG for short) and Exome Sequencing Project (ESP for short). Here, we followed the strategy used in VEST and retained indels with minor allele frequency (MAF) > 1% in the African American subpopulation. The collected indels were categorized into four subtypes: (1) nonframeshift deletion (ND for short), (2) nonframeshift insertion (NI for short), (3) frameshift deletion (FD for short), and (4) frameshift insertion (FI for short). The detailed summary statistics about data used in the simulation studies were presented in Table 1, which suggested obvious enrichment of frameshift indels in the HGMD dataset, when compared with the KG and ESP datasets (Fisher's exact test, p-values < 2.2 × 10 −16 ). This may partly be attributed to severe consequence induced by frameshift indels since they can disrupt all amino acids after indels while nonframeshift indels only alter several amino acids. The coverage of each functional scores for data used in simulation studies was presented in Table 2, which showed that coverage varied across different datasets and the integration of multiple scores in deed helped to improve the overall coverage.
For each disease-causing indel, we spiked it into the corresponding control dataset of the same subtype and applied our method to prioritize the resulting simulated data set. Here, we conducted simulations for these four subtypes of indels separately, in order to avoid the bias of imbalance due to the obvious difference of the number of each subtype. With integrated p-values supplied by our method, we obtained a rank for each disease-causing indel from the final prioritization list, and derived two metrics for overall performance evaluation. First, we counted the number of disease-causing indels ranked among top 20 and referred to this criterion as TOP. Second, we defined rank ratio of each disease-causing indel by dividing its rank by the number of all indels in the  Table 2. Coverage of each genomic data on the datasets used in the simulation studies. Each entry denotes the coverage rate of each genomic data for each indel subtype in the three datasets.
dataset. The rank ratios of all disease-causing indels were then averaged to get an overall measure, called as MRR. Typically, high TOP and low MRR together indicate good performance. The rank histograms of causal indels were presented in Fig. 2, and the overall ranks of causal indels were obviously skewed towards top position. For example, over 80% of disease-causing indels were ranked in top 5 for all indel subtypes, while the corresponding numbers for random guess would be 0.9%, 1.6%, 1.3% and 2.3% respectively, significantly less than that of our method. We also observed the same trend for causal indels ranked in top 10 and top 20. This demonstrated that our method could effectively uncover the real disease-causing indels in top positions. On the other hand, the MRRs for these four indel subtypes were 1.67%, 1.06%, 3.62% and 2.19%, while MRRs of random guess would be 50%, significantly worse than our method. Both MRR and TOP metrics supported the effectiveness of our method. We also performed a prospective simulation study, in which we extracted HGMD disease-causing indels that were discovered after 2015 from the HGMD database (professional version 2016. (4) We used these indels as cases and performed the same simulation study as before, and the results were shown in Supplementary Table 1. We observed similar performance in terms of MRR and TOP, suggesting our evaluation was unbiased.
Integration helps performance improvement. We next investigated whether integration of multiple genomic data could significantly improve prioritization performance. With the same simulation studies mentioned above, we evaluated the performance of individual genomic data for prioritization of disease-causing indels. The results were shown in Table 3, in which both MRR and TOP were evaluated for each genomic data. Here, we also considered the influence of missing data, and last position was assigned to the causal indel if the corresponding genomic data was missing. Therefore, the final performance of each genomic data relied on both its predictive power and coverage, and genomic data with high coverage and strong predictive power would have good performance.
From Table 3, we clearly observed that our integration method outperformed every individual genomic data alone. Specifically, the MRRs of individual genomic data ranged from 5.55% to 62.01%, 3.44% to 75.12%, 8.69% to 56.43% and 7.08% to 42.38%, while our integration method achieved MRR of 1.67%, 1.06%, 3.62% and 2.19% for the four indel subtypes, respectively. The TOPs of individual genomic data ranged from 7 to 949, 0 to 210, 1 to 6,843 and 0 to 2,956, while our integration method achieved TOP of 1,082, 231, 7,614 and 3,445, respectively. Besides MRR and TOP, we also evaluated rank ROC for performance comparison. Given a threshold for rank ratio, ranging from 0 to 1, we defined the true positive rate (TPR, also called sensitivity) as the proportion of disease-causing indels with rank ratios below the threshold and the false positive rate (FPR, also called 1-specificity) as the proportion of neutral indels with rank ratios below the threshold. By varying the threshold, we obtained a series of TPR and FPR values and plotted TPR against FPR to obtain rank ROC. From Fig. 3, we found that our method had better rank ROC than any individual genomic data.
The gained improvement in prioritization performance could be largely attributed to integration of diverse genomic data. These genomic data could be categorized into three groups in general, the first group of functional prediction scores, including SIFT, CADD, PinPor, DDIG and VEST, the second group of genic association scores, including gobp, gexp, strg and tsfc, and third group of the genic intolerance score RVIS. The functional prediction scores could discriminate between damaging indels and neutral indels, while it could not discriminate causal indels responsible for different diseases, because all damaging indels manifested similar effect, such as altering protein structures and functions. However, given a disease of interest, we only interested in these damaging indels responsible for the given disease. The second group of genic association scores could prioritize genes which may be associated with the disease of interest, and these genes are more likely to harbour disease-causing indels responsible for the given disease. In addition, RVIS provides genic intolerance, and genes with low RVIS scores are more likely to be disease relevant genes. We draw the correlations between these genomic data for the four different indel subtypes (Fig. 4). From this figure, we found that different correlation patterns existed for different indel subtypes, manifested as different hierarchical clustering results. In general, obvious correlations between genic association scores were observed and no correlations between these association scores and the other functional prediction scores. This reflected that association scores and functional prediction scores measured different genomic activities. Thus, these three types of genomic data sources complement with each other, and integration of them can help us prioritize real disease-causing indels.
Comparison of different data fusion strategies. We adopted Fisher's method (Fisher for short) for integration of p-values, which was an old but well-established and widely-used method. We also investigated the other three methods for integration of p-values, named minimal p-value (minP for short), Stouffer's Z-score method (Stouffer for short) and Robust Rank Aggregation 35 (RRA for short). With the same simulation studies mentioned above, we applied these three methods for disease-causing indel prioritization and evaluated the corresponding MRRs and TOPs for comparison. The results were shown in Supplementary Table 2, and we found that Fisher's method was similar with Stouffer's method in terms of MRRs and TOPs, which was not surprising because the two methods had close connections and nearly same statistical power asymptotically. Minimal p-value followed the two methods and RRA exhibited worst performance. It was worth noting that RRA was original designed for rank aggregation and its algorithm was not optimized to perform p-values integration. Another interesting phenomenon was that these methods have similar TOP performance in spite of different MRRs. We also compared rank ROCs of these four methods, as shown in Fig. 5, from which the same conclusion could be draw. Thus, one can pick anyone of them if only TOP performance is cared about and minimal p-value method shows great advantage over the others for its simplicity.
We also compared the robustness of these methods in terms of missing data, and presented the results in Supplementary Figure 1. We used the results generated from above simulation studies and categorized these indels by the number of missing functional scores, ranging from 0 to 6. We then evaluated the rank ratios of each indels in each subgroup and investigated the influence of the number of missing data on performance for each method. We observed increasing rank ratios (decreasing performance) with increasing the number of missing data, for all these methods. This was reasonable because missing data indicated missing information and posed challenge for prioritization. Stouffer's method assigned nearly 1 (meaning last position) to several indels in all subgroups, while the other three methods avoided that. The median rank ratios for all method were near zero  Table 3. Performance of each individual genomic data and the integrated score for the simulation studies. Each entry denotes MRR or TOP of each individual score, and the integrated score achieves the best performance.
when the number of missing data was less than 4 and began to increase obviously after that. Stouffer's method had the best control when the number of missing data was over four, followed by Fisher's method and the other two methods performed worse. Generally speaking, Fisher's method had both best performance and relatively robust controls over missing data, thus it was considered to be the most suitable method for our problem.
Performance for personal exome. The simulation studies presented above only focused on uncovering disease-causing indels from common indels (MAF > 1%), while realistic personal exome harboured both common indels and rare indels (MAF < =1%). We randomly selected an individual (HG00096) exome from KG dataset, and draw the MAF histogram of indels from this individual in Supplementary Figure 2, from which we observed a roughly two-component mixture distribution, including common and rare components. We also investigated the functional prediction scores for the two different components, as shown in Supplementary  Figure 3. There existed some difference between the distribution of functional scores of common and rare indels and rare indels tended to have high pathogenic possibility. Besides individual HG00096, we also randomly selected another nine individuals from different populations as listed in Supplementary Table 3. We assumed no disease-causing indels existed in anyone of these ten individuals owing to the healthy status of them. We then spiked each of HGMD disease-causing indels into each individual exome and prioritized the simulated exomes, which was a more difficult task than before due to the introduction of neutral rare indels. From Supplementary Table 3, we found that the MRRs of causal indels tended to be larger than before and the TOPs were also fewer than before. We compared the rank ratios of each causal indels between the two simulation studies and found it was significantly (Wilcoxon rank sum test, p-value < 2.2 × 10 −16 ) worse in these individual exome data. Despite of difficulties presented in individual exome data, our method achieved relatively satisfactory results, as shown in Supplementary Table 3. Specifically, the average MRRs for the four indel subtypes were 2.71%, 1.99%, 4.88% and 3.35% respectively and the average TOPs were 509, 116, 3,220 and 1,434. The results on these ten individuals were similar, suggesting the robustness of our method against different genetic backgrounds. These results collectively supported our method for effectively prioritizing disease-causing indels in personal exome.

Conclusions and Discussion
In this paper, we proposed an integrative method to prioritize disease-causing indels from exome sequencing data. Our method integrates five functional prediction scores, four genic association scores and a genic intolerance score with a statistical method. Our integration procedure mainly consists of two steps, transforming various genomic data scores into p-values and integrating these p-values with Fisher's method. Our method enjoys several advantages, including simplicity, robustness in terms of missing data, effectiveness in terms of prioritization power and flexibility for further development. We believe that our integration methodology not only be useful for variant analysis, but also can benefit the other genomic studies involving data integration. Despite of being effective, our method can be further improved in several aspects. First, we focus on coding indels and ignore noncoding indels. How to extend our method to be capable of analysing noncoding indels remains a topic of future research. Second, more genomic data are generated and some of them may be proved to be useful for variant prioritization, such as transcriptomic data, epigenomic data, and other phenotypic information etc. It is not difficult to incorporate other genomic data into our method, in which additional genomic data are transformed into p-values with appropriate procedures and these new p-values are integrated with existing p-values to obtain the integrated p-values. Third, we assume the correctness of variant calling in present study and ignore the quality information supplied in VCF files. It is worth noting that correct calling of indels still remains

Materials and Methods
Data sources. We downloaded the HGMD database (professional version 2014.3), KG and ESP databases from corresponding official websites and extracted indels satisfying criterion mentioned in simulation studies. We then extracted five types of functional prediction scores for these indels from corresponding websites: SIFT (http://sift.bii.a-star.edu.sg/), PinPor (http://watson.compbio.iupui.edu/pinpor), CADD (http://cadd.gs.washington.edu/), DDIG: (http://sparks-lab.org/ddig/), VEST (http://www.cravat.us/CRAVAT/). We downloaded 7,719 diseases with text descriptions from the OMIM database (accessed in February 2014) and 20,327 genes from Ensemble database (accessed in March 2014). OMIM ID and Ensemble ID were used to represent disease and gene respectively in our study. With BioMart tool, we also obtained 4,368 associations between 3,709 diseases and 2,870 genes, and mappings between gene Ensemble ID and gene HGNC symbol. We downloaded the RVIS scores from its website (http://genic-intolerance.org/index.jsp).

Transform functional scores into p-values.
For each type of functional scores, we first build an empirical null distribution from either public or private databases, which puts equal probability on each data point. For a query functional score, we calculate the p-value as the proportion of null distribution data points with extremer (e.g. larger) scores than query score  Construction of disease network. To construct disease network, we first collected 10,346 concepts from the HPO database 36 and annotations for 6,346 human diseases, which were used to construct a 10,346 dimensional numeric vector (concept vector) for each disease. Each entry of the concept vector characterizes the information of the disease relating to the corresponding HPO concept, calculated as −log(h i ), where h i is the frequency of concept i occurred in text description of the disease. We calculated the similarity score between every pair of disease as the cosine of the angle between concept vectors of the two diseases. We then obtained a similarity matrix for all human diseases by applying the above method to each pair of disease. In order to remove low confident edges in this network, we only retained 10 neighbouring diseases with highest similarity scores for each disease, leading to a nearest neighbour network, which we referred to as the disease network.
Construction of gene network. We constructed four gene networks with different genomic data, including gene expression (gexp), gene ontology (gobp), protein-protein interaction (strg) and transcriptional regulation (tsfc).
gexp. We obtained a whole-genome gene expression profile 25 , which measured expression levels of 44,775 transcripts across 79 human tissues. Each gene was represented by a 79 dimensional numeric vector (expression vector) with each dimension denoting the expression level in the corresponding human tissue. We obtained the raw similarity scores between pairs of genes by calculating the Pearson's correlation coefficient of their expression vectors and an exponential transformation was applied to remove noise as: where ϕ ab was the transformed similarity score for gene a and b, w ab the raw similarity score and σ the standard deviation of raw similarity scores for all gene pairs. We obtained a gene similarity matrix by applying the above method to each pair of genes and we only retained 100 neighbouring genes of highest similarity for each gene in order to remove low confident edges, leading to a final gene network (gexp).
gobp. We downloaded the gene ontology database 27 (November 22, 2014) and extracted 25,616 concepts associated with biological process domain and used a 25,616 dimensional numeric vector (gene ontology vector) to characterize each gene with each dimension denoting whether the gene had the corresponding concept. We obtained the raw similarity score between each pair of genes by calculating the cosine of the angle between their gene ontology vectors and transformed these raw similarity score with the exponential transformation mentioned above. We finally obtained the gene network (gobp) with the same strategy mentioned above.
strg. We downloaded the STRING database 26 (version 9.1) and extracted 403,514 interactions between 13,747 proteins, leading to a binary gene network. For each pair of proteins (e.g. a and b), we calculated the shortest path distance between them in the PPI network as δ ab and rescaled this score with We also applied the exponential transformation mentioned above to this score to obtain the final similarity score, and constructed the gene network (strg) with the same strategy as above.
tsfc. We extracted 218 vertebrate transcription factors from the TRANSFAC database 28 with high confident position specific scoring matrices and used the program MATCH to identify potential binding sites for each transcription factor within the 1,000 basepairs upstream of each human gene. Based on the matching results, we constructed a 218 numeric vector (binding vector) for each human gene with each dimension denoting the number of potential binding sites of the corresponding transcription factor on the gene. We calculated raw similarity score between each pair of genes as the cosine of the angle between their binding vectors. We then applied the exponential transformation on the raw similarity scores and constructed a gene network (tsfc) using the same strategy mentioned above.
Random walk on two-layered network for disease-gene association inference. We constructed a disease-gene heterogeneous network, consisting of a disease network, a gene network, and known associations between diseases and genes, and simulated the random walk process on this network to infer the association strength of a given disease-gene pair 37 . The disease network was the one constructed as above, which encoded the similarities between diseases, the gene network was one of the four gene networks constructed as above, encoding Scientific RepoRts | 7: 1804 | DOI:10.1038/s41598-017-01834-w similarities between genes, and the known links between diseases and genes were collected from the OMIM databases as mentioned above. For a query disease of interest, a random walker started a series of moves on the disease-gene network with some initial probability p (0) , which encoded the interested disease. At each move, the walker either restarted with probability π or move on with probability 1 − π. If moves on, the walker may switch between disease network and gene network with probability τ or just wander within either disease or gene network with probability 1 − τ. When wandering about within disease or gene network, the walker moved to one of its direct neighbours with probabilities proportional to the similarities between current node and its direct neighbours, e.g. higher probabilities moving to more similar neighbours. Over a series of iterations, the probability distribution of the walker on the network will converge to a steady probability vector p (∞) , providing a measure of the association strength between the query disease and genes.
Mathematically, we used a triple H = (D, G, A) to denote the disease-gene heterogeneous network, where D = {d ij } m×m collected edge weights of the disease network, G = {g ij } n×n collected edge weights of the gene weights, A = {a ij } m×n collected binary indications of disease-gene associations, and m and n the number of diseases and genes, respectively. We normalized each row of D to obtain a transition matrix denoted the transition probability from disease i to disease j. Similarly, we derived the other three transition matrices: ij j n ij 1 denoting the transition probability from gene i to gene ) denoting the transition probability from disease i to gene j, and ) denoting the transition probability from gene i to disease j. We then defined matrix T as and performed row-normalization to obtain overall transition matrix for the heterogeneous network as = ij j m n ij 1 denoting transition probability within the heterogeneous network.
1 denote the initial probabilities for the diseases and genes, respectively. We assigned equal probabilities to the neighbours of the query disease to u (0) and all zeros to v (0) , assuming a completely unknown genetic basis.
denotes the initial probabilities for the heterogeneous network and p (t) denotes the probability after t moves, the interaction can be formulated as Repeating the interaction until p (t) becomes stable (e.g., , where ε is a small positive number), and we obtained the steady-state probability p (∞) , consisting of a disease probability u (∞) and a gene probability v (∞) . The gene probability can be used to infer the association strength between the query disease and genes. We set default values for parameters τ = 0.5, π = 0.5 and ε = 10 −4 according to the literature 37 .
We then transformed the steady-state probabilities to p-values in order to facilitate subsequent integration. We simulated the random walk process for all disease-gene pairs that were not associated and obtained the distribution of steady-state probability under no association. Then we assigned a p-value for a query disease-gene pair as = ≥ -p Pr(scores of non associated disease gene pairs the query score) (6) Here, score referred to the steady-state probability and this p-value quantified the probability of observing stronger association scores under no association (null hypothesis). We excluded the known link between the query disease and the query gene when inferring their association.

Fisher's method for p-values integration.
We have ten different genomic data, leading to ten p-values through computations described as above, and Fisher's method is used to integrate these p-values into an integrated p-value, which quantify the statistical significance of candidate indels causing the query disease. In detail, let p 1 , …, p K denote the p-values to be integrated, where K = 10 in our study, we then define the Fisher's statistic as where V i = −2 log p i . If independence between different sources is assumed, it is obvious to verify that χ U  We estimated the covariance cov(Vk, V j ) with a normal model 38  . In the case of missing data, it is simple to ignore that data source and reduce the degree of freedom.