An analysis of disease-gene relationship from Medline abstracts by DigSee

Diseases are developed by abnormal behavior of genes in biological events such as gene regulation, mutation, phosphorylation, and epigenetics and post-translational modification. Many studies of text mining attempted to identify the relationship between gene and disease by mining the literature, but they did not consider the biological events in which genes show abnormal behaviour in response to diseases. In this study, we propose to identify disease-related genes that are involved in the development of disease through biological events from Medline abstracts. We identified associations between 13,054 genes and 4,494 disease types, which cover more disease-related genes than manually curated databases for all disease types (e.g., Online Mendelian Inheritance in Man) and also than those for specific diseases (e.g., Alzheimer’s disease and hypertension). We show that the text mining findings are reliable, as per the PubMed scale, in that the disease-disease relationships inferred from the literature-wide findings are similar to those inferred from manually curated databases in a well-known study. In addition, literature-wide distribution of biological events across disease types reveals different characteristics of disease types.


Re-evaluation of text mining tools used in DigSee
DigSee used several state-of-the-art text mining tools to identify disease-gene relations. To confirm if the performance is consistent with the Medline abstracts relevant to our study, we re-evaluated TEES and the other text mining tools against 100 sample Medline abstracts of our experiments and found similar results as follows: A. Turku event extraction system DigSee utilizes the Turku event extraction system (TEES) 1 to locate biological events in Medline abstracts. It was reported to achieve a precision of 53.98%, a recall of 52.69%, and an F measure of 53.33% at the EPI task in BioNLP-ST 2011. 2 We tested TEES with randomly selected 100 Medline abstracts. In the sample abstracts, TEES extracted total 1177 event relations, 55.14% of which were identified correct (679 relations) in both named-entity recognition (NER) and event relation. Among the incorrect relations, 83 relations were attributed to the errors of ABNER.
B. ABNER ABNER, 3 the NER tool for identifying gene and protein mentions, shows a precision of 77.93% in sample Medline abstracts. The accuracy of ABNER was previously known to achieve an F-measure of 69.9% in the previous study. 3 We also compared BANNER 4 and Gimli 5 with ABNER in the DigSee pipeline. In the validation results, from the 100 abstracts, ABNER and BANNER extracted 1,278 and 1,265 gene names with similar accuracies (ABNER: a precision of 77.93% and BANNER: a precision of 76.44%). Although the precision of Gimli was higher as 82.57% than the other two tools, it identifies the smaller number of genes names (1,033). Based on these empirical results, we used ABNER as the NER tool.

C. DNorm
We use DNorm 6 for extracting disease mentions and normalizing diverse disease names into standard terms.
In the results of sample Medline abstracts that randomly selected 100 abstracts, DNorm recognized 778 disease mentions, 76.34% of which were identified correct in both NER and normalization. DNorm was known to achieve an F-measure of 80.9% against a test set of the NCBI disease corpus in the previous study. 6

D. tmVar
For locating mutation event, DigSee utilizes tmVar. 7 tmVar was reported to achieve an F-measure of 91.39% (a precision of 91.38% and a recall of 91.40%). tmVar achieved a precision of 99.58% from randomly selected 100 sample Medline abstracts. In the sample abstracts, tmVar recognized total 726 mutation mentions.

E. Moara
Moara 8 is a flexible and trainable text mining system for gene/protein tagger and normalization. The system has been trained for several model organisms and corpora, moreover it can be expanded to support new organisms and documents. DigSee utilizes the Moara to normalize recognized gene mentions with human model, but not used to recognize gene and protein mentions. Moara achieved a precision of 55.00%, a recall of 83.31%, and an F-measure of 66.26% in the previous study. 8 In the randomly selected 100 Medline abstracts, Moara achieved a precision of 75.10% in the normalization step.

A ranking method for evidence sentences
To distinguish positive sentences supporting the triplet relationship of gene, disease, and biological event from negative sentences that do not describe the relationship, we previously developed a Bayesian model based on ten linguistically motivated features were constructed using the feature selection sentences 9 such as event and edge scores, gene-event distance, event-regulation distance, and event-disease distance, event depth, cancer keywords count, hallmark keywords count, negative score, and agent. These features were obtained from ABNER 3 and Turku event extraction system 1 , dependency parse trees generated by Stanford parser, hand-crafted disease-related terms, and terms related to negative sentences.
Developed Bayesian classifier with the features was modeled to identify positive evidence sentences from negative sentences. By assigning the same prior to positive and negative evidence sentences, we calculated a likelihood ratio of features, Basically, naïve Bayesian classifier assumes conditional independence among features. However, we empirically chose two types of dependencies, a dependency between cancer keyword and event-cancer distance and a dependency between agent and hallmark keywords count, after analyzing the feature selection data set. Therefore, the likelihood ratio can be rewritten as follows: Among the features, two features of "cancer keywords count" and "hallmark keywords count" were based on hand-crafted disease-related terms that means disease-dependent features. To adopt two features for all diseases, we develop a method of collecting disease-related terms using Word2Vec. 10 Word2Vec computes continuous vector representations of words based on neural networks, where the word vectors can be used for certain inference. At first, we computed the vectors of words using all sentences in disease-related Medline abstracts. The ten most similar words of each disease name in the vector space were selected as disease keywords. In addition, we identified hallmarks of a given disease (e.g., hypertension) by contrasting them with known pairs of cancers and their hallmarks (e.g., proliferation). In particular, we used vector operations such as vector("cancer") − vector("proliferation") + vector("hypertension") to predict terms related to the given disease. For example, term "blood pressure" was found to be the top hallmark of hypertension using example vector operation.

Additional methods for ranking disease-related gene ranking
In addition to the five measures introduced in "Improving disease-related gene ranking" in the Method section, we augmented the fourth and fifth measures with biological events in order to see whether particular biological events might affect gene ranking and generated four additional ranking. First, for each disease, the ratios of biological events were calculated, and the summation and the average of normalized scores were multiplied by the event ratio (6 th and 7 th rankings). Second, for each gene, event ratios were calculated and the summation and the average of normalized scores were multiplied by the event ratio of genes (8 th and 9 th rankings) (Supplementary Figure 1).