PALM-IST: Pathway Assembly from Literature Mining - an Information Search Tool

Manual curation of biomedical literature has become extremely tedious process due to its exponential growth in recent years. To extract meaningful information from such large and unstructured text, newer and more efficient mining tool is required. Here, we introduce PALM-IST, a computational platform that not only allows users to explore biomedical abstracts using keyword based text mining but also extracts biological entity (e.g., gene/protein, drug, disease, biological processes, cellular component, etc.) information from the extracted text and subsequently mines various databases to provide their comprehensive inter-relation (e.g., interaction, expression, etc.). PALM-IST constructs protein interaction network and pathway information data relevant to the text search using multiple data mining tools and assembles them to create a meta-interaction network. It also analyzes scientific collaboration by extraction and creation of “co-authorship network,” for a given search context. Hence, this useful combination of literature and data mining provided in PALM-IST can be used to extract novel protein-protein interaction (PPI), to generate meta-pathways and further to identify key crosstalk and bottleneck proteins. PALM-IST is available at www.hpppi.iicb.res.in/ctm.


Results
Input and output features of PALM-IST server. Input option. A two tier keyword (primary and secondary) based search engine is introduced in PALM-IST server. Topic (e.g., glioblastoma or haemophilia) and author (e.g., Weinberg RA) based searches can be performed simultaneously and/or separately. All primary keywords along with their synonyms and acronyms are searched using AND (specified as ", " symbol) or OR (specified as "|" symbol) Boolean gates in NCBI PubMed 4 using NCBI Eutils. Abstracts retrieved from primary keyword based search are further sorted for all possible combinations of secondary keywords (separated by new line) along with their synonyms. Option of exclusion of certain keywords can also be invoked in PALM-IST input option.
Output option. The output of the PALM-IST server can be subdivided into five parts. Following section briefly describes each of the output using an example set of keywords (primary and secondary) based search. Fig. 1 provides a general overview of the output options of the PALM-IST server. All the output numbers provided in this manuscript are derived from PUBMED search performed in the month of December 2014.
Abstract result. Abstracts retrieved for primary keyword based search are displayed and highlighted with the bio-entity words (gene/protein, drugs, and diseases and biological processes) and relation terms (e.g., modulate, elevate, etc.). PALM-IST provides a unique option of simultaneous literature mining using multiple secondary keywords where abstracts are sorted for all possible combination of secondary query words. 63992 abstract containing articles are retrieved using an example literature search with primary keywords "Glioblastoma|Glioma|Brain tumor|Brain cancer" ("|" denotes OR gate) and secondary keywords like "EGFR<new line>TP53<new line>Erlotinib|Gefitinib". Results for all single secondary keywords and in combination are mutually exclusive. Synonyms of genes for secondary keywords (e.g., p53, TP53, tumor protein p53) are automatically used in abstract search. For example, out of 63992 abstracts only 2 abstracts were found where all the three secondary keywords or their synonyms are present. However, this could easily be done using advance keywords search in PubMed, but to retrieve abstracts for the other combinations, such as EGFR and Erlotinib|Gefitinib or TP53 and EGFR, it would require separate PubMed searches. In PALM-IST abstracts for all the combinations can be retrieved in single search. This becomes really useful option when a large number of secondary keywords are required to be searched. Result for secondary keywords can be used to write the summary of abstract, which in principle can be utilized for refined and curated text search.
Gene result. This section of the PALM-IST server deals with entity recognition of genes/proteins and tagging of abstracts containing the gene/protein names or their synonyms. It provides a list of human genes/proteins that are frequently found in the abstracts yielded by the primary and secondary keywords based search. Protein-protein interaction (PPIs) with two tiers of interaction (1 st level and 2 nd level interacting protein) for each observed protein is displayed with subcellular compartmentalization. Biological pathways with which each observed protein is involved are shown in a network display where crosstalk proteins (proteins that connect multiple pathways) are identified and emphasized. Biological pathways (signaling and/or metabolic) of the observed proteins and their PPI network are overlaid to provide a meta-interaction network of cellular systems. Secondary information including gene summary, gene loci, Swiss-Prot/Ensembl code, three dimensional structure, and single nucleotide polymorphism (SNP) for each observed gene are provided on mouse click. Molecular expression data of the listed genes are provided where users can find up-and down-regulation patterns of those genes in numerous experimental Scientific RepoRts | 5:10021 | DOi: 10.1038/srep10021 conditions 32 . Molecular expression profile of identified gene/proteins is also overlaid onto the assembled pathway network using user defined expression context and datasets. Figure S1 provides an example of expression mapping onto the p53 signaling pathway overlaid with protein-protein interaction information.
In this section, options are also provided for the users to merge PPIs and pathways of multiple (maximum 5 proteins) proteins to create a meta-interaction network. This broadens the scope of visualizing large interaction and pathway information data within a single visualization window. Co-occurrence based connections between multiple genes can be extracted within this section of the PALM-IST server. Authors and co-authorship networks extracted from the publications in which the particular protein and the primary keywords co-occur is also generated.
Drugs and Disease result. Abstracts yielded by primary and secondary keywords based search are scanned and sorted based on the presence of 925 approved drugs and 3813 disease names. Biological pathways (signaling and/or metabolic) related to these drugs and diseases are presented. Similar to the gene/protein section, co-occurrence based connections between multiple drugs and diseases can be extracted within this section of the PALM-IST server. Names of the experts or authors who are frequently publishing scientific papers related to the drugs and diseases observed within the searched abstracts are presented in tabular and network display.

Co-occurrence based interaction result.
In this section PALM-IST offers text based co-occurrence of genes, drugs, diseases, and biological processes. Triad combinations of gene-drugs-disease and gene-disease-process are extracted and most frequent combinations are provided in tabular and network display option. For example, for the abstracts obtained with Glioblastoma OR Glioma OR Brain tumor OR Brain cancer as primary keywords, most frequent gene-drug-disease triads are found to be MGMT-Temozolomide-Glioma (534 abstracts), VEGF-Bevacizumab-Glioma (156 abstracts), EGFR-Erlotinib-Glioma (99 abstracts), etc. Similarly, most frequent gene-disease-process triad combinations are EGFR-Glioma|Glioblastoma-Growth (1301 abstracts), VEGFA-Glioma|Glioblastoma-Growth (800 abstracts), MGMT-Glioma|Glioblastoma-Methylation (609 abstracts), AKT1-Gliom a|Glioblastoma-Signaling (598 abstracts), VEGFA-Neoplasms-Angiogenesis (377 abstracts), etc. In addition to the triads, various combinations of pairs of genes, drugs, diseases, and biological processes are also extracted from the searched abstracts based on their co-occurrence. The observation of MGMT (O-6-methylguanine-DNA methyltransferase), Temozolomide and methylation as the most frequently observed (515 abstracts) gene-drug-biological process triad is in fact quite fascinating and can act as a proof of concept for the discovery of new knowledge of association between genes, drugs, diseases, and biological processes. The strong co-occurrence of MGMT, Temozolomide and methylation extracted by PALM-IST clearly indicates crucial association of them with Glioma. This is indeed the case as some tumors become sensitive to Temozolamide, via epigenetic silencing of MGMT/AGT gene 33 . Similarly, brain tumors with MGMT protein show little responce to Temozolomide 34 .
Author's network results. Author's statistics and network is an interesting feature of PALM-IST server. It provides detailed countrywide publication statistics represented in tabular and interactive global map format. Similarly, most frequent authors and their co-authoring relationship for a given literature search are provided in network based display using Cytoscape Web 35 applet. For the abstracts obtained with Glioblastoma OR Glioma OR Brain tumor OR Brain cancer as primary keywords, most papers are published from United States Of America (20945 papers) while most frequent authors and co-authors are Darell D Bigner from Duke University and Henry S Friedman from Duke University Neurosurgery Division who are renowned experts of the Glioma field for the last few decades. These author and co-author's networks not only provide an idea about the experts of the fields but are also quite useful in revealing many interesting features of academic communities 36 and are helpful in generating new and valuable information relevant to the strategic planning, implementation and monitoring of scientific policies and programs 37,38 . Disambiguation of author's name is an important but challenging pre-processing step in literature mining. However, it is out of the scope of this paper to pre-process and disambiguate all the authors' name. We have used author's name and initials provided by the PubMed 4 .
PALM-IST statistics for multiple types of diseases as query keywords. Other than the above mentioned example primary keyword (i.e., "Glioblastoma|Glioma|Brain tumor|Brain cancer"), various other disease names were used as query keywords. These diseases were grouped into four categories: a) metabolic disease, b) cancer, c) infectious disease and d) other diseases. Table S1 outlines total number of abstracts, gene/protein, drugs, PPI, pathways, crosstalk protein, signaling-metabolic common proteins, and co-authors statistics extracted for these diseases when used as query keywords in the PALM-IST server.
PALM-IST report. In addition to the web-based interactive display, a summarized report containing associated genes, drugs, disease and authors is generated and sent through email on user's request for their respective input query. This report contains number and list of the extracted abstracts, protein-protein interaction, cross-talk proteins, frequent authors, associated drugs, diseases, pathways and genes/proteins for a given keyword search.
Validation and Benchmark. Table 1 outlines a qualitative comparison highlighting various features of the PALM-IST server with respect to other freely available tools.
We validated the performance of the bio-entity recognition component of PALM-IST using various gold standard corpuses 39 (GSC) such as BioCreative corpus 39 , NCBI Disease corpus 40 , CHEMDNer corpus 41 (BioCreative task IV), Arizona disease corpus (AZDC) 42 etc. Table 2 provides the performance measures for the bio-entity validations (see Methods and supplementary file 1 for details). Programs shaded in grey in Table 2 are used in the PALM-IST.
Performance of gene name recognition component of the PALM-IST server aided by the GeneTUKIT 43 was compared with that of two other programs namely BANNER 44 and Abgene 11 using the standard BioCreative task II 45 gene mention corpus. F-measure of the PALM-IST gene name recognition component calculated from the precision and recall values was found to be higher than those of the two above mentioned programs. Similarly, gene normalization component of the PALM-IST server aided by the GenNorm 46 software was benchmarked against the BioCreative task III 47 and task II 48 corpuses. Performances of the PALM-IST gene normalization component were observed to be higher than GNAT 49 and Moara 50 when compared for all species (BioCreative task III) and for human (BioCreative task II) gene normalization, respectively. However, it must be noted that for human gene normalization, GNAT 49 outperforms the PALM-IST component.
Disease name recognition aided by the DNorm 51 software was benchmarked against the NCBI and Arizona disease corpuses 40,42 . In both cases PALM-IST embedded component (i.e., DNorm) outperformed the MetaMap 52 package. Similarly, Pubtator 53 based Chemical/Drug name recognition component also provides better performance than that achieved by the Whatizit 9 package ( Table 2).
The Comparative Toxicogenomics Database 54 (CTD) includes curated data describing association between genes/drugs/pathway and various environmentally influenced diseases. Here, we have validated the accuracy of genes/drugs/pathway associations suggested by PALM-IST text and data mining components using the CTD enlisted disease MESH terms as query keywords. The top 10 and 20 genes/drugs/ pathways based on occurrence for each disease keyword search was compared against CTD enlisted associations. Table 3 provides the percentage of identical gene/drugs/pathway yielded by the PALM-IST keywords search based association and the CTD enlisted disease-gene/drugs/pathway association.
The co-authorship network and network derived features were verified against published networks provided in the PubNet 55 server. PubNet co-author networks were re-created by the PALM-IST using the same query keywords/author based searches ( Figure S2). Further, the networks and their features were compared (Table S2 in supplementary file 1) to show their similarities, which indirectly provide reliability of the PALM-IST co-author networks.

Conclusion
Biomedical literature scan is critical to understand large amount of data generated in experiments and to retrieve novel information from them. PALM-IST constructs and assembles protein network and pathway information data relevant to the gene/proteins frequently observed within the searched text. Hence, Table 1. Qualitative comparison of the server/database features. A: PPI with sub-cellular localization, B: PPI with network analysis C: PPI with second level of interactors.
PALM-IST can become an important platform to aid large scale system biology based research where multiple genes/proteins and pathways are required to be examined simultaneously for better understanding of the cellular complexity. A key challenge in cell biology is to understand the interconnectivity between its biochemical pathways with respect to extracellular signals. Hence, the assembled/interconnected network (supra-network) constructed via PALM-IST applications can help in generating new hypotheses and can discover emergent properties of the biological systems.

Methods
Methodology and Architecture of the server. PALM-IST is developed on CGI-PERL based web architecture. Figure 2 shows a schematic representation of the workflow of the PALM-IST methodology and architecture. Following section briefly describes various features of the server. Input/Query. Multiple keywords of varied nature, such as genes, disease, drug, author names or any other word(s) can be provided as primary keyword input. Similarly, secondary keywords can also be mined in all possible combinations on abstracts retrieved from the primary keyword based search.
Collection of bio-entity information. Information regarding genes, diseases, drugs, pathways, interactions, and expression data were collected from various well-established resources (complete list of resources and MYSQL indexed table size with relevant details are provided in Table S3 and S4 of supplementary file 1) are utilized within the PALM-IST server. Information regarding 15.5 million gene and 11 million taxonomic entries were collected from the NCBI resources and were further processed for indexing. Indexed genes/proteins were mapped onto 1.23 million cellular pathways collected from the Kyoto Encyclopedia of Genes and Genomes 29 (KEGG) database and almost 40,000 Gene Ontologies collected from the GO database 56 . Protein-protein interaction information was collected from the STRING database 57 and additional information regarding the 23184 human genes was extracted from the Genecards 58 resources. Gene expression data was collected from the Gene Expression Omnibus 59 (GEO) while drug-gene/drug-disease association information was extracted from the Comparative Toxicogenomics Database 54 (CTD) and the DrugBank 60 database.
Indexing and Scoring. Till November 2014, 14361661 PubMed abstracts were indexed and processed in PALM-IST. Newer abstracts are downloaded and added on regular interval. Hypergeometric test was used to estimate the likelihood of the observation of a bio-entity by chance within a given  61 . Following section briefly describes the hypergeometric and co-occurrence scoring between two bio-entities.
Hypergeometric test was used to estimate the likelihood of the observation of a bio-entity by chance within a given text. Gene/protein, drug, disease observed for a given text search are ranked based on the number of publications retrieved with the gene/drug/disease among the total number of publications linked to the gene/drug/disease. Hypergeometric test and score for the bio-entity was calculated using the contingency table (Table 4) and the following formula: For a given primary query term(s) (e.g., Glioblastoma or Glioma) based text search, A is the number of publications that involve an observed bio-entity (e.g., TP53 or Gefitinib) where B is the number of publications that do not contain that particular bio-entity term. Similarly, C is the publications containing the bio-entity term (e.g., TP53) but not the query term(s) while D does not contain the particular bio-entity (e.g., TP53) and the query term(s), but contain other bio-entity (e.g., proteins) name. Y denotes the number of publication in which at least one bio-entity term is found (for example, articles containing at least one gene or one drug) but not the query term(s). X C A , Y C C , and N C Z [can also be represented as equation Eq.  values, we have utilized a log 2 conversion followed by a division with a constant number (100). Higher the Score better is the significance of association between the bio-entity and the query term(s). Score in Eq. (4) for co-occurrence based relation between two entities is calculated using the mutual information 62 (MI). MI relates the joint probability of two items occurring [p(X,Y)] with respect to the probability of independent occurrence [p(X) . p(Y)]. The higher the MI value, the greater is the confidence in hypothesizing the co-occurrence. Where, Where, N is the number of abstracts in query result (for query context) or complete PubMed database (for global context).

Named Entity Recognition (NER).
Biological entity recognition (BER) is a part of named entity recognition (NER) where textual data is mined to identify 63-65 relevant biological entities (e.g., genes, proteins, drugs, diseases, etc to facilitate their functional classification 66 ). In PALM-IST we have used two open sourced, widely used programs GeneTUKit 43 and GenNorm 46 for gene name recognition and normalization, respectively. Similarly, DNorm 51 was used for disease recognition and a dictionary-based lookup approach implemented in Pubtator 53 was utilized for chemical/drug name recognition. Short descriptions about the methodology of these programs are provided in supplementary file 1.

Information Extraction (IE). PALM-IST extracts relations based on co-occurrence of bio-entities and
presents in tabular and interactive network visualization manner aiding to understand the relationship between gene-gene, protein-protein, gene-disease-drug, disease-drug, gene-drug, gene-processes-disease and gene-processes. Indexing of association table is performed based on common publication containing two/three bio-entities. These associations as solely based on co-occurrence at abstract level. This approach is based on the assumption that co-occurrence of multiple biomedical concept in the same abstract is an indication of a functional link between those bio-entities.
Network construction and visualization. Protein-protein interaction (PPI) data were collected from the STRING 57 database. For each protein up to two-level interaction (interactors of direct interactor) were considered. The interactions were further divided into two classes, high (> = 0.7) and medium confidence (> = 0.4) based on the STRING confidence score. Each protein is tagged with its subcellular localization and the protein-protein interaction networks (PPIN) are displayed with subcellular compartmentalization to aid the visualization of interaction ( Figure S3). PALM-IST combines biological pathway information with protein-protein interaction data by overlaying the pathway (directed) with PPI (undirected) network using Cytoscape web 35 application. Similarly, PPIs and biological pathways associated with multiple proteins can be merged and assembled in PALM-IST. Molecular expression profile of the identified gene/proteins can also be overlaid onto the assembled pathway network using user defined expression context and datasets. Details regarding the workflow of expression mapping onto pathways and differential expression calculation can be found in the supplementary file 1 ( Figure S4). In addition to these pathway assembly features, co-occurrence network of multiple genes, drugs, diseases and biological processes can be visualized in PALM-IST server via the Cytoscape web 35 application. Associations between pairs of gene, drug, disease and process are identified based on their co-occurrence in the abstract. Two types of pair wise co-occurrence scores are calculated. Query context score and global context score, where query context score is meant to depict the significance of the co-occurrence within the abstract containing query term whereas global score signifies with respect to complete database size (see Indexing and Scoring section for details). Triad combinations of gene-drugs-disease and gene-disease-process are extracted based on abstracts with three bio-entities and most frequent combinations are provided in tabular and network display option. Triad combination score is calculated based on hypergeometric test (see Indexing and Scoring section for details). Network of co-occurrence can be visualized using Cytoscape web display where, nodes are bio-entities and edge represents abstracts connecting the corresponding bio-entities. Bio-entities are color coded according to their types and edge width is set on the basis of number of abstracts in which those bio-entities are co-occurred.
Author's statistics and co-authorship network. MySql based indexing is used to extract country statistics, authors and co-authors name. Google maps API is used to point country wise publication information on world map. Further, most frequent authors for a given text search is also extracted and co-authorship networks of the most frequent authors are provided within network based display using Cytoscape web 35 applet.
Evaluation of named entity recognition (NER). BioCreative task II 45 gene mention (BC2GM) corpus is concerned with the named entity extraction of gene and gene product mentioned in text. BC2GM test set containing 5000 sentences were utilized for gene mention programs' evaluation. BioCreative task II 48 gene normalization (BC2GN) is meant to link genes or gene products mentioned in the literature to standard database identifiers. BC2GN test set containing 252 articles were utilized for gene normalization programs' evaluation. BioCreative task III 47 gene normalization (BC3GN) containing 50 gold standard articles is meant to link gene or gene products mentions in full text literature. NCBI disease corpus 40 is fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Test set contains 100 articles, which were utilized for disease normalization programs' evaluation. AZDC corpus 42  True and false positives are the number of bio-entities that were identified correctly and incorrectly, respectively.