Pedican: an online gene resource for pediatric cancers with literature evidence

Pediatric cancer (PC), that is cancer occurring in children, is the leading cause of death among children worldwide, with an incidence of 175,000 per year. Elucidating the genetic abnormalities and underlying cellular mechanisms may provide less toxic curative treatments. Therefore, it is important to understand the pathology of pediatric cancer at the genetic, genomic and epigenetic level. To unveil the cellular complexity of PC, we have developed a database of pediatric cancers (Pedican), the first literature-based pediatric gene data resource by comprehensive literature curation and data integration. In the current release, Pedican contains 735 human genes, 88 gene fusion and 24 chromosome abnormal events curated from 2245 PubMed abstracts. Pedican provides detailed annotations for each gene, such as Entrez gene information, involved pathways, protein–protein interactions, mutations, gene expression, methylation sites, TF regulation, and post-translational modification. Additionally Pedican has a user-friendly web interface, which allows sophisticated text query, sequence searches, and browsing by highlighted literature evidence and hundreds of cancer types. Overall, our curated pediatric cancer-related gene list maps the genomic and cellular landscape for various pediatric cancers, providing a valuable resource for further experiment design. The Pedican is available at http://pedican.bioinfo-minzhao.org/.

Scientific RepoRts | 5:11435 | DOi: 10.1038/srep11435 pond4kids, is made up of hospital-based cancer registration and clinical information, not including patient genetic data. The genetic abnormality relating to other harmful PCs are scattered in the literature without systematic collection and comparison. In this study, we integrated known genetic predisposition information from thousands of cases in the literature to complement the population-based study from PCGP. To this aim, 2245 PC-related PubMed abstracts were collected and manually curated, which result in 735 human PC-related human genes, 88 gene fusion events, and 24 chromosome-level events being recorded. Moreover, we provide comprehensive biological annotation for biological pathway, gene regulation, interaction and expression in a user-friendly way, which may help the PC community to obtain a better understanding of pathogenesis for various PCs, and even facilitate the gene prioritization and prediction for PCs. In addition, this data resource also makes it feasible to compares the genetic differences for the cancers in children and adults.

Results
Functional enrichment analyses pinpoint development-related NOTCH1, FGFR and GAB1 signaling transduction in PC. To explore the relevant biological processes of our collected genes, gene-set enrichment analysis was adopted to characterize whether the 735 PC-related genes had any significant annotations comparing to all the human protein-coding genes. Using strict cutoff (corrected p-value less than 0.01 and the annotated genes more than 30% of all PC-related genes); we identified 35 statistically significant enriched pathways (Table S1) and 170 gene ontology terms (Table S2). Those enriched functional pathways are mainly related to cancers such as transcriptional mis-regulation, constitutive PI3K/AKT signaling, proteoglycans and the P53 signaling pathway (Table 1). Notably, the top enriched gene ontology terms are all related to development processes, such as cell fate commitment, gland development, regulation of organ morphogenesis, stem cell proliferation, mesenchyme development, and morphogenesis of a branching epithelium.
In fact, the pathway analysis result also confirmed the gene ontology result. The PC-related genes were also highly enriched in development-signaling pathways such as Notch1 intracellular domain regulates transcription, constitutive signaling by Notch PEST domain mutants, downstream signaling of activated FGFR, and the GAB1 signalosome. The Notch signaling pathway has a dual role in cancer (oncogenic and tumor suppressor functions) 8 . It is hypothesized that Notch tends to modulate the epithelial mesenchymal transition (EMT) during cancer metastasis 9 . However, the role of Notch signaling in PCs has only been studied in childhood T cell acute lymphoblastic leukemia (T-ALL) 10 . More extensive studies of Notch signaling in other PCs will provide a rationale for Notch-based therapeutic strategies. FGFR is the receptor for fibroblast growth factors (FGFs), which are often relevant to cell stemness, proliferation, anti-apoptosis, drug resistance, and angiogenesis 11 . In our Pedican, four FGFRs (FGFR1, FGFR2, FGFR3, FGFR4) were recorded to be related to PCs. For example, FGFR1 was reported to be associated with tumorigenesis of Ewing's sarcoma 12 and Rhabdomyosarcoma 13 . It was demonstrated that FGFR inhibitors have an effect on overcoming drug resistance, thus FGFR-based therapeutic strategy is promising. More systematic studies using a targeted-sequencing approach will be useful to detect more candidate mutations in other PCs. GAB1 is a docking protein to transduce cellular signals from tyrosine kinases, such as Met (the hepatocyte growth factor) and EGFR (the epidermal growth factor receptor). The role of GAB1 signalosome in cancer was only reported in breast 14 and colorectal cancers 15 . Though GAB1 is not included in our Pedican as there is not direct link of GAB1 to any PCs, the other components of the GAB1 signalosome are enriched in our 735 PC-related genes, such as PDGFB, PDGFA, EGFR, MDM2, CDK4, PDGFRA. In summary, our results highlight that multiple cellular signaling events are related to PCs, especially NOTCH, FGFR and GAB1 signaling. The GAB1 is a good candidate gene to test its functions in PCs and other adult cancers.

PC-related genes are enriched in adult cancers, preterm birth and high birth weight.
Though previous studies show that the PCs are different from their corresponding adult cancers 4 , our disease-based enrichment analysis still shows connections between the PC-related genes and a broad-spectrum of human adult cancers (Table S3). Even the enrichment analysis of PC-related genes cannot measure how much commonality exists for the underlying molecular mechanisms between PCs and adult cancers; instead, it may imply that the overall signaling pathways of PCs are similar to adult cancers. The cancers involved mainly include those of the breast, colorectal, lung, stomach, esophageal, leukemia, bladder, prostate, pancreas, cervix, liver, melanoma, ovary and glioma. Systematic comparison of PCs with adult cancers may provide more comprehensive picture for the underlying common molecular mechanism between PCs and adult cancers.
Most interestingly, the 735 PC-related genes are also over-represented in endometriosis, type 1 diabetes (T1D), benzene toxicity, primary biliary cirrhosis, preterm birth, and high birth weight. The positive association of high birth weight to both childhood and adult cancers is shown by several studies [16][17][18][19][20][21] . Though the risk of preterm birth to an increased incidence of breast cancer in the mother has been discussed previously 22,23 , there is no direct evidence linking preterm birth to PCs. Our enrichment analysis may provide a clue for further exploration on the potential role of preterm birth in PCs. Therefore, further data mining on our Pedican may provide a clue about a potential role of birth weight and preterm birth in both PCs and adult cancers, including changes of hormone signaling along the cancer development.
Scientific RepoRts | 5:11435 | DOi: 10.1038/srep11435 Prioritize the key genes in PC and their mutational landscape in pan-cancer genomic data. To systematically evaluate the importance of PC-related genes, we conducted a gene ranking using 47 reliable genes as a training set by Endeavour (see Methods). The top ten ranked genes, included CDK4, CCND2, IGF1R, PDGFRB, CHEK2, CASP10, ERBB3, ATR, and E2F1. Not surprisingly, the majority of these top ranked genes are involved in the key pathway of cancers such as the cell cycle and P53 signaling pathway.
Although our collected genes have been demonstrated to have abnormal gene expression or other functional relevance to PCs, the systematic examination of the genetic variants in pan-cancer has not yet been conducted. These mutational patterns are useful for comparing the PCs with their counterpart adult cancers. As shown in Fig. 1, the top 100 ranking PC-related genes (including 47 genes from the training set and 53 top ranked genes from Endeavor) have overwhelming mutations in adult cancers. It is interesting that the 100 genes are over 90% mutated in a few cancers and cell lines including colorectal cancer, lung small cell cancer, bladder cancer, uterine cancer, ovarian cancer, squamous cell lung cancer, glioblastoma multiforme, pancreatic cancer, prostate cancer and melanoma. This result may highlight that PCs share substantial molecular mechanisms from adult cancers. The further comparison between specific PC and its corresponding adult cancer may provide more clues.
The PC-related protein-protein interaction network is highly modularized. By using the integrative protein-protein interaction data from the Pathway Commons database 24 , we performed a pathway reconstruction to present a cellular map related to PC. The reconstructed PC-related protein-protein interaction network contains 819 genes and 7720 gene-gene interactions with existent evidence from known biological pathways (Fig. 2). Among the 819 nodes, 725 are from our curated 735 PC-related genes. The remaining 94 are the linker genes to bridge the PC-related genes to form a fully connected map. Therefore, the majority of curated PC-related genes are organized in a highly modular structure. This is not only supportive of the precision of our data curation, but it also reveals the PC-related genes are acting in a high-density cellular module.
The common cancer genes across multiple PC types. On the basis of information from the literature, we annotated all the genes in Pedican with a specific cancer type. We classified all the PC types into 17 major groups according to anatomic and biological functions, including bone, cardiovascular, connective tissue, dermatological, developmental, ear/nose/throat, endocrine, gastrointestinal, genitourinary, hematological, immunological, muscular, neurological, ophthalmology, related syndrome, renal, and unclassified. The majority of PC cancer-related genes are related to neurological (357) and blood (220) functions. Based on the common genes in the 17 PC groups, the overlapping relationships were plotted in Fig. 3. It revealed that the multiple cancer groups shared potential molecular mechanisms. For instance, 58 common genes are found between neurological-related cancers and haematological-related cancers.

Conclusion
Pedican is constructed as a free database and analysis server to enable users to rapidly search and retrieve summarized PC-related genes. The functional enrichment analyses reveal that multiple developmental processes are related to PC-related genes involved in various cancer types. Our curated gene list provides a clue to the discovery of the common driver genes across multiple PCs and to explore the difference between the adult cancers and their counterpart PCs. The Pedican is freely accessible at http://Pedican. bioinfo-minzhao.org/.
Limitations and future work. This study aims to integrate literature and genomic data to explore the common mechanisms for different pediatric cancers. Comparing with the other public databases, our pediatric cancer database provided a curated, organized, and annotated gene list for pediatric cancer in an easily accessible way. From our web interface, user can not only find the reported genes related to pediatric cancer with their origin references, but also obtain more comprehensive knowledge about these collected genes. This culmination of long-standing and high quality resource in pediatric cancer may address the genetic improvement of pediatric cancer treatment. For example, our reconstructed PC protein-protein interaction network may help researchers to connect novel candidates to the known pediatric cancer genes and sum up the small genetic effects in biological pathway and network level. Additionally, our systematic comparison to explore the common pediatric-related genes may be useful for further experimental design of genetic screen for different PCs. However, the further data integration in multiple dimensions may provide deeper insight about the common and uniqueness of pathogen between pediatric cancers and adult cancers. Comparing to adult cancer project from TCGA, the PCGP project does not provide a comprehensive genomics features such as methylation, microRNA regulation, lncRNA expression. This may limit our understanding of differences between pediatric and corresponding adult cancer precisely. Armed with the knowledge in pediatric cancer databases and other members of pediatric cancer data, we will establish genomics-informed programs that will improve our understanding of pediatric cancers. In current study, we built the database based on literature curation, which may slow down our database update cycle. Due to the difficulty to collect pediatric cancer samples, the small scale studies related to PCs are not increasing dramatically. To obtain update of relevant literature, we constructed automatic literature searching terms using My NCBI tool, which will return matched published articles every two weeks. According to the statistics during the past half year, we can only receive about 0-10 abstracts from PubMed using our search expression in the manuscript. We may consider to use the literature similarity to cluster the newly available literature to accelerate our curation. Another limitation for our database may arise from the bioinformatic annotation since the fast development in the cancer genomics field. To address this potential problem, we have implemented an automatic system to import functional information from a variety of data sources, which can help us integrate more genes with relevant annotations. Once the data content update, the web interface will be updated accordingly annually.  retrieving all the resulting 2245 abstracts and grouping the literature using the "Related Articles" function in Entrez system; (iii) extracting PC-related text from the grouped articles. Those sentences related to PC were manually curated to obtain the correct gene names and cancer types; (iv) the extracted candidate gene name and alias were mapped to NCBI Entrez gene database manually. As a result, the 735 Entrez human Gene IDs with high confidence were collected as PC-related genes. In addition to collecting the mutated genes, the gene fusion events and other chromosome events were also curated. Based on the curated cancer types, we grouped all the PC types into 17 major groups according anatomic and biological functions. The overlapping cancer genes across cancer types were visualized using Circos 25 .

Biological functional annotations and database construction.
To characterize the meaningful biological function, we retrieved comprehensive functional information from public resources, including crosslinks to NCBI Entrez gene 26 36 and PID Reactome 37,38 . The relevant disease information was collected from GAD (gene association database) 39 , KEGG Disease 40 , Fundo 41,42 , NHGIR 43 , as well as OMIM 26 . The comprehensive mRNA expression profiling data from both normal and tumor tissues was incorporated from the BioGPS database 44 . Moreover, the original PC related articles in the NCBI PubMed database are hyperlinked to each gene. Additionally, we also collected the mutation information from the PCGP 7 and COSMIC 45 databases. The protein-protein interaction data were integrated from the pathway commons database. To help construct regulatory networks, we also obtained various upstream and downstream regulators in humans with emphasis on their regulatory transcription factors.
Gene ranking using Endeavour and cancer mutational pattern in multiple cancer types. Though we collected hundreds of genes related to various PC types, the common driver genes are still unclear because of its high genetic heterogeneity. All the PC-related genes are scattered in individual studies, which often focus on verifying specific genes/variants predisposing to PCs. Thus, data integration and evaluation across all the cancer types may help to highlight some important common driver mutated genes. To this goal, we adopted the Endeavour gene ranking tool 46 to prioritize all the 735 genes in the Pedican. Basically, the Endeavour extract features on the training gene list by using a multiple dimensional dataset, including gene expression, protein-protein interaction information, biological annotations, sequence features, and literature evidences. Here, we extracted a well-known PC-related  (TP53, WT1, MYCN, RB1, SMARCB1, CDKN2A, MYC, RET, MDM2, ABCB1,  IGF2, NF1, PMS2, CTNNB1, BRAF, ALK, PTCH1, MLH1, CDKN1A, BCL2, APC, MSH2, MLL, EGFR,  CDKN1B, NTRK1, RASSF1, PTEN, MSH6, CDKN2B, CCND1, WNT1, SHH, PDGFRA, VHL, KRAS,  CD99, CASP8, ATM, TNF, MTHFR, ETV6, CDKN1C, TFE3, NBN, KIT, and ERBB2) with at least 10 abstract evidences to build a ranking model. Using all the remaining 688 genes as input, Endeavour starts to prioritize the genes using multiple extracted features. As a result, the Endeavour system integrates all the outputs from the training models to form a global ranking for all the candidate PC-related genes by order statistics. Totally, there are 643 human genes which were ranked (Table S4). The top ten ranked genes include CDK4, CCND2, IGF1R, PDGFRB, CHEK2, CASP10, ERBB3, ATR, E2F1, and RBL2 ( Table 2). Not surprisingly, majority of these top ranked genes are involved in key pathways of cancers such as "TGF-beta signaling pathway". Although these candidate genes have been demonstrated to have abnormal gene expression or other functional relevance to some PCs, most of them are not detected as genetic variants in all the reported PCs, which are useful for users to screen potential genes for new PCs and other diseases (Table S4). The top 100 ranking of PC related-genes are inputted into the cBio portal to obtain a mutation pattern across multiple cancers 47 .
Reconstructing a protein-protein interaction network related to PC genes. To explore the frequently mutated genes in PCs and determine the underlying biological mechanisms, we build a protein-protein interacting map based on all the 735 PC-related genes. To this aim, we used a non-redundant human interactome from Pathway Commons, containing 3629 nodes and 36,034 protein-protein links. It is noteworthy that the protein-protein interactions are based on pathway databases such as Reactome 37 , which have biological meanings. The final interactome contains pathway-based gene-gene interaction links. To build a sub-network related to the 735 PC-related genes of interest, we used the similar strategies implemented in our previous study 48 . In this approach, all the inputted seed genes were mapped to the human interactome, which was used to produce a sub-network connected with inputted genes, where possible by the shortest paths. The final network visualization and topological properties were generated by using Cytoscape (version 2.8) 49 .
The structure of biological networks is often related to its functions 50 . Generally, the network follows a few topological rules, which are useful for characterizing the potential function. To explore the function of our reconstructed interactome based on PC-related genes, topological analyses were conducted using the NetworkAnalyzer plugin in Cytoscape (Fig. 3b-d) 49 . In this study, the number of connections for each node was defined as degree in a network 50 . To present the shortest steps for one node to reach another, we obtained closeness centrality for each node in the network 50 . The visualization of the whole network was performed by using Cytoscape 49 .
Pathway enrichment analysis. Throughout the paper, the representative pathways from KEGG and Reactome for each gene set were identified by KOBAS 51 . In these pathway analyses, all the human protein-coding genes were set as background to calculate statistical significance. In addition, the Benjamini-Hochberg multiple testing corrected P-values for enriched pathways were adopted based on hypergeometric test by using KOBAS. Finally, the enriched human pathways with corrected P-values less than 0.01 were identified as over representative pathways for each gene set.

Web interface development.
To provide a web interface for the public to access our collected information, we stored all the data and annotations in the relational database management system MySQL, which is open source and reliable. Using the Perl CGI module and JavaScript technology, a web interface was implemented to read and browse the database. The apache web server on a Linux server was used to publish the web pages. In Pedican, all the human genes are mapped to NCBI Entrez gene IDs, which is able to comprehensively hyperlink to various bioinformatics data resources (Fig. 5A). For all the genes in Pedican, we provided five sub-pages to characterize five annotation categories, including the general gene sequence information, the literature evidence to PC, the upstream regulators and other regulation events, the mutational pattern in PCGP data and COSMIC database, and the protein-protein interaction data. For example, by incorporating the gene expression profile from BioGPS, we generated an expression bar chart to present an overview for various normal tissues and cancer individuals (Fig. 5A). For our curated literature related to each gene, we highlighted their abstracts with keywords related to cancer or pediatric disorder for the users to review them conveniently.
To help user to do text query against our Pedican data, we developed six powerful query forms regarding pathway and disease information, genomic location, literature evidence, and gene expression range in normal/cancer samples, and mutation information (Fig. 5B). Notably, a quick text search for GeneID, gene symbol, and gene alias is on the top right of each page, which is useful for the user to retrieve any data in the database quickly. In addition, users can also run a sequence similarity search (BLAST) against the nucleotide and protein sequences in Pedican (Fig. 5C). Users can also explore the data in Pedican using a web browser; including the PC type, the organ/tissue classification, significantly enriched pathway, related disease, reported linkage region, and chromosome number (Fig. 5D). For each related KEGG pathway, the marked chart is provided to highlight the entire related PC-related genes. Finally, for the purpose of offline data usage, we provide a downloadable gene list corresponding to all the PC types in a plain text for all the collected 735 genes related to PC.