Introduction

Animal regeneration refers to the regeneration of damaged or diseased body parts to completely restore function1,2. It involves stem cells that have the capacity to differentiate and mature into a variety of cell types depending on the potency of the stem cell and the organism. In fact, the ability to regenerate is vastly different across the animal kingdom. In metazoans, animal groups like: hydra, planaria, starfish and several worms can regenerate their entire body from a small body fragment3, whereas birds, nematodes and leeches have lost all capacity for self-renewal2.

The majority of human tissues and organs possess limited self-renewal and true-regeneration abilities, which is not to be confused with compensatory growth, the mechanism by which tissues such as the liver recover from trauma. Regenerative medicine is an area that promises to repair damage following traumatic injury or disease, by direct stimulation of a wound-site, or by introduction of exogenous, man-made tissue4. Multiple therapeutic strategies are being explored including: small molecules, gene delivery and stem cells. Recent advances in tissue engineering provide more practical approaches to achieving regeneration; tissue engineering can enhance the regenerative cascade and stimulate production of the body’s own complex tissues by replacing lost or damaged material5. However, progress with transplantations has been hampered due to the complexity of the interactions and regulatory systems involved, as well as the sheer diversity of tissues and organs these cells differentiate into.

The molecular mechanisms of regeneration are well studied in several model organisms. For example, the SemdGD and Planform databases were developed to browse the genomes of regenerative free-living species, including Schmidtea mediterranea; a freshwater planarian with a capacity to regenerate from small body fragments into a complete body6,7. Additionally, numerous studies have focused on limb regeneration, which have been systematically combined into the Limbform resource8. However, these studies are focused specifically on a limited number of species or only limb regeneration, so the broader view regarding multi-species organ/tissue regeneration is still lacking. Moreover, the differences and similarities of different regenerative processes is unclear. To elucidate the commonalities, the data must be mined systematically for all kinds of regeneration and integrated into one resource to provide us with the essential knowledge to eventually understand, manipulate and control regenerative properties. The majority of regeneration studies to date have not focused beyond the gene level. Although, with the development of affordable high-throughput sequencing technology, a few studies have characterized the change in gene expression during limb regeneration in salamanders9, fin, heart and retinal regeneration in zebrafish10 and fin regeneration in medaka11. Furthermore, numerous microRNAs have now been identified as regeneration genes12,13,14, which further adds to the complexity of the regenerative cellular signaling map. Importantly, these studies lack cross-species data integration and thus fail to provide the whole picture of regenerative cellular processes. In addition, the relationship of regenerative process and other common diseases such as cancer are unexplored systematically, although there are some clues documented15.

In this study, we curated genes with identified links to regeneration, from an array of tissue types and species listed in 1293 PubMed abstracts. Additionally, well annotated regeneration genes from the gene ontology (GOA) database16 were integrated to produce a total of 948 human regeneration-related human genes and 8445 homologs from another 11 species was obtained. Moreover, we provide high quality annotations detailing biological pathways, gene expression, regulation and interaction, to aid regeneration researchers in obtaining a rapid understanding of the known molecular mechanism for regeneration in various tissue/organs. This data resource also makes it feasible to prioritize genes by their regeneration-associated importance and to identify both the common and unique cellular events involved in different regenerative processes.

Results and Discussion

Data integration and literature search

The primary aim for the REGene database was to collect and maintain a high quality animal regeneration gene resource, which serves as a comprehensive, classified and accurately annotated regeneration gene knowledgebase. The database provides extensive cross-references and querying functionality. It is in the public domain and freely accessible to support the animal regeneration and regenerative medicine research community in the design of systematic regeneration and regenerative medicine studies. In order to provide a comprehensive resource, we collected known regeneration genes from the gene ontology annotation database (GOA)16 and GeneRif literature database17 (Fig. 1). To retrieve a comprehensive list of annotated genes from GOA, we curated 20 GO terms related to regeneration and extracted 549 genes from the GOA database associated with regeneration GOs (see methods for GO terms). Due to the pace of research in this field and the volume of data generated, GOA annotation does not always provide the most up-to-date literature to support regeneration gene roles, data curation is, by it’s nature, always a step behind regenerative biology research.

Figure 1
figure 1

Pipeline for collection and annotation of regeneration-related genes.

To provide a detailed and precise regeneration gene resource with literature evidence, we performed an extensive literature query of GeneRif (Gene Reference Into Function) database (17/12/14) using the keyword “regeneration”, resulting in a return of 2245 PubMed abstracts. GeneRIF is a collection of short gene function descriptions for entries in the Entrez Gene database17. To ensure the precision of collected regeneration information, much care was taken regarding species information and the regenerative organ/tissue. For example, in the sentence “ACF regulates liver regeneration following partial hepatectomy at least in part by controlling the stability of IL-6 mRNA18 the gene ACF was listed as a synonym for mouse A1cf in the current Entrez gene database. Following careful manual inspection, the list was refined to 1417 Entrez genes from various species, obtained from 1293 PubMed abstracts. To provide a more comprehensive overview, we mapped all the 1417 genes to 936 homologous groups using the NCBI HomoloGene database, as has been implemented in previous analysis19,20,21,22. By assimilating the regeneration-related genes from GOA, we consolidated our list for further annotation and database construction to 948 human genes including 929 protein-coding and 19 non-coding genes (Table S1). Using these human genes, we were able to retrieve 8445 homologs from 17 experimental model organisms using the HomoGene database.

Representative entry in REGene

To provide data access for the regeneration community, we constructed a web-based platform, REGene, to store all the information for REGs. As shown in Fig. 2, a typical REGene gene entry contains six categories of information, accessible by clicking the labels: “General information,” “literature,” “Expression,” “Regulation,” “Homolog,” and “Interaction” displayed on the top of the page. The basic information, including: gene name, pathway, disease-association, nucleotide sequence and protein sequence, can be found in a tabular view in the “General information” page (Fig. 2A). Highlighted summaries of supporting literature and gene ontology annotation sources are provided in the “literature” page (Fig. 2A). While on the “Expression” page, gene expressions from 84 normal tissues and 184 tumor samples are piled using a bar plot with the sample name and normalized expression scores (Fig. 2A), which is useful in exploring the tissue specificity of each regeneration gene among normal and tumor samples. Take the gene WNT10B as an example: the expression bar view indicates that it is expressed relatively high in certain brain regions: the temporal lobe and the superior cervical ganglion (Figure S1). The “homolog” page allows the user to map human genes to 17 model species, including a filamentous fungus (Ashbya gossypii), Baker’s yeast (Saccharomyces cerevisiae), Cattle, Chicken, Chimpanzee, Dog, Fission yeast, Frog, Fruit fly, Milk yeast, Mosquitos, Mouse, Neurospora, Rat, Rhesus monkeys, Worm and Zebrafish. Additionally, the sequences in the page allow the user to easily retrieve the sequences for phylogenetic relationship analysis (Figure S2). The “Regulation” page is designed to classify regulatory information, including: interactions with transcription factors, abundance of post-translational modification information and methylation features for each REGs. For those interested in systems biology, the interaction partners of each REGs are presented in the “Interaction” page to illustrate different interaction categories including: physical interactions from high-throughput experiments, as well as metabolic and signaling interactions from known pathway databases23.

Figure 2
figure 2

Web interface of REGene.

(A) The basic information in each regeneration-related gene page. The expression values in the bar represent the relative expression scores from BioGPS database. (B) Query interface for text search; (C) Quick search button for gene symbol-based search. (D) BLAST search interface for comparing query against all sequences in REGene.

In order to accommodate a broad range of user queries against our REGene data, we developed six powerful query platforms: pathway and disease information, genomic location, literature evidence and gene expression range in human samples and homology information (Fig. 2B). Notably, a quick text search for the GeneID, gene symbol and gene alias exists on the top right of each page, to allow the user to retrieve any desired information from the database quickly (Fig. 2C). Users can run a sequence similarity search (BLAST) against the nucleotide and protein sequences in REGene (Fig. 2D), or explore other features of the data including: the organ/tissue type, significantly enriched pathway, related disease, reported linkage region and chromosome number. For each related KEGG pathway, the marked chart is provided to highlight all the known regeneration-related genes (Figure S3). Finally, for the purpose of offline data usage, we provide a downloadable plain text format gene list corresponding to all the organ/tissue types for all 948 regeneration related genes collected.

Functional analysis of human REGs revealed an enrichment of cell proliferation and developmental processes

To explore the biological processes associated with our collected genes, gene-set enrichment analysis was adopted, characterizing whether the 929 human protein-coding REGs had any significant annotations as compared to all human protein-coding genes. A strict cutoff was implemented (corrected P-value less than 0.01 and the annotated genes more than 30% of all 929 REGs), we were able to identify 30 gene ontology (GO) terms (Table 1) and 17 statistically significant enriched phenotypes (Table S2). The enriched GO terms identified are chiefly related to cell proliferation and development, specific examples include: regulation of developmental processes, tissue development and regulation of cell proliferation (Table 1). Interestingly, the enriched GOs also include cellular processes in response to wounding, oxygen-containing compounds and endogenous stimuli. This finding aligns with studies in zebrafish that have demonstrated that low oxygen (hypoxia) can adversely affect heart regeneration24. The other GO clusters are associated with cell apoptosis, metabolism and locomotion. For the 17 enriched phenotypes, the majority relate to abnormal organ morphology and physiology, such as: abnormal cardiovascular system and immune system morphology/physiology. Moreover, at least 437 REGs represent essential genes related to “prenatal lethality” or “lethality during fetal growth through weaning” in mouse models. These huge numbers of essential genes in human REGs also highlight their critical roles in organism development.

Table 1 Summary of statistically significant enriched gene ontology annotations of regeneration-related genes.

Enriched REGs encode proteins involved in cancer-related processes and contain domains highly affiliated with cancer

Further gene set enrichment analyses; for diseases, pathways and protein domains, revealed that human REGs are enriched with cancer-related signaling pathways and domains such as PI3K-Akt signaling pathway and EGF domains (Tables S3 and S4). To explore the role of REGs in specific cancers, all REGs were mapped to a KEGG colorectal cancer and pancreatic cancer pathway; as shown in Figure S3, over 90% of genes associated with colorectal and pancreatic cancer pathways are REGs. The specific connections between REGs and a broad-spectrum of human adult cancers (Table S3) may be able to provide a better understanding of common mechanisms utilized by both processes. To date, few studies in the scientific have linked tissue regeneration with cancer15,25,26,27,28. Importantly, the enrichment analysis of REGs does not quantitatively measure the degree of commonality between the molecular mechanisms that underpin regeneration and cancers, rather it implies that the relevant signaling pathways of the two are very similar. This link is not limited to providing insight into the cellular process, but also suggests a cancer-like regulation of regenerating tissue. For instance, 12 intestine REGs are enriched in colorectal cancer gene sets (corrected P-value = 0.00042). By the same token, systematic comparison of regeneration, using the REGene database with specific diseases may provide a more comprehensive picture for the underlying molecular mechanisms of the two processes, both in terms of the particular tissue inspected and more holistically. For example, 22 heart REGs are associated with coronary artery disease (corrected P-value = 0.0086), this suggests certain signaling components/pathways are shared by these two vastly different processes.

A key finding in our analysis was identifying 54 REGs that contain epidermal growth factor (EGF)-related domains. These over-represented EGF domains are EGF-1, EGF-2, EGF-3 and EGF-like domains. EGF proteins have profound roles in various regenerative processes, including liver regeneration29 and regulation of hematopoietic regeneration after radiation-damage30. At the same time, the EGF-related family has been implicated in carcinoma cell growth and survival, through multiple ligands to induce cell transformation31. Shared EGF-related proteins and relevant downstream pathways further solidify the link between regenerative processes and complex diseases like cancer. Consequently, further research regarding EGF-related REGs has the potential to not only deepen our insight in the regenerative biology field, but may direct the development of potential anti-cancer therapeutics targeting EGF pathways.

Common REGs across multiple regenerative tissue types are shared with cancers

Information derived from the existing regeneration literature facilitated gene annotation for all REGene entries with a specific tissue/organ type. Tissue/organ types were collected into 17 major groups of regenerative tissue: bone, cartilage, endothelia, epithelia, hair cell, intestine, kidney, liver, muscle, nervous system, pancreas, retina, salivary gland, skin, spinal cord, stem cells and miscellaneous. The majority of human REGs were identified from nerve (284 genes, 29.95% of total 948 REGs), liver (246 genes, 25.95%) and muscle (197 genes, 20.78%) tissues. The relationships of common genes that were identified in multiple regenerative tissue/organs were plotted (Fig. 3). This suggests that the molecular machinery adopted by regenerative processes in different tissues possess uniform components, a feature that could logically be attributed to either evolutionary expediency, or functional importance. In total, 149 human REGs were involved in regeneration by 2 tissue types. In addition, 85 human REGs were determined to be shared by 3 or more regenerative tissues. This large number adds further weight to the conjecture that the regenerative process in multiple tissue types share molecular mechanisms. In addition to this, further functional enrichment analysis on these 85 REGs not only confirmed their roles in regeneration (P-value = 6.75e-12, Table S5), but also linked the REGs to a multitude of cancer types, including: bladder cancer, breast cancer, colorectal cancer, endometrial cancer, kidney cancer, oral cancer, pancreatic cancer, prostate cancer and stomach cancer (all corrected P-value are less than 0.05, Table S5). In conclusion, the large overlap observed for common REGs with cancer pathways points to shared molecular mechanisms for tissue regeneration and cancer progression.

Figure 3
figure 3

Shared regeneration-related genes across multiple regenerative processes.

The length of circularly arranged segments is proportional to the total genes in each regenerative process group. The ribbons connecting different segments represent the number of shared genes between regenerative process groups. The outer ring is stacked bar plots that represent relative contribution of other regenerative process group to the regenerative process group totals. Ribbons connecting different segments represent the number of shared genes between regenerative tissues. The 17 regenerative tissue/organs are bone, cartilage, endothelia, epithelia, hair cell, intestine, kidney, liver, muscle, nerves, pancreas, retina, salivary gland, skin, spinal cord, stem cells and miscellaneous (short with MISC).

Prioritization of key genes in animal regeneration reveals abundant mutations across multiple cancer types

To systematically evaluate the importance of regeneration-related genes, we conducted a gene ranking analysis, using ToppGene (see methods) with a training set of 19 reliable genes supported at least 10 times within the literature. The resultant top ten ranked genes consisted of: APC, ERBB2, MTPN, PTEN, CDH1, CDKN2A, MCAM, FGL1, MIR204, MIRLET7A1 (Table S6). Not surprisingly, the majority of these genes are components of pathways regulating cell proliferation and tumorigenesis such as the cell cycle control and DNA damage pathway.

Although these REGs are over-represented in a number of cancers, the systematic examination of genetic variants in multiple cancers requires further investigation. Such mutation patterns could vastly augment comparisons of REGs with their anatomically-corresponding cancers. With comprehensive cancer genomics datasets available via The Cancer Genome Atlas (TCGA) project, there exists an unprecedented opportunity for exploring the global genetic mutation of REGs in multiple cancer types. As shown in Fig. 4, the top 100 ranked REGs (comprised of 81 top ranked genes and the 19 genes used as the ToppGene training set) have an overwhelming number of mutations in cancers (Table S7); these 100 genes are mutated in over 90% of patients across 30 different cancer types. A most striking case exemplifying this can be observed analyzing the lung squamous cell carcinoma cohort comprised of 178 patients, who all presented with mutations in the top 100 REGs. In like fashion, the top 100 REGs are mutated in (99.60%) of a 239-strong cohort of Uterine Corpus Endometrioid Carcinoma patients; over half of which were single nucleotide mutations. The very same pattern can be seen in a host of other cancers: colorectal cancer from the TCGA dataset (98.10% in 208 individuals) and ovarian cancer (99% in 308 individuals). As summarized in Table S7, the top 100 REGs have mutations in over 50% of patients across 67 major cancer types. This result strongly suggest REGs may have important roles for cancer progression, roles that are shared in various cancers, further comparison between the regenerative process of specific tissues and corresponding cancer types may provide a thorough intimation of the nature of these caner-connections.

Figure 4
figure 4

The mutational landscape for the top 100 ranked regeneration-related genes in multiple cancers.

The CAN represent copy number alteration. The presentation of any mutations in a cancer types are indicted with “+”. The lacking of any specific mutations are “−”. Same cancer types are marked as the same color.

Mutation frequency on the protein domain level was further explored within the 19 REGs implemented earlier as a training set for ToppGene (AKT1, BDNF, BMP2, CTNNB1, CXCL12, EGFR, FGF2, GAP43, HGF, IGF1, IL6, MET, RTN4, RTN4R, SOCS3, STAT3, TGFB1, TP53, VEGFA) supported at least 10 times within the scientific literature. As shown in Table S8 and Figure S4: AKT1 has variations in 108 samples from 16 adult cancers (bladder, breast, cervical, colorectal, glioblastoma, head and neck, liver, lung adenocarcinoma, lung squamous cell carcinoma, melanoma, pancreas, papillary renal cell carcinoma, prostate, stomach, thyroid, uterine cancer). In total, the 19 REGs possess 8221 mutation events in multiple cancer types, mainly concentrated in regions encoding protein functional domains. To put it succinctly, a great many well-known cancer genes, such as TP53, EGFR and AKT1, have prominent roles in the regeneration processes; this striking overlap of genes and pathways is indicative of an, as yet, unexplored connection between cancer and regeneration.

Reconstructed REG protein-protein interaction network exhibits a highly modular structure

To develop a thorough picture of the regenerative processes and construct the most comprehensive cellular map of regeneration, the connections among top ranked REGs, as recorded in reliable public data sources, were explored. The top ranked 100 REGs were incorporated into an interactome from the Pathway Commons database, which combines all prevailing pathway databases to provide functional gene-gene interaction pairs23. The extracted sub-network of REGs contains 97 genes and 534 gene-gene interactions (Fig. 5A). It is worth noting that all interactions are based on current evidence from known biological pathways with biological meaning, not physical interactions from high-throughput experiments (Fig. 5A). Of the 97 nodes, 90 are among our top 100 ranked REGs; the remaining 7 are linker genes that connect REGs facilitating their cellular function. The vast majority of top ranked REGs are linked to each other in such a way as to form highly modular structures. This serves to further verify our earlier deductions and to reveals that REGs are highly connected to each other, assuming a high-density modular structure.

Figure 5
figure 5

Reconstructed regenerative cellular map using pathway-based protein-protein interaction data.

(A) The 90 genes in orange are genes in our REGene; the remaining 7 genes in blue are linker genes that connect the 90 genes; the size of the node represents the number of connections in the network; (B) Degree distribution; (C) Short path length frequency.

Further topological analysis of the REGs network reveals a high degree of interconnectivity amongst each other. Only 14 nodes were limited to one connection (Fig. 5B), this implies that the majority of nodes are capable of communicating with each rapidly and with great ease across short paths. The degrees of all nodes in our regeneration map follow a power law distribution P(k)~b, where P(k) represents the probability that a gene has links with k other genes while b represents an exponent with an estimated value of 0.622. The resultant map of REG networks is quite different from other human PPI (Protein-protein interaction) networks where most nodes are sparsely connected, with an exponent b of 2.932. This topological feature indicates a high degree of connectivity, with the shortest path length distribution for the network being a relatively smaller number: 2 and 3, meaning ~76.9% of node communication can be reached in only two or three steps (Fig. 5C). With high modularity, the hub nodes in this network may have prominent roles, these nodes act as common connections to mediate rapid and efficient information transfer. In total, there are 6 genes with ≥30 connections: UBC (58 connections), CTNNB1 (40), STAT3 (34), TP53 (30), GSK3B (30) and EGFR (30). With the exception of UBC, all these hub genes are from our literature-based gene set. To be concise: Network analysis of REGs identified novel linker gene hubs that are undoubtedly crucial to regenerative processes in addition to revealing a highly modular structure to the network of all analyzed regenerative genes.

Conclusion and future plan

REGene is the first literature-based gene resource dedicated to furthering animal research by integrating multi-dimensional bioinformatics data consisting of: gene expression, regulation, homology and interactions. It should prove a valuable tool to probe the molecular mechanisms underpinning animal regeneration and thus expedite the development of regenerative medicine therapies. The REGene database is in the public domain and freely accessible at http://regene.bioinfo-minzhao.org/.

The high heterogeneity of cellular processes presents an enormous challenge towards understanding animal regeneration. Classical approaches for the identification of candidate genes that relate to specific regenerative phenotypes have been conducted, however, these studies seldom incorporate multiple species comparisons. Following on from the REGene, we plan to integrate other homologous genes from other species with regenerative capacities, including salamanders, axolotls and from other species of hydra and planarian. Also starfish, where there has been accumulating gene resources becoming available. This information will further enable a comparative systems biology approach to summarize the commonality and uniqueness of animal regeneration, removing bias resulting from any single species study or technology platform. For cancer-related study, it will also be interesting to compare the REGs with other cancer-related processes33,34,35 or genes on specific cancer types36,37,38. We will continue to maintain and update the REGene database, as new research references appear, particularly data from large-scale genomics studies such as ChIP-seq. Since our study indicated that many regeneration-related genes are involved in cancer progression, we also plan to integrate high-throughput cancer genomics data.

Methods

Data collection

To collect the regeneration-related genes, the gene ontology annotation database was downloaded on Dec 8th, 201416. 20 gene ontology (GO) terms related to regenerative processes were collected as follows: axon extension involved in regeneration (GO:0048677), axon regeneration (GO:0031103), cardiac muscle tissue regeneration (GO:0061026), collateral sprouting of injured axon (GO:0048674), dendrite regeneration (GO:0031104), fin regeneration (GO:0031101), formation of growth cone in injured axon (GO:0048689), liver regeneration (GO:0097421), MAPK cascade involved in axon regeneration (GO:1903616), myoblast differentiation involved in skeletal muscle regeneration (GO:0014835), myoblast migration involved in skeletal muscle regeneration (GO:0014839), myotube differentiation involved in skeletal muscle regeneration (GO:0014908), neuron projection regeneration (GO:0031102), organ regeneration (GO:0031100), peripheral nervous system axon regeneration (GO:0014012), regeneration (GO:0031099), sensory epithelium regeneration (GO:0070654), skeletal muscle satellite cell maintenance involved in skeletal muscle regeneration (GO:0014834), skeletal muscle tissue regeneration (GO:0043403), tissue regeneration (GO:0042246).

To further curate matched literature, 2245 PubMed abstracts associated with regeneration were downloaded for manual review. Curation of regeneration genes from literature included three major steps: (1) grouping all 2245 PubMed abstracts by topic, using the “Related Articles” function in Entrez; (2) extracting descriptions of regeneration genes from grouped abstracts; (3) manually collecting gene names from the descriptions of the regeneration genes and mapping the gene names to Entrez gene IDs. These three steps allowed us to quickly and easily evaluate if and how, the curated abstracts were related to regeneration genes while allowing for cross validation between multiple literature sources. Here, Entrez gene IDs for regeneration genes served as the REGene database’s crosslink between the same genes from different public databases.

Database interface

The REGene database is written in Perl CGI and JavaScript. The database manage system is MySQL which stores the relationship data model. The website is served with Apache on a server running Red Hat 4.4.7-11. The dynamic coding is mainly implemented using two primary components: web applications for browsing and searching and CSS that control the general page style of visualization.

Biological functional annotations and database construction

To better understand the function of the regeneration genes collected into the REGene database, associated functional information for each gene was collected. Representative annotations in the REGene database are summarized in Table 1. Basic gene information is included, such as gene names from the Entrez gene database39, crosslinks to the rate-limiting enzyme RLEdb40, text mining server iHOP41. For functional annotations, the pathways involving the genes were retrieved from BioCyc42, KEGG Pathway43, PID Curated44, PathLocdb45, PANTHER46 and PID Reactome47,48; possible association with diseases were also extracted from KEGG Disease49, Fundo50,51, GAD52, NHGIR53 and OMIM54 using the functional annotation server KOBAS55,56. Additionally, possible post-translational modifications and transcription factor regulation information was collected from dbPTM57 and the TRANSFAC database58, respectively. Digital gene expression information for 184 tumor samples and 84 normal tissues were integrated from BioGPS59; While Information about methylation sites and protein-protein interactions were integrated from DiseaseMeth60 and Pathway Commons23 databases, respectively.

Gene ranking using ToppGene and cancer mutational pattern analysis

Hundreds of genes were collected that originated from various organ/tissue types, although the common REGs were still unclear. All REGs were scattered in individual studies, which often focus on verifying highly specific tissue/organ regeneration. Thus, data integration and evaluation across all the regenerative types may help to highlight some important common REGs and their global involvement in regenerative processes. To this end, the ToppGene gene ranking tool61 was used to prioritize all the 948 genes in the REGene database. Essentially, the ToppGene tool extracts features based on a training gene list by using a multiple dimensional dataset, including biological annotations, gene expression, sequence features, protein-protein interaction and literature evidence. In this analysis, the training set was comprised of 19 well-known regeneration-related genes (AKT1, BDNF, BMP2, CTNNB1, CXCL12, EGFR, FGF2, GAP43, HGF, IGF1, IL6, MET, RTN4, RTN4R, SOCS3, STAT3, TGFB1, TP53, VEGFA) that were supported within scientific literature by a minimum of 10 studies. The resultant gene-prioritizing model input the remaining 921 genes and integrates all the outputs from the training models to form a global ranking for all the candidate REGs.

Based on the curated gene information, all organ/tissues types were collected into 17 major groups according anatomic and biological functions. The overlapping cancer genes across cancer types were visualized using Circos62. While mutational landscape in multiple cancer types for the top 100 ranked REGs were generated using the cBio portal63.

Functional enrichment analysis

Throughout this research, the representative pathways from KEGG and Reactome for each gene set were identified by KOBAS41. In this pathway analyses, all human protein-coding genes were set as background in order to calculate statistical significance. In addition, the Benjamini-Hochberg multiple testing corrected P-values for enriched pathways were adopted based on hypergeometric test by using KOBAS. Finally, enriched human pathways with corrected P-values less than 0.01 were identified as over-representative pathways for each gene set.

Reconstructing a protein-protein interaction network related to REGs

To explore the relevant biological mechanisms related to REGs, all protein-protein interactions associated with the 948 REGs were extracted. To this end, we used a non-redundant human interactome from the PathCommons database23, containing 3629 proteins and 36,034 protein-protein interactions. It is of note that the collected protein-protein interactions are from pathway databases (HumanCyc42, Reactome36,37 and KEGG pathway43), which have biological meaning, rather than physical interaction. Thus, the final interactome is comprised of pathway-based interactions. To extract a sub-network related to the top 100 ranked REGs, we used the similar approach implemented in our previous study64. In this algorithm, all the 100 REGs were mapped to the human pathway-based interactome, which was used to produce a sub-network with as many input genes connected by their shortest path as possible.

Generally speaking, biological networks are extremely complex, but often follow a few simple rules that may relate to their function65. Essentially, the topological properties of networks can yield clues that reveal elements of their function. To explore the REGs interactome, the NetworkAnalyzer plugin in Cytoscape 2.8 was used to calculate the topological properties of the REG network (Fig. 5B,C)66, the amount of connections at each node was represented as the degrees in the network65. Finally, path distribution was calculated to reveal the shortest route for any one node to reach another65. The final network visualization was generated using Cytoscape66.

Additional Information

How to cite this article: Zhao, M. et al. REGene: a literature-based knowledgebase of animal regeneration that bridges tissue regeneration and cancer. Sci. Rep. 6, 23167; doi: 10.1038/srep23167 (2016).