ARN: Analysis and Visualization System for Adipogenic Regulation Network Information

Adipogenesis is the process of cell differentiation through which preadipocytes become adipocytes. Lots of research is currently ongoing to identify genes, including their gene products and microRNAs, that correlate with fat cell development. However, information fragmentation hampers the identification of key regulatory genes and pathways. Here, we present a database of literature-curated adipogenesis-related regulatory interactions, designated the Adipogenesis Regulation Network (ARN, http://210.27.80.93/arn/), which currently contains 3101 nodes (genes and microRNAs), 1863 regulatory interactions, and 33,969 expression records associated with adipogenesis, based on 1619 papers. A sentence-based text-mining approach was employed for efficient manual curation of regulatory interactions from approximately 37,000 PubMed abstracts. Additionally, we further determined 13,103 possible node relationships by searching miRGate, BioGRID, PAZAR and TRRUST. ARN also has several useful features: i) regulatory map information; ii) tests to examine the impact of a query node on adipogenesis; iii) tests for the interactions and modes of a query node; iv) prediction of interactions of a query node; and v) analysis of experimental data or the construction of hypotheses related to adipogenesis. In summary, ARN can store, retrieve and analyze adipogenesis-related information as well as support ongoing adipogenesis research and contribute to the discovery of key regulatory genes and pathways.

Scientific RepoRts | 6:39347 | DOI: 10.1038/srep39347 provides an online tool with filtering and analysis functions, suggesting that ARN will be a useful benchmark for the development of hypotheses regarding adipogenesis.

Results
Database Description. The homepage of the database provides a visualization of the adipogenesis regulation network, which consists of 50 nodes with the largest numbers of connections. Users can choose the number of nodes they wish to view. The color and shape of a node is determined by its classification and function, and the color and shape of a link is determined by its impact and mode of action (Fig. 1).
The node page consists of six sections (Fig. 2). The first section lists general information for the requested gene or microRNA. The second section contains a list of sentences describing the gene or microRNA in the context of adipogenesis and the corresponding PMID. The third section contains a table that shows the expression of the node under different conditions. The fourth section contains a table showing SNPs associated with the node. The fifth section consists of a relationship chart and a visualization network, and the color and shape of the node and the link are identical to those of the homepage. The final section is a relationship table that can be filtered according to potential impact (e.g., activation or inhibition), mode of action (e.g., DNA binding or epigenetic modification) and test method (e.g., ChiP or siRNA). Users can also order the results by the impact factor (IF) of the source and target nodes. Moreover, possible relationships (TFs and miR targets) associated with the node are accessible in the sixth section of the page, which contains a visualization network in addition to the table. New predictions are shown with bold black links.
On the maps page (Fig. 3①), we provide images collected from review papers on adipogenesis. These images were divided into six categories: epigenetic modification, transcription regulation, signal transduction, miR, cell growth and others. Below every picture, a table lists all of the nodes in the picture. By clicking on a gene symbol, the user is directed to the node page for the specific gene. We also provide a network of these nodes based on our database.
The literature page (Fig. 3②) provides basic information about the articles. All papers were divided into four categories (review, article, SNPs and high-throughput) according to their contents and results. We then manually extracted the materials and methods used in each paper.
Moreover, the expression of genes involved in adipogenic differentiation progression is available on the expression page (Fig. 3③). Users can view a line chart by clicking on the button following it. Expression data Figure 1. Visualization of 50 hub nodes in adipogenesis. This database screenshot shows the main home page. ① Visualization network of the top 50 highly connected nodes. ② Here, the user chooses the number of the nodes that he or she wants to see. ③ The color of the node is determined by its classification. ④ The shape of the node is determined by its function during the process of adipogenesis. ⑤ The relationship type determines the color of the links. ⑥ The relationship name determines the shape of the link. (If the user selects the check box in the map, the display content can be customized).
were collected from many different papers. Comparisons of these data facilitate access to different perspectives to understand gene functions.
We also provide a download page (Fig. 3④). Users can choose one class of genes (e.g., "transcription factors" under "Classification" and "promoters of adipogenesis" under "Differentiation Direction") and then download the GeneIDs, symbols and PMIDs for related papers. This information can also be directly used to search other databases.
If we have missed specific genes or publications regarding adipogenesis, users are welcome to send suggestions via the ARN message board, and we are pleased to add them to the database. A graphical guide of the ARN database is available for download on the database website at http://210. 27 . For gene searches, Entrez GeneID and official gene symbols are accepted. MicroRNAs require names of mature microRNA sequences (e.g., mirn143). Literature searching requires a PubMed PMID. Users can select their requested entry, and the results page is displayed. In practice, the most important contents are the four following types of information. i) Regulatory map information. The ARN Map page provides graphics summarized by experts in the field of adipogenesis. ii) Impact of a query node on adipogenesis. The "IF" value measures the degree of influence, while "differentiation direction" represents the nature of the impact; for example, circular nodes indicate that the node promotes adipogenesis, whereas triangular nodes indicate that the node inhibits fat formation. iii) Interactions and their mode for a query node. In the relationship chart for an ARN Node page, the shapes and colors of the links represent information on interactions and their modes. iv) Prediction of interactions of a query node. The Prediction Chart contains the predicted relationships for a query node based on four external databases (miRGate, BioGRID, PAZAR and TRRUST); bold links show prediction relationships that are new, whereas gray links indicate that these prediction relationships have been verified to be involved in the regulation of adipogenesis.
When we searched "PPARg" in NCBI PubMed, we obtained more than 900 results. Users may then read through the list of results one by one. When we searched "PPARg" in the ARN database, the results page included six sections, as shown in Fig. 2 and Table 1, with data collected from seven websites (NCBI-Gene, miRBase, NCBI-PubMed, miRGate, PAZAR, TRRUST, and BioGRID) and 109 papers. Among the sections on the results page, "NCBI gene" and "Literature summary" provide an overall summary of the PPARg gene as well as a summary from a professional point of view, respectively; "Node Expression" and "Relation Chart" show information on what is known about PPARg; and finally, "Prediction Chart" lets us identify potentially new studies on the regulation of adipogenesis. Furthermore, users can sort based on the "IF" value of the nodes in the "Prediction Table" to select the most important predictions. Table 2 provides examples of prediction results. For example, the results indicate that Pan et al. 13 demonstrated that both E2F1 and CEBPd are involved in the transcriptional regulation of PPARg in cancer cells in the process of apoptosis. Thus, researchers can design experiments to verify the effects of E2F1 and CEBPd on adipogenic differentiation by PPARg.
Analysis of experimental data and construction of hypotheses. Currently, the database contains over 53,000 records. Such a large amount of information represents a solid foundation for analysis and prediction. The ARN database provides 2 useful analytical tools for the user: (1) the "IF" value of each node allows us to gauge the extent of the impact of the node on adipogenesis, whereas the (2) ARN Analysis page allows users to perform analyses based on a node or a class of nodes, an article or a specific node set in the ARN Analysis page (see Fig. 4). For example, Chartoumpekis et al. 14 (PMID: 22496873) analyzed the miRNA expression profile of adipose tissue after long-term high-fat diet-induced obesity in mice using microarray analysis and identified 25 differentially  Table  130 6 Prediction Chart and Table  80 Total 297 expressed microRNAs. First, we need to rapidly screen miRNAs to identify those that are highly correlated with adipogenesis. The 'IF' value is very useful in this case, as a greater 'IF' of a node corresponds to a greater effect on adipogenesis. Table 3 shows detailed information. Four out of 10 up-regulated microRNAs have been confirmed to promote or inhibit adipogenesis, whereas 10 out of 15 down-regulated microRNAs have been confirmed to promote or inhibit adipogenesis. Once we have identified the object of study, the ' ARN Analysis' page is useful. Thus, we need to identify the intersection between their target genes and pro-osteogenesis genes or the intersection between their target genes and anti-adipogenesis genes (see Fig. 5). ' ARN Analysis' is helpful for identifying these intersections, and we can obtain the results shown in Table 4 (Analysis steps: see Supplementary materials "ARN Handbook" -Example 4).

Discussion
There is ongoing research to detect genes or pathways that are frequently altered in adipogenesis. Identification of such genes and pathways becomes more complicated due to the ever increasing body of literature containing adipogenesis studies, making literature searches highly time-consuming. Therefore, it is necessary to structure the existing knowledge of genes and microRNAs associated with adipogenesis. To this end, we developed the ARN database to provide a review of the current state of adipogenesis research, and we have made this information easily accessible to researchers.  Hub nodes in adipogenesis. The ultimate aim of adipogenesis research is to understand the molecular mechanisms underlying the biology of obesity to discover innovative prognostic and/or predictive biomarkers. Table 5 lists the top 50 genes or microRNAs and the corresponding number of relationship records. This table is ranked according to the possible impact of the genes or microRNAs. Until now, prognostic predictions or therapeutic stratification of obesity have not been based on biomarkers. However, the table suggests many promising candidates that should be further investigated, potentially in clinical studies.
Target control of adipogenesis genes. Target control refers to the control of a subset of target nodes (or a subsystem) that are essential for a system's mission pertaining to a selected task 15 . If we know all the relationships  Table 3. Twenty five differentially expressed microRNAs obtained by Chartoumpekis DV. Bold display microRNAs were selected for further analysis. for a given node, then we may understand how to control it. The ARN database provides an overall view of each node in the adipogenesis regulation network. As shown in Fig. 2 for the node PPARg, there is a map comprising the full life cycle of this protein, from epigenetic modification of its chromatin [16][17][18][19][20] , transcriptional regulation of its promoters [21][22][23][24][25][26][27] , post-transcriptional regulation by microRNAs [28][29][30][31] , phosphorylation of its proteins by signal factors 32,33 , transcription initiation to final degradation. Such detailed knowledge of PPARg may help us design an ideal path for its control.

Future directions. Mesenchymal stem cells (MSCs), the precursors of adipocytes, can also differentiate into
osteoblasts, chondrocytes and myoblasts. Understanding the factors that govern MSC differentiation has significant implications in diverse areas of human health, from obesity to osteoporosis to regenerative medicine 34 . Thus, we would like to add these MSC differentiation factors into our network in the future. Moreover, it was recently shown that long-chain non-coding RNA (lncRNA) is involved in the regulation of adipogenic differentiation 35,36 ; thus, lncRNA data must be added as soon as they are available. In addition, information regarding the institutions involved in the papers included in the database will soon be available for visualization, and we expect that this will promote the exchange of ideas, project cooperation and resource sharing between institutions. We plan to update the database monthly to provide state-of-the-art knowledge and track improvements in the field. All recently added data will be displayed separately on a corresponding page. We hope that the ARN database will serve as a platform for information and hypothesis generation for the research community and will aid in elucidating the complexity of adipogenesis-related mechanisms, pathways and processes.

Methods
The ARN database aims to provide a high-quality collection of genes, microRNAs and relationships implicated in the regulation of adipogenesis, as reviewed by experts in the field. The data collection and processing steps are illustrated in Fig. 6. The workflow comprised four major steps as follows.
Step one: construction of a text-mining  association network using the Agilent Literature Search plugin 37 .
Step two: manual review, annotation and extension.
Step three: information storage and visualization.
Step four: design of the analysis tool.
Scientific RepoRts | 6:39347 | DOI: 10.1038/srep39347 sentences, and analyzed for known interaction terms, such as 'binding' or 'activate' . Agilent Literature Search uses a lexicon set for defining gene names (concepts) and aliases, drawn from Entrez Gene, and interaction terms (verbs) of interest. An association was extracted for every sentence containing at least two concepts and one verb. Associations were then converted into interactions with corresponding sentences and source hyperlinks and added to a Cytoscape network. To choose key gene sets, we conducted a two-step procedure. In the first step, we established 47 key genes via a literature review 5 . This candidate set was updated by incoming nodes from post-manual curation. In the second step, we prioritized the remaining "candidate nodes" by scoring them based on the frequency of each node in all regulatory interactions; 53 new "candidate nodes" were used to search for candidate sentences for the next round of manual curation. The final download of abstracts was executed on 29 October 2015. In total, 9908 PubMed abstracts were obtained and served as the initial corpus for further processing. False positives for the results would not affect the quality of our database because molecular-molecular interactions would be identified by manual curation.
Information processing and analysis. During the manual review, annotation and extension step, the reviewers verified the specific genes, microRNAs and their relationships recognized in the abstracts. Additionally, information regarding experimental settings, node classification, function and adipogenic impact was marked. For each paper in the ARN database, the experimental settings comprised the experimental procedure, the names of cell lines and types of samples. Occasionally, a dormant value could only be revealed by combining one dataset with another, potentially a very different dataset. We screened data from 4 external databases and obtained more than 10000 prediction results from among over 1 million interaction records (Table 6). Using "miRGate" as an example, the screening process was as follows (Fig. 7). The workflow comprised five major steps as follows.
Step one: We obtained 385 miRNAs and 2671 associated genes in the ARN database.
Step three: The predicted results were downloaded.
Step four: To obtain high-efficacy targets, we excluded target predictions with computational predictions of < 3 9 .
Step five: We used the 385 miRNAs recorded in the ARN database to screen the predictions. Finally, we obtained 8171 miRNA-Target prediction records, and after manual data cleaning, these were uploaded to the ARN database. The other three databases also underwent similar screenings. In the future, when a new database appears, we will be able to add data associated with adipogenic differentiation to the ARN database within a short period of time utilizing this method.  Table 6. Four external databases. Figure 7. Screening the data from miRGate.
Step ① : Screen all of the genes and miRs in the ARN database.
Step ② : Submit these genes to the miRGate database.
Step ③ : Download the set of results.
Step ⑤ : Screen the predictions with miRs in the ARN database. Design analysis tool. Based on Swanson's discovery process, Weeber et al. 38 defined two types of knowledge discovery approaches: open discovery and closed discovery. An open discovery process is used to generate a hypothesis (Fig. 8a). For a given starting concept C, concepts that co-occur with C in the literature (called linking concepts B) are found. Concepts that co-occur with linking concepts B (called target concepts A) are then similarly found, bearing in mind that concepts A should not co-occur with starting concept C. This process can be described as C − > B − > A. A closed discovery process is used to test a hypothesis (Fig. 9). For two given concepts C and A, a researcher would like to determine whether or not hidden links exist between them. As more links are found between A and C, it is more likely that the tested hypothesis is correct. This process can be described as C − > B < − A.
We adopted the open discovery process to design a two-step discovery approach (Fig. 8b). Here, concept C is adipogenesis.
Step 1 can screen out the nodes (called linking concepts B) that have specific effects on C. In Step 2, the second round of screening can identify concepts that co-occur with linking concepts B (called target concepts A).   38 . The process is a two-way discovery process starting from A and C simultaneously, followed by the discovery of intersection B.
Scientific RepoRts | 6:39347 | DOI: 10.1038/srep39347 We adopted the closed discovery process to design a discovery approach to identify two or more result sets. As shown in Fig. 5, we can obtain multiple result sets via the open discovery approach, and their intersections can be identified by ARN Analysis.
In the field of literature-based hidden knowledge discovery, popular methods based on co-occurrence produce too many target concepts, leading to the decreased ranking of potentially relevant target concepts. In this current paper, we propose a new method for choosing useful and promising linking concepts. This method calculated the "IF" value for each node according to the following formula: In this formula, IF (i) shows the effect of node i on the differentiation of fat. Ri indicates the number of relationships of node i, and Rmax indicates the number of relationships of node r-max, which has the greatest number of relationships; Ei indicates the number of expression records of node i. Emax indicates the number of expression records of node e-max, which has the greatest number of expression records; Pi indicates the number of prediction records of node i. Pmax indicates the number of prediction records of node p-max, which has the greatest number of prediction records. All values have been updated within the database, meaning that the information it contains is comprehensive and timely.