Introduction

The currently enhanced computing power is boosting the acquisition and processing of scientific data obtained from wet and dry lab experiments. In the fields of biology and chemistry, a huge amount of literature is being published every day and uploaded to the databases in real time. Several databases are currently available applying manual curation or in silico approaches for data management in arbitrary forms1,2,3,4,5. Hence, finding the meaningful information from large databases is almost like ‘finding a needle in a haystack’. Therefore, automation techniques, such as text-mining and natural language processing (NLP) methods, were developed to help convert raw scientific texts into well-structured scientific data6,7. Recently, many machine learning technique for NLP have been applied to significantly improve and utilize the text mining performances of models8,9,10,11,12,13.

PubTator Central (PTC) is a state-of-the-art text-mining service for automated annotation of bioentities including genes/proteins, genetic variants, diseases, chemicals, species, and cell lines in about 30 million abstracts and 3 million full-text articles available in PubMed (https://www.ncbi.nlm.nih.gov/pubmed)14,15. The bioentities co-occurring in the body of literature and extracted by PTC can be re-organized as a complex network and used for literature-based discovery.

Herein, we introduce the bioentity extraction tool, ChexMix, and applied it to extract inter-relationships between medical subject headings (MeSH, https://www.ncbi.nlm.nih.gov/mesh) terms and taxonomy identifiers (TaxIDs) in the National Center for Biotechnology Information (NCBI) Taxonomy from the co-occureence relationships from PTC16. ChexMix is an open source python module for accessing and processing various forms of data from multiple biomedical databases. It collects the bioentities, such as species, chemicals, and diseases, which were annotated by the PTC, from abstracts and/or free full-text biomedical articles stored in the PubMed database. ChexMix converts and links the bioentities found in the literature with the unique identifiers of the species (TaxID), chemicals (MeSH), and diseases (MeSH). The association between these bioentities can be modulated into bipartite or multipartite networks, or hierarchical trees, aiding to inspect and simply understand the holistic structures of the information associated with targeted topics queried as keywords.

Each bioentity has its own hierarchical organization system according to the bioentity type. For taxonomy, species is located in the lowest rank of the taxonomic hierarchy and is involved with the higher ranks, genus and family. The hierarchical location and distance in the taxonomic tree between the species bioentities provide clues to discover other hidden bioentities. Species under the same genus share more similar features compared with different species under another genus. For chemicals, since similar structures can interact with proteins holding comparable binding pockets, the types of backbones or derivatives help inspect their physicochemical properties and/or biological roles. Moreover, annotation methods are introduced to classify the structural types of chemicals in chemicals profiling studies17,18,19. Even in the case of protein targets, the expression of genes helps understand the physiological function of proteins, as well as to identify related physiological disorders and pathologies20,21.

ChexMix was developed to extract the bioentities based on the literature keywords, as well as the keywords entered by researchers (Fig. 1). Therefore, ChexMix was designed to help organize biomedical data into hierarchical knowledge based on topological similarities between bioentities. The co-occurrence of biomedical terms provides the assumption that the bioentities in the same abstract or full-text can be considered to be biologically or chemically related to each other2. These associations can then be visualized as network graphs or hierarchical trees, and be more easily analyzed for uncovering hidden insights from already existing knowledge.

Figure 1
figure 1

Network and hierarchical tree of biomedicals using ChexMix.

Results and discussion

ChexMix was designed for the extraction of hierarchical and topological information related to bioentities. Therefore, ChexMix extracts the bioentities that co-occur with the keywords queried in PubMed and encodes into unique identifiers indexing their related information. The combination of a hierarchical representation with a mapping of bioentities to identifiers at each level allows the relationships between them to be organized and cross-referenced. For example, species resulting from keywords of interest, such as chemicals or diseases, can be hierarchically represented from the highest rank, ‘cellular organisms’, according to the phylogenetic taxonomic system of the NCBI taxonomy database16. The search results are arranged according to the hierarchical characteristics of each bioentities and can be displayed in plots for hierarchical data visualization or nested lists (Fig. 1); therefore, the information can be useful for the inspection of related information among keywords of interest. Herein, ChexMix was applied to discover the biomedical sources of natural products that produce the bioactive compound, amentoflavone, which holds a wide range of biological activities, including antioxidative, anti-inflammatory, anticancer, antiviral, and antifungal properties22. This compound also shows potent antisenescence activity against ultraviolet B irradiation-induced skin aging, preventing nuclear aberrations23; thus, it can be used for the prevention of skin aging in the cosmetic industry.

Firstly, 319 bioentities were extracted from ChexMix using the keyword ‘amentoflavone’ under the highest taxonomic rank, ‘cellular organisms’ (Fig. 2). Among them, 223 species comprised in the Viridiplantae (literally ‘green plants’) clade were targeted. It was possible to verify that those species co-occurred with amentoflavone in the same study and investigate whether a plant species could produce amentoflavone (Supplementary Table S1).

Figure 2
figure 2

(A) The recommendation process of Korean native plants related to the query keyword using ChexMix. (B) Network obtained by entering ‘amentoflavone’ as input keyword in ChexMix. The unique identifiers (TaxID, pale green nodes) for species co-existing with the input keyword in the literature are linked to their own taxonomic higher rank (genus, sky blue color). Orange nodes represent species names that only existed in the list of Korean medicinal plants of the KPEB and are linked to the nodes for genus to which each species belongs. (C) Detailed subnetwork under the Viburnum genus. Each node was displayed as ‘ID: name’ for TaxID and genus or species name. The networks were drawn by Gephi software (ver. 0.9.2, https://gephi.org/)30.

To avoid duplicated studies and find novel bioactive sources, the analysis was focused on the allied species belonging to the Viburnum genus, retrieving 19 samples of different parts of eight species native to Korea that were not previously studied on amentoflavone-related topics (Fig. 3, Supplementary Table S2). Next, the existence of amentoflavone was evaluated in samples of these plants and quantified by HPLC. The presence of amentoflavone was confirmed by its isotopic peak at 537.4 m/z [M + H] detected by liquid chromatography–mass spectrometry. Among them, the leaves of V. erosum contained the highest amount of amentoflavone (7.39 mg/g) compared with Selaginella tamariscina, which is the representative natural ingredient for anti-wrinkle effect and the major source of amentoflavone in the cosmetic industry24. Overall, the summarization of hierarchical bioentities information using ChexMix is expected to help inspect massive sparse bioentities in databases in future investigations.

Figure 3
figure 3

(A) Chemical structure of amentoflavone. (B) Chromatograms of the five samples with the highest amentoflavone content determined as described in the “Methods” section. AMEN, amentoflavone; VCL, leaves of Viburnum carlesii; VDF, fruits of V. furcatum; VDSt, leaves of V. dilatatum; VEL, leaves of V. erosum; VESt, stems of V. erosum.

The performance of the results from Chexmix was quantitatively evaluated based on the extracted bioentities using a set of keywords which are associated with the original keyword ‘amentoflavone’. 243 networks of taxonomies were obtained using ChexMix from MeSH terms of chemicals co-occurred with ‘amentoflavone’ in the literature, and they were analyzed by the basic network properties and similarity metrics (Supplementary Table S3). The similarity metrics compared each of the 243 networks with the network of ‘amentoflavone’, where the number of true positives was calculated by the number of common nodes in both of the networks (Supplementary Table S3).

Additionally, ChexMix can also integrate the results from multi-keywords. The MeSH identifiers for bioentities co-occurring with the keywords of interests could be used for connecting the results by two different queries (Fig. 4). For instance, two species names, Taxus cuspidata and Podophyllum peltatum, were queried by ChexMix and generated two small networks consisting of bioentities with MeSH identifiers extracted from PubTator. It was possible to inspect the co-occurred bioentities among the MeSH identifiers in the integrated network. The network of each species showed different MeSH identifier profiles and MeSH identifiers related to ‘cancer’, in particular ‘ovarian neoplasms’, co-occurred. This agrees with the fact that paclitaxel of T. cuspidata and podophyllotoxin of P. peltatum are well-known potent anticancer drugs for ovarian cancer25,26,27.

Figure 4
figure 4

(A) Acquired network using ‘taxus cuspidata’ and ‘Podophyllum peltatum’ as input keywords in ChexMix. MeSH terms co-occurring in the literature with the input keywords were reorganized according to the hierarchy rules of the MeSH Tree Structures in the MeSH browser (https://meshb-prev.nlm.nih.gov/treeView). The nodes of the co-occurred bioentities in both keywords are colored in orange. (B) Details of the subnetwork of the co-occurred bioentities in both keywords. Each node displays as ‘Tree Number: MeSH Heading’ for MeSH identifiers and a MeSH term. The networks were drawn by Gephi software (ver. 0.9.2, https://gephi.org/)30.

Here, a usage scenario of ChexMix to alleviate the complex task of compiling large data by narrowing down the scope of bioentities or grouping similar bioentities using the hierarchical relationships was described. Firstly, to obtain the appearance counts of bioentities in literature queried by keywords of interest, ChexMix collects PubMed and PMC literature followed by fetching annotations within that data from PTC and converting them into unique identifiers according the respective bioentity class. ChexMix allows Boolean operators (‘AND’, ‘OR’, ‘NOT’), double quotes for phrases, and asterisk for truncated terms for PubMed literature search. Each bioentity extracted from ChexMix is classified within more general categories of bioentity and arranged in a hierarchical structure.

When single or multi keywords of interest are entered in ChexMix, bioentities in all citations that have keywords are retrieved and automatically mapped into unique identifiers. The search results indicate the co-occurrence of bioentities in the available literature, allowing to link them and yielding the co-occurrence network. ChexMix makes the process straightforward by managing data access from multiple sources and providing functions to manipulate the network data structure.

The analysis is mainly focused on taxonomy terms to inspect the species that biologically affect physiological disorders or diseases within the network. Each taxonomy name in the search results is listed in a hierarchical form. Trivial bioentities are located on the higher ranks of the list. Other near species within the obtained taxonomic tree are expected to have similar biological effects, representing potential alternative biomedical options. ChexMix can also generate the connections between taxonomic terms and MeSH identifiers, which are located under ‘Diseases [C]’ and ‘Chemicals and Drugs [D]’, in the same literature. MeSH identifiers co-occurring with a taxonomic term in the literature are expected to have a close relationship.

In Fig. 4, the intersection set of MeSH terms co-occurring with each taxonomy keyword is highlighted on the whole network resulting from the union set of two networks. Networks generated from a single keyword in ChexMix can be simply reprocessed by the combination of set operations, such as union, difference, and intersection with other networks. The re-organization of complex networks from single or multi keywords provides new insights or clues for bioentities in PubMed, the biggest biomedical database.

In the present study, we have focused on how to use ChexMix to construct a taxonomic tree or a co-occurrence network from multi keywords, and analyze the networks from bioentities identified by PTC. We designed ChexMix for easily adapting the diverse types of bioentities and integrating other existing databases as well as recently introduced state-of-art text mining systems28. We hope ChexMix will be utilized for other researchers to integrate other datasets, and manipulate and visualize the relationships between bioentities.

Methods

Data processing

ChexMix currently obtains biomedical data from multiple databases using their web application programming interface (APIs) or bulk data files. For example, Entrez API allows to query Entrez databases, such as PubMed and PubMed Central (PMC), using combinations of keywords. PTC provides a web API to fetch annotations of biomedical concepts, such as taxonomy and MeSH identifiers, in a publication. ChexMix also manages to download and parse bulk data files from biomedical databases29. For example, ChexMix loads the data from PTC, including NCBI taxonomy and MeSH that inherently have relationships between entities therein, and transforms it into internal network data structures. ChexMix also grants the possibility to construct, manipulate, and simplify the network data structures.

Bioentities extraction and visualization

The keywords of interest can be input as single words or phrases. The results are output in hierarchical tree format according to their own taxonomic or hierarchically-organized rules for each type of bioentity. In the case of taxonomy information, species names in the literature are encoded into unique identifiers, TaxID, and hierarchically re-organized in the classification rules of NCBI taxonomy. In the present study, hierarchical results were applied to discover relevant species with lower taxonomic ranks (family and genus levels) using the list of the Korean medicinal plants of the Korea Plant Extract Bank (KPEB). The results were visualized in the network format using the Gephi software (ver. 0.9.2).

Sample preparation

To prove the usefulness of ChexMix, 18 Viburnum samples, including V. carlesii, V. dilatatum, V. wrightii, V. sargentii, V. opulus, V. furcatum, and V. awabuki, were purchased from the KPEB of the Korea Research Institute of Bioscience and Biotechnology, Korea. V. erosum was collected from the Medicinal Herb Garden of the College of Pharmacy, Seoul National University, Korea and deposited in the Medicinal Herbarium of Kangwon National University with the accession number KNUVE-1. The use of plants in the present study complies with the guidelines of the Medicinal Herbarium of Kangwon National University. Each powdered sample (1 g) was extracted using 80% methanol for 1 h using an ultrasonic apparatus and the filtrate was dried using a vacuum rotary evaporator. Next, the samples were dissolved in 100% methanol at a concentration of 10 mg/mL and filtered through a 0.45 μm polytetrafluoroethylene membrane before analysis.

High-performance liquid chromatography (HPLC) analysis

The samples were analyzed on a 1260 quaternary pump, an autosampler, and a multiple wavelength detector (Agilent Technologies, Santa Clara, CA, USA). Chromatographic separation was performed using a Hector‐M C18 column (250 × 4.6 mm I.D.; 5 μm, RSTech, Daejeon, Korea). The ultraviolet detector was set at a wavelength of 260 nm. The mobile phase was a gradient solvent system consisting of solvent A (0.1% formic acid in water) and solvent B (MeOH) as follows: isocratic 95% A (0–10 min), linear gradient 95–80–30% A (10–20–30 min) and isocratic 30% A (30–40 min). The flow rate was 1.0 mL/min, and aliquots of 10 μL were injected using an autosampler.