Introduction

Biodiversity is the variety of all life forms, including the morphological diversity of individuals and populations within a species, the taxonomic diversity of species within a community or ecosystem, the functional diversity of groups of species within an ecosystem, and the diversity of ecosystems themselves1. While the total number of species in every taxonomic group has been predicted for all kingdoms of life on earth at approximately 8.7 million2,3, it is remarkable that the distribution of that vast number of species is highly concentrated in specific areas. These regions are particularly important for biodiversity conservation and are called biodiversity hotspots4, although: Bolivia, Brazil, China, Colombia, Costa Rica, the Democratic Republic of Congo, Ecuador, India, Indonesia, Kenya, Madagascar, Malaysia, Mexico, Peru, Philippines, South Africa, and Venezuela are considered megadiverse countries5. Peru occupies the seventh place in this group, as it possesses 28 of the 32 existing climates in the world and 84 of the 103 life zones known on earth. This is evidenced by considering that the country has 25,000 plant species or 10% of the entire number of species worldwide, whereas 30% are endemic, and endemic animal species such as 115 birds, 109 mammals, and 185 amphibians species, which represent 6, 27.5 and 48.5% of the total number worldwide, respectively6,7.

Biodiversity conservation is important since plants, animals, and other life forms such as bacteria, archaea, protozoa, and fungi, are used directly or indirectly to produce pharmaceuticals, and for their scientific value, among other resources8. The number of drugs derived from natural products (NP) that were introduced to the market over forty years represented a significant source of new pharmacological entities9. Whilst the Peruvian population uses approximately 5000 Peruvian plants for 49 purposes or applications, where about 1400 species are described as medicinal10,11,12,13. The contribution from traditional Peruvian medicine can be embodied by Quinine, a component of the bark of the cinchona tree (Cinchona officinalis), employed in the treatment of Malaria14. Additionally, two other valuable contributions to modern pharmacopeias such as the coca plant (Erythroxylum coca), from which cocaine was first isolated and later led to local anesthetics15, and the balsam of Peru (Myroxylon balsamum), which was used wide-reaching for the treatment of wounds16, can be mentioned. However, the potential of Peruvian NPs remains underexploited since most of these useful native species can be domesticated or semi-domesticated17. Also, the amount and nature of experimental evidence published on active NPs are still limited18, and most of the current studies reported crude medicinal activities, while potentially active NPs have been isolated only from a few numbers of plants19.

Computer-aided drug design (CADD), one of the key approaches to modern pre-clinical drug discovery, can be defined as computational methods that are applied to discover, develop, and analyze drugs and active molecules20. Among the key approaches that comprise CADD, virtual screening is one of the major contributors to CADD since it stands as a contemporary approach to the experimental in vitro high-throughput screening (HTS) for hit identification and optimization21. Integrating CADD approaches to curated databases, which are described as a well-organized collection of data in any field, the drug development process may be sped up and cost reduced22. Considering this, large databases containing NPs from various data sources have been released, such as the COlleCtion of Open Natural prodUcTs (COCONUT), which contains 406,076 unique “flat” NPs, and a total of 730,441 NPs where stereochemistry has been preserved23; and the LOTUS initiative, which has 750,000 referenced structure-organism pairs24. Also, several NPs compound databases from particular geographical locations have been assembled, such as the Traditional Chinese Medicine (TCM) Database@Taiwan database containing approximately 58,000 molecules25; the Indian Medicinal Plants, Phytochemistry and Therapeutics 2.0 (IMPPAT 2.0) which contains more than 10,000 phytochemicals26; and the AfroDB which is composed of around 1000 NPs27. Likewise, some countries in Latin America have published their own public NPs databases such as NuBBEDB which contains more than 2000 NPs28, and SistematX which contains more than 2500 NPs29, both from Brazil, and BIOFACQUIM from Mexico, which contains a total of 531 molecules30. Furthermore, NPs databases had been used as a repository to identify several promising candidates to be considered for further development for the treatment of diseases31, such as Chagas disease32,33, Tuberculosis34, Leishmaniasis35,36, Schistosomiasis37, and COVID-1938. The present work introduces the first version of the Peruvian Natural Products Database (PeruNPDB), describing its assembly, curation, and chemoinformatic characterization of molecular diversity and coverage in chemical space. The database is freely available at the web-interface PeruNPDB Explorer (https://perunpdb.com.pe/). We anticipate that the PeruNPDB will make it possible to conduct additional virtual screening tests to create innovative pharmacological entities and other biotechnological approaches and serve as a resource for information on conservation guidelines.

Methods

Search strategy, study selection, and data extraction

A systematic review search strategy to examine the literature for studies describing NP from Peruvian sources was adapted from39. Whereas PubMed, the main database for the health sciences, maintained by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), is a database that contains about 32 million citations, belonging to more than 5300 journals currently indexed in MEDLINE40; it provides uniform indexing of biomedical literature, the Medical Subject Headings (MeSH terms), which form a controlled vocabulary or specific set of terms that describe the topic of a paper consistently and uniformly41. Firstly, to find terms associated in the literature with Peruvian NPs, the MeSH terms “Peru” AND the “Natural Products” were employed in a search carried out at the PubMed database (https://pubmed.ncbi.nlm.nih.gov/), (last searched on 10 June 2022), though the results were plotted into a network map of the co-occurrence of MeSH terms in the VOSviewer software (version 1.6.17)42, which employs a modularity-based method algorithm to measure the strength of clusters43. The resultant cluster content was analyzed to select relevant studies associated with Peruvian NPs. Three phases went into selecting the studies. First, papers written in languages other than English, copies of articles, reviews, and meta-analyses were disregarded. The highly relevant full studies were then retrieved and separated from the papers with a title or abstract that did not provide enough information to be included. Next, the titles and abstracts of the publications chosen through the search approach were visually evaluated. The data supplied from each investigation contained the NP’s characterization as well as details on the genus and species of the sources from which the NP were isolated. Additionally, the information from the bibliographic reference was extracted, even if all research that discussed chemicals derived from Peruvian natural sources was already considered.

PeruNPDB assemble and molecular properties calculation

The simplified molecular-input line-entry system (SMILES)44 of compounds previously described in the NPs selected in the previous step were searched and retrieved from PubChem45, DrugBank46, or ChEMBL47 servers, while for unavailable NPs the ChemDraw tool48 was employed to generate the SMILE notation. Moreover, the Osiris DataWarrior v05.02.01 software49 was employed to generate the dataset’s structure data files (SDFs). This followed the uploading to the Konstanz information miner (KNIME) Analytics Platform50, where the “Molecular Type Cast”, and the “RDKit Structure Normalizer” KNIME nodes were employed to curate the chemical structures on the dataset. Moreover, for every compound in the dataset, the classification system for describing small molecule structures is described based on NP Classifier51, which employs a biosynthetic ontology that is specific to natural products; or ClassyFire52 which is a general classification system for small molecules that are based on the ChemOnt ontology, was employed. The KNIME’s “RKDit Descriptor Calculator” node was employed to calculate six physicochemical properties of therapeutic interest, namely: molecular weight (MW), octanol/water partition coefficient (clogP), topological surface area (TPSA), aqueous solubility (clogS), number of H-bond donor atoms (HBD) and number of H-bond acceptor atoms (HBA) of the PeruNPDB, while the statistical analysis was done within the GraphPad Prism software version 9.4.0 for Windows, GraphPad Software, San Diego, California USA, http://www.graphpad.com, by calculating the mean, median, standard deviation, and the coefficient of variation of the calculated properties. Box-and-whisker plots showing, the maximum and minimum values were generated for visualization, and the One-way ANOVA followed by Dunnett correction for multiple comparisons test was employed to evaluate the differences between the datasets. The results were considered statistically significant when p<0.05.

Visual representation of chemical space

To generate a visual representation of the chemical space of the PeruNPDB, two visualization methods, for the auto-scaled six properties of pharmaceutical interest, namely: MW, ClogP, TPSA, clogS, HBD, and HBA, were employed: principal component analysis (PCA), which reduces data dimensions by geometrically projecting them onto lower dimensions called principal components (PCs)53 calculated by the “PCA” KNIME node. The second technique was the t-distributed stochastic neighbor embedding (t-SNE), which is a nonlinear dimension reduction in which Gaussian probability distributions over high-dimensional space are constructed and used to optimize a student t-distribution in low-dimensional space54, calculated by the t-SNE (L. Jonsson) KNIME node. Three and two-dimensional scatter-plot representations were generated for PCA and t-SNE, respectively with the Plotly KNIME node. Additionally, the Tanimoto similarity score was calculated for clustering the compounds, while the atom-pair-based fingerprints of the NPs were obtained using the “ChemmineR” package55 in the R programming environment (version 4.0.3)56, a heatmap was generated for visualization. The same procedure was employed in the reference datasets: AfroDB27, BIOFAQUIM30, and NUBBEDB28 retrieved from the ZINC20 database57.

Global diversity: consensus diversity analysis

Since chemical diversity strongly depends on the structure representation, it is reasonable to consider multiple representations for a complete global assessment. The consensus diversity (CD) plots have been proposed as simple two-dimensional graphs that enable the comparison of the diversity of compound data sets using four sets of structural representations: the molecular fingerprints, scaffolds, molecular properties, and the number of NPs58. The multiple-variable plot was generated by GraphPad Prism software version 9.4.0, whereas the y-axis represents the area under the cyclic system recovery curve59, the x-axis, represents the median of the fingerprint-based diversity computed with Molecular Access System (MACCS) keys (166-bits) and the Tanimoto coefficient60, the bubble color represents the molecular properties of pharmaceutical interest, and the bubble size represents the number of NPs for each database.

Drug-likeness

The Osiris DataWarrior v05.02.01 software61 was employed to calculate the drug-likeness score of the compounds from the PeruNPDB; the calculation is based on a library of  5300 substructure fragments and their associated drug-likeness scores. This library was prepared by fragmenting 3300 commercial drugs as well as 15,000 commercial non-drug-like Fluka NPs61. Frequency distribution of the obtained scores was performed at GraphPad Prism software version 9.4.0 for Windows, GraphPad Software, San Diego, California USA (http://www.graphpad.com), and plotted into stacked bar plots. Furthermore, the Lipinski Rule-of-5 (Ro5) is a set of four rules (logP, MW, and H-bond donor and acceptor cut-offs) for drug-likeness and oral bioavailability derived from a subset of 2245 drugs62. For this Lipinski’s Ro5 KNIME node was employed to assess the number of violations to the rule for each compound on the PeruNPDB and plotted into pie charts. The US Food and Drug Administration (FDA)-approved drugs dataset57, was employed as a reference, whereas the same procedures were applied to their compounds. Also, the chemical space representation was analyzed, and the procedures were the same as described earlier.

Results

PeruNPDB assemble

In the present study, the assembly of the PeruNPDB, followed by its chemoinformatic characterization on molecular diversity and coverage of the chemical space was performed; to select the studies from which the NPs will further retrieve, a search with the MeSH Terms “Peru” AND “Natural Products” was performed in the Pubmed database, followed by the construction of a network map of the co-occurrence of MeSH terms. The workflow proposed in Fig. 1 was considered. The search resulted in 399 published papers between 1950-2021, whereas establishing the value of five as the minimum number of occurrences of keywords, a map with 194 keywords that reaches the threshold was constructed (Fig. 2A). In the analysis of the map, it is shown that six main clusters were formed, while terms such as “Plant Extracts”, “Plants, medicinal”, “Phytotherapy”, “Ethnopharmacology”, “Ethnobotany”, “Plants stems”, “Plants bark”, and “Seeds”, which are associated with NPs were observed in the first cluster (red color). Also, terms such as “Peru”, “Humans”, “Animals”, and “Male”, were recurrent terms. Although using the eligibility criterion established, 47 articles were selected which showed a 2000-2021-year range, and terms such as “Flavonoids”, “Sesquiterpenes”, and “Anthocyanins”, were recurrent terms (Fig. 2B). Also, bibliographic data extracted from the selected articles analyzed: the “Journal of Agricultural and Food Chemistry”, the “Journal of Ethnopharmacology”, “Phytochemistry”, and “Planta Medica” where the main peer-reviewed journals were the studies describing compounds extracted from Peruvian NPs were published (Fig. 2C). Furthermore, while retrieving the SMILES of the compounds from PubChem, DrugBank, and ChEMBL, it was observed that 242 structures were found in the repositories and that 38 needed to be generated in the ChemDraw tool. Ninety-five and five percent of the compounds were retrieved from plant or animal sources, respectively (Fig. 3A). The genus from which most of the compounds were extracted were Uncaria and Lepidium, with 11 and 10 percent, respectively (Fig. 3B). When analyzing the structure of the compounds with a classification system for small molecule structures, it is shown that 76 classes of NPs were found among the 280 NPs of the PeruNPDB, whereas anthocyanidins (N=25), aporphine alkaloids (N=11), cinnamic acids and derivatives (N=17), germacrane sesquiterpenoids (N=13), stigmastane steroids (N=10), and unsaturated fatty acids (N=22) were the most predicted classes of NPs (Fig. 4).

Figure 1
figure 1

General workflow to generate and curate the first version of the Peruvian natural product database. The graph was edited in SmartDraw 2023 Software, LLC (Accessed April 14, 2023).

Figure 2
figure 2

Bibliographic search for studies describing the characterization of Peruvian natural products. (A) Network map of the co-occurrence of MeSH terms. (B) Network map of articles selected from 2000 to 2021-year. (C) Bibliographic data extracted from selected articles that describe compounds extracted from Peruvian NPs.

Figure 3
figure 3

Dot plots showing the kingdom and genus of the species studied. (A) Compounds of Peruvian NPs found in PubChem, DrugBank, and ChEMBL databases. (B) Dot plot of the genus of the Peruvian NPs compounds obtained from the databases.

Figure 4
figure 4

Dot plots showing the natural products classification.

Molecular properties

Six physicochemical properties were calculated for all compounds in PeruNPDB and plotted into box plots, which include the distribution of the same properties of the three reference datasets, retrieved from the ZINC20 database (Fig. 5). To compare the results of the datasets, the coefficient of variation (CV) was calculated, which represents the ratio of the standard deviation to the mean and is considered a useful tool to statistically compare the degree of variation from one dataset to another54. Besides the results of the HBA, in which NuBBEDB obtain the highest CV (123.2%), the PeruNPDB showed the highest CV in MW, clogP, TPSA, clogS, and HBD with 46.58%, 84.49%, 112.8%, 50.08%, and 83.84%, respectively. Still, the results from TPSA, clogP, clogS, and HBD showed high statistical differences compared to AfroDB, BIOFAQUIM, and NuBBEDB, while showed no statistical difference in HBA results compared to the AfroNP database (Fig. 5).

Figure 5
figure 5

Box plots for the physicochemical properties of PeruNPDB and reference datasets.

Visualization of the chemical space

The chemical space visualization of PeruNPDB was conducted using PCA and t-SNE. Though the visual analysis of 3D-PCA shows that molecules in PeruNPDB share the chemical space roughly with NuBBEDB. Whereas in some regions the molecules of PeruNPDB are predominant (Fig. 6A). While the explained variance percentage of PC1, PC2, and PC3 was 50.24, 39.94, and 6.72, respectively. PeruNPDB, BIOFAQUIM, and NuBBEDB chemicals overlap in most of the chemical space represented, according to the 2D-t-SNE visual analysis (Fig. 6B).

Diversity analysis

The heatmap generated using the Tanimoto score matrix and the atom-pair-based fingerprints show that there is a similarity between the structures of the compounds of the PeruNPDB, AfroDB, BIOFAQUIM, and NuBBEDB (Fig. 6C). Also, a consensus diversity plot was used to evaluate the diversity of the PeruNPDB dataset, based on molecular fingerprints, scaffolds, and physicochemical properties. The Euclidean distance of the scaled properties was used to compute the property-based diversity of the PeruNPDB, AfroDB, BIOFAQUIM, and NuBBEDB databases. Data points on a continuous color scale are used to represent the values on the color CD plot. Darker colors signify less diversity, but brighter colors signify more diversity. Finally, different point sizes are used to illustrate how large or tiny the databases are, with smaller data points indicating databases with fewer molecules. The results showed that the diversity of compounds found in the PeruNPDB was the largest since it was found in the area where the highest diversity in scaffold and fingerprints should are found (Fig. 7), which is consistent with the results shown in the box plots (Fig. 6).

Figure 6
figure 6

Visual representation of the chemical space of the PeruNPDB and reference datasets. (A) PeruNPDB 3D-PCA chemical space. (B) 2D-t-SNE visual analysis of the compounds PeruNPDB, AfroNP, BIOFAQUIM, and NuBBEDB. (C) Heatmap generated with Tanimoto scoring matrix of similar structures among compounds between PeruNPDB and control data sets.

Drug-likeness

Druglikeness assesses qualitatively the chance for a molecule to become an oral drug concerning bioavailability and is established from structural or physicochemical inspections of development compounds advanced enough to be considered oral drug candidates63. To assess the “drug-like” profile of the compounds from the PeruNPDB two approaches were performed; firstly, the frequency distribution of the drug-likeness score was analyzed, and the results showed that besides the differences in the number of compounds compared in both datasets a similar distribution among the compounds is observed (Fig. 8A). In the second approach, the number of violations to Lipinski’s Ro5 was analyzed and the results showed that compounds with at least one violation represent the 85.82 and 76.35% of the FDA and PeruNPDB datasets, respectively (Fig. 8B). Also, the visual representation of the chemical space as PCAs (Fig. 8C) and t-SNE (Fig. 8D) indicates that some of the NPs are distributed in the same space as the already approved drugs. Whereas the explained variance percentage of PC1, PC2, and PC3 was 52.38, 37.64, and 5.54, respectively. The findings imply that because the compounds in PeruNPDB have chemical structures like those of approved medications, they can be used in virtual screening to find possible lead compounds or points for further optimization.

Figure 7
figure 7

Consensus diversity plot comparing the global diversity of PeruNPDB with the reference data sets.

Figure 8
figure 8

Druglikeness analysis of the PeruNPDB and the reference datasets. (A) The similar distribution between FDA compounds and PeruNPDB. (B) Lipinski’s five rules from the FDA and PeruNPDB data sets. (C) Visual 3D representation of the chemical space as PCAs from the FDA and PeruNPDB data sets. (D) Representation t-SNE of FDA and PeruNPDB data sets.

Discussion

Peru has exceptionally high biodiversity, with numerous endemic species of mammals, reptiles, amphibians, flowering plants, and ferns, which is why has been described as a “megadiverse” country64,65, but worldwide hotspot analysis for potential conflict between food security and biodiversity conservation points out Peru as a region that is especially at risk of biodiversity loss due to agricultural expansion66. Thus, the conservancy of biodiversity can be considered important since historically NPs have played a key role in drug discovery, especially for illnesses such as cancer, cardiovascular and infectious diseases67, while the growing interest in NPs and their application is evidenced by a growth of the number of published databases of NPs, and collections of structures from various organisms, geographical locations, targeted diseases, and traditional applications68. Currently, several NPs or NPs-derived molecules are employed in the treatment of distinct diseases, such as the antibiotic penicillin originally obtained from the fungi Penicillium spp.69; the analgesic aspirin, which is the most used drug in the world, derived from salicin extracted from the bark of the willow trees Salix alba70; and the immunosuppressant tacrolimus employed in the prevention of the rejection organ after transplants, obtained from bacteria Streptomyces tsukubaensis71, are some examples. Besides, NPs and their derivatives have been considered promising options to improve treatment efficiency in cancer patients and decrease adverse reactions72, whereas vinca alkaloids73, taxane diterpenoids74, camptothecin derivatives75, and epipodophyllotoxin76, are NPs-derived anticancer compounds clinically used as chemotherapeutics; while an example of the importance of biodiversity conservation is exemplified by the tree Taxus brevifolia, from which the chemotherapeutic drug paclitaxel was originally extracted, that was put on the list of endangered species77,78. According to the data, there are fewer compounds identified in the PeruNPDB than in AfroDB, BIOFAQUIM, and NuBBEDB, but the chemical diversity is also higher. Of the 280 compounds characterized, 95% came from plant sources, and 5% came from animal sources. But in the BIOFACQUIM and NuBBE databases as well as plant sources, compounds derived from fungi, propolis, bacteria, and marine organisms are also described. This partially explains the difference in the TPSA results of the PeruNPDB, since it has been reported that natural products from the animal kingdom have the highest TPSA due to the number of hydrogen bond donors and acceptors79. Furthermore, the Peruvian marine biodiversity hotspot located on the northern coast has been predicted to hold 501 species, 270 genera, and 193 families80, as marine natural products have shown an interesting array of diverse and novel chemical structures with potent biological activities81, which includes: Cephalosporin C an antibiotic derived from marine fungi Cephalosporium82, Eribulin an anticancer drug derived from halichondrin B from the natural Japanese marine sponge Halichondria okada83 and the antiviral, isolated from sponge Tethya crypta, nucleoside Ara-A84. Also, Peru is considered a diverse country that has a very broad microbial diversity richness, however, remains slightly studied and exploited85,86. Fungi, the eukaryotic microorganisms, produce a tremendous number of NPs with diverse chemical structures and biological activities87, such as lovastatin, the first statin approved as a hypercholesterolemic medication by the FDA, most frequently produced by Aspergillus terreus88, and cyclosporine A, a potent immunosuppressant that was initially used to prevent organ rejection, isolated from the fungal species Tolypocladium inflatum gams89. Besides that no current drug has been developed from propolis, it is considered a very rich and complex chemical composition, while about 300 different chemicals components isolated from it, and which composition fluctuates according to parameters such as plant source, seasons harvesting, geography, type of bee flora, climate changes, and honeybee species90,91; highlighting Artepillin C, extracted from Brazilian green propolis, that showed in vitro92 and in vivo93 anti-inflammatory potential. These emphasize the urgency to promote and enhance the study of Peruvian NPs quantitatively and qualitatively. Compounds from Peruvian medicinal plants have been evaluated for their antidiabetic94, anticancer95, antiviral96, antibiotic97, and antiparasitic activities98; however, most of the studies in the literature were in vitro performed over plants extracts, and little information about the potential of single compounds on these activities is described, while these promising results can be explained by synergistic interaction or multi-factorial effects between compounds present in the plant extracts studied99. While pharmacodynamic synergy involves multiple substances acting on various receptor targets to enhance the overall therapeutic effect, and pharmacokinetic synergy involves substances with little to no activity helping the main active principle to reach the target by improving bioavailability or by reducing metabolism and excretion, this type of assay can hide the true potential of single molecules activity between different constituents of plant extracts100. Thus, the concerted effort of experimental NPs research with CADD is continuously increasing; and recently, NPs from the Peruvian native plants Smallanthus sonchilofolius, Lepidium meyenii (40 compounds)39, and Uncaria tomentosa (26 compounds)101 were in silico analyzed for their antiviral activity against SARS-CoV-2. Also, the in silico polypharmaceutical potential of 84 NPs from S. sonchifolius, L. meyenii, Croton lechleri, U. tomentosa, Minthostachys mollis, and Physalis peruvianus was analyzed against Alzheimer’s disease102.

Conclusion

Here we present the first version of PeruNPDB, a compound database of NPs from Peru that includes 280 compounds from plant and animal sources. PeruNPDB was constructed curated, and maintained by the Computational Biology and Chemistry Research Group from the Universidad Catolica de Santa Maria, and it is freely accessible through the website https://perunpdb.com.pe/. The PeruNPDB was envisioned as a tool for virtual screening, identifying promising compounds, serving as a springboard for further biotechnological products, and providing suggestions for conservation policies. The chemoinformatic characterization and analysis of the coverage and diversity of PeruNPDB in chemical space suggest broad coverage, overlapping with regions in the drug-like chemical space. The database contains an identification code (ID), the chemical name, bibliographic reference (name of the journal, year of publication, and DOI number), kingdom, genus, and species of the natural product, SMILES notation, and classification of the natural product. In the future, we want to launch the PeruNPDB version 2 with new computed molecular descriptors, NP stereochemical data, and the possibility to download several structures at once. The web-based user interface will also be improved and kept, and new NPs from various taxonomic ranks that aren’t included in the current edition will be added. Additionally, as we increase the quantity of NPs, we anticipate comparing the PeruNPDB with larger, more varied free datasets that are available in the literature. The complete PeruNPDB dataset for research purposes is available upon request and may be directed to and will be fulfilled by the lead contact Miguel Angel Chavez Fumagalli (mchavezf@ucsm.edu.pe).