Introduction

Over the past few decades, drug discovery has shifted mainly to high-throughput screening of large chemical libraries while research on natural products has diminished1,2. Reasons for this are among others: (i) the legitimate concerns about the United Nations Convention on Biological Diversity and the Nagoya protocol which regulates the sovereign rights of genetic resources, resulting in many years of legal uncertainty regarding derived rights3,4. (ii) The difficulty of patent protection for natural products, especially since the publication of the “Guidance for Determining Subject Matter Eligibility of Claims Reciting or Involving Laws of Nature, Natural Phenomena, & Natural Products”5,6. (iii) The belief that natural products are somehow incompatible with high-throughput screening, and (iv) that they are difficult to isolate or synthesize7. At the same time, some compound collections are designed to mimic natural products, because these also offer clear advantages; for example, they exhibit a wide range of pharmacophores, show a high degree of stereochemistry and are metabolite-like2,4,8,9,10,11,12,13. However, recent technical developments and increased legal certainty have led to a renewed interest in natural product drug discovery1,2,4,7,12,14,15,16. This is further strengthened by the 2015 Nobel Prize in Physiology or Medicine for natural product research on artemisinin; a sesquiterpene lactone with anti-malarial properties from the plant Artemisia annua, and on avermectins; macrocyclic lactones with potent anthelmintic and insecticidal properties derived from the soil bacterium Streptomyces avermitilis7,17.

Companies have made great efforts to increase the chemical diversity of their compound collections used for drug discovery, but at the same time to reduce especially late-stage attrition; as a result the composition of these collections is increasingly shaped, or even dictated, by requirements for lead- and drug-likeness4,18,19,20. To evaluate the lead- and drug-likeness of candidate molecules in silico, drug discovery filters (DDFs) are used21,22,23. DDFs are essentially sets of simple rules, determining whether a molecule meets criteria for one or more specific (derived) physicochemical parameters, further referred to as drug discovery parameters (DDPs). To qualify, a molecule must meet all or most criteria of the DDPs included in a specific DDF21,23. These DDFs were initially derived from collections of marketed medicines, but later more specific DDFs have been developed for specific applications e.g. fragment-based drug discovery24,25. The use of DDFs has led to lower attrition rates, i.e. fewer false positives, in drug development, although recent results suggest that further control of physicochemical properties is unlikely to affect attrition rates significantly26,27. DDFs, however, render a large area of chemical space “off-limits” for medicinal chemists, so that otherwise interesting molecules are not even considered4,28. For instance, many natural products would be excluded by Lipinski’s well-known Rule of Five (Ro5). This rule of thumb, consisting of a subset of the studied DDPs with specific criteria (Table 1), is used to evaluate whether a molecule is likely to be orally bioavailable, but without predicting whether the molecule is pharmacologically active21,29. Lipinski himself excluded natural products from his Ro5 when establishing the criteria30. Nonetheless, a substantial number of natural products meet the Ro5 DDP criteria31. Moreover, some natural products that do not meet the Ro5 DDP criteria, nonetheless exhibit good oral bioavailability, probably through co-evolution32,33.

Table 1 Drug Discovery Parameter criteria used in Drug Discovery Filters available in JChem21,22,23,56,57,58. #: number of; —: not applicable.

Natural product-based drug discovery is considered intrinsically complex and requires a highly integrated interdisciplinary approach1,2,4,7,34,35. Essential oils (EOs) are plant-based natural products that have been used therapeutically during millennia for a broad range of biological activities36,37,38. Nowadays, EOs, their components and derivatives thereof, are used in a wide variety of commercial applications. They are found in numerous products, including regular medicines39,40,41,42,43; and for some of these, randomized controlled (clinical) trials have been performed44,45,46,47. An EO is defined by the International Organization for Standardization as a product obtained from raw plant material by water, steam or dry distillation, or by mechanical processing of the epicarp of citrus fruits, after possible separation of the aqueous phase by physical processes. In addition, an EO may undergo physical treatments, e.g. filtration, decantation, centrifugation, provided this does not significantly change its composition48. Although other EO definitions are used, this study adheres to the International Organization for Standardization definition, because it permits a clear distinction between EOs and EO-like products, such as (supercritical fluid) extracts36,48,49,50. Plant species, - chemotype, - part(s) and extraction method have a major influence on the composition of an EO. EOs are typically composed of many essential oil components (EOCs), most of which are synthesized via the methylerythritol phosphate -, mevalonic acid -, or shikimate pathway36,51.

Natural products such as EOs are often avoided in drug discovery, partly because of some undesirable characteristics1,4. EOs are complex mixtures of relatively hydrophobic, volatile EOCs, which can cause interference during screening52,53,54. However, can the exclusion of EO(C)s in the search for new drugs be justified because of these properties or because they are (derivatives of) natural products? To answer this question, DDPs of EOCs obtained from a set of commercially available EOs were calculated, analyzed and summarized. Then the lead- and drug-likeness of these EOCs were evaluated using all the DDFs available in JChem (for Office) from ChemAxon, a widely accepted cheminformatics software package in drug discovery (Table 1). Additionally, the Rule of Three (Ro3) DDF was used to test if some EOCs would be potential candidates for fragment-based drug discovery24,25,55. Finally, the results of the DDFs obtained from (i) the EOCs in our EO set and (ii) the approved drugs in DrugBank were compared.

Results

Selection of an EO set

A sample of 188 chemotypically defined EOs, representing a cross-section of what is currently commercially available, were obtained from Pranarôm International S.A. (Belgium). Eliminating duplicates led to our final set of 175 EO (SI 1), on which all further analyses were performed54,56. These EOs were produced by distillation (93.7%) or by mechanical pressing (6.3%)48.

Analysis of EOCs: introducing the (unique) Core Molecular Constitution of EOCs: (u-)cmcEOCs

EO analysis by Gas Chromatography with Mass Spectrometry and Flame-Ionization Detection (GC-MS-FID) identified a total of 6,142 EOCs (≥0.10%; nEO = 175), at least at the level of their Core Molecular Constitution (CMC; Box 1); they are further referred to as cmcEOCs and were retained for further analysis. A total of 764 EOCs (≥0.10%; nEO = 175) could not be identified at least at their CMC level, resulting in incomplete or no data on the DDPs being studied; therefore, they were not included in further analyses.

Because there is an overlap in composition between EOs, many of the 6,142 cmcEOCs are identical, and the whole set can be described with only 627 different InChIKeys-14; these are further referred to as unique-cmcEOCs (u-cmcEOCs; Box 2; SI 2). Approximately 35% (n = 218) of the u-cmcEOCs appear only in one EO. This is in strong contrast with the five most frequently found u-cmcEOCs in our EO set (Fig. 1), i.e. limonene (identified in 153 EOs: nEO = 153; see also SI 2 no 551), alpha-pinene (nEO = 149; SI 2 no 137), beta-myrcene (nEO = 141; SI 2 no 457), beta-caryophyllene (nEO = 139; SI 2 no 319) and beta-pinene (nEO = 133; SI 2 no 523). Together, these results largely correspond with previous findings57. As a robustness check, the EOC composition of three subsets (A-C) of our EO set were analyzed, i.e. (A) EOs of conventional cultivation (n = 101), (B) EOs of certified-organic cultivation (n = 74), and (C) all EOs of conventional cultivation, complemented with those EOs of certified-organic cultivation that originated from other plant species, the same plant species but from other plant parts, or the same plant species but a different chemotype (n = 141; SI 1). No differences in rank order were found for the five most frequent u-cmcEOCs for the defined EO (sub)sets (SI 3). This indicates that a typical EO consists of a combination of common and rare EOCs, and these findings suggest that the most common EOCs are present in a majority of EOs.

Figure 1
figure 1

Chemical structures of the five most frequently found u-cmcEOCs in our EO set (in descending order of frequency from left to right).

Suitability of EOCs for drug discovery and development: DDPs

DDPs are (derivatives of) common physicochemical parameters. DDFs may use the same DDPs, but possibly with different value ranges. To evaluate the potential lead- and drug-likeness of the u-cmcEOCs present in our EO set, we calculated the values of 13 DDPs, and determined how many times these satisfy the criteria for one or more of the six standard DDFs in our study (Fig. 2 and Table 2). We also demonstrate that some DDPs, e.g. log P and log D (pH 7.4) or Muegge’s atoms and polar surface area, are unsurprisingly very highly correlated and may be interchangeable (SI 4).

Figure 2
figure 2

Summary of the values for the u-cmcEOCs (n = 627; Box 2) of the DDPs in the DDFs we used. Tukey boxplots of the DDP values for the u-cmcEOCs [minimum; 25th percentile; median; 75th percentile; maximum] (unit of measurement): (a) molecular mass [46.1; 152.2; 194.2; 212.4; 352.7] (Da), (b) log P [−0.16; 2.46; 3.18; 4.06; 11.58], (c) number of H-donor atoms [0; 0; 0; 1; 2], (d) number of H-acceptor atoms [0; 0; 1; 1; 4], (e) log D at pH 7.4 [−1.71; 2.44; 3.18; 4.06; 11.58], (f) number of molecular rings [0; 0; 1; 2; 4], (g) number of rotatable bonds [0; 1; 2; 4; 22], (h) number of atoms [9; 27; 32; 39; 77], (i) molecular refractivity [13.01; 46.50; 56.01; 67.90; 116.80] (m3 mol−1), (j) number of C-atoms [2; 10; 12; 15; 25], (k) Muegge’s atoms [0; 0; 1; 2; 4], (l) polar surface area [0.00; 0.00; 20.23; 26.30; 55.76] (10−10 m2). Values for the DDP fused aromatic rings were not summarized as for all but four u-cmcEOCs the values were zero. For more details on the Drug Discovery Parameters described in b, e, i, k, l, see the materials and methods section.

Table 2 Number (percentage) of u-cmcEOCs (Box 1) that pass each criterion separately (Drug Discovery Parameters; top rows). Number (percentage) of u-cmcEOCs and u-cmcADs (Box 2) that pass the combined criteria of a specific Drug Discovery Filter (bottom rows) available in JChem (for Office)21,22,56,57,58. —: not applicable.

More than 90% of the u-cmcEOCs had values within the criteria limits for at least nine out of 13 DDPs, irrespective of the DDF. For only two DDP criteria of one DDF, i.e. the Muegge filter, more than half of the values of the u-cmcEOCs DDPs were outside the limits for these criteria (Table 2).

Suitability of EOCs for drug discovery and development: DDFs

All u-cmcEOCs were evaluated with the DDFs available in JChem (for Office), i.e. (i) Ro5, (ii) Lead Likeness, (iii) Ghose, (iv) Muegge, (v) Veber, and (vi) Bioavailability, thus combining all the DDP criteria of each DDF, including three variants (Table 1), and the Ro3 DDF. A u-cmcEOC passes a DDF if the u-cmcEOC DDP values for all criteria of that DDF are within the DDP limits.

DDFs for bioavailability

For the Ro5 benchmark DDF, all (n = 627) u-cmcEOCs passed the Ro5 DDF when assumed that an orally active drug usually has no more than one criterion violation21, and more than 94% (n = 591) of the u-cmcEOCs passed all four criteria (Tables 1 and 2). According to Lipinski, candidate drugs that meet the Ro5 criteria tend to have a lower attrition rate during clinical development, and therefore have an increased chance of reaching the market29,58. Approximately 98% (n = 607) of the u-cmcEOCs passed the Veber DDF, which again implies that u-cmcEOCs generally should have good oral bioavailability assuring good intestinal absorption22,59. Veber et al.22 reported that the DDPs polar surface area and rotatable bonds probably discriminated better than DDP molecular mass between compounds that are orally bioavailable, and those which are not22. For our sample, we could not prove that DDPs polar surface area and molecular mass are correlated (ρ = 0.07, p > 0.05). We found that DDPs rotatable bonds and molecular mass are very weakly correlated (ρ = 0.10; p < 0.05) indicating that the Veber DDF captures mostly different information compared to the Ro5 DDF (SI 4). Furthermore, because the DDPs polar surface area and rotatable bonds are only moderately correlated (ρ = 0.46; p < 0.0001), these DDPs capture partially different information (SI 4). Therefore, the DDFs Ro5 and Veber can be considered complementary DDFs, at least for our EO set. Also, approximately 98% (n = 607) of the u-cmcEOCs passed the Bioavailability DDF (Table 2). This is not entirely surprising, because the Bioavailability DDF is essentially the merging of the Ro5 and Veber DDFs, complemented with the fused aromatic rings DDP, and whereby for this filter only any of 6 out of 7 criteria must be met (Table 1).

DDF for lead-likeness

DDFs are most often applied to hits from high throughput screens. However, in order to improve affinity and selectivity of a drug candidate, additional chemical groups are usually added, so that molecular mass and lipophilicity often increase during lead optimization. The Lead Likeness DDF, for example, is biased towards lower lipophilicity and molecular mass, so that interesting lead candidates can be further optimized towards drug-like candidates (Table 1). The standard Lead Likeness DDF uses the DDPs Log D at pH 7.4 or alternatively Log P; approximately 73% (Table 2) or 87% (n = 544; not shown in Table 2), respectively, of the u-cmcEOCs pass this DDF.

DDF for fragment-based drug discovery

Furthermore, a DDF derived from Ro5 appears useful for efficient lead discovery in a fragment-based drug discovery approach, i.e. the Ro3 DDF24. About 32% (n = 202) of the u-cmcEOCs pass all 5 DDP criteria of the Ro3 (for criteria see Materials and Methods section). Most u-cmcEOCs that do not pass this DDF fail because the criteria limit(s) of DDP log P and/or DDP rotatable bonds were exceeded in 55.5% (n = 348) and 32.7% (n = 205) of the cases, respectively.

DDFs for drug-likeness

The DDP criteria of the high drug-likeness Ghose DDF are based on an analysis of known drugs from the Comprehensive Medicinal Chemistry database60 and approximately 60% (n = 377) of the u-cmcEOCs passed this DDF (Table 2). In contrast, less than 10% (n = 59) of the u-cmcEOs passed the Muegge filter, which tries to differentiate between drug-like and non-drugs based on the observation that non-drugs are often under-functionalized61. The reasons for failing the Ghose or Muegge DDFs were the lower limit of DDP molecular mass for both DDFs, combined with the DDP Muegge’s atoms (i.e. the total number of atoms of a molecule minus the total number of carbon and hydrogen atoms) for the latter (Table 1). Only about 32% of the u-cmcEOCs passed Muegge’s atoms criterion (Table 2).

In general, only nine out of 627 u-cmcEOCs (SI 2; nos. 33, 99, 159, 165, 226, 233, 251, 571, 584) did not pass any of the DDFs under study, including variants, with the exception of the DDF Ro5 variant, where only three out of four criteria suffice, so that all u-cmcEOCs pass this latter DDF variant. In contrast, eight u-cmcEOCs (SI 2; nos. 8, 73, 371, 407, 490, 510, 543 and 560) passed all ten DDFs (variants) in our study, implying that these u-cmcEOCs passed the most stringent criteria of each DDP in our study.

Comparing EOCs in our sample with approved drugs in DrugBank

The results of the standard DDF analyses of the u-cmcEOCs in our sample and the u-cmcADs (Box 2) in DrugBank were compared (Tables 2 and 3). Except for the Muegge DDF, proportionally more u-cmcEOCs than u-cmcAD passed the individual standard DDFs (Table 2), indicating that overall, EOCs meet the combined criteria of most DDFs at least as well as the drugs on the market. However, proportionally about six times more approved drugs (35.4%) passed all the DDFs compared to the u-cmcEOCs (6.2%). Nonetheless, about 94% of all u-cmcEOCs passed at least four out of six DDFs compared with only 67.5% of the approved drugs. It should be noted that a relatively large proportion of the approved drugs (13.8%) did not pass any of the DDFs (Table 3).

Table 3 Number (percentage) of u-cmcEOCs and u-cmcADs (Box 2) that pass exactly X standard Drug Discovery Filters (with 0 ≤ X ≤ 6).

Some EO(C)s made it into DrugBank

DrugBank lists few u-cmcEOCs approved as drugs in at least one jurisdiction (SI 5). Eugenol for instance found routine use as a topical antiseptic in dentistry, as a counter-irritant and for pain control; it is the major EOC of the EO Syzygium aromaticum, also known as Eugenia caryophyllus. Menthol is used as a local anesthetic, has counter-irritant qualities, and relieves minor throat irritation; it is a major EOC of EO Mentha x piperita62. Moreover, none of the EOCs approved as drugs have ever been withdrawn (SI 5), though a limited number of u-cmcADs (n = 138; 5.8%) have been withdrawn to date. However, it is difficult to draw conclusions from this because the sample of EOCs approved as drugs is too small. We can, nevertheless, conclude that, once approved as a drug, EOCs have stood the test of time. In addition, of the 180 approved drugs in DrugBank for which no InChIKey could be defined, e.g. because they were (among others) complex mixtures of components, we identified at least seven (EOs of Eucalyptus, turpentine, sage, tea tree, Pinus mugo (needle), Curcuma aromatica (root), Atractylodes japonica (root).) EOs that are approved as drugs. None of these EOs have been withdrawn to date, although this does not imply their efficacy or lack of toxicity.

Discussion

A diverse set of 175 commercially available and chemically defined EOs from a multinational company specialized in scientific aromatherapy was selected for analysis. Possible advantages are that (i) they must meet a minimum number of (quality) requirements, and (ii) when they are purchased from the same reliable source, analysis procedures and handling of the EOs have been standardized where possible, hence minimizing variation. A possible disadvantage is selection bias i.e. the exclusion of non-commercial EOs, or bias against some EOs because they are e.g. toxic, insufficiently marketed, locally regulated, not available in sufficient quantities, or considered insufficiently interesting from the company’s viewpoint. The quality control parameters of our set of commercial EOs (SI 6) coincide with previous findings where no distinction was made between commercial and non-commercial EOs36,49,63,64,65. From a drug discovery perspective, however, the investigation of non-commercial, uncommon and toxic EOs merits attention as they are potentially an interesting source of possibly unknown and rare lead- and drug-like EOCs.

We found it useful to define the CMC of an EOC as its InChIKey-14 (Box 1); this permitted e.g. rapid and effective deduplication. This CMC contains in a coded manner essential information about the structure and composition of a molecule, but without data on isomerism (which are encoded by the remainder of the InChIKey). We are not aware of earlier uses of the CMC as we defined it, and believe that it may be useful for the chemoinformatic analysis of other compounds.

In this study, each component with an InChIKey-14 (n = 627) was considered an EOC. Two of these InChIKeys-14, however, belong to molecules that are not considered EOCs, but are commonly found in the EOs due to procedural contamination or fermentation, e.g. acetone (SI 2; no. 58) and ethanol (SI 2; no. 254). However, eliminating these two molecules would not have influenced our conclusions as their number is small (0.32%) compared to the entire sample.

The Muegge DDF has two DDPs that u-cmcEOCs rarely satisfy: the atom criterion and the lower limit on molecular mass. (i) Muegge’s atoms criterion requires that the total number of atoms of a molecule minus the total number of carbon and hydrogen atoms be equal to or greater than 2 (Table 1). Because all u-cmcEOCs except for three (<0.5%) nitrogen-containing ones (SI 2; nos. 139, 455, 481) consist of only carbon, hydrogen and oxygen atoms, u-cmcEOC with no or only one oxygen atom would therefore not meet this Muegge criterion (67.5%; Table 2). This would include all but four (SI 2; nos. 29, 346, 415 and 428) of the above-mentioned EOCs that are approved as drugs (SI 5). (ii) The lower limit of the DDP molecular mass (≥200; Table 1). Almost 55% of the u-cmcEOCs in our sample do not meet this criterion (Table 2), including all but two (SI 2; nos. 403 and 428) of the above-mentioned EOCs that are approved as drugs (SI 5). None of the EOCs in our sample that were ever approved as drugs (n = 12), according to DrugBank, were ever withdrawn, and only one, i.e. benzyl benzoate (SI 2; no. 428), met all criteria of the Muegge DDF62. Moreover, the Muegge atoms criterion is based on the observation that drugs have on average more “pharmacophore points” (i.e. functional groups such as amine, amide, alcohol, ketone, sulfone, sulfonamide, carboxylic acid, carbamate, guanidine, amidine, urea, and ester) than non-drugs61. But Muegge admits that 30–40% of compounds in different drug databases do not meet this criterion, whereas over one-third of non-drug chemicals do. In isolation, this criterion therefore does not discriminate drugs from non-drugs very well. The requirement for multiple pharmacophore points fits with the idea that these groups confer binding specificity. Although the clinical effects of EO(C)s often suggest sufficient therapeutic specificity, functional groups could be added to lead-like EOCs to increase their specificity for the desired molecular target. As these groups will also increase the molecular mass, this will also help to satisfy the second of Muegge’s DDPs. Since most EOCs tend to be small, they provide ample room for adding functional groups before running foul of other DDP criteria based on molecule size. Therefore, we think that the Muegge DDF should not be a limiting filter for the evaluation of EOCs.

The fact that relatively few EO(C)s made it into approved drugs could be due to their unusual properties. However, when these properties are benchmarked against various measures of drug-likeness, most EOCs pass with flying colors. For example, all u-cmcEOCs passed the Ro5 when only three of the four criteria had to be met. In addition, almost 94% of the EOCs passed at least any four out of six standard DDFs. Because DDFs are based on marketed drugs, it was expected that many approved drugs in DrugBank would pass most, but not necessarily all DDFs. The Lead Likeness and Ro3 DDFs, for example, were developed to search for lead molecules, and therefore not necessarily for marketed drugs that have already passed this development stage. Conversely, it was expected that most u-cmcEOCs would not pass all six DDFs.

One of the major drawbacks, however, in the transition of a natural compound from a hit to a drug is the increased amount of compound required, which often cannot be met by re-isolation from the relevant plant sources34,66. EO production and distribution, however, is a mature industry and EOs were the 446st most-traded product in the world in 2017, with a total export value of 5.44 billion $ (SI 7)67. Therefore, if necessary, EO production can be relatively easily scaled up, with or without the use of biotechnology68, while medicinal chemists69 can find a way to synthesize the EOC and derivatives thereof70.

In the end, this suggests that EOCs, are promising (sources of) new drugs and deserve more attention in the future. EOCs also have unique properties that might be useful for some therapeutic applications, e.g. for lung or airway diseases71,72,73,74, for transdermal administration75,76 and diseases of the central nervous system77,78,79.

Materials and Methods

Essential oils (EOs)

A set of 175 EOs, representing a cross-section of what is currently commercially available, were retained for further data analysis from a sample of 188 chemotypically defined EOs, obtained from Pranarôm International S.A. (Belgium). Chemical composition, quality and origin of the EOs were certified by the company. The reduction from 188 EOs to our final set of 175 is essentially due to deduplication54. EOs were considered different when originating from (i) other plant species, (ii) the same plant species but from other plant parts, (iii) the same plant species but a different chemotype, and (iv) certified-organic versus conventional cultivation. Chemical analyses of the EOs were performed by GC-MS-FID using the NF ISO 11024-1/2 standard (Pranarôm International S.A., personal communication). The detection of organophosphorus and organochlorine pesticides residue levels was in compliance with the relevant EU-legislation and maximum permitted levels were never exceeded56. The chemical composition (≥10%) and metadata of the EOs used in this study was reported previously; see www.nature.com/articles/s41598–018–22395–6 under the heading electronic supplementary material54. More detailed analysis certificates and the methodology used (in French) can be consulted at www.inula-group.com/fr/pranaquality (see also SI 1).

Data preparation, calculations and visualisation of the EO set

Initially all GC-FID peaks ≥ 0.01% from the EO set (n = 175) were considered. After a preliminary evaluation, only peaks ≥ 0.10% were retained for further analysis because many GC-peaks < 0.10% were not or only partially identifiable. Subsequently, any EOC that was at least identifiable at its core molecular constitution (CMC; see also Box 1) was retained for further analysis, and a standard International Chemical Identifier (InChI) along with the corresponding hashed 27-character counterpart, i.e. InChIKey, was assigned using publicly accessible databases e.g. ChemSpider, PubChem or Chemistry WebBook80,81. Only the first 14 characters of the InChIKeys (InChIKeys-14) were retained of each EOC, thereby removing additional layers of information other than the CMC of the EOC (cmcEOC). After deduplication of the cmcEOCs (u-cmcEOCs), the unique InChIkeys-14 of the u-cmcEOCs were retained for further analysis. To display with Marvin the u-cmcEOCs molecular structures (SI 2), a structure-data file was created with ChemMine Tools using the u-cmcEOCs Simplified Molecular-Input Line-Entry System, a.k.a. SMILES, notation. To this end, the unique InChIkeys-14 was first complemented with an information-neutral second InChIKey block, i.e. UHFFFAOYSA, to re-establish a full InChIKey that was then translated with JChem (for Office) into SMILES81,82. RapidMinerStudio was used for data preparation and data blending.

Data preparation of, and calculations on, the DrugBank sample

The CSV-file (nentries = 2,594) ‘approved’ in the ‘drug group’ column was downloaded from DrugBank containing the names of all drugs that were once approved in any jurisdiction at any given time, and the structure information in the form of, e.g. InChI/InChI Key/SMILES for most of them (nentries = 2,414). All approved drugs with no structure information (nentries = 180) were initially not considered and therefore removed from the sample. Subsequently, the second block of the InChIKeys was removed, resulting in an InChIKey-14 for each drug. After deduplication, a total of 2,359 unique InChIKeys-14, corresponding to the unique CMCs (Box 1), of all approved drugs (u-cmcADs) in DrugBank were retained for further analysis.

Drug Discovery Parameters (DDPs)

JChem (for Office) and Excel were used for (i) chemical database access, (ii) structure-based property calculations (Fig. 2 and Table 2) and (iii) for searching and reporting the chemical structures i.e. u-cmcEOCs (SI 2). Briefly, to estimate the octanol/water partition and distribution coefficients of the EOCs, the consensus model of ChemAxon, based on the Viswanathan et al.83 and Klopman et al.84 models, and the PhysProp database85,86, were used by JChem (for Office)87. For calculating the water/octanol partition coefficient, P, only the un-ionized form was considered, whereas the distribution coefficient, D, also considers, if applicable, all charged forms of the molecule for a given pH; thus we obtained DDPs (i) Log P = log10(octanol/water partition coefficient) and (ii) Log D = log10(octanol/water distribution coefficient). (iii) The molar refractivity was calculated based on the atomic method described by Viswanadhan et al.83 and to estimate (iv) the polar surface area of the EOCs, the topological polar surface area method as described by Ertl et al.88 was used by JChem (for Office)88. (v) Muegge’s atoms DDP is equal to total number of atoms of a molecule minus the total number of carbon and hydrogen atoms.

Drug Discovery Filters (DDFs)

JChem (for Office) and Excel were used for calculating the number of u-cmcEOCs and u-cmcADs that passed the different DDFs. The six DDFs supported by JChem (for Office) are referred to as standard DDFs (Tables 1 and 2). Three of the six standard DDFs each have two variants: (i) for the Ro5 DDF, three or four out of four criteria have to be met to pass this filter, and the latter more conservative variant was considered standard87. For the (ii) Lead Likeness and (iii) Veber DDFs, the combinations of DDPs supported by JChem (for Office) were considered standard, whereas the alternative combination of DDPs mentioned in the respective publications were considered non-standard variants (see also Tables 1 and 2)22,87,89. We added one DDF not included in JChem (for Office); it is derived from the Ro5 (i.e. the Ro3 DDF with the following DDPs: (i) log P ≤ 3, (ii) molecular mass ≤ 300, (iii) hydrogen bond donors ≤ 3, (iv) hydrogen bond acceptors ≤ 3 and (v) rotatable bonds ≤ 3)24. In all, we use 10 DDF(s) (variants) i.e. Ro5 (2 variants), Lead Likeness (2 variants), Ghose, Muegge, Veber (2 variants), Bioavailability, and Ro3.

Statistical analyses

Statistical analyses were performed using GraphPad Prism. For correlation analyses, the Spearman rank correlation coefficient (ρ) was calculated.

Software versions & databases

We used GraphPad Prism versions 7.0.5–8.1.2 (www.graphpad.com), JChem for Office versions 17.22–19.14 and Marvin versions 18.16–19.14 (ChemAxon), Office 365 ProPlus (Microsoft), RapidMinerStudio versions 9.0–9.5 (RapidMiner) and ChemMine Tools (Girke Lab)82. The databases ChemSpider (Royal Society of Chemistry), PubChem (National Institutes of Health), Chemistry WebBook (National Institute of Standards and Technology) and Drugbank version 5.1.4 (www.drugbank.ca)62 were accessed between August 2017 and July 2019.