Cheminformatics Analysis and Modeling with MacrolactoneDB

Macrolactones, macrocyclic lactones with at least twelve atoms within the core ring, include diverse natural products such as macrolides with potent bioactivities (e.g. antibiotics) and useful drug-like characteristics. We have developed MacrolactoneDB, which integrates nearly 14,000 existing macrolactones and their bioactivity information from different public databases, and new molecular descriptors to better characterize macrolide structures. The chemical distribution of MacrolactoneDB was analyzed in terms of important molecular properties and we have utilized three targets of interest (Plasmodium falciparum, Hepatitis C virus and T-cells) to demonstrate the value of compiling this data. Regression machine learning models were generated to predict biological endpoints using seven molecular descriptor sets and eight machine learning algorithms. Our results show that merging descriptors yields the best predictive power with Random Forest models, often boosted by consensus or hybrid modeling approaches. Our study provides cheminformatics insights into this privileged, underexplored structural class of compounds with high therapeutic potential.


Supplemental Methods 1 MacrolactoneDB
The web interface of MacrolactoneDB (http://macrolact.collabchem.com/), provides options to subset macrolactones based on multiple criteria for which users can provide a range or a single minimal, maximal threshold value. For many filters, users can fill in both entries to provide a range wherein the first entry is designated for minimal values of the range and the later for the maximal. Providing only the first or the last entry designates a threshold value for minimal and maximal margins respectively. For example, if a numerical value "1" is provided in the first entry and the second entry is left blank for the filter e.g "Number of Sugars", the compounds with at least one sugar moiety will be filtered. If, however, "1' is filled only in the second entry and the first entry is left empty, the compounds with at most one sugar moiety will be selected.
For structures with exactly one sugar, users can submit "1" in both the first and second entries for the "Number of Sugars" filter.
Below is a list of filters currently employed in MacrolactoneDB: Ring Filters 1. Ring size: Ring size is defined as the number of atoms in the core ring structure with ring size larger than 12. MacrolactoneDB contains macrolactones which are at least 12membered ring structures. MacrolactoneDB contain two or more esters within the core rings. Classic macrolides normally contain one ester within the core rings. Users can filter classic macrolides using this filter by submitting "1" in both the first and second entries.
Molecular Property Based Filters 8. Molecular weight: the mass of one mole of a given molecule; users can restrict the chemical space of macrolactones by their mass. It is useful for users who are interested in studying only small molecules (i.e. <900 Daltons) or larger macrolactones. 9. SlogP: hydrophobicity as assessed by the fragment-based octanol/water coefficient partition 10. Meet Lipinski Rules? Filter compounds on whether they are within druglike or oral bioavailable region according to Lipinski's rule of five; backend module by Mordred descriptor calculator 1 . Users can select "Whatever" for no preference; which will provide a chemical dataset of compounds whether they are within the druglike region or not.

Activity Reported? Filter compounds by whether their bioactivities have been reported in
ChEMBL. Users can select "Whatever" for no preference. "Whatever" option will provide a chemical dataset of compounds whether their bioactivities are reported or not. The scripts for data cleaning, manipulation, computing descriptors / fingerprints and developing QSAR models were written in python 3.6.3. The following modules were used: pandas 13 , numpy 14 , scipy 15 , RDKit 8 and mordred 1 . SMILES for macrolactone ligands were standardized with molvs module 16 and used as input for descriptor / fingerprint generation.

Machine learning Methods
Random Forest (RF) -an ensemble learning method that uses consensus of decision trees to predict the endpoint for regression problems 17,18 . RFs are robust to hyperparameter tuning and work well with different types of molecular descriptors. Known as the "gold standard", they have been shown to perform reasonably well for small and large chemical datasets.
Naïve Bayes (NB) -a probabilistic algorithm based on Bayes' theorem that assumes features are independent and distributed differently. Despite its simplicity, NB predictions are more accurate than other complex ML methods, especially for classification tasks. It is widely used in cheminformatics due to easy implementation and extremely fast computation. Support Vector Regression (SVR) -a kernel regression function that finds a hyperplane or set of hyperplanes in a high-dimensional space. It is one of the most popular methods in cheminformatics commonly used to predict toxicity related endpoints [19][20][21][22][23][24] .
K nearest neighbors (KNN) -one of the simplest algorithms which predicts a numerical target based on K closest or most similar neighbors 25 . KNN is widely used in a plethora of cheminformatics studies to predict physicochemical properties (melting point 26,27 , boiling Hybrid takes the average of two best performing ML methods. In our case studies, we applied two hybrids which are RF_KNN (average predictions from RF and KNN) and RF_DNN (average predictions from RF and DNN).

Supplemental Results 1 Cluster Analysis of Macrolactone Ligands
In the Hepatitis C dataset, we found an activity cliff (two 19-membered macrolactones indicated with red arrows in Figure S6B) by ECFP6, fragment-based approach. Their structures were shown in Figure S6C. The only difference between these two structures was a N2 (tertiary amine in the piperidine component, highlighted yellow in Figure S6C). Despite this minor structural difference, CHEMBL601195 and CHEMBL591648 afforded pIC50s of 7.22 and 5.84 respectively. They were however clustered relatively far apart and identified to be less similar via mordred_mrc. This demonstrates that despite high structural similarity, they had different molecular properties accounted for by mordred_mrc which contributed to various pIC50s and an activity cliff. Applying mordred_mrc to QSAR models may establish better chemical relationship between molecular properties and biological endpoints for these types of chemical analogues, while ECFP6 may have some challenges in this regard.
Additionally, we found a chemically unique macrolactone (CHEMBL2407590) based on mordred_mrc ( Figure S6A, blue arrow), yet not as unique based on ECFP6 ( Figure S6B, blue arrow). Its structure was shown in Figure S6D. It was a 33-membered peptide macrolactone, an attractive example to predict with our machine learning models to assess the relevance of structural fragment-based approaches like ECFP6 (10-fold CV predictions shown in Figure S19B and in Results S4).
In the T-Cells Dataset, an activity cliff was identified between CHEMBL422367 and CHEMBL347589 (blue cluster) based on ECFP6 ( Figure S7). The former afforded experimental pIC50s of 8.67 and the later 5.88. With mordred_mrc, they were, however, located in different clusters occupying similar shades of pIC50s. In other words, the activity cliff identified via ECFP6 was resolved in mordred_mrc. The structures were shown in Figure S7D. Herein, mordred_mrc captured the different molecular properties derived from the structural differences at C15 (highlighted yellow in Figure S7D). In this scenario, reducing the polarity of the side chain by replacing the ethanone component to a methyl significantly improved the activity of the analog.
Another structure which stood out in mordred_mrc based on relatively low pIC50 yet clustered with chemicals having higher pIC50s was shown in Figure S7E. It was however clustered together with other structures sharing similar pIC50s by mordred_mrc. It contained the same core scaffold as all the other chemical analogues in the previous purple and blue clusters with variations in the structural motif at C15.

Supplemental Results 2 Distribution Analysis of MRC Properties
We additionally looked at the distribution of ring sizes present in these case studies along with the distribution of pIC50 and number of compounds associated with corresponding ring sizes ( Figure S8). 15-membered macrolactones (the most common ring size with 204 compounds) has a relatively lower variation in pIC50 values which range from 4.89 to 8.61. It might be ideal to model just 15-membered ring structures since the applicability domain would be limited and the performance of the model could be improved, however it could also introduce bias into our models. We can learn a lot from such large-scale modeling with all the machine learning algorithms and descriptor sets in this study. We do not know with certainty the level of influence of ring sizes or how critical they are towards their inhibition. Additionally, if implicit fingerprints like ECFP6 can sufficiently encode important features of these ring structures and predict biological endpoints, fragment-based descriptors may be an appropriate approach for building predictive models for structures with varying ring sizes, even for very large complex structures.
Thus, restricting the dataset based on the most prevalent ring size would limit our exploration and understanding of this structural class. We used the entire dataset of pIC50 and the modeling results to better understand and gain insights into macrolactones. Additionally, due to the unbalanced distribution and limited variation of ring size ( Figure S8A), sugars ( Figure S9) and core esters (not reported; 221 macrolactones have 1 core ester and 2 macrolactones have 2 core esters) in the dataset of case study I (Plasmodium falciparum), we do not expect the prediction boost from adding mrc descriptors to mordred. We reported 10-fold CV predictions from our QSAR modeling workflow on CHEMBL2303629 and CHEMBL487366, two structures with highly unique and complex ring sizes (the only 27-, 36-membered macrolactones) from the Plasmodium falciparum macrolactone ligand dataset.
In the second case study with Hepatitis C Virus, 18-membered macrolactones were the most dominant structures (65 compounds) with pIC50 ranging from 6.33 to 8.7. 19-membered macrolactones (12 compounds) covered a similar range of pIC50 from 6.22 to 8.07, whereas 20membered macrolactones covered a larger pIC50 range from 4.5 to 8.01. From this dataset, it would be interesting to observe our machine learning predictions on ring size-based outliers such as 14, 21, and 33-membered macrolactones. Similar to the first case with Plasmodium falciparum, we do not expect to see the prediction boost from our mrc descriptors to mordred in the second case study with Hepatitis C due to limited variation and heavily uneven distribution in ring sizes ( Figure S8B), sugars and core esters (not reported). In Results S4, we show model predictions on CHEMBL591604 (the only 21-membered) and CHEMBL2407590 (the only 33-membered macrolactones), the two largest ring structures from Hepatitis C macrolactone ligand dataset.
The T-cells target, our third case study, has the least variation in terms of ring sizes which account for 20, 21 and 29-membered macrolactones ( Figure S8C), sugars and core esters (not reported). 21-membered ring size was the most dominant among all others with a total number of 87 compounds and covered larger pIC50 distribution which range from 4.93 to 9.74. Due to the small variation and uneven distribution of these properties, we do not expect to see a noticeable impact of mrc descriptors by complementing mordred descriptors.

Supplemental Results 3 Tuning parameters and feature extraction techniques to improve performance
For all targets, we implemented a) base ML models with no parameter tuning, b) models with more than 80% PCA feature extraction and tuned parameters, c) models with tuned parameters (Figure S1). For base models, only default parameters were used; no parameter tuning / selection was performed. For pca_tuned and tuned models, the best hyperparameters were selected by performing randomized search on 100 sample parameter settings with 10-fold cross-validations on the associated full datasets. The hyperparameters used in grid search areas across 5 MLs for pca_tuned and tuned models have been provided in Table S4. Additionally, DNN models use 300 training epochs with early stopping method to prevent overfitting. The hyperparameters for DNN covered in this study are provided in Table S5.
For all case studies of Plasmodium falciparum, Hepatitis C and T-cells datasets, we first generated heatmaps for eight machine learning (ML) models and six descriptors (excluding mrc descriptors) for base, pca_tuned and tuned models in the evaluation metrics of R 2 -coefficient of determination, and MAE -mean absolute error. All the modeling results in this study were consistently from 10-fold cross-validation. The results from mrc descriptor set were excluded due to consistently low performance in comparison to the rest; thereby affecting the heatmap. In our heatmaps, green is associated with "good" performance, while red is associated with "poor" performance; thus "green" indicates high R 2 and low MAE, and "red" indicates low R 2 and high MAE. The heatmaps of R 2 and MAE across base, pca_tuned and tuned models for Malaria, Figure S10, Figure S11 and Figure S12 respectively. In all three case studies, pca_tuned or tuned models had better performance than base models (greener in Figure S10, Figure S11, Figure S12) though it varies from one ML algorithm to another. DNN and RF methods which can discriminate and select features on their own without needing much feature pruning or reduction work better with just parameter tuning. In fact, their performance worsened in all three cases when the PCA feature extraction technique was applied.

Hepatitis C, T-cells have been provided in
It can be seen in the RF ( Figure S13A) and DNN ( Figure S13B) boxplot analysis of R 2 distribution across six descriptor sets for pca_tuned and tuned methods in three case studies. The SVR algorithm, which was observed to perform consistently well across the descriptor sets with the PCA-tuned method, performed worse in the tuned method in which ≥80% PCA feature extraction method was removed ( Figure S14). In this scenario, applying PCA that explains 80% of the variables to the SVR algorithm boosted the performance because feature selection and/or extraction is an important part of developing good SVR models. It was perhaps due to removing less relevant features along with the noise that could be involved in 20% least explainable variables by PCA.
To interpret more easily across all case studies, we generated an average R 2 and MAE from 10-fold cross-validation of eight ML models and six descriptors (excluding mrc descriptors) for base, pca_tuned and tuned methods separately. It allowed us to compare the general performance of these three methods and see the impact of ≥80% PCA feature extraction and parameter optimizations. The resulting clustered columns for R 2 and MAE for all three targets have been provided in Figure S15. The tuned method yielded the best performance in terms of R 2 and MAE across the case studies. Except for the first case study of Plasmodium falciparum (CHEMBL364), PCA-tuned models also performed better than base models which only use default parameters.

Performance of ML Workflow with Macrolactones
In this section, we took a few unique macrolactone ligand samples from each case study (Plasmodium falciparum, Hepatitis C and T-cells targets) and assessed the previously conducted 10-fold cross validation predictions from machine learning workflow on these individual compounds. We previously analyzed the distribution of ring sizes in each case study and sampled those with unique ring structures or complex, challenging structures based on circular dendrogram analysis. These are chemicals of interest because they stand out from the rest of the dataset due to either their ring size, structural complexity or molecular properties. The goal of this analysis is to assess the predictive power of QSAR models on these "misfit" or "unique" macrolactones even though the training data did not include many structures within their chemical proximity. This is because larger, more complex macrolactone rings are harder to organically synthesize, and are thus more difficult to test and retrieve experimental values.
Consequently, they cover a sparse chemical space even though they are still of high research interest due to favorable biological activities reported. The importance of QSAR modeling could certainly be emphasized if good predictions can be made on such complex, uniquely ring-sized molecules especially if the training set barely covers the chemical scope of these structures.
In the case study I with Plasmodium falciparum, we picked two chemicals with unique ring sizes: CHEMBL2303629 (27-membered), CHEMBL487366 (36-membered). The predictions on CHEMBL2303629 (27-membered) and CHEMBL487366 (36-membered) are shown in Figure   S18. These predictions within 0.1 of experimental biological endpoints are bolded and will be referred from now on as "top" predictions. CHEMBL2303629 is, in fact, a highly complex structure with two large rings (26-and 27-membered) fused together with five 5-membered rings and three 6-membered rings. The only other structure containing two large structures in the Malaria dataset was CHEMBL1682585, which contained 12-and 15-membered rings. Regardless of whether that analogue, CHEMBL1682585, was part of the training data in the cross-validation process, the predictions from machine learning workflow on CHEMBL2303629 was quite impressive ( Figure S18A). In fact, 8 combinations of ML and descriptors afforded "top" prediction values. Another compound of interest, CHEMBL487366, was the only 36-membered macrolactone in the Malaria dataset affording experimental pIC50 of 6.1 (Figure S18B). It is relatively less complex than the previous structure (CHEMBL2303629); nonetheless, the biggest macrolactone in the Plasmodium falciparum dataset with only two 31-membered macrolactones as the second biggest ( Figure S8A). The machine learning workflow had highly successful predictions on this compound; 11 combinations of ML and descriptors afforded "top" predictions, which fell within 0.1 of its experimental pIC50.
It is important to note the high success rate of our models affording eight and eleven "top" predictions on these two uniquely ring-sized and complex molecules of CHEMBL2303629 (27-membered) and CHEMBL487366 (36-membered) respectively though the training set was dominated heavily by 15-membered ring sizes (Results S2, Figure S8A).
In the second case study with Hepatitis C, we picked out two macrolactones with 21-and 33-membered macrolactones which were unique ring size-wise. The predictions from machine learning workflow on 21-membered (CHEMBL591604) and 33-membered (CHEMBL2407590) were shown in Figure S19. The former, the second largest ring structure in the Hepatitis C dataset, had experimental pEC50 of 8.08 (Figure S19A). There was only one "top" predicted pIC50 of 8.01 afforded by KNN algorithm and "all" descriptors. The second chemical, CHEMBL2407590 (33membered) was sampled out in the circular dendrogram analysis as a unique chemical based on mordred_mrc. It was the single largest ring structure among the dataset of Hepatitis C Virus macrolactone ligands (Figure S8B). The machine learning workflow had more success with this compound in achieving six "top" predictions ( Figure S19B). Overall in this second case study, the QSAR models still performed well and afforded "top" predictions on these two challenging and large ring size macrolactones: one ML and descriptor combination on the 21-membered CHEMBL591604 and six on 33-membered CHEMBL2407590.
In the third case study, there were three ring size variations in total, and no singled-out macrolactone with unique ring size ( Figure S8C). Thus, we looked at "misfit" compounds from the activity cliffs and reported the findings in the Results S4. The important takeaways from a close look at "misfit" macrolactones from these studies was the models' high predictability on uniquely different (or large) ring structures especially with mordred_mrc, ECFP6 and RF algorithm. Though they were one-of-a-kind in terms of ring size and structural complexity, our models afforded top predictions on these macrolactones as shown in the case studies of Malaria and Hepatitis C targets.
In the third case study with the T-cell dataset, we observed several activity cliffs by mordred_mrc ( Figure S7A) and ECFP6 (Figure S7B). The first activity cliff (purple cluster containing three compounds -CHEMBL269732, CHEMBL351140, CHEMBL158921) was identified by mordred_mrc ( Figure S7A). The activity cliff was however resolved by ECFP6 since they were clustered apart with other analogues sharing similar shades of pIC50s (purple arrows in Figure   S7B). Their structures were shown in Figure S7C. In that cluster, the compounds shared the same core scaffolds, but are differentiated in the side chains at C15 (highlight yellow) which contributed to widely varying pIC50s of 9.41, 8.3 and 5.68. Even though they were classified to be similar by mordred_mrc, ECFP6 distinguished them based on minor structural differences.
These structural variations may play an important role in the interactions with the protein target though they are similar in terms of chemical properties. In this example, the shorter side chain at C15 afforded a higher, more favorable pIC50.
Another activity cliff was observed between two chemicals (CHEMBL162354 and CHEMBL352172) based on both mordred_mrc and ECFP6 (orange oval indicated with orange arrows). Their structures were shown in Figure S20A and Figure S20B along with experimental pIC50s and QSAR 10-fold CV predictions. These analogues were highly similar structure-wise and property-wise with the only difference being a single methyl extension at C15 position in the core ring structure. However, they afforded vastly different experimental pIC50s of 5.37 and 8.58. It is surprising that reducing one methyl from the side chain significantly improved pIC50s. Perhaps, this small structural change played an important role in accommodating the compound to fit well within the binding site of the target or it induced different poses or conformations of the ligand that favors interaction with the amino acids in the binding pocket. It would be useful to couple with docking studies in another project to further analyze these sets of analogues to better understand the contributions of such structural differences.
An activity cliff was found between CHEMBL162354 and CHEMBL352172 (highlighted blue oval) from circular dendrogram analysis based on both mordred_mrc and ECFP6. It is important to note again that these two structure2 differ slightly by a single methyl at C6 of the core ring structure yet yielded moderately different experimental pIC50s. Understandably, the machine learning workflow did not perform as well on these analogues as in the previous case studies. No combination of ML and descriptor set afforded "top" predictions; thus, the single closest prediction value to the experimental pIC50 were underlined to assess its proximity of the best prediction value. CHEMBL162354 afforded experimental pIC50 of 5.37. A closer look at the circular dendrogram based on ECFP6 (Figure S7B, orange oval indicated with orange arrows) revealed that it was the only one among four compounds in that cluster of CHEMBL352172, CHEMBL162354, CHEMBL347139, CHEMBL351140 affording the lowest pIC50. Consequently, our models predicted very high pIC50s for this compound, especially ECFP6 across MLs (Figure S20A).
None yielded a predicted value lower than pIC50 of 6; the closest prediction came from DNN with 2Drdkit which predicted pIC50 of 6.24 which was 0.87 apart from experimental pIC50.
However, the machine learning workflow had better success at predicting the other analogue CHEMBL352172 (Figure S20B) with an experimental pIC50 of 8.58 even though no combination afforded "top" predictions. Again, it is important to note that in the recently mentioned "orange oval and the closest cluster", it had a similar range of pIC50s with other chemical analogues. Perhaps, that contributed to our models predicting closer values for this compound. The closest prediction came from DNN with "all" descriptors which yielded an pIC50 of 8.68 which was exactly 0.1 away from the experimental pIC50, on the borderline of "top" prediction. In fact, our workflow overwhelmingly predicted pIC50s above 7.0.
We also sampled another compound CHEMBL347657 with an experimental pIC50 of 5.78 ( Figure S20C), based on cluster analysis of T-cells dataset with mordred_mrc in Cluster Analysis of Macrolactone Ligands. It was clustered along with other analogues sharing very different shades of pIC50s in mordred_mrc. However, it was clustered with analogues sharing similar range of experimental pIC50 based on ECFP6. Thus, we initially assumed that ECFP6 may perhaps be better at predicting the bioactivity of CHEMBL347657 than mordred_mrc. However, our workflow did not afford any "top" prediction for this compound as well, not even with ECFP6.
The closest prediction with pIC50 of 6.45 came from DNN with "all" descriptors, 0.67 higher than the actual value. It should be noted that it would be unreasonable to expect our models to make highly accurate predictions for these challenging and "misfit" macrolactones, especially with a small training dataset in the first place.
Interestingly in this case study with T-cells, DNN was able to generate the closet predictions either with "all" or 2Drdkit descriptors on these challenging macrolactones. In general, the workflow had the least success on these complex macrolactones with T-cell dataset among all three cases. Even for compounds from uneven clusters with various endpoints, DNN was doing relatively better than other ML algorithms including combined approaches such as CSS, RF_DNN and RF_KNN. It is hopeful that we can increase the predictive power of DNN models with more training data addressing these slight structural variations and associated biological endpoints. We expect including molecular dynamics-based fingerprints or descriptors will enhance the predictability of our models especially in the case study of T-cells with minor structural differences causing activity cliffs.

Supplemental Data 1.
Curated structure activity data -data supplied as a separate file.       Note: Outside this study, we attempted modeling with only pIC50s as endpoints for the Hepatitis C target; however, our models yielded poor performance as expected due to poor data distribution (not reported).