A curated binary pattern multitarget dataset of focused ATP-binding cassette transporter inhibitors

Multitarget datasets that correlate bioactivity landscapes of small-molecules toward different related or unrelated pharmacological targets are crucial for novel drug design and discovery. ATP-binding cassette (ABC) transporters are critical membrane-bound transport proteins that impact drug and metabolite distribution in human disease as well as disease diagnosis and therapy. Molecular-structural patterns are of the highest importance for the drug discovery process as demonstrated by the novel drug discovery tool ‘computer-aided pattern analysis’ (‘C@PA’). Here, we report a multitarget dataset of 1,167 ABC transporter inhibitors analyzed for 604 molecular substructures in a statistical binary pattern distribution scheme. This binary pattern multitarget dataset (ABC_BPMDS) can be utilized for various areas. These areas include the intended design of (i) polypharmacological agents, (ii) highly potent and selective ABC transporter-targeting agents, but also (iii) agents that avoid clearance by the focused ABC transporters [e.g., at the blood-brain barrier (BBB)]. The information provided will not only facilitate novel drug prediction and discovery of ABC transporter-targeting agents, but also drug design in general in terms of pharmacokinetics and pharmacodynamics.

These determinants include descriptors that conserve certain physicochemical features of the small-molecules of interest, such as the calculated octanol-water partition coefficient (CLogP), molecular weight www.nature.com/scientificdata www.nature.com/scientificdata/ Data Curation -Bioactivity Data. Dataset Update and Complementation. New reports particularly from 2021 and 2022 were taken into consideration to update the dataset with compounds that were evaluated against the three transporters ABCB1, ABCC1, and ABCG2. In total, 22 new compounds were included into the list of qualified compounds 7,[40][41][42] . In addition, we focused an extended literature search, particularly of known standard inhibitors of ABCB1, ABCC1, and ABCG2 to obtain bioactivities with less mathematical uncertainty which also align well with our empirical experience in the laboratory. These compounds included verapamil (ABCB1 43 ), cyclosporine A (ABCB1 41,[43][44][45][46] and ABCC1 31,[44][45][46], verlukast (ABCC1 [31][32][33][34][35][36], and Ko143 (ABCG2 41,45 ). As a side note, the additional literature search also resulted in an update of bioactivity data of the natural compound piperine 47 . In the curation process to complement bioactivity values, we found that two compounds were erroneously included into the dataset (apatinib 48 and ceritinib 49 ). Both were not evaluated against ABCC1, and therefore, did not qualify for this dataset and were therefore removed.
Complementary Data Analysis. The bioactivity of several inhibitors could only be described as an estimation (either described as span, marked as 'active' , or annotated with '>' , '≥' , '<' , '~' in the previous dataset 7,27 ). However, to allow for the use of the entire dataset in mathematical and computational operations, we sought to allocate defined bioactivity values to these compounds. Hence, the individual reports were analyzed and the given indications of bioactivity [e.g., screening figures, flow-cytometry histograms, or tables with bioactivity values other than IC 50 values (e.g., percentages)] were taken into consideration for further data analysis. The specific bioactivity value (e.g., percentage inhibition) was extracted and correlated to the used compound concentration. By using GraphPad Prism version 8.4.0 applying the three-parameter logistic equation with a fixed Hill slope (=1.0), IC 50 values were calculated and listed in the new multitarget dataset. A detailed curation protocol is provided on https://www.zenodo.org 50 as well as he http://www.panabc.info website, and the related GraphPad Prism file containing the concentration-effect curves can be accessed without restrictions. In total, the bioactivity data of 104, 77, and 73 ABCB1, ABCC1, and ABCG2 inhibitors, respectively, have been calculated and complemented.
Therefore, we decided to allocate an arbitrary value of 100 µM to these compounds to acknowledge their minor inhibitory potential against ABCC1. Dihydrodibenzoazepine derivative 4i 53 , dregamine derivative 2 54 , and tabernaemontanine derivative 22 54 , on the other hand, reached over 100% inhibition at concentrations of 2.50 µM, 20.0 µM, and 20.0 µM, respectively. Unfortunately, these were the only indications of bioactivity by the authors of the original reports 53,54 . Hence, we decided to allocate arbitrary values of 0.999 µM 53 , 4.99 µM 54 , and 4.99 µM 54 , respectively, to acknowledge their potentially (very) high inhibitory power against ABCB1 as well as ABCG2 considering the effect-concentrations used in the original reports. These arbitrary IC 50 values have been chosen since sub-classifications of bioactivity classes according to bioactivity thresholds (e.g., 1 and 5 µM) provided a better prediction in our previous works 7 .
Data Unification. Several compounds were evaluated in multiple assays, e.g., the mentioned standard inhibitors of ABCB1, ABCC1, and ABCG2. However, to allocate one bioactivity value to one compound, a unification process was necessary. As IC 50 values do not follow a normal distribution, the multiple IC 50 values associated with one compound were subject to a three-step mathematical operation: (i) logarithmization of the IC 50 values; (ii) calculation of the mean; and (iii) delogarithmization of the log(IC 50 )-mean value. The resultant mean value was allocated to the respective compound. It shall be noted that the bioactivities of the compounds curcumin I-III (ABCC1) 55 and gefitinib (ABCB1 and ABCC1) 56 were only given as a span in the original reports 55,56 , and hence, the mean of the respective span was taken for further operations. In total, 60, 48, and 209 ABCB1, ABCC1, and ABCG2 inhibitors have been given a new bioactivity value by these operations compared to the previous multitarget dataset 7,27 .
Data Correction and Harmonization. Through the complementary analysis process, several bioactivity values were corrected. This applied for compounds that were falsely marked as 'inactive' in the previous multitarget dataset (ABCB1: 22 compounds; ABCC1: 26 compounds; ABCG2: 19 compounds) 7,27 . Lastly, all bioactivity values of the ABC_BPMDS were harmonized according to a number of three significant digits. This harmonization resulted in a standardized format of presentation: (i) 'XXX0 µM'; (ii) 'XXX µM; (iii) XX.X µM; (iv) X.XX µM; (v) 0.XXX µM; and (vi) 0.0XX (X = any numeric value between 1-9). Here, 11, 8, and 9 ABCB1, ABCC1, and ABCG2 values have been changed compared to the previous multitarget dataset 7,27 . www.nature.com/scientificdata www.nature.com/scientificdata/ hydrogens with critical discriminatory potential in the virtual screening process 7,26,27 . However, the original C@PA worked with a very preliminary and limited dataset of 308 substructures which were compiled after multitarget dataset visualization and literature consideration 65 , of which only 162 substructures were active in the multitarget dataset of, at the time of the study, 1,049 compounds 27 .
Substructure Visualization, Identification, and Extension. For the development of a complete, detailed, and novel (multitarget) fingerprint, which may also universally be used in (multitarget) virtual screening approaches, the 1,167 compounds of the updated multitarget dataset were visualized using ChemDraw Pro version 20.1.1.125, and substructures were identified and extracted. The extracted substructures [e.g., single-standing/centered (hetero-)aromatic rings, condensed (hetero-)aromatic rings, (un)saturated side chains, extremities, and non-aromatic (hetero-)cycles, etc.] were derivatized by applying a heavy atom substitution scheme as already reported earlier 26 (scaffold fragmentation and substructure hopping). Especially the presence and positioning of (non-polar) hydrogens in the sense of a proton/non-proton pattern scheme was stressed. These measures increased the quantity of substructural properties covered by the intended fingerprint. In addition, alternative datasets of ABC transporter modulators 5 and modes of action (particularly ABC transporter activators) 6 Individual Pattern Analysis 7 . In a final step, the multitarget dataset of 1,167 compounds was statistically analyzed for the listed 604 substructures of the substructure catalog. Here, the resultant list of hit molecules per substructure derived from the query search function of InstantJChem version 21.13.0 was saved and compared to the original list, translating the entry differences into a binary code [1 = substructure present (active substructure); 0 = substructure not present (inactive substructure)]. A binary pattern distribution scheme resulted which constituted the final ABC_BPMDS. It shall be taken note that the number of the very same substructure within the same compound was irrelevant; the presence (numeric value = 1) of the substructure was not an expression of how often the respective substructure appeared within the compound.
Physicochemistry Space Validation. Physicochemical properties shape not only the pharmacological profile of ABC transporter inhibitors [66][67][68][69] , but are also very often used as additional discriminators in virtual screening processes 7,26,27,38 . To prove that the 1,167 compounds of the ABC_BPMDS have a balanced distribution of physicochemical attributes, the ABC_BPMDS was analyzed for the CLogP, MW, MR, and TPSA using MOE version 2019.01. Figure 3 demonstrates that these physicochemical properties are normally distributed within the ABC_BPMDS comparable to other reported datasets 23,70 . Table 4 Table 5 summarizes the median and mean values of H-bond donors, H-bond acceptors, and rotatable bonds of the entire ABC_BPMDS as well as important sub-classes. Hence, the ABC_BPMDS contains suitable templates for future drug design and therapeutic development purposes, however, leaves also enough molecular-structural and physicochemical space for explorational analyses beyond the 'Lipinski rule of five' for the creation of inhomogeneous high-quality compound collections and compound libraries.   www.nature.com/scientificdata www.nature.com/scientificdata/ Usage Notes Status Quo. Practical Use. An easy-to-use sort function allows the user to discriminate the compounds regarding their bioactivities toward the targets, physicochemical properties, or molecular-structural features, but also in terms of the 604 different substructures. Hence, the user can retrieve the necessary binary pattern information for subsequent virtual screening and rational drug design approaches.
Special Considerations. The majority of the compounds was evaluated in proper full-blown concentration effect curves within the original report, providing either only one single IC 50 or two IC 50 values from different assays for biological validation, resulting mostly in minor standard deviations or standard errors. However, considering established reference compounds, many IC 50 values have been reported that are not fully covered by the deep literature search. Moreover, these drugs and drug-like compounds were tested in various assays, and thus, their IC 50 values vary in a greater span than of other compounds. In addition, data processing prior to the original publication varied from laboratory to laboratory [e.g., number of concentrations tested, manner of assay performance (non-standardized procedures), manner of data analysis (e.g., three-vs four-parameter logistic equation, relative vs absolute inhibition), data presentation (single-point screening graphic vs full-blown concentration effect curve, number of significant digits, in-or exclusion of standard deviation and/or standard error)] -contributing to a greater uncertainty of these particular data. Furthermore, the assays themselves that were considered for the ABC_BPMDS were various [e.g., influx vs efflux assay, fluorescence labeling vs radionuclide detection, manner of substrate (e.g., calcein AM vs mitoxantrone), selected cells vs transfected cells vs membrane vesicles) -contributing to a general variation in data that is hidden due to the fact that most compounds were only evaluated in one particular assessment system. These aspects should be considered when using the ABC_BPMDS, however, at the same time, it should be taken note that our previous work demonstrated the strength of substructural patterns based on the previous version of the ABC_BPMDS 7,26,27 . A list of compounds affected by these variations in assessment systems can be found in the curation protocol under the https://www.zenodo.org 50 5,7,[25][26][27] , and verlukast 5,7,[25][26][27]33 . These 'truly multitarget pan-ABC transporter inhibitors' 25 are the primary focus for extension of the ABC_BPMDS, particularly with respect to their substructural elements that promote multitargeting. On the other hand, the addition of multitarget agents that are not part of the ABC_BPMDS will contribute valuable input to the polypharmacological space as charted by the future ABC_BPMDS_1.2. (' ABC_BPMDS_1.2'). The substructural elements of the mentioned truly multitarget pan-ABC transporter inhibitors include 4-anilinopyrimidine 7,27 , benzyl 7 , cyano 7,27 , 3,4-dimethoxyphenyl 7 , fluorine 7,27 , furan 7,26 , ethylene diamine 7 , ethylene hydroxy 7 , hydroxy 7 , isopropyl 7,27 , methylene hydroxy 7 , phenethyl 7 , piperazine 7,27 , pyrimidine 7,26 , quinazoline 7,27 , thiazlole 7,26 , and thioether 7 . These and other substructures will be re-evaluated with respect to true multitargeting, and thus, receive a differential value dependent on the purpose of the subsequent studies. Furthermore, the addition of multitarget agents that are not part of the ABC_BPMDS will contribute valuable input to the substructure catalog, extending the substructural output of