Structure-based design and classifications of small molecules regulating the circadian rhythm period

Circadian rhythm is an important mechanism that controls behavior and biochemical events based on 24 h rhythmicity. Ample evidence indicates disturbance of this mechanism is associated with different diseases such as cancer, mood disorders, and familial delayed phase sleep disorder. Therefore, drug discovery studies have been initiated using high throughput screening. Recently the crystal structures of core clock proteins (CLOCK/BMAL1, Cryptochromes (CRY), Periods), responsible for generating circadian rhythm, have been solved. Availability of structures makes amenable core clock proteins to design molecules regulating their activity by using in silico approaches. In addition to that, the implementation of classification features of molecules based on their toxicity and activity will improve the accuracy of the drug discovery process. Here, we identified 171 molecules that target functional domains of a core clock protein, CRY1, using structure-based drug design methods. We experimentally determined that 115 molecules were nontoxic, and 21 molecules significantly lengthened the period of circadian rhythm in U2OS cells. We then performed a machine learning study to classify these molecules for identifying features that make them toxic and lengthen the circadian period. Decision tree classifiers (DTC) identified 13 molecular descriptors, which predict the toxicity of molecules with a mean accuracy of 79.53% using tenfold cross-validation. Gradient boosting classifiers (XGBC) identified 10 molecular descriptors that predict and increase in the circadian period length with a mean accuracy of 86.56% with tenfold cross-validation. Our results suggested that these features can be used in QSAR studies to design novel nontoxic molecules that exhibit period lengthening activity.


Material and methods
Molecular dynamics simulation. Mouse-CRY1 (mCRY1) (PDB ID: 4K0R) which is 93% identical to human CRY1 protein was retrieved from the protein databank. The structure was solvated in a rectangular box with TIP3P water molecules with the size of 7.25 × 10 5 Å 3 and neutralized with counterions using the NAMD (v. 2.6) 41 program packages. Then the system was minimized using the conjugate gradient method and kept the backbone atoms of the protein frozen. Then further minimization steps with relaxed backbone atoms were carried out. The system was heated up to physiological temperature with 10 K increments by running 10 ps simulation at each temperature. Constraints were applied during 1.4 ns equilibration simulation where the initial force constant on the C α atoms of the protein was 2 kcal/mol/Å 2 and reduced by 0.5 kcal/mol/Å 2 for each 0.4 ns equilibration run. CHARMM-PARAM22 force field 42 was used for the molecular dynamics (MD) simulations. After the equilibration of the system, MD simulation was run at 310 0 K for 10 ns. The pressure was controlled by the Langevin piston method during the simulations. The timestep was set to 2 fs and the bonded interactions, the van der Waals interactions (12 Å cutoff), long-range electrostatic interactions with particle-mesh Ewald (PME) were included for calculating the total force acting on the system. The last frame of the simulation was used as the "receptor" for the docking simulations. RMSD values were obtained using the RMSD trajectory tool of VMD. Backbone atoms (C, CA, N, and O) of each residue were used for RMSD calculation by excluding the translational motions. Molecular docking simulations. More than 8 million small molecules with non-identified functions were used as ligands for the docking. Molecules having the following criteria were filtered to eliminate non-relevant molecules: molecules having more than 7 H-bond donors, more than 12 H-bond acceptors, more than 600 Da molecular weight, logP > 7, more than 8 rotatable bonds, less than 3 aromatic rings 43 , and less than total of 4 rings. Openbabel, Autodock4.2, Autodock Tools4 44 and Autodock Vina 45 programs were utilized to prepare ligands (small molecules) for the docking. Finally, more than 1million compounds were docked to target pockets by using the Autodock Vina program. The target pocket for FAD and FBXL3 binding site was determined based on the CRY-FBXL3 crystal structure 38 . The target pocket on CRY1 was constructed via Autodock Tools. The Center of the box was located on the side chain of Phe296 amino acid residue, and the grid box size was determined as 1.9 × 10 4 Å 3 . Another target pocket was the secondary pocket of CRY1. The Center of the box was located on the side chain of Lys11 amino acid residue, and the grid box size was determined as 2.7 × 10 4 Å 3 .
The binding energy of molecules to CRY1 was calculated by Autodock Vina which uses a novel scoring function combining the knowledge-based and empirical approaches.
Establishment of CRY1-knockout U2OS cell line. CRY1 knockout U2OS cell line was generated using the LentiCRISPRv2 system 48 . In this study, we used the LentiCRISPRv2-CRY1-T1 construct which was described previously 49 . This construct was generated using the following oligos: CRY1 Sense: 5′ CAC CGC CTT CAG G GCG GGG TTG TCG 3′; CRY1 Antisense: 5′ AAA CCG ACA ACC CCG CCC TGA AGG C 3' .
The lentivirus preparation, transduction of U2OS cells and selection of the knockout candidates with puromycin (at 0.5 mg/mL concentration) were performed as described previously 49 . CRY1 knockout candidates were screened with immunoblotting using anti-CRY1. To show the specificity of targeting CRY1, we also analyzed CRY2 protein level and actin level, which was probed as the loading control. The antibodies used for this were as follow: anti-CRY1 (A302-614A, Bethyl Labs Inc. Montgomery, TX., USA), anti-CRY2 (A302-615A, Bethyl Labs), and anti-Actin (CST-4967S, Cell Signaling Technology, Boston, MA, USA). HRP-labeled anti-rabbit antibody (Thermo Fisher Scientific, Waltham, MA, USA cat: 31460) were used at 1:5000 dilution. Chemiluminescence was developed using WesternBright Sirius HRP substrate (Advansta, San Jose, CA, USA, cat no: K-12043-D20) and images were captured using the ChemiDoc XRS + system (Bio-Rad).
Real time bioluminescence of CRY1-knockout cells. 40 × 10 4 Cry1 -/-U2OS cells were seeded to 35 mm clear plates. Then, cells were transduced with Bmal1-dLuc lentiviral particles as described in Doruk et al 22 . Next cells were reset with dexamethasone (0.1 μM final) for 2 h and then media replaced with bioluminescence media described above with DMSO or molecules (final DMSO concentration 0.5%). Plates were sealed with vacuum grease and placed to luminometer LumiCycle (Actimetrics). Each plate was recorded continuously every 10 min for 70 s at 37 °C via photomultiplier tubes for a week. Raw luminescence data were analyzed using BioDare2 (biodare2.ed.ac.uk) 50 . For each molecule, the experiment was performed three times with duplicates (at least 6 plates per molecule) The unpaired t-test with Welch's correction was used to evaluate the significance.
Classification. PaDEL descriptors of molecules were produced using ChemDes web server 51 . The 1538 descriptors were evaluated to describe the properties of molecules; details of molecular descriptors analyzed by PaDEL in ChemDes server were given in Table 1. The molecules both in the toxicity and period change datasets belong to two groups and we can categorize these datasets using binary classification, a machine learning approach to classify objects into two groups. The toxicity molecule set is composed of toxic and nontoxic molecules whereas in the period change dataset we have group of molecules that significantly change the period www.nature.com/scientificreports/ and another that does not affect it. The class membership of each molecule is explained in Results and discussion section.
The number of the molecular descriptors (and so is the size of the feature space) 1538, is high relative to the number of molecules in both datasets that may cause overfitting. As such, a classifier with a good fit on the training set may produce poor results on the test dataset. To prevent overfitting, it is necessary to select the best set of molecular descriptors and eliminate the redundant features. As an initial step, the features with a single value for all the molecules are discarded since they do not provide any information for classification. For feature selection, we used Recursive Feature Elimination (RFE) 52 , which is originally proposed for selection of gene subset from patterns of gene expression data.
RFE necessitates an external estimator to weigh the features with respect to their importance. Starting from initial feature set, the estimator is trained on the current set to get the importance of each feature and the features with the least importance are discarded. The process continues the reduced sets until a feature set with a predefined size is reached.
Decision Tree (DTC) 53 , Random Forest (RFC) 54 , Extra Trees (ETC) 55 , and Gradient Boosting 56 . Classifiers were used as classification methods, all of which can also work as an estimator for RFE. DTC assigns labels to samples on leaves of a decision tree by partitioning the feature space on each node and it is superior to other methods considering its interpretability. RFC is an ensemble classification method where multiple DTCs are trained on several subsets of the dataset and prediction is made based on the outcome of individual trees. Like RFC, ETC is based on training several DTCs. The main difference is that ETC uses the full learning set instead of its subsets. In addition, to find the best split at any node ETC uses randomly selected features. As in RFC, ETC does final prediction by majority voting of the individual trees. Gradient Boosting Classifier is a boosting algorithm that converts weak learners to stronger ones. Starting from a weak learner, decision trees, it adds new trees sequentially by minimizing a loss function using a gradient descent procedure. We used Extreme Gradient Boosting 57 (XGBC) which is an efficient, and flexible implementation of Gradient Boosting. We implemented RFE and the classification methods using the Scikit-learn package 58 and coded in Python.

Results and discussion
Structure-based small molecule design. CRYs are core clock proteins that participate in generating circadian rhythm by acting as strong transcriptional repressors of BMAL1/CLOCK transactivation in mammals 4,59,60 . Studies revealed that CRYs SNPs are associated with different types of diseases. For example, CRY1 variants have been associated with depression and mood disorders 8,61,62 , elevated blood pressure and hypertension 62 . Additionally, a CRY1 variant is linked with familial delayed sleep phase disorder and attention deficit/hyperactivity disorder 18,63 . We, therefore, selected mammalian CRY1 as a target for in silico screening to find molecules that regulate the period of the circadian rhythm. The CRY1 crystal structure (ID: 4K0R) is solved 38 . Comparison of various CRYs from different organisms shows that CRYs have variable length of extended C-terminal domains that range from 30 to 300 amino acids 3,64,65 (Fig. 1). N-terminal domain has high homology to photolyases and is called the PHR domain. The PHR domain consists of two important regions, www.nature.com/scientificreports/ called the FAD-binding domain (primary pocket) and an α/β domain (secondary pocket) which are shown to be important for the interaction with the FBXL3 and the CLOCK PAS B domain, respectively 66 . Therefore, the FAD-binding and secondary pockets were selected as targets which are shown to be important for regulating repressor activity of the CRYs 67,68 ( Fig. 1).
To bring CRY1 structure (PDB ID: 4K0R) near physiological conditions it was minimized and gradually heated to 310 0 K. Then 10 ns MD simulation was run to obtain structure for the molecular docking simulations. To monitor the convergence of the simulation root mean square deviation (RMSD) of backbone atoms (C, N, C α ) of amino acid residues were analyzed throughout the simulation (Fig. S1).
We initiated in silico screening using a commercially available small molecule library (which contains ~ 8 million molecules). Since docking pockets are large enough to accommodate relatively large molecules, we filtered the library to eliminate irrelevant molecules as described in the material-method section. Thus, nearly ~ 1 million molecules were docked to primary and secondary pockets of CRY1 by using AutodockVina. Then, molecules were ranked based on their Vina binding energies. Additionally, Pan Assay INterference compoundS (PAINS) PAINS-Remover 69 was used to eliminate possible false-positive results. We tested 139 molecules designed for the primary packet of the CRY1 based on their availability. Similarly, 32 molecules designed for the secondary packet of the CRY1 were also tested for toxicity (Table S1).

Toxicity of molecules.
The toxicity studies were conducted using the human osteosarcoma (U2OS) cell line, which was also employed in the circadian bioluminescence assay. We initially tested the toxicity of the 171 compounds using an MTT (3-[4,5-dimethylthiazol-2-yl]-2,5 diphenyl tetrazolium bromide)-based assay at 20 µM and determined that 48 of them were non-toxic (Fig. 2). The remaining 123 molecules that show toxic effects at 20 µM were further evaluated at 10 µM. Results indicated that 26 molecules were not toxic at 10 µM. Finally, the other 97 molecules were tested at 2.5 µM and found that 41 molecules were non-toxic at this concentration (Fig. 2). The rest of 56 molecules with relative cell viability < 85% at 2.5 µM were labeled as toxic and, therefore, eliminated from further characterization. As a control, cells treated with 5% DMSO known to be toxic. In summary, of 171 tested molecules, 56 were toxic to U2OS cell lines whereas the 115 molecules were evaluated as non-toxic molecules at different concentrations. Structures of all molecules were provided in the supplementary data (Fig. S2). www.nature.com/scientificreports/ Classification of molecules based on toxicity. The toxicity data set is composed of 171 molecules with 1538 molecular descriptors. 334 features repeating the same value for all molecules were discarded. The remaining 1203 features were utilized to obtain the best feature set by Recursive Feature Elimination (RFE). Since RFE method 52 and so the selected feature set is dependent on the estimator used, Decision Tree Classifier (DTC) 53 , Random Forest Classifier (RFC) 54 , Extra Trees Classifier (ETC) 55 , and Gradient Boosting (XGBC) 56 were tested as the external estimators. To search for the promising regions in the space of the selected features, we generated feature sets with cardinality ranging from 2 to 20 in increments of 1 for each of these classifiers. To evaluate the potency of the selected features and compare the classifiers based on their prediction accuracy, 10-Fold crossvalidation (CV) were run on all the generated sets and replicated 100 times. The toxic and non-toxic groups of molecules are not evenly distributed. The 33% of all molecules are toxic and the rest, 67%, is non-toxic. To cope with the unbalanced groups, we used weights associated with each class which are inversely proportional to the class sizes. The weights corresponding to toxic and non-toxic molecules, 1.53 and 0.74 are calculated by w T = n/(2*n T ), w NT = n/(2*n NT ), respectively, where n is the total number of molecules, n T and n NT are the number of toxic and non-toxic molecules in the dataset.
To optimize the performance in discriminating between toxic and non-toxic molecules, we tuned the hyperparameters of each classifier. We did a grid To search for the promising regions in the space of the selected features, we generated feature sets with cardinality ranging from 2 to 20 in increments of 1 for each of the classifiers. To evaluate the potency of the selected features and compare the classifiers based on their prediction accuracy, 10-Fold cross-validation (CV) were run on all the generated sets and replicated 100 times. The grid search for parameter tuning is made for each feature set independently and CV repetitions are implemented based on tuned hyperparameters.
The average accuracies of classifications with 100 replications on generated feature sets are given in Table 2. The rows represent the number of features selected and the mean accuracies for the generated feature sets. The classifiers used for 10-Fold CV are placed in columns. The maximum average accuracies for each of the classifiers are marked in bold numbers. Our analyses show that DTC attained the highest mean accuracy of 78.77% for 19 feature set and is by far superior to the other classifiers studied in terms of prediction power. RFC follows DTC with the highest mean accuracy 72.99% for a set having 9 features, while ETC and XGBC are inferior, resulting in 71.36% and 71.17% maximum accuracies with 20 and 17 features, respectively. The maximum and standard deviation of 100 CV accuracies for each feature set and classifier pair are presented in Tables S2 and S3, respectively. In line with the mean accuracy comparison, DTC attained the best with 84.80% maximum accuracy on a set with 19 features. RFC and XGBC reached the highest score of 77.78% while ETC stayed at 76.02% level. The standard deviation of DTC values is greater than 2 except for one feature set and is higher compared to the other classifiers (Table S3). However, this high variation in DTC results is compensated by higher mean accuracies. The lower variation in RFC, ETC and XGBC results, mostly less than 2, does not pose an advantage due to their lower mean accuracies.
The feature set with cardinality 14 results in a mean accuracy of 78.49% by DTC and it is very close to the highest score of 78.77% for 19 features. We concluded that the additional 5 features do not provide significant improvement in the prediction power of DTC and we continued our study with 14 features. Tuning Hyperparameters of DTC by grid search for 14 features resulted in the optimized values: max_depth = None, max_features = 13, min_samples_leaf ' = 1, min_samples_split = 5.
The RFE method iteratively prunes the least important features to get the set with preferred cardinality. However, the generated set is not guaranteed to be optimal. To determine the most essential features, we iteratively pruned the features in a similar approach with RFE in the selected 14 features. At each iteration, we did 100 CV repetitions on the reduced sets obtained by dropping every feature one at a time. The feature that provides the highest mean accuracy among the reduced sets was pruned. Our analysis for the reduced sets together with the pruned molecular descriptor showed that removing the descriptor ATSC8v to get 13 features increased the mean accuracy from 78.49 to 79.63% (Table 3). Further reduction in the size of the feature set decreased the mean accuracies. This is probably due to excluding the informative descriptors. We concluded that 13 features are the best descriptive set among 1203 descriptors with DTC to classify the toxicity data. Note that since there is no max_depth limit for 14 features DTC parameters, at each pruning step we additionally tuned the max_depth parameter to get the best max_depth level of 10.

Scientific Reports
Circadian bioluminescence assay. U2OS cell is a commonly used cell line in the circadian rhythm field due to its robust rhythm 22,25,46,71 . Any agents such as small interfering RNA (siRNA) and chemicals or gene knockout (KO) affecting the stability or activity of clock proteins change the parameters (period, amplitude, and phase) of the circadian rhythm 25,72 . We analyzed the effect of non-toxic molecules on the period length of circadian rhythm in U2OS cells stably expressing destabilized firefly Luciferase (dluc) under the control of the Bmal1 promotor (U2OS Bmal1-dluc). www.nature.com/scientificreports/ Since the primary and secondary pockets of CRY1 are critical to interact with different proteins e.g. CLOCK and FBXL3, respectively, molecules designed for these two pockets might have differential impacts on the circadian rhythm. Thus, we focused only on the effect of 85 non-toxic molecules designed for the primary pocket of CRY1. U2OS Bmal1-dLuc cells treated with these molecules and their effect on circadian period length was analyzed by BioDare2 (biodare2.ed.ac.uk) 50 . Analysis revealed that 21 molecules significantly lengthen the period of circadian rhythm (Fig. S3). One molecule, N8, shortened the period and was excluded from further classification studies. The representative figure for period lengthener molecules is shown in Fig. 4A. Circadian rhythm results of all period lengtheners were given in Fig. S3. All period values were provided in Table S4. To verify the CRY1 dependency of molecules, we generated U2OS CRY1 -/-Bmal1-dLuc cells by utilizing CRISPR/Cas9 technology (Fig. S4). Knocking out the CRY1 in this cell line resulted in a shorter period (indicated with red line) compared to wild-type controls (indicated with black line) as in agreement with previously published data 73 (Fig. 4B). Notably, when U2OS CRY1 -/-Bmal1-dLuc cells were treated with potent molecules A7, K5, K14, M17, M35, M47, M49, M54, M78, and N15 no change was observed in the period length of the circadian rhythm (Fig. 4B). We, then, performed a classification study to determine the molecular characteristics leading to period change.
Next, we analyzed the stability of the interaction between molecules and CRY1 using the MD simulations. The seven of the most potent (A7, M17, M35, M47, M49, M54, and M78) molecules in complex with CRY1 were simulated. Parameters for molecules were generated using CHARMM-GUI server. The CRY1-molecule complexes, obtained from docking analysis, were simulated for 20 ns. The initial docking position of molecules and nearby amino acid residues on CRY1 were shown in (Fig. 5A). We identified the nature of interactions between CRY1 and molecules as followings. Ring structures in molecules generated pi-type (Pi-cation, Pi-alkyl, Pi-Pi stacked or T-shaped) interactions with at least one of either Arg-293, His-355 and Trp-399 of CRY1. In addition, all molecules interacted with Leu-255 and Ile-392 of the CRY1 through van der Waals forces. To to an opaque 96-well plate. Next day cells were synchronized by dexamethasone for 2 h. Then the medium was replaced with luminescence recording medium having molecules or DMSO. Bioluminescence readings were recorded for a week using Synergy H1 (BioTek). Period data was calculated using Biodare2 web-server (biodare2.ed.ac.uk) (all results were given in Fig. S3). To determine molecules that are changing the period of the rhythm significantly, the period length of molecule-treated cells was compared to that of DMSO control using unpaired t-test with Welch's correction (****p < 0.0001 ***p = 0.001 **p < 0.01, *p < 0.05, n = 3). Each biological replicate was the average of the 3 technical replicates. (B) 1 × 10 5 CRY1 -/-U2OS transduced with Bmal1-dLuc reporter. WT represents U2OS Bmal1-dLuc. Bioluminescence readings were recorded for a week using LumiCycle (Actimetrics, USA). Period data was calculated using Biodare2 web-server (biodare2.ed.ac. uk). Unpaired t-test with Welch's correction was used for significant analysis (*p < 0.05, n = 3). Each biological replicate was the average of the two technical replicates. www.nature.com/scientificreports/ evaluate the persistence of these interactions we generated a contact map of molecules with nearby amino acids through 20 ns MD simulations. RMSD values of backbone atoms of proteins showed that simulations reached the equilibrium (Fig. S5). Contact maps showed that these interactions were maintained during the simulations (Fig. 5B). Visual inspection showed that molecules did not cause any conformational changes and their interactions were maintained throughout the simulation. A list of highly interacting amino acids with each molecule was given in Table 5. In addition, amino acid residues, which formed hydrogen bonds, were determined from the initial docking position of molecules (Table 5) and the persistence of these interactions was confirmed from interaction maps (Fig. 5B).

Classification of period lengthening molecules.
Factors determining the period length in the circadian rhythm are quite complex. For example, the deletion of analogous Cry1 and Cry2 genes in mice causes short and long period phenotypes, respectively 74 . CRY binding small molecules discovered by high-throughput screening were reported to stabilize the CRY1. Interestingly, these molecules caused differential circadian phenotype in treated cells e.g. shorter or longer period length 25,27 . Thus, we focused only on the period lengthening molecules. Previously reported 6 molecules binding to CRY1 and lengthened the period of circadian rhythm (KL001, GO058, GO061, GO152, GO214, GO216) 25,27 were included in the classification analysis, GO203 which does not change the rhythm included as no-changer. We started the classification with 90 molecules of which 27 are period lengthening, 63 are no-changers. 1538 molecular descriptors were generated for all molecules. Interaction maps of molecules with CRY1 through 20 ns MD simulation were generated using VMD timeline applet. It has been evaluated as interaction if molecule and amino acid residues get 5 Å or closer and shown with a black bar. www.nature.com/scientificreports/ However, 360 of them have the same value for all molecules and were discarded. The remaining 1177 features were used to train the dataset. We followed a similar approach with the toxicity dataset for the classification of the set of period-lengthening molecules. The period dataset is also unbalanced as in toxicity since 30% of the molecules are period changers and the rest 70% are no-changer molecules. To deal with the possible bias, we set the weights of the periodlengthening and no-changer molecules as 1.67 and 0.71, respectively. We generated feature sets with cardinalities between 2 and 20 from the period dataset by RFE using DTC, RFC, ETC and XGBC as external estimators. Next, we tuned the parameters of the given classifiers on each of the feature sets and did 100 CV with the optimized parameters.
Mean accuracy levels for each feature set and classifier pair are presented in Table 6. All of the classifiers achieved mean accuracies greater than 80% for multiple numbers of feature sets and the highest mean accuracies for each classifier are marked in bold. RFC and XGBC are the best of all with highest mean accuracies, 83.62% for 16 features and 83.69% for 15 features. Since XGBC provided slightly higher mean accuracy than RFC with 1 less feature, we selected XGBC as the most promising classifier for the period dataset.
The maximum and standard deviations of 100 CV accuracies for each feature set and classifier pair are given in Tables S5 and S6 respectively. Among all, XGBC provided the maximum accuracy of 90% again with the 15 features. XGBC Parameters tuned for 15 features are such that, colsample_bytree = 0.5, learning_rate = 0.1, max_depth = 3, min_child_weight = 1, n_estimators = 100, and subsample = 0.7.
As performed in the toxicity dataset, we iteratively pruned the features in the selected 15 features to eliminate the redundant ones, this time using XGBC with the tuned hyperparameters. In Table 7 maximum, mean and standard deviation of accuracies of 100 CV applied on reduced feature sets are presented. Reducing the set from Table 5. Highly interacting and hydrogen bond generating amino acid residues of CRY1 with molecules.

Molecules
Interacting amino acids  We concluded that XGBC with the tuned parameters coupled with the reduced set with 10 features is the best classifier for the period changer dataset. We did a final 10,000 repetition of 10-Fold CV with XGBC on 10 features to get the maximum and mean accuracies of 93.33%, and 86.56% respectively. The histogram and fitted normal probability density function of the accuracies for 10,000 replications are presented in Fig. 6. The selected final 10 features includes: "ATSC8c, MATS1e, minsCH3, MATS4e, MATS4s, ATSC7i, SpMin4_Bhp, MLFER_S, ATSC4p, SpMax2_Bhm" (Table 8).

Conclusions
Drug discovery is a very expensive and time-consuming process posing several daunting challenges. Compared to the classical high-throughput approach to computer-assisted drug discovery, employing virtual screening (VS) is a promising approach to reduce the cost of the initial drug discovery. VS allows identifying hit compounds from large databases of drug-like molecules much faster and cheaper than traditional approaches. VS utilizes comprehensive evaluation of ADMET parameters by pharmacophore modeling 75 and quantitative structure-activity relationship (QSAR) analysis 76 . In addition to these, toxicity prediction is becoming a more significant part of current computer-assisted drug development, especially when libraries contain tens of millions of untested compounds. As a result, quick and inexpensive computational algorithms are frequently used to eliminate potentially toxic compounds and reduce the number of experimental tests required. Here we identified small molecules that bind functionally important regions of a core clock protein CRY1. First, of tested 171 molecules, 115 molecules are nontoxic while 56 molecules are toxic. Then we performed machine learning methods to classify toxic and nontoxic molecules. DTC identified 13 features that can predict the toxicity with accuracy of about 80%. Second, 21 molecules were identified as period lengthener among 85 molecules. Furthermore, machine learning approach using XGBC determined 10 molecular descriptors that can predict period lengthener molecules about 87% accuracy. These descriptors can be implemented in future VS studies on CRY1 to predict the toxicity and period lengthener effect of molecules from libraries containing several hundred million compounds. Table 7. Period lengthening dataset, maximum, mean, and standard deviation of 10-Fold CV accuracies with 100 repetitions. XGBC applied to reduced feature sets obtained by removal of a single feature at a time.  Figure 6. Period Change Dataset, histogram, and fitted normal probability density function of the accuracies for 10,000 replications of XGBC applied to final 10 molecular descriptors.