Artificial intelligence–enabled virtual screening of ultra-large chemical libraries with deep docking

Gentile, Francesco; Yaacoub, Jean Charle; Gleave, James; Fernandez, Michael; Ton, Anh-Tien; Ban, Fuqiang; Stern, Abraham; Cherkasov, Artem

doi:10.1038/s41596-021-00659-2

Download PDF

Protocol
Published: 04 February 2022

Artificial intelligence–enabled virtual screening of ultra-large chemical libraries with deep docking

Francesco Gentile ORCID: orcid.org/0000-0001-8299-1976¹,
Jean Charle Yaacoub¹^na1,
James Gleave¹^na1,
Michael Fernandez¹,
Anh-Tien Ton ORCID: orcid.org/0000-0001-7418-6563¹,
Fuqiang Ban¹,
Abraham Stern² &
…
Artem Cherkasov¹

Nature Protocols volume 17, pages 672–697 (2022)Cite this article

38k Accesses
111 Citations
16 Altmetric
Metrics details

Subjects

This article has been updated

Abstract

With the recent explosion of chemical libraries beyond a billion molecules, more efficient virtual screening approaches are needed. The Deep Docking (DD) platform enables up to 100-fold acceleration of structure-based virtual screening by docking only a subset of a chemical library, iteratively synchronized with a ligand-based prediction of the remaining docking scores. This method results in hundreds- to thousands-fold virtual hit enrichment (without significant loss of potential drug candidates) and hence enables the screening of billion molecule–sized chemical libraries without using extraordinary computational resources. Herein, we present and discuss the generalized DD protocol that has been proven successful in various computer-aided drug discovery (CADD) campaigns and can be applied in conjunction with any conventional docking program. The protocol encompasses eight consecutive stages: molecular library preparation, receptor preparation, random sampling of a library, ligand preparation, molecular docking, model training, model inference and the residual docking. The standard DD workflow enables iterative application of stages 3–7 with continuous augmentation of the training set, and the number of such iterations can be adjusted by the user. A predefined recall value allows for control of the percentage of top-scoring molecules that are retained by DD and can be adjusted to control the library size reduction. The procedure takes 1–2 weeks (depending on the available resources) and can be completely automated on computing clusters managed by job schedulers. This open-source protocol, at https://github.com/jamesgleave/DD_protocol, can be readily deployed by CADD researchers and can significantly accelerate the effective exploration of ultra-large portions of a chemical space.

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

John Jumper, Richard Evans, … Demis Hassabis

Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations

Article Open access 09 April 2024

Srinivas Niranj Chandrasekaran, Beth A. Cimini, … Anne E. Carpenter

An open source knowledge graph ecosystem for the life sciences

Article Open access 11 April 2024

Tiffany J. Callahan, Ignacio J. Tripodi, … Lawrence E. Hunter

Introduction

The recent expansion of make-on-demand libraries to billions of synthesizable molecules has attracted significant attention from the drug-discovery community, because such ultra-large databases provide access to novel, unchartered areas of the chemical universe^1,2,3. On the other hand, the emergence of ultra-large libraries has highlighted significant limitations of conventional docking approaches that typically operate on the scale of millions of molecules. With chemical libraries comprising 100 billion molecules on the horizon⁴, it will soon become impossible to deploy conventional docking at its full potential, and so far, only a handful of billion-sized docking campaigns have been conducted on elite supercomputing facilities^2,5.

It should also be noted, that docking is not just computationally demanding, but also a remarkably wasteful process in which a very small subset of top-scoring compounds is considered for experimental evaluation. Thus, most docked molecules are simply discarded⁶. To approach this challenge in the context of the global shortage of computational docking power, we have recently developed Deep Docking (DD), an artificial intelligence (AI)–driven approach that provides very economical yet highly reliable access to ultra-large docking. After the preparation of chemical library and receptor stages, DD relies on the iterative execution of five sequential stages (DD phases 1–5) that comprise ligand sampling, ligand preparation, docking, model training and inference. A final stage of docking is then performed to process the compounds that are retained by DD as prospective top-scoring molecules.

Development of DD to accelerate structure-based virtual screening

In 2006, we proposed Progressive Docking, a hybrid approach that uses information on already-docked molecules to predict the scores of yet unprocessed entries of a database with quantitative structure activity relationship models based on partial least squares regression, thus reducing the docking workload⁷. However, the acceleration offered by this and similar shallow-learning methods that followed^8,9 was not yet sufficient to screen more than a few million compounds. To enable structure-based screening of billion-sized molecular libraries, we recently developed DD, a technique that iteratively trains deep neural networks (DNNs) with small batches of explicitly docked compounds to infer the ranking of the yet-unprocessed remainder of the library¹⁰. In this way, DD can discard unfavorable (undockable) molecular structures without wasting valuable computational resources (Fig. 1).

**Fig. 1: AI-accelerated DD approach versus regular docking.**

We have used the DD protocol to virtually screen the zinc is not commercial (ZINC)15 library (1.36 billion molecules)¹¹ against 12 proteins representing four major families of drug targets¹⁰. We demonstrated that DD needs to dock only 1% of the molecules to significantly reduce (100-fold) the size of a chemical library while retrieving 90% of the best-scoring structures (virtual hits defined by the FRED scores¹²), and that even larger reductions can be achieved by lowering the user-defined percentage of retrieved virtual hits.

In a consecutive study, we also applied DD to screen ZINC15 against the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) main protease (Mpro), a prominent drug target for coronavirus disease of 2019, using the Glide Single Precision (SP) approach¹³. The identified ‘make-on-demand’ compounds were structurally different from known protease inhibitors¹⁴. Our own experimental validation of DD-proposed hits showed that 15% were active against the target, with the established IC₅₀ values ranging from 8 to 251 μM^15,16. Specifically, this screen led to the discovery of a novel series of compounds based on a dihydro-quinolinone core that were confirmed experimentally as low micromolar Mpro inhibitors by us^15,16 as well as by an independent study¹⁷. After our initial results, similar machine learning approaches have rapidly emerged that emulate (rather than explicitly compute) docking scores^{18,19,20,21,22}, highlighting a global demand for AI-accelerated virtual screening^23,24.

The current version of the DD protocol can be seamlessly integrated into existing drug-discovery pipelines that rely on popular docking programs. An automated workflow has also been developed to facilitate the adoption of DD by drug-discovery scientists with minimal or no experience in machine learning and programming.

Experimental design

Preparation of chemical libraries

The most commonly used ultra-large chemical libraries are ZINC and the ‘make-on-demand’ collections from Enamine. The ZINC15 database¹¹ contains 1.5 billion molecules that can be readily downloaded as a whole in simplified molecular-input line-entry system (SMILES) format; it also provides access to several smaller subsets such as drug- and lead-like molecules. The library offers up to four precalculated stereoisomers per molecule and allows protomer and tautomer states to be computed at different pH ranges¹¹. The newest version, ZINC20, has also been recently released, comprising ~1 billion molecules²⁵. Furthermore, the Enamine REAL database²⁶ includes 1.95 billion make-on-demand molecules that can be synthesized with an 85% average success rate. On the other hand, Enamine REAL Space²⁷ is the largest database of commercially available compounds, accounting for 19 billion not-enumerated molecules.

Just like in the case of conventional docking, a chemical library has to be preprocessed for DD. Explicit isomeric configurations and correct ionization states need to be enumerated for each molecule. OpenEye flipper (license required)²⁸ or rdkit’s EnumerateStereoisomers (freely available)²⁹ are examples of programs that can be used to enumerate isomers, while tautomers and protonation states can be calculated by various licensed (e.g., QUACPAC³⁰ and ChemAxon (http://www.chemaxon.com)) or open-source (e.g., openbabel³¹ and Ambit³²) software. Circular binary Morgan fingerprints^33,34 with radius 2 and size of 1,024 bits, to be used as descriptors, can then be computed from the SMILES of the prepared molecules. These extended-connectivity fingerprints represent a machine-readable description of molecules based on a fixed-length binary bit vector encoding the presence or absence of specific substructures (circular atom neighborhoods at specified radius; Supplementary Fig. 1)³⁵. The fingerprint bits are used as features for the DNN models, which aim to learn which substructures are responsible for high predicted binding affinities (in terms of docking scores). In DD, Morgan fingerprints are stored as the indexes of bits set to 1 rather than the entire fixed-length vector (Box 1), and they are decompressed to the regular 1,024-bit representation on the fly.

Box 1 Example of Morgan fingerprint for piridocaine (ZINC000000000638) in DD format

The comma-separated list of integers represents the indexes of bits that are set to 1 in the 1,024-bit fingerprint, indicating the presence of specific substructures in the molecule. The chemical structure of piridocaine and substructures relative to the first nine bits are shown in Supplementary Fig. 1.

SMILES: c1ccc(c(c1)C(=O)OCC[C@@H]2CCCC[NH2+]2)N ZINC000000000638

Fingerprint:

ZINC000000000638,4,13,29,44,64,80,121,139,145,147,175,301,356,423,433,494,568,648,649,650,659,661,695,726,728,807,832,849,890,891,892,893,910,921,926,947,967,983,1019

Receptor preparation

Target structures need to be prepared before the docking grids can be generated. Non-structural water, lipids and solvent molecules are usually removed; the target protein may require structural optimization to repair any missing parts, add hydrogens, compute correct protonation states of residues and energetically relax the structure. This optimization process can be performed with various licensed (e.g., Schrödinger³⁶ and Molecular Operating Environment³⁷) and free (e.g., UCSF Chimera’s Dock Prep³⁸) program suites. The Protein Preparation Wizard tool from Maestro³⁶ provides a straightforward and automated way to perform such preparation. Any other protein preparation tool can be used for this purpose. The generation of docking grids, on the other hand, strictly depends on the docking program that the user intends to use. In the Procedure, we outline the steps to prepare grids for docking with Glide or FRED.

Molecular sample size

Validation, test and initial training sets are randomly sampled from the entire docking library at the first DD pass. From the second iteration on, the training set is iteratively augmented with random batches of molecules classified as virtual hits in the inference stage of the previous iteration. The chosen sample size should be large enough to properly represent the chemical diversity of the investigated library. At the same time, the maximum sample size inevitably depends on the total amount of docking (number of docked compounds) that is feasible on the user’s system.

Validation and test sets are generated only in the first iteration. Because the score threshold used to define virtual hits is decreased at each DD iteration, using small-sized sets can cause generalization issues, especially in the last iteration, in which the number of positive samples in the two sets is very limited (e.g., 0.01%). This problem might be difficult to diagnose, because the phenomenon may not be detectable in earlier iterations when positive samples are sufficiently well represented. We recommend choosing the size of validation and test sets in the first iteration as large as possible, ideally comprising 1 million molecules each, and absolutely avoiding using less than 250,000 molecules (corresponding to 25 positive samples in the last iteration), because we have observed that those values are sufficient to obtain robust and generalizable docking models for libraries that contain on the order of 1 billion compounds¹⁰. If at any iteration the number of virtual hits in the validation or test set is ≤10, model training is automatically canceled by the program.

The size of the training set, on the other hand, influences mainly model precision, and performances improve with larger training sets (700,000–1,000,000 molecules) and more iterations (8–11). If deep learning–dedicated resources available to the user (such as graphics processing units (GPUs)) are limited, we recommend using larger training sets for a smaller number of iterations (for at least four iterations) rather than the alternative, because the resulting models, although usually showing slightly worse performances, require fewer iterations of training and inference to achieve an acceptable result. An example is reported in Fig. 2, illustrating the effect of different training set size and number of iterations on a DD run against the dimerization site of the androgen receptor (Protein Data Bank (PDB) ID: 1R4I³⁹). Performing the same total amount of docking (~4,200,000 molecules including validation and test set), the four-iteration (depicted in light green) strategy using a training size of 700,000 returned a final number of qualified molecules slightly higher than the eight-iteration (depicted in violet) strategy (~58 million versus ~49 million); however, it required half the training and inference computations. Thus, it may be a better choice for systems with a limited number of GPUs. On the other hand, the use of a larger training set (700,000 molecules) for the same number of DD iterations (four) substantially improved the DD reduction performances (~58 million versus ~108 million qualified molecules obtained by using a training size of 350,000, depicted in dark pink). If the resources for docking are limited, we recommend choosing an initial training sample size equal to the maximum number of molecules that can be docked with the user’s system (ideally 1,000,000) and dock it, and then use the procedure outlined at Step 23B of Procedure 1 or 10B of Procedure 2 to identify a value that provides an optimal balance between amount of docking and model performance for the user’s specific system and that can be used in successive iterations.

Fig. 2: Effect of varying training size and number of iterations on the number of remaining molecules (molecules that are classified as virtual hits, hence not discarded) for screening ZINC20 against the dimerization site of androgen receptor (PDB ID: 1R4I³⁹).

Model training and inference

Each iteration step in the DD protocol encompasses model training and inference. To identify virtual hits, the protocol uses binary classifiers in the form of feedforward DNN models (multilayer perceptrons) trained on 1,024-bit circular Morgan fingerprints. Binary ‘positive samples’ in training, validation and test sets are virtual hits with scores above a threshold, corresponding to a predefined top percentage of the docking-ranked molecules in the validation set. The rest of the molecules are labeled as ‘negative samples’. These top-percentage values can be specified by the user for the first and the last iterations, whereas for the intermediate ones, the value will be linearly changing between those two.

After the binary labels are generated, a user-specified number of models with different combinations of hyperparameters (number of hidden layers and neurons, dropout frequencies, oversampling of minority class and class weights) are trained to optimize model test set accuracy by using a grid search strategy.

After the training phase is finished for the initial iteration, the optimal binary classifier is used for inference of virtual hit-likeness of the remainder of the molecular library. For the next iterations, training, validation and test sets are augmented with new compounds randomly selected from molecules with predicted virtual hit-likenesses higher than a classification threshold corresponding to a user-defined recall value for validation predictions. This recall value is a critical parameter, which, if set too high, may significantly increase the number of remaining molecules. On the other hand, if the recall is set too low, large portions of virtual hits may be discarded. Thus, we recommend setting the recall between 0.75 and 0.90 (corresponding to 75% and 90% retrieved virtual hits, respectively).

It is important to note that virtual hit calling becomes more stringent as the top-percentage threshold value is decreased linearly with the number of iterations; thus, the definition of ‘positive’ and ‘negative’ labels also changes at each iteration for all molecules in the training, validation and test sets. The inference is always performed over the whole library, usually setting the initial percentage value for virtual hit selection to 1% and the final value to 0.01%. The total number of iterations typically ranges from 4 to 11, and we normally train 24 models at each iteration in the optimization step. For most docking campaigns, these parameters are sufficient to shrink a database of 1–1.5 billion molecules to a few million compounds that could be conventionally docked with regular computational resources. Alternatively, as we mentioned before, the preset recall value could be adjusted for more ‘aggressive’ DD-selection of top-scored compounds.

Applications

The DD protocol can be used in conjunction with any popular docking program. In our DD campaigns, we were able to dock billion-size (1B+) chemical libraries against various targets using FRED¹², Glide¹³, Autodock-GPU⁴⁰, QuickVina2⁴¹ and ICM^42,43 docking suites. The presented protocol explains how to set up and run DD against a generic protein target. Although the steps required for protein, ligand and docking grid preparation are explained using specific tools that we use in house, all of those can be readily adapted to similar programs and computer-aided drug discovery (CADD) packages.

Comparison with alternative methods

One of the major challenges of modern CADD is a constantly growing need for computational resources required to screen chemical libraries that are exploding in size because of recent advances in automated synthesis and robotics. Few docking packages have been proven successful for screening of 1B+ libraries by relying on their code scalability across supercomputing clusters. For example, OpenEye GigaDocking⁴⁴ was used to dock the Enamine REAL database into the purine nucleoside phosphorylase and heat shock protein 90 targets in <1 d using 27,000 and 45,000 Amazon Web Services (AWS) cloud CPUs, respectively. The popular Autodock program has been parallelized for Compute Unified Device Architecture (CUDA)⁴⁵ and deployed on the Summit supercomputer (comprising >27,000 GPUs) to dock the same library into the SARS-CoV-2 Mpro active site in ~1 d⁵. VirtualFlow² is another automated platform for docking large libraries using supercomputing systems that has been used to screen 1.4 billion molecules in 2 weeks (using 8,000 CPUs). Very recently, Bender et al. developed a guide for ultra-large docking campaigns, highlighting that computing clusters of 500–1,000 CPU cores are required to perform billion-scale virtual screening in a timely manner⁴⁶. These docking platforms achieved great high-throughput but are extremely resource demanding in comparison to DD.

Consequently, because of the computational cost, conventional docking of ultra-large libraries remains unaffordable for most of the research community. Hence, various machine learning emulation techniques of docking have been proposed to perform such tasks without large computing resources. In Supplementary Table 1, we have listed several of such methods that have been developed as a proof of concept to approximate docking scores from molecular structural features (descriptors)^{9,18,19,20,21,22}. Although these methodologies cannot be easily compared (owing to the use of different benchmark sets and docking libraries), it is possible to stipulate that DD is one of the fastest AI-enabled docking platforms and the only method that has been extensively tested on 1B+ libraries. In addition, the DD protocol does not rely on a particular docking program, and thus it is compatible with the emerging large-scale docking methods to improve their high-throughput capabilities.

Limitations

Some technical limitations of DD that are worth mentioning are related to the extensive use of GPU acceleration. The protocol requires access to GPU resources for optimal performance, in contrast to most of the docking platforms that rely on CPUs. In addition, DD is implemented for fast and economical virtual screening and thus provides docking details exclusively for the top-scoring molecules and disregards large fractions of chemical libraries. Therefore, docking campaigns assessing hit rate variability with docking scores¹ or rescoring low-ranked molecules⁴⁷ should consider a brute-force approach instead. In addition, the quality of DD results entirely depends on the suitability of the docking program to prioritize active molecules from an ultra-large library. Hence, we anticipated that it would be challenging to discover active molecules from DD of a library of a billion molecules if docking performs poorly on the specific target, just like in the case of conventional docking¹. In this context, recent work by Bender et al. provides useful guidelines for establishing benchmarking calculations before performing large-scale docking⁴⁶.

Overview of the protocol

The present protocol provides guidance on how to set up and run a DD campaign with Glide SP and FRED docking packages. Minor modifications can be applied to adapt the same workflow to other docking programs.

We outline two alternative procedures to set up and run a DD campaign. The first relies on manually running each individual script (Procedure 1). This option is well suited for users who do not have access to large computational facilities or for those who want to use only specific DD steps in their drug-discovery campaigns.

The second option relies on a series of scripts that automatically run each stage on a cluster managed by a Simple Linux Utility for Resource Management (Slurm) job scheduler (Procedure 2). This option is particularly suitable for automation purposes and for performing virtual screening campaigns on large computational systems. Procedures 1 and 2 share a number of common steps (Supplementary Table 2). Both procedures have been extensively tested by users with no prior knowledge of the DD protocol.

Materials

Equipment

Molecular data

A chemical library with molecules in the SMILES format. A ready-to-screen version of ZINC20 (downloaded in early March 2021) with SMILES and already calculated fingerprints has been deposited at https://files.docking.org/zinc20-ML/
3D structure of the receptor in PDB format
An example iteration to test that the protocol can be downloaded from the Federated Research Data Repository at https://doi.org/10.20383/102.0489

Software

Python3.0+
Receptor preparation: Schrödinger Maestro suite³⁶ or alternative program for initial protein optimization, Maestro for Glide docking grid generation, Make Receptor⁴⁸ from OpenEye for FRED docking grid generation or alternative protein-preparation tools for other docking software
Ligand preparation: QUACPAC³⁰ and OMEGA²⁸ OpenEye modules or other similar programs for tautomer and protomer enumeration, stereoisomer enumeration and 3D conformer generation
Docking: Glide SP¹³ or FRED¹² program; the protocol can be easily adapted to any other conventional docking program
Descriptor calculation and machine learning: Python3.0+ conda environment (https://docs.conda.io/en/latest) with rdkit²⁹, tensorflow-gpu (version 1.14.0 or higher)⁴⁹, pandas, numpy, keras, matplotlib and scikit-learn⁵⁰ (called dd-env in the protocol). A yml file, called environment.yml, is available in the utilities folder of the repository (https://github.com/jamesgleave/DD_protocol) for setting up the environment (see https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-from-an-environment-yml-file for details)
DD program: all the scripts required to prepare a library and to run DD are freely available from https://github.com/jamesgleave/DD_protocol
Text editor

Hardware

Windows, Macintosh or Linux computer
Linux system or cluster (preferably with Slurm workload manager, https://slurm.schedmd.com/documentation.html) with CPU nodes and preferably also GPU nodes for machine learning. Our in-house setup consists of 50 GPUs (Nvidia Tesla V100 GPUs with 32 GB of memory) and 640 CPU cores (Intel Xeon Silver 4116 CPU @ 2.10 GHz)

Procedure 1: manual DD

1
Access the Linux system, create a DeepDocking folder and clone or copy the DD_protocol repository there. No installation is required to run the scripts.

Stage I: chemical library processing

Timing ~13 h

2
Download the library of SMILES in the DeepDocking folder. If the library is provided as separate files, concatenate them into a single file from the command line:

cat *smi > library.smi
3
Create a smiles folder and divide the library into evenly populated files of 10,000,000 molecules each, all placed inside the folder:

split -d -l 10000000 library.smi smiles/smiles_all_ --additional-suffix=.smi

This reorganization step enables one to efficiently run the random sampling and inference stages on a larger number of smaller files rather than a massive, memory-consuming library file. Because these processes are run on each file independently, any other value can be used instead of 10,000,000 to better fit specific computational setups (e.g., to have a number of files equal to the number of cores available on a node to optimize the sampling). The number of resulting files will depend on the size of the original database and the number of molecules specified for each file.

Enumerate stereoisomers, tautomers and protomers

Critical

Stereoisomers, tautomers and protomers must be enumerated using, for instance, OpenEye tools, and Morgan fingerprints must be calculated (Fig. 3). Other similar database preparation tools can be used for the same purpose. The procedure described here calculates all the possible stereoisomers of a given molecule and the dominant tautomer form (protonated at pH 7.4) for each isomer. These commands can be changed, for example, to enumerate multiple tautomers and protomers (see https://www.eyesopen.com/quacpac for guidance); each of these must be assigned a unique name, however, resulting in a substantial increase in the number of unique library entries. For some chemical libraries, the tautomers and protomers are partially or completely enumerated, and therefore these steps may not be required.

4
For each file generated in Step 3, enumerate stereocenters with unspecified stereochemistry using the flipper tool from the OpenEye OMEGA module, assigning a unique name to each isomer:
Fig. 3: Chemical library preparation for DD.
Initially, the library is obtained in SMILES format and split into evenly populated files to facilitate sampling and inference. Depending on the available resources, the user can then follow two preparation procedures. Procedure 1 (manual preparation) requires each preparation action to be executed manually (enumeration of stereoisomers, tautomers and protomers; renaming the files; and calculation of Morgan fingerprints). Procedure 2 (automated) allows running all the preparation automatically; however, it requires access to a computing cluster running with a job scheduler (Slurm). Both processes will generate a folder with the prepared SMILES and another one with the corresponding Morgan fingerprints in DD format.
Full size image

flipper -in smiles/smiles_all_1.smi -out smiles/smiles_all_1_isom.smi -warts
5
Calculate the dominant tautomeric and protonation form of each isomer at pH 7.4 using the tautomers tool from the OpenEye QUACPAC module:

tautomers -in smiles/smiles_all_1_isom.smi -out smiles/smiles_all_1_states.smi -maxtoreturn 1 -warts false
6
Create a folder for the prepared files (library_prepared) and move the files generated in the previous step inside it, renaming them and changing their extension from .smi to .txt:

mv smiles/smiles_all_1_states.smi library_prepared/smile_all_1.txt

Morgan fingerprints

7
Activate the conda environment and use the morgan_fp.py script (provided in the utilities folder of the repository) to calculate the Morgan fingerprints of the prepared structures from their SMILES:

python morgan_fp.py --smile_folder_path ~/DeepDocking/library_prepared --folder_name ~/DeepDocking/library_prepared_fp --tot_process 60

This will create a library_prepared_fp folder with the resulting fingerprints. The tot_process argument controls how many files from library_prepared will be processed in parallel using multiprocessing.

Critical step

DD works exclusively with Morgan fingerprints of radius 2 and 1,024 bits in the format generated by the morgan_fp.py script, which is the name of the molecule followed by the indexes of fingerprint bits that are set to 1, comma separated (Box 1).

Stage II: receptor preparation

Timing ~30 min

8
Obtain the PDB structure of the target molecule from a public repository (e.g., the PDB⁵¹). Where possible, selecting target structures with bound ligands is recommended.
9
Use the Protein Preparation Wizard tool from Maestro suite³⁶ to prepare and optimize the target structure by following the procedure illustrated at https://www.schrodinger.com/training/videos/protein-preparation. Any other protein preparation tool can be used for the same purpose.
10
Save the resulting optimized structure in PDB format.
11
Depending on the choice of docking software, the grid can be generated in different ways. Each process has been described extensively elsewhere; therefore, in the present protocol, we describe only the default workflows for Glide (A) and FRED (B). Detailed instructions on how to customize the grid generation process can be found in the training videos provided by Schrödinger for Glide (https://www.schrodinger.com/training/videos/docking-receptor-grid-generation) and in the manual of OEDocking for FRED (https://docs.eyesopen.com/applications/oedocking/make_receptor/make_receptor.html).
1. (A)
  Generation of the Glide docking grid
  1. (i)
    Launch the Maestro graphical user interface (GUI).
  2. (ii)
    Load the prepared PDB in Maestro using ‘File’ → ‘Import Structures …’
  3. (iii)
    In the Tasks panel, search for ‘Receptor Grid Generation’ and select it.
  4. (iv)
    In the Receptor panel, check the ‘Pick to identify the ligand’ box and select the bound ligand in the Workspace.
  5. (v)
    Press ‘Run’. The docking grid will be generated in zip format.
2. (B)
  Generation of the FRED docking grid
  1. (i)
    Open Make Receptor GUI.
  2. (ii)
    Import the prepared PDB structure using ‘File’ → ‘Open …’
  3. (iii)
    Check the ‘Pro’ box for the target protein and the ′Lig′ box for the ligand.
  4. (iv)
    Adjust the box size in the Box panel, if necessary.
  5. (v)
    Press ‘Create Shape’ in the Shape panel.
  6. (vi)
    Type the name of the protein and ligand in the respective boxes in the Finish panel and press ‘Save’ to save the grid in oeb format.
12
Move the grid into a docking_grid directory in the DeepDocking folder.

Stage III: random sampling of molecules in the first iteration (DD phase 1)

Timing ~15 min

13
In a projects directory, create a protein_test project folder. The DD process (Fig. 4) will output all the results into this folder.
Fig. 4: General DD workflow.
The method comprises five phases to be repeated iteratively and one final phase in which predicted virtual hits are conventionally docked to the target. Validation and test sets are generated and docked only in the first iteration. Step numbering refers to Procedure 1 (manual version).
Full size image
14
Activate the conda environment and use the molecular_file_count_updated.py script provided in the scripts_1 folder with the following command to determine the number of molecules to sample from each library file to generate training, validation and test sets, each containing 1,000,000 molecules:

python molecular_file_count_updated.py --project_name protein_test --n_iteration 1 --data_directory ~/DeepDocking/library_prepared_fp --tot_process 60 --tot_sampling 3000000

A Mol_ct_file_updated_protein_test.csv file will be created in the ~/DeepDocking/library_prepared_fp Morgan fingerprint folder, reporting the number of molecules to sample from each file.

Critical step

The choice of molecular sample sizes in the first iteration is critical, because validation and test sets are generated only during this iteration and should be as large as possible (ideally, 1,000,000 molecules each). Refer to Experimental design for detailed information on how to properly choose the sizes of validation, test and training sets.
15
Use the sampling.py script to perform random sampling:

python sampling.py --project_name protein_test --file_path ~/DeepDocking/projects --n_iteration 1 --data_directory ~/DeepDocking/library_prepared_fp --tot_process 60 --train_size 1000000 --val_size 1000000

This step will create an iteration_1 folder inside the project directory and will generate three files with the names of 1,000,000 molecules each. train_size and val_size arguments control the number of molecules to sample for the training set and validation/test set files, respectively. Validation and test sets will be read always from the iteration_1 folder, regardless of the iteration.

Critical step

The sum of the train_size and 2val_size arguments must be equal to the tot_sampling value of Step 14 in the first iteration.
16
Remove the overlaps between sets using the sanity_check.py script:

python sanity_check.py --project_name protein_test --file_path ~/DeepDocking/projects --n_iteration 1
17
Extract the fingerprints of the sampled molecules:

python extracting_morgan.py --project_name protein_test --file_path ~/DeepDocking/projects --n_iteration 1 --morgan_directory ~/DeepDocking/library_prepared_fp --tot_process 60

This will create a morgan folder in the iteration directory, with three .csv files containing the fingerprints of the molecules.
18
Extract the corresponding SMILES:

python extracting_smiles.py --project_name protein_test --file_path ~/DeepDocking/projects --n_iteration 1 --smile_directory ~/DeepDocking/library_prepared --tot_process 60

The SMILES will be extracted into a smile folder inside the iteration directory.

Stage IV: ligand preparation (DD phase 2)

Timing ~12 h

19
The following procedures can be used to prepare molecules for docking with Glide SP (A) and FRED (B) using an OMEGA module (https://docs.eyesopen.com/applications/omega/). Alternative conformer generation programs can be used for the same purpose.
1. (A)
  Ligand preparation for Glide docking
  1. (i)
    Create an sdf folder in iteration_1. Run OMEGA conformer generation in classic mode (one conformer per molecule):
    
    oeomega classic -in smile/train_smiles_final_updated.smi -out sdf/training_sdf.sdf -maxconfs 1 -strictstereo false -mpi_np 60 -log training.log -prefix conf_training
    
    for each smi file inside the smile folder.
2. (B)
  Ligand preparation for FRED docking
  1. (i)
    Go to the iteration_1 directory and create an sdf folder. Run OMEGA 3D conformer generation in pose mode:
    
    oeomega pose -in smile/train_smiles_final_updated.smi -out sdf/training_sdf.oeb.gz -strictstereo false -mpi_np 60 -log training.log -prefix conf_training
    
    for each file inside the smile folder. When running in pose mode, OMEGA will output a ligand-dependent number of 3D conformations for each molecule in a oeb.gz file in the sdf folder.

Stage V: molecular docking (DD phase 3)

Timing ~20 h

20
Docking of the sampled molecules can be performed using Glide SP (https://www.schrodinger.com/products/glide) (A) or FRED (https://docs.eyesopen.com/applications/oedocking/fred/fred.html) (B) or any other docking program:
1. (A)
  Glide SP docking
  1. (i)
    From scripts_1, run:
    
    python input_glide.py --project_name protein_test --file_path ~/DeepDocking/projects --grid_file ~/DeepDocking/docking_grid/glide_grid.zip --iteration_no 1 --glide_input ~/DeepDocking/DD_protocol/scripts_1/glide_template.in
    
    to create a docked folder in iteration_1 and generate a docking input script for each file in the sdf folder. Note that input_glide.py will modify only the GRIDFILE and LIGANDFILE line of the glide_template.in script in scripts_1, whereas all the other docking parameters will be used as they are specified. Hence, the template can be modified to run alternative Glide docking protocols (e.g., Glide XP).
  2. (ii)
    Go to the docked folder. Note that the $SCHRODINGER environment variable must have been set to the installation directory of the program using:
    
    export SCHRODINGER=schrodinger-installation-directory
    
    Further details can be found at https://www.schrodinger.com/kb/1842. Run Glide from the command line:
    
    $SCHRODINGER/glide -OVERWRITE -JOBNAME docking_train training_docked.in
    
    for each docking input file generated in the previous step. The user can also use the -NJOBS option to run multiple subjobs in parallel, or run docking from Maestro GUI. Refer to the Glide manual or watch https://www.schrodinger.com/training/videos/docking-ligand-docking/glide-ligand-docking-calculation for more detailed instructions.
2. (B)
  FRED docking
  1. (i)
    Create a docked folder and run:
    
    fred -receptor ~/DeepDocking/docking_grid/fred_grid.oeb -dbase sdf/training_sdf.oeb.gz -docked_molecule_file docked/phase_3_training_docked.sdf -hitlist_size 0 -mpi_np 60 -prefix training_docking
    
    for each oeb.gz file in sdf to dock the molecules.
    
    Critical step
    
    Regardless of the program, the docking results must be saved in sdf format to be processed correctly by the next steps; gz compression is allowed. One single docking output file must be saved for each input molecular set. The docking result files must contain the word ‘test/valid/train’ in their name, depending on which set they derive from.

Stage VI: model training (DD phase 4)

Timing ~3 h

21
In the DD workflow, a user-defined number of DNN are trained with molecular fingerprints and docking scores of sampled molecules (converted to binary labels). The process is trivially adaptable to train score predictors for any docking program, because the only difference is the keyword used to extract scores from the docking results. Run the extract_labels.py script from the scripts_2 folder:

python extract_labels.py --project_name protein_test --file_path ~/DeepDocking/projects --iteration_no 1 --tot_process 3 --score_keyword ‘r_i_docking_score’

The docking scores for each molecular set will be saved to a corresponding txt file inside iteration_1. The number of processes should match the number of docking sdf files (usually three). The score_keyword argument must match the title of the field that stores the docking score of a molecule in the sdf files. For example, for Glide SP results, the field title is ‘r_i_docking_score’, and for FRED results, it is ‘FRED Chemgauss4 score’.

Troubleshooting
22
Activate the environment and run simple_job_models_manual.py from scripts_2:

python simple_job_models_manual.py --iteration_no 1 --morgan_directory ~/DeepDocking/library_prepared_fp --file_path ~/DeepDocking/projects/protein_test --number_of_hyp 24 --total_iterations 11 --is_last False --number_mol 1000000 --percent_first_mols 1 --percent_last_mols 0.01 --recall 0.90

to generate the scripts for model training.

Critical step

The setup of DNN models is a critical part of DD. The arguments for the script are explained in detail in Table 1. The recommended values are based on our experience in screening 1B+ libraries.

Table 1 Parameters for model training
Full size table
23
Model training can be performed using the entire training set generated during random sampling (A). Alternatively, it is possible to evaluate performances at different training sizes without additional docking and choose an optimal training size for the user’s system to be used in successive iterations (B). Option B works only for the first iteration, and it requires that model training and selection are performed for each investigated size; thus, it may significantly increase the computational cost. Importantly, if the user is using a docking program relying on a stochastic algorithm for pose generation (e.g., Autodock⁵²), the database reduction power associated with a specific training size will be higher for more ‘deterministic’ runs (converging to the same outcome at the cost of more time-consuming simulations), and larger training sets will be required to obtain the same performances for more ‘random’ runs (Supplementary Fig. 2).
1. (A)
  Regular training
  1. (i)
    Go to the simple_job folder of iteration_1 and run each script (preferably on a GPU) after activating the environment, to train the corresponding model:
    
    bash simple_job_1.sh
    
    The models are saved in the all_models folder, and a hyperparameter_morgan_with_freq_v3.txt file will be created in iteration_1. The file will report details for each trained model: model number, training time, hyperparameter values (oversampling, batch size, learn rate, number of hidden layers, number of units per layer, dropout frequency and class weight), score cutoff for virtual hits in the validation set, area under the curve (AUC)⁵³ values for validation and test sets, precision for the validation and test sets, recall in the test set, number of true virtual hits in the test set and remaining molecules in the library on the basis of validation and test sets. In the first iteration, training should take ≤3 h per model.
2. (B)
  Evaluation of different training sample sizes
  1. (i)
    Use the progressive_evaluator.py script from scripts_2 in add_train_num_mol mode to scale down the number of molecules used for training to 250,000 (or any desired value) in the sh scripts of simple_job folder of iteration_1:
    
    python progressive_evaluator.py --sample_size 250000 --project_name protein_test --project_path ~/DeepDocking/projects/ --mode add_train_num_mol
  2. (ii)
    Go to the simple_job folder of iteration_1 and run each script (preferably on a GPU) after activating the environment, to train the corresponding model:
    
    bash simple_job_1.sh
    
    In the first iteration, training should take ≤3 h per model.
  3. (iii)
    Activate the environment, preferably on a GPU, and use the hyperparameter_result_evaluation.py script in scripts_2 to run a grid search to identify the best-performing model:
    
    python hyperparameter_result_evaluation.py --n_iteration 1 --data_path ~/DeepDocking/projects/protein_test --morgan_directory ~/DeepDocking/library_prepared_fp --number_mol 1000000 --recall 0.90
  4. (iv)
    Run the progressive_evaluator.py script from scripts_2 with finished_iteration mode to extract the performance values of the best selected model:
    
    python progressive_evaluator.py --sample_size 250000 --project_name protein_test --project_path ~/DeepDocking/projects/ --mode finished_iteration
    
    This will generate an evaluation_250000 folder in the project folder, storing the files generated during model training and selection. It will also generate an evaluation.csv file in the project folder, reporting sample size, test set recall value, test set precision and predicted number of remaining molecules (based on the test set).
  5. (v)
    Repeat Step 23B(i–v) for any desired training set size smaller than or equal to the original number of molecules in the training set.
  6. (vi)
    The evaluation.csv file (example in Supplementary Table 3) can be used to plot the expected number of remaining molecules at each sample size, identifying the value that represents a good balance between the number of molecules that can be docked as part of DD training on the user’s system and model performance in terms of database reduction power. See Steps 25–28 for further information on how to interpret the values reported in evaluation.csv and best_model_stats.txt (in the respective evaluation_N folder) files for the selected size.
  7. (vii)
    Once a training set size has been chosen, move the content of the respective evaluation_N folder to iteration_1 and proceed to Step 25 (Step 24 has already been performed as part of the model evaluation).

Stage VII: inference (DD phase 5)

Timing ~4 h

24
In the final phase of a DD iteration, the model with the highest precision is identified and used to predict the virtual hit-likeness of all molecules in the chemical library; molecules with virtual hit-likenesses below the probability threshold selected to ensure the predefined recall in the validation set are then discarded. Activate the environment, preferably on a GPU, and use the hyperparameter_result_evaluation.py script in scripts_2 to perform a grid search on the models:

python hyperparameter_result_evaluation.py --n_iteration 1 --data_path ~/DeepDocking/projects/protein_test --morgan_directory ~/DeepDocking/library_prepared_fp --number_mol 1000000 --recall 0.90

The performances and statistics of the best model will be saved in a best_model_stats.txt file in the iteration folder (Box 2); the Total Left Testing value in that file indicates an estimation of the number of molecules from the library that will be classified as prospective virtual hits, calculated by scaling the values observed for the test set. The model will be saved in the best_models folder together with a thresholds.txt file reporting model number, virtual hit probability threshold and docking score threshold for virtual hits.
25
Read the recall value in the test set, reported in the Model Recall line in best_model_stats.txt. This should match the value that was chosen at Step 22.

Critical step

A test recall value that differs by >0.015 from the predefined value indicates poor model generalizability caused by validation and test sets that are not sufficiently large. As already indicated in the Critical step of Step 14, the user should generate validation and test sets of as large a size as possible in the first iteration.
26
(If needed) If at any iteration the test set recall significantly differs from the predefined value, increase the size of validation and test sets and restart training at Step 21.

Troubleshooting
27
Check the precision and number of remaining molecules estimated from the test set, indicated by the Model Precision and Total Left Testing lines of best_model_stats.txt, respectively. These values will depend on the size of the training set and the predefined recall value. Low precision values (and consequently high numbers of remaining molecules) are due to a small training set and/or a challenging target. The procedure illustrated at Step 23B can guide the user to select a properly sized training set without additional docking.
28
(If needed) When during the first iteration the precision value is <0.0225 (2.25%) and the number of remaining molecules is >40% of the starting number in the database, restart the iteration and increase the training set size. If the training set cannot be further increased or its increase does not have a significant effect on the number of remaining molecules, we recommend rerunning model training from Step 22 by decreasing the recall value at Steps 22 and 24 by 0.05 and repeating this process until <40% of the original molecules are retained; then, keep the new recall value for the remainder of the run. Two illustrative cases are reported in Box 2, where a DD iteration has been run against the same target using (i) excessively small and (ii) properly sized molecular sets. In the first case (Example 1), the recall value in the test set differs by 0.026 from the expected value (0.90). Moreover, the precision is low (1.59%), and >50% of the original molecules in the library are retained. Setting the size of the molecular sets to 700,000 molecules (Example 2) resulted in substantially higher precision and AUC values, which allowed us to discard ~79% of the original molecules, and improved model generalizability as well.
29
Use simple_job_predictions_manual.py in scripts_2 to generate the scripts for inference:

python simple_job_predictions_manual.py --project_name protein_test --file_path ~/DeepDocking/projects --n_iteration 1 --morgan_directory ~/DeepDocking/library_prepared_fp

One inference script per file in library_prepared_fp will be generated in the simple_job_predictions folder, inside the iteration directory.
30
Activate the environment, preferably on a GPU, and launch all the inference script from the simple_job_predictions folder of iteration_1:

bash simple_job_1.sh

The names and virtual hit-likenesses for the qualified molecules will be saved in the morgan_1024_predictions folder of iteration_1.
31
(Recommended) When inference is finished, compare the number of positively predicted molecules in morgan_1024_predictions with the Total Left value in best_model_stats.txt to confirm model generalizability. If the values are substantially different, larger validation and test sets must be regenerated, and training and inference must be repeated, as described in Step 26.

Box 2 Comparison between results from test set sizes of 3,000 and 700,000 molecules

Shown below are two best_model_stats.txt files resulting from the grid search for runs using training, validation and test set sizes of 3,000 and 700,000 molecules. Both iterations were run against the ligand binding domain of the androgen receptor (PDB ID: 1T7R⁵⁴) using a FRED docking program and a recall value of 0.90. The screened chemical library was ZINC20 (~1 billion molecules).

Example 1. Training, validation, test set size: 3,000 molecules * Best Model Stats * ____________________ - Model Cutoff: -13.711672409898348 - Model Precision: 0.01588310038119441 - Model Recall: 0.9259259259259259 - Model Auc: 0.7905723905723906 - Total Left: 533724567.64106053 Example 2. Training, validation, test set size: 700,000 molecules * Best Model Stats * ____________________ - Model Cutoff: -13.72898444870879 - Model Precision: 0.04121249730196417 - Model Recall: 0.8926223520818115 - Model Auc: 0.9213150997969082 - Total Left Testing: 214126186.75484607

Successive iterations

Timing ~27 h per iteration

32
In subsequent iterations, molecules are sampled from the pool of positive predictions of the previous iteration rather than the original library. For iteration N, activate the environment and from scripts_1 run:

python molecular_file_count_updated.py --project_name protein_test --n_iteration N --data_directory ~/DeepDocking/projects/protein_test/iteration_N-1/morgan_1024_predictions --tot_process 60 --tot_sampling 1000000

The script will generate a Mol_ct_file_updated_protein_test.csv file inside the morgan_1024_predictions folder of the N–1 iteration, reporting the number of molecules to sample from each file generated during the previous inference stage. Note that the total number of molecules to sample can now be reduced to 1,000,000 for training augmentation, because validation and test sets are generated only during the first iteration.
33
Run the sampling.py script as:

python sampling.py --project_name protein_test --file_path ~/DeepDocking/projects --n_iteration N --data_directory ~/DeepDocking/projects/protein_test/iteration_N-1/morgan_1024_predictions --tot_process 60 --train_size 1000000 --val_size 1000000

Critical step

train_size must be equal to tot_sampling value of Step 32. The val_size refers always to the size of validation and test sets that were generated in the first iteration.
34
Run the procedure from Steps 16–31 as described previously, modifying the argument of number of the current iteration where requested. Model training will require more time, up to 12 h per model.
35
(If needed) The progress of a DD campaign can be monitored using the plot_progress.py program in scripts_2. To do this, start by activating the conda environment. Then, plot the variation in the number of remaining molecules and the corresponding receiver operating characteristic (ROC) curves⁵⁵ with the following command:

python plot_progress.py --project ~/DeepDockingProjects/projects/protein --size_test_set 1000000 --start_iteration 1 --end_iteration N --output_folder ~/DeepDocking

start_iteration and end_iteration arguments can be adjusted to any range of iterations that the user would like to analyze. The program will generate a figure file in the folder specified by --output_folder argument, showing the number of remaining molecules at each iteration, as well as the ROC curve and AUC value (calculated in the test set), indicating the model performance.
36
Run the predefined number of iterations or stop earlier if the number of prospective virtual hits is low enough to be actually docked with the available computational resources. After completing the last DD iteration, the remaining molecules should be explicitly docked to eliminate misclassified low-scoring entries and focus on true top-scoring molecules. Different strategies can be adopted to post-process the remainder of the library, depending on its size and available resources (see also Troubleshooting):
- Dock all the remaining molecules to the target conventionally and identify virtual hits by their docking scores.
- Decrease the recall value and re-run the final iteration; then, dock the resulting smaller set of the remaining molecules.
- Rank the prospective hits by their virtual hit-likeness and dock only a top-ranked subset to the target. This strategy has been proven effective to retain active compounds both in retrospective¹⁰ and prospective¹⁴ virtual screening campaigns.
- Run additional iteration(s) until reaching a number of ‘dockable’ molecules. Additional iterations with more stringent virtual hit selectivity criteria can be run after the predefined iterations have been completed. To do this, run Steps 14–31 by changing the number of the current iteration and by updating the number of total iterations (--total_iterations) and the percentage of top-scoring molecules to be considered as virtual hits in the additional iteration (--percent_last_mols) in Step 22.
  
  Troubleshooting

Stage VIII: final phase

Timing Depends on the number of final predicted virtual hits

37
Extract the SMILES of molecules predicted as virtual hits in the last iteration. To extract all the remaining molecules, activate the environment and run the final_extraction.py script available in utilities:

python final_extraction.py -smile_dir ~/DeepDocking/library_prepared -prediction_dir ~ /DeepDocking/projects/protein_test/iteration_11/morgan_1024_predictions -processors 60

The script will save the SMILES of the predicted virtual hits into a smiles.csv file and the names and virtual hit-likeness into an id_score.csv file.
38
Prepare the SMILES for docking as described in Step 19 and dock the resulting structures into the target site. Virtual hits can be identified by comparing their docking scores with the score threshold used in the last iteration (printed to the thresholds.txt file in best_models).