PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications

Computing binding affinities is of great importance in drug discovery pipeline and its prediction using advanced machine learning methods still remains a major challenge as the existing datasets and models do not consider the dynamic features of protein-ligand interactions. To this end, we have developed PLAS-20k dataset, an extension of previously developed PLAS-5k, with 97,500 independent simulations on a total of 19,500 different protein-ligand complexes. Our results show good correlation with the available experimental values, performing better than docking scores. This holds true even for a subset of ligands that follows Lipinski’s rule, and for diverse clusters of complex structures, thereby highlighting the importance of PLAS-20k dataset in developing new ML models. Along with this, our dataset is also beneficial in classifying strong and weak binders compared to docking. Further, OnionNet model has been retrained on PLAS-20k dataset and is provided as a baseline for the prediction of binding affinities. We believe that large-scale MD-based datasets along with trajectories will form new synergy, paving the way for accelerating drug discovery.


Binding Affinity -Cutoff
Equations ( 1),(2) provide the experimental binding affinity of -9.54 kcal/mol for an SB with a binding constant k i of 10 nM and -6.82 kcal/mol for a WB for a binding constant k i of 100 µM respectively.
The corresponding correlation plot is shown in Fig. S5a B

Evaluation Metrics
Precision: Precision is a measure of how accurately a classifier predicts a class.It is computed as the ratio of correctly predicted positives to the total predicted positives.

Accuracy:
Accuracy is a measure of how accurately a classifier can predict both positive and negative classes.It is computed as the ratio of correctly predicted samples to the total number of samples.

Accuracy =
T rue positives + T rue negatives T rue positives + T rue negatives + F alse positives + F alse negatives F1-Score: F1-Score is a measure of a classifier's accuracy, taking into account both precision and recall.
It is computed as the harmonic mean of precision and recall.

Support:
Support is the number of samples belonging to a particular class.

Figure S1 :
Figure S1: Screenshot of the PDB Viewer

Figure S3 :
Figure S3: The distribution of RMSD of the protein and ligand from molecular dynamics simulations.

Figure S4 :
Figure S4: The distribution of binding affinities computed for 14500 proteinligand complexes using MMPBSA method.

Figure S5 :
Figure S5: The correlation plot for PL complexes (4343) with experimental binding affinities under weak and strong binders.a) Experimental vs MMPBSA and b) Experimental vs Docking

Figure S8 :
Figure S8: Correlation plots for a set of pdbids from PLAS-20k (Molecular weight of ligand as per Lipinski rule of five) for which experimental binding affinities are known -(a) Experimental vs Docking, (b) Experimental vs MMPBSA.

Figure S9 :
Figure S9: Correlation plots for a set of pdbids from PLAS-20k (Number of

Figure S10 :
Figure S10: Correlation plots for a set of pdbids from PLAS-20k (Number of hydrogen bond donors for ligand as per Lipinski rule of five) for which experimental binding affinities are known -(a) Experimental vs Docking, (b) Experimental vs MMPBSA.

Figure S11 :
Figure S11: The distribution of the descriptors of the ligand of PLAS-20k PL complexes, a) No. of amide Bonds, b) No. of aromatic carbocycles, c) No. of aromatic rings, d) No. of rings, e) No. of rotatable bonds and f ) molecular weight.

Figure S12 :
Figure S12: The distribution of calculated energy components of binding affinity from MMPBSA method (a) Electrostatic, (b) van der Waals, (c) Non-polar Solvation free energy and (d) Polar Solvation free energy for 14500 protein-ligand complexes.
) introduces the concept of a cutoff value that can be used to classify binding affinities as either strong or weak.The cutoff value is determined by taking the average of the binding affinities for SB and WB and the calculated cutoff value Cutof f exp is -8represents a linear regression line where Y M M P BSA exp is determined as a function of X.The regression line is derived from pairwise correlation between experimental and MMPBSA binding affinities.The cut-off values for SB and WB are estimated using (Y M M P BSA exp ).Y M M P BSA exp = −12.94+ 3.15 * (X), where Y is the regression line (4) .E strong M M P BSA = −12.94+ 3.15 * (−9.54) = −42.99kcal/mol (5) B.E weak M M P BSA = −12.94+ 3.15 * (−6.82) = −34.423kcal/mol (6) Equation (7) represents the calculation of the MMPBSA cut off value to distinguish strong and weak binders.PL complexes with MMPBSA afinities above the cuttoff value -38.70 kcal/mol are classified as SB and binders with MMPBSA affinities below the cutoff value are classified as WB. as MMPBSA cut-off was applied in estimating the docking cut-off and the regression equation based on docking affinities is given in Equation (8) Y Docking exp = −3.08 + 0.40 * (X)where Y is the regression line .E strong Docking = −3.08 + 0.40 * (−9.54) = −6.896kcal/mol (9) B.E weak Docking = −3.08 + 0.40 * (−6.82) = −5.80kcal/mol (10) Equation (11) represents the calculation of the docking cut off value to distinguish strong and weak binders.PL complexes with Docking afinities above the cuttoff value -6.352 kcal/mol are classified as SB and binders with Docking affinities below the cutoff value are classified as WB.Cutof f Docking = B.E weaker Docking + B.E strong Docking 2 = −6.352kcal/mol (11)

Figure S2 :
Figure S1: Screenshot of the PDB Viewer

Figure S3 :
Figure S3: The distribution of RMSD of the protein and ligand from molecular dynamics simulations.

Figure S4 :
Figure S4: The distribution of binding affinities computed for 14500 protein-ligand complexes using MMPBSA method.

Figure S5 :
Figure S5: The correlation plot for PL complexes (4343) with experimental binding affinities under weak and strong binders.a) Experimental vs MMPBSA and b) Experimental vs Docking

Figure S8 :
Figure S8: Correlation plots for a set of pdbids from PLAS-20k (Molecular weight of ligand as per Lipinski rule of five) for which experimental binding affinities are known -(a) Experimental vs Docking, (b) Experimental vs MMPBSA.

Figure S9 :
Figure S9: Correlation plots for a set of pdbids from PLAS-20k (Number of hydrogen bond acceptors of ligand as per Lipinski rule of five) for which experimental binding affinities are known -(a) Experimental vs Docking, (b) Experimental vs MMPBSA.

Figure S10 :
Figure S10: Correlation plots for a set of pdbids from PLAS-20k (Number of hydrogen bond donors for ligand as per Lipinski rule of five) for which experimental binding affinities are known -(a) Experimental vs Docking, (b) Experimental vs MMPBSA.

Figure S11 :
Figure S11: The distribution of the descriptors of the ligand of PLAS-20k PL complexes, a) No. of amide Bonds, b) No. of aromatic carbocycles, c) No. of aromatic rings, d) No. of rings, e) No. of rotatable bonds and f) molecular weight.

Figure S12 :
Figure S12: The distribution of calculated energy components of binding affinity from MMPBSA method (a) Electrostatic, (b) van der Waals, (c) Non-polar Solvation and (d) Polar Solvation for 14500 protein-ligand complexes.