Molecular dynamics trajectories for 630 coarse-grained drug-membrane permeations

The permeation of small-molecule drugs across a phospholipid membrane bears much interest both in the pharmaceutical sciences and in physical chemistry. Connecting the chemistry of the drug and the lipids to the resulting thermodynamic properties remains of immediate importance. Here we report molecular dynamics (MD) simulation trajectories using the coarse-grained (CG) Martini force field. A wide, representative coverage of chemistry is provided: across solutes—exhaustively enumerating all 105 CG dimers—and across six phospholipids. For each combination, umbrella-sampling simulations provide detailed structural information of the solute at all depths from the bilayer midplane to bulk water, allowing a precise reconstruction of the potential of mean force. Overall, the present database contains trajectories from 15,120 MD simulations. This database may serve the further identification of structure-property relationships between compound chemistry and drug permeability.


Background & Summary
The passive permeation of small organic, drug-like molecules across phospholipid membranes has garnered much interest, not only to practically optimize pharmaceutical properties, but also as a more fundamental physical-chemistry problem 1 . The latter acts as a testbed to understand the molecular driving forces at play during a permeation process across a soft interface. A more robust understanding of the structure-property relationship can be obtained by screening across chemistries and systematically measuring the permeability coefficient from in vitro experiments 2,3 . Given the small size and apparent bias of databases of experimental compounds 4 , the perspective to harness computational methods at high throughput has been on the rise [5][6][7][8][9] .
Permeation is described using the inhomogeneous solubility-diffusion model to yield a diffusion process in terms of a one-dimensional Smoluchowski equation along z-the normal to the membrane midplane. The resulting permeability coefficient, P, takes the form is the inverse temperature, G(z) is the potential of mean force (PMF), and D(z) is the local diffusivity. As such, knowledge of G(z) and D(z) enables an in silico estimation of P, which may be obtained from molecular dynamics (MD) simulations. While the direct estimation of these quantities from brute-force MD typically fails, enhanced-sampling methods have provided a robust strategy to estimate both G(z) and D(z). Equation 1 depends exponentially on G(z), but only linearly on D(z), making the latter quantity less critical-it was also found to depend rather weakly on the chemistry of the drug 10 . A large number of enhanced-sampling studies have demonstrated the capability to not only converge the PMF, but also to provide permeability coefficients that exhibit high correlation with experimental measurements [10][11][12][13][14][15][16] . An illustrative example of the PMF is shown in Fig. 1a, together with a cartoon of a phospholipid membrane in the background.
The in silico route is predictive and generalizable in that it does not rely on adjustable parameters: the main input of an MD simulation is the force field, often parametrized on properties unrelated to interactions with a phospholipid membrane 17,18 . Critically, this limits the danger of overfitting observed in statistical models 19 . The main downside of using MD simulations is the computational investment: atomistic simulations with explicit solvent typically require 10 5 CPU-hours for a small molecule in a single-component lipid membrane 10,13,14,20 , hindering the prospects of running them at high-throughput.
We have recently proposed the use of coarse-grained (CG) models to tackle this problem. Coarse-graining can enable a more efficient sampling of the conformational space by lumping together atoms into super-particles or beads 21,22 . In particular we relied on the CG Martini model [23][24][25] , which is specifically tailored to reproduce the partitioning behavior of compounds in different environments-thus making it particularly well suited for permeability calculations. The modularity of Martini means that it constructs molecules based on a small set of bead types, each one encoding different chemical properties-mainly hydrophobicity, hydrogen-bonding, and charge (see Table 1 for details). We reported the systematic calculation of PMFs using umbrella sampling for all

Polarity
Type Table 1. Characteristics of non-charged Martini bead types. The table contains information about the beadtype name (Type), hydrogen-bonding capability (HB, "d" and "a" for donor and acceptor, respectively), and the octanol/water partitioning free energy (ΔG Ol→W = ΔG water − ΔG octanol , in units of kcal/mol), as reported elsewhere 38 . www.nature.com/scientificdata www.nature.com/scientificdata/ CG compounds made of one and two neutral beads (hereafter denoted unimers and dimers) 26 . This amounted to 14 unimers and 14 × 15/2 = 105 dimers (Fig. 1c). The thermodynamic parametrization of Martini yields accurate PMFs, as compared to reference curves from atomistic simulations, as well as remarkably-accurate permeability coefficients, as compared to reference simulations and experiments 27 . Because of the transferable nature of Martini, the CG model significantly reduces the size of chemical space, such that these 119 computer simulations offer estimates for more than 500,000 small molecules 26,27 . We more recently extended our approach to linear trimers and tetramers, demonstrating that the screening range can be significantly increased 28 . As such, Martini offers a robust methodology to run high-throughput computer simulations of drug-membrane permeability.
The present database reports the full umbrella-sampling MD trajectories necessary to run PMF calculations for all Martini dimers in six different single-component phospholipid membranes. The 105 dimers inserted in 6 membranes amounts to 630 drug-membrane combinations (Fig. 1). Given that each PMF calculation relied on 24 umbrella sampling simulations, the present database contains 15,120 MD trajectories. The diversity of compound and lipid chemistries can offer unprecedented insight into the underlying thermodynamics 26,29 . Below we present an example use of the present database by displaying the tilt angle of each compound across membrane-insertion depth and compound chemistry. We believe that the raw MD trajectories provided for this breadth of chemistries will provide further insight into the structure-property relationships governing drug-membrane permeability.

Methods
We follow previously established simulation protocols that are described in detail elsewhere 26 . In brief, we built symmetric, single-lipid bilayer membranes that contain 64 lipids per leaflet using the Insane script 30 . Table 2 informs on the composition of the various membrane systems that differ in the number of water beads. As it is common practice, we replaced at least 10% of the non-polarizable Martini waters by anti-freeze beads. Simulations were performed in Gromacs 4.6.6 31 using the Martini force field with standard input parameters 32 . We ran simulations in the NPT ensemble at 300 K and 1 bar controlled by means of a stochastic velocity-rescaling thermostat 33 and a Parrinello-Rahman barostat 34 , respectively. We performed umbrella sampling along the bilayer normal (z-axis) in a range from 0.0 to 4.1 nm at a step size of 0.1 nm by generating 24 windows in which the     www.nature.com/scientificdata www.nature.com/scientificdata/ solute is centered via a harmonic biasing potential (k = 240 kcal/mol/nm 2 ). For computational efficiency, each simulation box contained two solute compounds placed in different membrane leaflets. Each window included a sequence of minimization, heat-up, and equilibration runs prior to the production one, the latter being simulated for 1.2 · 10 5 τ using a time step of δt = 0.02 τ, where τ (1 ps) refers to the model's natural unit of time. PMF profiles were then reconstructed by means of the weighted histogram analysis method (WHAM) 35,36 , with error bars estimated from 100 bootstraps.

Data Records
We provide datasets for MD trajectories of solute-membrane systems at a CG resolution for 105 solutes inserted in six different phospholipid bilayers 37 . Each dataset is denoted by the abbreviated name of the lipid and deposited as a single archive file, e.g., DPPC.tar.bz2. Within a dataset, there are 105 folders containing the trajectories and PMF profile of a particular solute, following the naming convention DIM_bead1-bead2, where bead1 and bead2 denote the relevant bead types following the standard Martini notation (see Table 1). For improved sampling we have systematically placed two solutes in each simulation box, always separated by a normal distance (i.e., only along z) of 4.1 nm. The trajectories obtained from umbrella sampling (US) are stored in sub-folders denoted us-x, where x takes values 0.0, 0.1, …, 2.4, corresponding to the reference depth in the bilayer of one of the two solutes. For instance, the folder us-2.4 contains two solutes restrained around z 1 = 2.4 nm and z 2 = −1.7 nm. The US sub-folders contain all necessary input files to repeat the production runs as well as the respective output files, including trajectories and observables. The sub-folder pmf contains the input files to perform WHAM reweighting in Gromacs and the output PMF profiles. Table 3 lists all files included in the sub-folders us-x and pmf together with a brief description of their purpose. In Fig. 2  www.nature.com/scientificdata www.nature.com/scientificdata/ profiles obtained for hydrophobic (Fig. 2a), amphiphilic (Fig. 2b), and polar solute compounds (Fig. 2c) across all considered lipid environments.
Our records indicate a computational investment to run heat-up, equilibration, and production simulations of roughly 0.5, 1, and 8 CPU-hours per umbrella, respectively. Summing up over all 15,120 MD simulations supplied, this amounts to roughly 150,000 CPU-hours to generate the present dataset.

technical Validation
A number of studies using the same simulation protocol have demonstrated the thermodynamic validity and accuracy of CG Martini simulations. Bereau and Kremer found a mean absolute error between experimental and Martini transfer free energies between water and octanol of 0.79 kcal/mol across 653 neutral small organic molecules-an excellent result given the minimalism of the model 38 . By further invoking relations between bulk transfer free energies, used as proxy for various environments of the membrane, we deduced a mean absolute error on features of our CG PMFs of approximately 1.4 kcal/mol 26 . This remarkable agreement had been earlier probed specifically for amino acids 39 . On a structural level, we showed that backmapping CG snapshots and running short atomistic MD simulations offered a significant speedup in convergence of the atomistic PMF calculations, suggesting that the conformational ensemble of the CG model adequately matches its atomistic counterpart 40 . Moreover, as the permeability coefficient depends exponentially on the PMF, reliable estimates of the latter prove necessary 13 . We showed that the accuracy of the PMFs obtained through Martini translated into excellent predictability for the permeability coefficient-roughly 1 log unit 27,29 .

Usage Notes
The 15,120 MD trajectories in this dataset provide a rich amount of information. As an illustration, we focus on the orientation of the solute with respect to the normal of the membrane bilayer. We define a tilt angle θ between the bond vector of the solute and the normal vector of the membrane, oriented as to point from the the bilayer midplane to the membrane surface. Figure 3 displays the average tilt angle as a function of the depth z in a DPPC bilayer across all 105 solute compounds. The depicted angles are normalized to sin θ to account for the Jacobian of the transformation to spherical coordinates.
We find that solutes composed of two beads of identical or similar polarity display no preferred orientation, such that θ ≈ 90°. On the other hand, features appear for compounds that show a difference in polarity between the two beads of the solute. These features are markedly present in the range of depths 0.9 < z < 2.2 nm, which entails the lipid tail region. These amphiphilic solutes, such as C1-P5, show a strong preference for small tilt angles θ < 45°, where the more hydrophobic bead is facing the membrane core. The lack of features below z ≈ 0.9 nm is likely due to the force field's interaction cut-off. In addition, strongly amphiphilic solutes also show orientational order at the membrane/water interface (2.4 < z < 2.7 nm), but with a flipped bond vector (θ ≈ 130°), i.e., the polar site now faces the membrane.

Code availability
No custom code was used to generate this dataset. Simulations were run using Gromacs 4.6.6, and all input files used to generate the trajectories are contained in the present dataset. The output data contains a number of binary files, all generated from Gromacs. Gromacs is available across many platforms and architectures (see Gromacs manual) and its output files can be read either from a compiled version of the freely-available code, or from other analysis modules, such as mdanalysis 41 or mdtraj 42 .
The WHAM-reconstructed PMFs were obtained by running a command that can be found in each DIM_ bead1-bead2/pmf/bsResult.xvg file.