Introduction

Tools based on artificial intelligence (AI) are currently revolutionising many fields, yet their success relies on training data that is available in programmatically accessible and standardised format1. For example in structural biology, Protein Data Bank2 has enabled revolutionary tools that can predict protein structures with unprecedented accuracy3. However, the lack of smart guidelines and community-consensus best-practices for data sharing are limiting the development of such databanks and AI applications in many fields1. This is the case in biomolecular modelling, where vast amount of data is already available for proteins, lipids, nucleic acids, and carbohydrates, but tools that enable applications of these data for data-driven applications are not yet available4. Here we propose a cost-effective solution to make data scattered in various locations and formats accessible for data-driven and machine learning (ML) applications: an overlay databank. We demonstrate the practical relevance of such approach for understanding cellular membranes by incorporating available lipid bilayer simulations into the NMRlipids Databank. Importantly, the basic principles of an overlay databank can be applied to simulation data of any other biomolecules (that are becoming available in increasing amounts4), as well as to any other field where the development of AI-based tools is limited by the lack of data access and community best-practices—such as the assignment of NMR spectra5. Incentives to share data can be further accelerated by combining overlay databanks with an open collaboration approach (see ref. 6).

Cellular membranes contain hundreds of different types of lipid molecules that regulate the membrane properties, morphology, and biological functions7,8,9. Membrane lipid composition is implicated in diseases, such as cancer and neurodegenerative disorders, and therapeutics that affect membrane compositions are emerging10. However, biomembranes are often difficult to study experimentally, because they are complex mixtures of proteins and lipids in disordered fluid state with complicated phase behaviour at biological conditions. Molecular dynamics (MD) simulations can be used to model biomembranes in detail, but the computational cost for simulating all possible biological membrane compositions would be formidable. For those reasons, data-driven and machine-learning-based models that predict biomembrane properties will benefit wide range of fields covering academia and industry, from cell membrane biology to lipid nanoparticle formulations.

Here we present the NMRlipids Databank—a community-driven, open-for-all database featuring programmatic access to atom-resolution MD simulations of lipid bilayers. To demonstrate its advantages over existing approaches, we build a ML model that predicts membrane properties from its lipid composition and show how information on rare phenomena that are beyond the scope of standard MD simulation investigations can be gleaned from the Databank. In addition, we demonstrate the immediate relevance of NMRlipids Databank in extending the scope of MD simulations to new fields: Using a data-driven approach, we are able to analyse how anisotropic diffusion of water depends on membrane properties; this benefits understanding in magnetic resonance imaging (MRI)11 and pharmacokinetics12, where MD simulations have, until now, been rarely applied. Furthermore, the Databank performs automatic quality evaluation of membrane simulations, which facilitates the selection of best-performing models for each given application and accelerates the development of simulation parameters and methodology. Notably, the overlay-databank and open-collaboration approaches would facilitate the collecting of community-contributed data and providing programmatic access to them also in fields other than membrane simulations. For example, substantial biomolecular MD simulation data are already available but in scattered locations and formats4. The technical advances presented here will immediately benefit making these data programmatically accessible. Furthermore, the overlay databank configuration will benefit a broad range of fields where access to training data is a bottleneck for building AI-based tools.

Results

NMRlipids overlay databank delivers access to MD simulations of membranes composed of the biologically most abundant lipids

NMRlipids Databank is a community-driven catalogue containing atomistic MD simulations of biologically relevant lipid membranes emerging from the NMRlipids open collaboration6,13,14,15,16. It has been designed to improve the Findability, Accessibility, Interoperability, and Reuse17 of MD simulation data, most importantly the output trajectories and necessary information to their reuse. The NMRlipids Databank is constructed using the NMRlipids project protocol, in which all the content is openly accessible throughout the project6. Currently, the NMRlipids Databank contains 765 simulation trajectories with the total length of approximately 0.4 ms. Single-component lipid membranes and binary mixtures are currently most abundant in the NMRlipids Databank, yet mixtures with up to five lipid types are available. For available mixtures, see Fig. 1E. The distribution of lipids among the available simulations, shown in Fig. 1B, roughly resembles the biological relative abundance of different lipid types, with phosphatidylcholine (PC) being the most common followed by cholesterol, phosphatidylethanolamine (PE), phosphatidylserine (PS), phosphatidylglycerol (PG), phosphatidylinositol (PI), and other lipids, depending on organism and organelle7. Abbreviations and full names of all lipids present in the Databank are listed in the NMRlipids Databank documentation18. Force fields used in simulations cover all the essential parameter sets commonly used in lipid simulations, see Fig. 1C and Supplementary Table 1, including also united atom and polarisable force fields. Therefore, the averages calculated over the Databank can be considered as mean predictions from available lipid models (average over force field parameters) for an average cell membrane (average over lipid compositions).

Fig. 1: Overview of the NMRlipids Databank.
figure 1

A Schematic presentation of the overlay structure used in the NMRlipids Databank. A more detailed structure of the Databank layer is shown in Supplementary Fig. 1. B Distribution of lipids present in the trajectories of the Databank. ‘Others’ lists lipids occurring in six or fewer simulations. C Distribution of force fields in the simulations in the Databank. References for each force field are given in Supplementary Table 1. D Flowchart for performing an analysis of properties through all MD simulations in the NMRlipids Databank using the API. E Currently available lipid mixtures in the NMRlipids Databank. Colorbar shows the number of available simulations with the darkest green indicating three or more.

The overlay structure of the NMRlipids Databank, illustrated in Fig. 1A, is designed to enable efficient upcycling of MD simulations for data-driven and ML applications with minimal investment on new infrastructure. Raw simulation data in the Data layer can be stored in any publicly available location with long term stability and with permanent links to the data, such as digital object identifiers (DOIs), such as Zenodo (zenodo.org). The Databank layer (github.com/NMRlipids/Databank) is the core of the Databank containing all the relevant information about the simulations: links to the raw data, relevant metadata describing the systems, universal naming conventions for lipids and their atoms, quality evaluation of simulations against experimental data, and the computer programmes to create the entries and to analyse the five basic properties extracted from all simulations (area per lipid, C–H bond order parameters, X-ray scattering form factors, membrane thickness, and equilibration times of principal components). Also the values for these five basic properties are stored in the Databank layer.

The Application layer is composed of repositories and tools that read information from the Databank layer for further analyses. Because the Application layer does not interfere with the Databank layer, it can be freely extended by anyone for a wide range of purposes. This is demonstrated here with two examples: the NMRlipids Databank graphical user interface (NMRlipids Databank-GUI) at databank.nmrlipids.fi and a repository exemplifying novel analyses utilising NMRlipids Databank as discussed below (github.com/NMRlipids/DataBankManuscript). A more detailed description of the NMRlipids Databank structure is available in the Supplementary Information.

NMRlipids Databank-GUI: graphical access to the MD simulation data

NMRlipids Databank-GUI, available at databank.nmrlipids.fi, provides easy access to the NMRlipids Databank content through a graphical user interface (GUI). Simulations can be searched based on their molecular composition, force field, temperature, membrane properties, and quality; the search results are ranked based on the simulation quality as evaluated against experimental data when available. Membranes can be visualised, and properties between different simulations and experiments compared. The NMRlipids Databank-GUI enables rapid surveying of what simulation data is available, selection of the best available simulations for specific systems based on ranking lists, and comparisons of basic properties between different types of membranes. Notably, the GUI enables these operations to be performed by scientists with a wide range of backgrounds—including those who do not necessarily have programming expertise or other means to access MD simulation data.

NMRlipids Databank-API: programmatic access to the MD simulation data

The NMRlipids Databank-API provides programmatic access to all simulation data in the NMRlipids Databank through application programming interface (API). This enables wide range of novel data-driven applications—from construction of ML models that predict membrane properties, to automatic analysis of virtually any property across all simulations in the Databank. The flowchart in Fig. 1D illustrates the practical implementation of such an analysis. After cloning the Databank repository to a local computer, raw data of each simulation can be accessed and analyzed with the help of functions delivered by the NMRlipids Databank-API. Analyses over simulations with different naming conventions can be automatically performed with the help of mapping files that associate the specific naming conventions in each simulation with the universal molecule and atom names used by the Databank. Finally, the analysis results can be stored using the same structure as in the Databank layer. Documentation of the NMRlipids Databank-API and template for new user-defined analyses with further instructions are available at the NMRlipids Databank documentation19.

While analysis codes and results for basic membrane properties are included in the Databank layer, unlimited further analyses can be implemented by anyone in separate repositories in the Application layer. When Application layer repositories are organised by mimicking the Databank layer structure, they can be accessed programmatically and further analyzed using the tools in the NMRlipids Databank-API by implementing the flowchart demonstrated in Supplementary Fig. 2. Novel analyses that demonstrate the power of NMRlipids Databank in selecting the best simulation models, analysing rare phenomena, and extending MD simulations to new fields are implemented in an Application-layer repository located at github.com/NMRlipids/DataBankManuscript. The related codes are listed in Supplementary Table 2.

Selecting simulation parameters using NMRlipids Databank: Best models for most abundant neutral membrane lipids

MD simulations have been particularly useful in understanding membrane systems, although their accuracy has often been compromised by artefacts such as the quality of model parameters20,21. Presently, the accuracy of models is becoming increasingly important as researches are progressing from simulations of individual molecules to simulating whole organelles or even cells using interdisciplinary approaches21,22,23. Such systems exhibit intricate emergent behaviour making inaccuracies more difficult to detect, and accumulation of even modest errors may have a dramatic impact on the conclusions drawn. To minimise the detrimental consequences of artificial MD simulation results for their applications, the quality of lipid bilayer MD simulations has to be carefully assessed20. This can be done, for example, against the C–H bond order parameters from NMR spectroscopy16,24 and the form factors from X-ray scattering13, although it requires comparisons between large number of simulations, which is laborious even with collaborative approaches6,14,15,16.

Here we streamlined this process by defining quantitative quality measures for conformational ensembles of individual lipid molecules and membrane dimensions using C–H bond order parameters from NMR and X-ray scattering form factors13. These measures enable automatic ranking of lipid bilayer simulations based on their quality against experiments. Qualities of order parameters were evaluated by first calculating the probabilities for each C–H bond order parameter to locate within experimental error, and then averaging the possibilities over different lipid segments (Phg, Psn1,Psn2, and Ptotal). Qualities against X-ray scattering experiments (FFq) were estimated as the difference in the experimental and simulated locations of the first form factor minimum. These measures are good proxies for membrane properties because they correlate with the membrane lateral packing and thickness (Fig. 2G, Supplementary Figs. 3 and  4). Ergodicity of conformational sampling of lipids was estimated by calculating τrel, the convergence time of the slowest principal component divided by the simulation length.

Fig. 2: Examples of data obtained using NMRlipids Databank.
figure 2

A Area per lipid of POPC and POPE lipid bilayers predicted by different force fields at 310 K in simulations that are available in the NMRlipids Databank. The data points from the best-performing simulations, based on rankings in (B, C), are surrounded by black circles. B Best POPC simulations ranked based on the sn-1 acyl chain order parameter quality (Psn1). Also sn-2 acyl chain (Psn2), headgroup (Phg) and total (Ptotal) order parameter qualities, form factor quality (FFq), and relative equilibration time for conformations (τrel) are shown. Note that the best possible order parameter quality is one, while the best possible form factor quality is zero. C Best POPE simulations ranked based on the sn-1 acyl chain order parameter quality. Direct comparison against experimental (NMR order parameters and X-ray scattering) data exemplified for a simulation with the best overall order parameter quality (D), the best quality for POPE lipid (E), and the headgroup quality for POPE (F). Error bars for simulations are standard error of the mean over different lipids (n = number of lipids in a simulation shown in B). Error bars for experiments are 0.0213. G Scatter plots and Pearson correlation coefficients, r, for the membrane area per lipid, thickness, first minimum of X-ray scattering form factor and average order parameter of the sn-1 acyl chain extracted from the NMRlipids Databank. All correlation coefficients have p-value below 0.001 with two-sided test. For more correlations see Supplementary Fig. 3.

Figure 2 demonstrates how the automatic simulation-quality evaluation and the NMRlipids Databank-API enable rapid selection of the best models for membrane simulations. Figure 2A illustrates that predictions for the lateral packing of membranes composed of two most biologically-abundant neutral membrane lipids, POPC and POPE7, diverge between different force fields. To find the most realistic parameters to simulate membranes with these lipids, we first ranked all simulations based on order parameter quality (Supplementary Fig. 5), then picked force fields that occur in Fig. 2A (that is: force fields for both POPC and POPE in the Databank), and then ranked them according to the quality of the sn-1 chain of POPC (Fig. 2B) and of POPE (Fig. 2C). Simulations with τrel clearly above one (larger than 1.3) were discarded in this analysis. Because the average sn-1 chain order parameter is a good proxy for the membrane packing (Fig. 2G), rankings in Fig. 2B and C can be used to select the simulations giving the most realistic results in Fig. 2A. Based on this, Lipid17 and Slipids simulations are most realistic for a POPC membrane, while CHARMM36 and GROMOS-CKP simulations predict overly packed bilayers (overestimated order in Supplementary Fig. 6). For POPE, on the other hand, GROMOS-CKP and Slipids are most realistic, while CHARMM36 and Lipid17 predict too packed membranes. In conclusion, the quality evaluation based on the NMRlipids Databank suggests that the Slipids parameters are the best currently available choice for simulations with PC and PE lipids, at least for applications where membrane packing is relevant. Also direct comparisons with the experimental data for the most relevant simulations are shown in Fig. 2D–F and Supplementary Fig. 6A. Figure 2D shows the overall highest-ranked simulation, POPC bilayer with OPLS3e parameters, for the reference.

Using NMRlipids Databank as a training set for machine learning applications: Predicting multi-component membrane properties

To demonstrate the usage of the NMRlipids Databank to construct ML models that predict membrane properties, we trained a model that predicts area per lipids and thicknesses of membranes with diverse compositions. First we used randomly selected 80% of the Databank simulations to choose the hyperparameters for, and optimise, a set of ML models with the goal of predicting the area per lipid from the membrane composition.

After the parameter optimisation, we tested predictions from different ML models for the area per lipid both against the remaining 20% of the Databank and against areas per lipid reported from simulations of membranes containing mixtures of POPC, POPE, POPS, PI, sphingomyelin lipids, and cholesterol25,26,27 that are not included in the Databank. Essential differences between models were not observed when predicting area per lipids of 20% of simulations selected as the test set, but linear regression and Ridge models gave the best correlations with the literature data (predictions from the linear regression model are shown in Fig. 3 and from other models in Supplementary Fig. 7). The linear regression model was selected for further studies due to its simplicity.

Fig. 3: Predictions of areas per lipid (APL) of multi-component membranes composed of POPC, POPE, POPS, PI, sphingomyelin lipids, and cholesterol from linear regression model against literature data from simulations (green25, blue26, and red27).
figure 3

Error bars are from the same publications as the values. Black line indicates x = y.

To demonstrate the usefulness of the constructed models for understanding multi-component membrane properties, we predicted later packing (areas per lipid) and membrane thicknesses of common biological membranes based on their lipid compositions reported in the literature, see Table 1. The model predicts substantial 50% difference in area per lipid between most densely (influenza virus) and loosely (mitochondria) packed membranes. Difference of 0.8 nm in thickness is predicted between the thinnest (bacterial) and thickest (plasma) membranes. Such differences are expected to effect on many biologically relevant functions of membranes, such as permeation, cholesterol flip-flops (see next sections), and interactions with proteins28, demonstrating that NMRlipids Databank can be used to give valuable insights on biologically relevant properties of complex biological membranes. Most importantly, the delivered programmatic access to increasing amount of MD simulation data enables training of ML models that predict various membrane properties for all.

Table 1 Areas per lipid (APL) and membrane thicknesses predicted by the linear regression model trained using the NMRlipids Databank for membrane compositions corresponding different biological membranes

Detecting rare phenomena using NMRlipids databank: cholesterol flip-flops

Lipid flip-flops from one bilayer leaflet to another play an important role in lipid trafficking and regulating membrane properties7. Phospholipid flip-flop events are rare when not facilitated by proteins, occurring spontaneously on the timescale of hours or days, while cholesterol, diacylglycerol, and ceramide flip-flop much more often. Still, the reported timescales range from minutes to sub-millisecods7,29,30,31. These timescales were previously accessible only by coarse-grained simulations or free energy calculations30, and atomistic simulations reporting cholesterol flip-flop events have been published only recently31,32,33. The atomistic studies report an increase in cholesterol flip-flop rates with increasing acyl chain unsaturation level and decreasing cholesterol concentration31,32, but the amount of data in these individual studies was not sufficient to systematically assess correlations between cholesterol flip-flop rates and membrane properties. Here, we demonstrate that the NMRlipids Databank-API makes analyses of such rare phenomena accessible for all by enabling access to a large amount of MD simulation data as illustrated in Fig. 1. This is particularly useful for scientists in various fields of science and industry who lack access to the computational resources or the expertise to produce the large amounts of MD simulation data required for such analyses.

Using the general workflow depicted in Fig. 1D, we first calculated the flip-flop rates from all the simulations available in the NMRlipids Databank. Flip-flops were observed for cholesterol, DCHOL (18,19-di-nor-cholesterol), DOG (1,2-dioleoyl-sn-glycerol), and SDG (1-stearoyl-2-docosahexaenoyl-sn-glycerol). The observed cholesterol flip-flop rates, ranging between 0.001–1.6 μs−1 with the mean of 0.16 μs−1 and median of 0.07 μs−1, are in line with the previously reported values from atomistic MD simulations31,32,33. The flip-flop rate of DCHOL, 0.2 μs−1, was close to the average value of cholesterol, while the average rates for diacylglycerols DOG (0.4 μs−1) and SDG (0.5 μs−1) were higher than for cholesterol. Flip-flops were not observed for other lipids, giving the upper limits for PC-lipid flip-flop rate as 9 × 10−6 μs−1 and for ceramide (N-palmitoyl-D-erythro-sphingosine) as 0.002 μs−1. Thus, the available data in the NMRlipids Databank suggest that the lipid flip-flop rate decreases in the order: diacylglycerols > cholesterol > other lipids including ceramides. However, the amount of data for diacylgycerols (8 simulations with the Lipid17 force field) and ceramide (3 simulations with CHARMM36) is less than that for cholesterol (83 simulations); thus we cannot fully exclude the effect of force field or composition on this comparison.

Nevertheless, we used the general workflow depicted in Supplementary Fig. 2 to analyse how the flip-flop rates calculated from the NMRlipids Databank depend on membrane properties. Figure 4B–D show cholesterol flip-flop rates and their histograms as a function of membrane thickness, lateral density, and acyl chain order. The results reveal a non-linear correlation between cholesterol flip-flop rate and membrane packing (depicted as area per lipid): Flip-flop rates increase by an order of magnitude when membrane packing density decreases, and a major jump is observed at low membrane packing. Such order-of-magnitude changes in cholesterol flip-flop rate with the membrane composition may have major implications in understanding lipid trafficking and membrane biochemistry31,33. Because the results from the NMRlipids Databank are averaged over a large range of membrane compositions and force fields, they show that the strong dependence of cholesterol flip-flop rate on membrane properties is not limited to the particular lipid compositions or force fields used in the previous studies31,32,33.

Fig. 4: Quantification of cholesterol flip-flop events in NMRlipids Databank simulations.
figure 4

A Illustration of cholesterol flip-flop. BD Cholesterol flip-flops analyzed from the Databank as a function of membrane thickness, area per lipid, and acyl chain order. Values from simulations with non-zero flip-flop rates are shown with blue dots. Histogrammed values are shown with black dots. For the mean value in each bin, average weighted with the simulation lengths was used, and error bars show the standard error of the mean.

Extending the scope of MD simulations to new fields using NMRlipids Databank: Water diffusion anisotropy in membrane systems

The anisotropic diffusion of water and hydrophilic molecules in directions parallel and perpendicular to membranes is an important parameter in models describing the translocation of drugs through biological material, particularly in the skin12,34,35,36. Water anisotropic diffusion plays a role also in the signal formation in diffusion-tensor MRI imaging11. MD simulations are rarely used to analyze the anisotropic diffusion of water, since only a few membrane permeation events of water are typically observed in a single MD simulation trajectory37,38, thereby making the collection of a sufficient amount of data challenging. Here, we show that the API access to the data in NMRlipids Databank enables systematic analysis on how the anisotropic diffusion of water depends on membrane properties in multilamellar membrane systems, thereby extending the application of MD simulations to new fields.

To this end, we first calculated the water permeability through membranes from all simulations in the NMRlipids Databank using the general workflow depicted in Fig. 1D. The resulting non-zero values range between 0.3 and 322 μm/s with the mean of 14 μm/s and median of 8 μm/s. These values agree with the previously reported simulation results37,38, but are on average larger than experimental values reported for PC lipids in the liquid crystalline phase, 0.19–0.33 μm/s39. Using the workflow depicted in Supplementary Fig. 2, we then plotted the observed permeabilities and their histogrammed values in Fig. 5B–E as a function of temperature, membrane thickness, area per lipid, and acyl chain order. As expected, the permeability increases with the temperature, giving an average energy barrier of 14 ± 3 kBT for the water permeation from the Arrhenius plot in Fig. 5B. On the other hand, the water permeability on average decreases when membranes become more packed, that is, with decreasing area per lipid and increasing thickness and acyl chain order (Fig. 5C–E). Permeation of water through bilayers depends on membrane properties also according to previous studies, but there is no established consensus on whether the area per lipid40 or bilayer thickness41 is the main parameter determining the permeability. Our analysis over the NMRlipids Databank, containing significantly more data than what was available in previous studies, suggest non-linear dependencies on both of these parameters. Clear dependencies of permeability on hydration level or the fraction of charged lipids, cholesterol, or POPE in the membrane were not observed (Supplementary Fig. 8).

Fig. 5: Quantification of water diffusion in NMRlipids Databank simulations.
figure 5

A Water diffusion, D, and permeability, P, through membranes, and lateral diffusion along the membrane, D, illustrated in a multilamellar stack of lipid bilayers. BE Water permeation through membranes analyzed from the Databank as a function of temperature, thickness, area per lipid, and acyl chain order. Inset in (B) shows the Arrhenius plot of permeation (\(\ln (P)\) vs. 1/T) that gives 14 ± 3 kBT for the average activation energy for water permeation through lipid bilayer. F Lateral diffusion of water as a function of hydration level. Experimental points for DMPC bilayers at 313 K at different hydration levels are shown76. G, H Diffusion anisotropy of water as a function of thickness and area per lipid. Non-zero permeation and diffusion values from simulations are shown with blue dots. Histogrammed values are shown with black dots. For the mean value in each bin, average weighted with the simulation lengths was used, and error bars show the standard error of the mean. Only bins with more than one microsecond of data in total were used for water permeation.

To examine how water diffusion anisotropy depends on membrane properties in a multi-lamellar lipid bilayer system, we analyzed the water diffusion parallel to the membrane surface from all simulations in the NMRlipids Databank using the general workflows depicted in Fig. 1D and Supplementary Fig. 2. The parallel diffusion coefficient of water, D, decreases with reduced hydration and increases with the temperature, but dependencies on the membrane area per lipid, thickness, or fraction of charged lipids were not observed in Fig. 5 and Supplementary Fig. 9. Simulation results are close to the experimental values with low hydration levels in Fig. 5F, but increase to approximately 50% higher than the experimental value for bulk water diffusion value (3.1 × 10−9 m2/s at 313 K42) with high hydration levels. This is not surprising as the most common water model used in membrane simulations, TIP3P, overestimates the bulk water diffusion43. To estimate the diffusion anisotropy of water, D/D, in multilamellar membrane system, the permeability coefficients of water through membranes were translated to perpendicular diffusion coefficients, D, using the Tanner equation44,45. The resulting perpendicular diffusion coefficients are approximately five orders of magnitude smaller than the lateral diffusion coefficients of water (Fig. 5G, H), which is at the upper limit of anisotropy estimated from experimental data12. A significant increase in the diffusion anisotropy with membrane packing is observed, as D/D deviates further from unity with decreasing area per lipid and increasing thickness in Fig. 5G, H. This follows from decreasing water permeability with membrane packing (Fig. 5C, D), while lateral diffusion remains approximately constant (Supplementary Fig. 9A, C).

In summary, our results suggest that the bilayer packing has a substantial effect on anisotropic water diffusion in multi-membrane lipid systems. The several-fold larger anisotropy in membranes with higher lateral density is expected to play a role in pharmacokinetic models not only for water but also for other hydrophilic molecules12. Furthermore, the enhanced understanding of this anisotropy may help in developing new diffusion-tensor-based MRI imaging methods where signals originate from the anisotropic diffusion of water in biological matter11.

Discussion

Sharing of biomolecular MD simulation and other data is becoming increasingly important in the age of big data and AI1,4. Besides the data itself, also programmatic access is a necessary requirement for data-driven and ML applications. This is particularly challenging when field-specific smart guidelines and community-consensus best-practices have not yet been defined, which is the case in biomolecular simulations4. The NMRlipids Databank demonstrates how these issues can be solved by the overlay databank design, where the raw data are distributed to already publicly available decentralised locations, while the core of the databank is composed only of the metadata stored in a version-controlled git repository with an open-access license. On the other hand, the open-collaboration approach developed in the NMRlipids Project6 creates incentives for sharing the data by offering authorship in published articles to the contributors. Advantages of such an approach are demonstrated here for membrane simulations, yet the concept can be applied to any other biomolecules, as well as in any other field where similar barriers hinder the AI revolution, such as the assignment of NMR spectra5.

The NMRlipids Databank-API delivers programmatic access to MD simulation data that can be used as training set for diverse data-driven and ML applications that predict membrane properties. Such applications could be analogous to AlphaFold3 and other tools46,47 that predict protein structures from their sequence using AI. This is demonstrated here by building ML models to predict multi-component membrane properties. Furthermore, the analysis of cholesterol flip-flop events (Fig. 4) and water permeation through membranes (Fig. 5) demonstrate how a large amount of accessible simulation data in terms of quantity (e.g., simulation length and number of conformations) and content (e.g., lipid compositions and ion concentrations) enable analyses of rare phenomena that are beyond the current possibilities for a single research group. Such analyses also pave the way for applications of MD simulations in new fields, as demonstrated here by analysing an essential parameter in pharmacokinetic modelling and MRI imaging:11,12 the anisotropic diffusion of water in membrane systems (Fig. 5). These possibilities are particularly valuable for scientists who do not typically have access to large-scale MD simulation data.

The focus of biomolecular simulations is moving from studies of individual molecules to larger complexes and even whole cells and organelles21,22,23. Simultaneously, machine-learning-based models for predicting the behaviour of biomolecules and automatic approaches to parametrise models are emerging3,20. The resources delivered by the NMRlipids Databank will support developments in both of these directions. Automatic quality evaluation and ranking of simulations against experimental data enable the selection of best simulations for specific applications without laborious manual force field evaluation. This also streamlines automatic parametrization procedures for atomistic and coarse grained simulations by, for example, pinpointing typical failures of force fields and highlighting points of improvement. Such practises for fostering the accuracy of simulations are becoming increasingly important as small errors accumulate when complexity and size of simulated systems are increasing. Examples of impact of NMRlipids and other overlay databanks in different disciplines are listed in Table 2, yet the scope of applications is expected to further widen with increasing amount of publicly shared data.

Table 2 Examples of impact of NMRlipids and other overlay databanks in different disciplines

Methods

Structure of the databank

The overlay structure designed for the NMRlipids Databank is composed of three layers (Fig. 1A). The Data layer contains raw data that can be distributed to publicly available servers such as Zenodo (zenodo.org). The core content of the Databank locates in the Databank layer, which is a git repository at github.com/NMRlipids/Databank and is also permanently stored in a Zenodo repository (https://doi.org/10.5281/zenodo.7875567). The essential information of each simulation is stored in a human-and-machine-readable README.yaml file located in a subfolder of the /Data/Simulations folder in the Databank layer repository; each subfolder has a unique name constructed based on a hash code of the trajectory and topology files of each simulation. The README.yaml files in these folders contain access to all information that is needed for further analysis of simulations, such as links to the raw data and associations with the universal molecule and atom names. The content of these files is described in detail in Supplementary Table 3 and in the NMRlipids Databank documentation nmrlipids.github.io. Results from analyses of basic membrane properties (area per lipid, thickness, C–H bond order parameters, X-ray scattering form factors, and relaxation of principal components) are stored in the same folders as the README.yaml files. Experimental data used for ranking is stored in the /Data/experiments folder, the ranking results in /Data/Ranking, and the relevant scripts in the /Scripts/ folder in the Databank layer repository. The scripts in the NMRlipids Databank are mainly written in Python and many of them use the MDAnalysis module48,49. The Databank structure is illustrated in more detail in Supplementary Figure 1. Whenever specific files or folders are referred here, they locate at the Databank layer repository unless stated otherwise.

Universal naming convention for molecules and atoms

When analysing simulation trajectories, atoms and molecules often need to be called by the names used in the trajectory. However, these names typically vary between force fields, as a universal naming convention has not been defined for lipids. To enable automatic analyses over all the simulations in the NMRlipids Databank, we have defined universal naming conventions for the molecules and atoms therein. The universal abbreviations used in the NMRlipids Databank for each molecule are listed in the NMRlipids Databank documentation18. The atom names used in simulation trajectories are connected to the universal atom names using the mapping files defined in the NMRlipids Project18. These files are located at /Scripts/BuildDatabank/mapping_files in the NMRlipids Databank repository. These files also define whether an atom belongs to the headgroup, glycerol backbone, or acyl chain region in a lipid. In practise, the force-field-specific molecule names and mapping file names are defined in the README.yaml files for each molecule in each simulation as described in the NMRlipids Databank documentation50.

Adding data to NMRlipids Databank

The NMRlipids Databank is open for additions of simulation data by anyone. In practise, the required information is first manually entered into an info.yaml file that is then added into the /Scripts/BuildDatabank/info_files folder trough a git pull request. Rest of the information to be stored in the README.yaml files will be then automatically extracted using the /Scripts/BuildDatabank/AddData.py script. The required manually entered and automatically extracted information are described in detail in Supplementary Table 3 and in the NMRlipids Databank documentation50. The documentation includes also detailed instructions to add data. To avoid ineligible entries and minimise human errors, the pull requests are monitored before the acceptance and generation of the README.yaml files. Currently, the NMRlipids Databank is composed of simulations that are found from the Zenodo repository with an appropriate license; most, but not all, of these trajectories originate from previous NMRlipids projects6,14,15,16.

Experimental data

Experimental data used in the quality evaluation, currently composed of C–H bond order parameters and X-ray scattering form factors, are stored in /Data/experiments in the NMRlipids Databank repository. Similarly to simulations, each experimental data set has a README.yaml file containing all the relevant information about the experiment. The keys and their descriptions for the experimental data, as well as detailed instructions to add the date, are given in Supplementary Table 4 and in the NMRlipids Databank documentation51. The NMR data currently in the NMRlipids Databank are taken from refs. 15,16,52,53,54,55,56 and the X-ray scattering data from refs. 56,57,58,59,60,61,62. In addition, previously unpublished NMR data for POPE, POPG, and DOPC was acquired as described in Supplementary Figs. 1015 and contributed to the Databank.

Analysing simulations

In practise, simulations in the NMRlipids Databank can be analyzed by executing a programme that (1) loops over the README.yaml files in the Databank layer, (2) downloads the data using the information in the README.yaml files, and then (3) performs the desired analysis on a local computer utilising the universal naming conventions for molecules and atoms defined in the README.yaml and mapping files. This general procedure is illustrated in Fig. 1D, and a templates for user-defined analyses are available via NMRlipids Databank documentation19. Further practical examples of codes performing such analyses are listed in Supplementary Tables 2 and 5. The equilibration period given by the user (TIMELEFTOUT in Supplementary Table 3) is discarded from the trajectories in all analysis codes. For further details, see the NMRlipids Databank documentation nmrlipids.github.io.

Principal components analysis of equilibration of simulations

To estimate how well conformational ensembles of lipids are converged in trajectories, the Principal Component Analysis (PCA) following the PCALipids protocol was used63,64. To this end, each lipid configuration was first aligned to the average structure of that lipid type, and PCA analysis was then applied on the Cartesian coordinates of all heavy atoms of the lipid. Because the motions along the first, major, principal component are the slowest ones63, the equilibration of each lipid type was estimated from the ratio between the distribution convergence of the trajectories projected on the first PC and the trajectory length, \({\tau }_{{{{{{{{\rm{rel}}}}}}}}}={\tau }_{{{{{{{{\rm{convergence}}}}}}}}}/{\tau }_{{{{{{{{\rm{sim}}}}}}}}}\)63,64. If τrel < 1, simulations can be considered to be sufficiently long for the lipid molecules to have sampled their conformational ensembles, while in simulations with τrel > 1 individual molecules may not have fully sampled their conformational ensembles. Rigid molecules that do not exhibit significant conformational fluctuations, such as sterols, were excluded from the analysis. In practise, the distribution convergence times were calculated utilising its linear dependence on autocorrelation decay times, τconvergence = kτautocorrelation, because calculation of autocorrelation decay times is faster and computationally more stable than direct calculation of distribution convergence times63,64. The empirical coefficient k = 49 was calculated based on the analysis of 8 trajectories with the length of more than 200 ns, including simulations of POPC, POPS, POPE, POPG, and DPPC with the CHARMM36 force field. Because the coefficient k does not depend on the force field63, the value determined from these CHARMM36 simulations can be used for all simulations in the Databank. The script that calculates the equilibration of lipids is available at Scripts/BuildDatabank/NMRPCA_timerelax.py in the NMRlipids Databank repository. The resulting values are stored in files named eq_times.json at folders in /Data/Simulations in the NMRlipids Databank repository.

Calculation of C–H bond order parameters

The C–H bond order parameters were calculated directly from the carbon and hydrogen positions using the definition

$${S}_{{{{{{{{\rm{CH}}}}}}}}}=\frac{1}{2}\left\langle 3{\cos }^{2}\theta -1\right\rangle,$$
(1)

where angular brackets denote the ensemble average, i.e., average over all sampled configurations of all lipids in a simulation, and θ is the angle between the C–H bond and the membrane normal. As in previous NMRlipids publications, the order parameters were first calculated separately for each lipid and the standard error of the mean over different lipids was used as the error estimate6. However, order parameters for simulations with τrel > 1 may be influenced by the starting structure and thereby their error bars may be underestimated. The script that calculates C–H bond order parameters from all simulations in the NMRlipids Databank is available at /Scripts/AnalyzeDatabank/calcOrderParameters.py in the NMRlipids Databank repository. The resulting order parameters are stored for all simulations in files named [lipid_name]OrderParameters.json at folders in /Data/Simulations in the NMRlipids Databank repository.

Calculation of X-ray scattering form factors

X-ray scattering form factors were calculated using the standard equation for lipid bilayers that does not assume symmetric membranes13,

$$F(q)=\left| \int\nolimits_{-D/2}^{D/2}\Delta {\rho }_{e}(z)\exp (iz{q}_{z}){{{{{{{\rm{d}}}}}}}}z\right|,$$
(2)

where Δρe(z) is the difference between the total and solvent electron densities, and D is the simulation box size in the z-direction (normal to the membrane). For the calculation of density profiles, atom coordinates were first centred around the centre of mass of lipid molecules for every time frame, and a histogram of these centred positions, weighted with the number of electrons in each atom, was then calculated with the bin width of 1/3 Å. Electron density profiles were then calculated as an average of these histograms over the time frames in simulations. The script to calculate form factors for all simulations in the NMRlipids Databank is available at Scripts/AnalyzeDatabank/calc_FormFactors.py. The resulting form factors are stored for all simulations in files named FormFactor.json at folders in /Data/Simulations in the NMRlipids Databank repository.

Calculation of area per lipid and bilayer thickness

Area per lipids of bilayers were calculated by dividing the time-averaged area of the simulation box with the total number of lipids and surfactant molecules in the simulation.The script that calculates the area per lipid from all simulations in the NMRlipids Databank repository is available at Scripts/AnalyzeDatabank/calcAPL.py in the NMRlipids Databank repository. The resulting area per lipids are stored for all simulations in files named apl.json at folders in /Data/Simulations.

Thicknesses of lipid bilayers were calculated from the intersection points of lipid and water electron densities. The script that calculates the thicknesses of all simulations in the NMRlipids Databank is available at Scripts/AnalyzeDatabank/calc_thickness.py in the NMRlipids Databank repository. The resulting thicknesses are stored in files named thickness.json at folders in /Data/Simulations in the NMRlipids Databank repository.

Quality evaluation of C–H bond order parameters

As the first step to evaluate simulation qualities against experimental data, a simulation is connected to an experimental data set if the molar concentrations of all molecules are within ± 3 percentage units, charged lipids have the same counterions, and temperatures are within ± 2 K. For molar concentrations of water, the exact hydration level is considered only for systems with molar water-to-lipid ratio below 25, otherwise the systems are considered as fully hydrated. In practise, the connection is implemented by adding the experimental data path into the simulation README.yaml file using the /Scripts/BuildDatabank/searchDATABANK.py script in the NMRlipids Databank repository.

The quality of each C–H bond order parameter is estimated by calculating the probability for a simulated value to locate within the error bars of the experimental value. Because conformational ensembles of individual lipids are assumed to be independent in a fluid lipid bilayer, \(\frac{{S}_{{{{{{{{\rm{CH}}}}}}}}}-\mu }{s/\sqrt{n}}\) has a Student’s t-distribution with n − 1 degrees of freedom and μ representing the real mean of the order parameter. The probability for an order parameter from simulation to locate within experimental error bars can be estimated from equation

$$P=f\left(\frac{{S}_{{{{{{{{\rm{CH}}}}}}}}}-\left({S}_{\exp }+\Delta {S}_{\exp }\right)}{s/\sqrt{n}}\right)-f\left(\frac{{S}_{{{{{{{{\rm{CH}}}}}}}}}-\left({S}_{\exp }-\Delta {S}_{\exp }\right)}{s/\sqrt{n}}\right),$$
(3)

where f(t) is the Student’s t-distribution, n is the number of independent sample points for each C–H bond (which equals the number of lipids in a simulation), SCH is the sample mean from Eq. (1), s is the variance of SCH calculated over individual lipids, \({S}_{\exp }\) is the experimental value, and \(\Delta {S}_{\exp }\) its error. The error of \(\Delta {S}_{\exp }=0.02\) is currently assumed for all experimental order parameters13, yet more accurate ones may be available in the future65. Because a lipid bilayer simulation contains at least dozens of lipids, the Student’s t-distribution could be safely approximated with a normal distribution. However, with the quality of currently available force fields, the simulation values can be so far from experiments that a normal distribution leads to probability values below the numerical accuracy of computers. To avoid such numerical instabilities, we opted to use the first order Student’s t-distribution that has slightly higher probabilities for values far away from the mean. On the other hand, some force fields exhibit too slow dynamics, which leads to large error bars in the SCH values66. Such artificially slow dynamics widens the Student’s t-distribution in Eq. (3), thereby increasing the probability to find the simulated value within experimental error bars. Therefore, the SCH with simulation error bars above the experimental error 0.02 are not included in the quality evaluation.

To streamline the comparison between simulations, we define the average qualities for different fragments (frag = ’sn-1’, ’sn-2’, ’headgroup’, or ’total’, with the last referring to all order parameters within a molecule) within each lipid type in a simulation as

$${P}^{{{{{{{{\rm{frag}}}}}}}}}[{{{{{{{\rm{lipid}}}}}}}}]={\langle P[{{{{{{{\rm{lipid}}}}}}}}]\rangle }_{{{{{{{{\rm{frag}}}}}}}}}{F}_{{{{{{{{\rm{frag}}}}}}}}}[{{{{{{{\rm{lipid}}}}}}}}],$$
(4)

where 〈P[lipid]〉frag is the average of the individual SCH qualities within the fragment, and Ffrag[lipid] is the percentage of order parameters for which the quality is available within the fragment. The overall quality of different fragments in a simulation (frag = ’tails’, ’headgroup’, or ’total’) are then defined as a molar-fraction-weighted average over different lipid components

$${P}^{{{{{{{{\rm{frag}}}}}}}}}=\mathop{\sum}\limits_{{{{{{{{\rm{lipid}}}}}}}}}{\chi }_{{{{{{{{\rm{lipid}}}}}}}}}{P}^{{{{{{{{\rm{frag}}}}}}}}}[{{{{{{{\rm{lipid}}}}}}}}],$$
(5)

where χlipid is the molar fraction of a lipid in the bilayer and ’tails’ refer to the average of all acyl chains.

The quality evaluation of order parameters is implemented in /Scripts/BuildDatabank/QualityEvaluation.py in the NMRlipids Databank repository. The resulting qualities for each SCH are stored in files named [lipid_name]_OrderParameters_quality.json, for individual lipids in files named [lipid_name]_FragmentQuality.json, and for overall quality for fragments in files named system_quality.json at folders in /Data/Simulations in the NMRlipids Databank repository.

Quality evaluation of X-ray scattering form factors

Because experiments give form factors only on a relative intensity scale, they should be scaled before comparing with the simulation data. Here we use the scaling coefficient for experimental intensities defined in the SIMtoEXP program67

$${k}_{e}=\frac{\mathop{\sum }\nolimits_{i=1}^{{N}_{q}}\frac{| {F}_{s}({q}_{i})| | {F}_{e}({q}_{i})| }{{(\Delta {F}_{e}({q}_{i}))}^{2}}}{\mathop{\sum }\nolimits_{i=1}^{{N}_{q}}\frac{| {F}_{e}({q}_{i}){| }^{2}}{{(\Delta {F}_{e}({q}_{i}))}^{2}}},$$
(6)

where Fs(q) and Fe(q) are form factors from a simulation and experiment, respectively, ΔFe(q) is the error of the experimental form factor, and summation goes over the experimentally available Nq points.

Also, a quality measure based on differences in the simulated and experimental form factors across the available q-range is defined in the SIMtoEXP program67. However, the lobe heights in the simulated form factors depend on the simulation box size, as shown in Supplementary Fig. 4; consequently, the quality measure defined in SIMtoEXP would also depend on the simulation box size. In contrast, locations of the form factor minima (or, in precise terms: the minima of the absolute value of the form factor) are independent of the simulation box size (Supplementary Fig. 4). Here we use only the location of the first form factor minimum for quality evaluation, because (due to fluctuations) the location of the second minimum is difficult to detect automatically in some experimental data sets, such as the POPE data in Fig. 2E, F. The first minimum correlates well with the thickness of a membrane (Fig. 2G), although the correlation of the second minima would be even stronger (Supplementary Fig. 3). In practise, we first filter the fluctuations from the form factor data using the Savitzky–Golay filter (window length 30 and polynomial order 1) and locate the first minimum at q > 0.1 Å−1 from both simulation (\(F\,{F}_{\min }^{{{{{{{{\rm{sim}}}}}}}}}\)) and experiment (\(F\,{F}_{\min }^{\exp }\)). The quality of a form factor is then defined as the Euclidean distance between the minima locations: \(F\,{F}_{q}=| F\,{F}_{\min }^{{{{{{{{\rm{sim}}}}}}}}}-F\,{F}_{\min }^{\exp }| \times 100\).

The quality evaluation of form factors is implemented in /Scripts/BuildDatabank/QualityEvaluation.py in the NMRlipids Databank repository. The resulting form factor qualities are stored in files named FormFactorQuality.json at folders in /Data/Simulations in the NMRlipids Databank repository.

Training machine learning model to predict properties of multi-component membranes

Different ML approaches were tested to find the best model to predict the area per lipids of multi-component membranes from their lipid composition. To his end, we tested linear, Lasso, Ridge, elastic net, decision tree, random forest, multi-layer perceptron, k-nearest neighbour, gradient boosting, AdaBoost, and XGBoost regressors. The models were trained to predict the area per lipid for a membrane with the given molar fractions of membrane lipids. Most abundant lipids in the Databank, POPC, POPE, POPG, POPS, and cholesterol, were considered explicitly, while SM, cardiolipins, and PI lipids with more sparse data were all grouped together based on their headgroups independently on acyl chain content. Salt concentration, temperature and other conditions were not considered.

To find the optimal model for this task, we first randomly divided simulations into training (80% of the simulations) and test (20% of the simulations) sets. Then, we used the training set with 5-fold cross validation to conduct a grid search for hyperparameters for the set of ML regression models. The best performing hyperparameters were then used to fit each model to the training set. The performance of the models was then tested on the remaining 20% of the data extracted from the Databank, and additionally, by predicting area per lipids from multi-component membrane simulations reported in the literature but not included in the Databank25,26,27. Due to its simplicity and good performance in these tests, we selected to use linear regression model for further studies. The linear regression model predicting membrane thicknesses was trained similarly. The ML models and analysis were implemented with Python using scikit-learn which was supplemented with the xgboost library to include extreme gradient boosted decision tree regressor to the set of tested model. Jupyter notebook that trains the models and predicts are per lipids is available at scripts/APLpredictor.ipynb in the repository at https://github.com/NMRLipids/DataBankManuscript/. In further applications, it is important to consider potential limitations arising from the currently limited amount of data for certain types of systems. For example, the current models have been trained and tested only for mixtures with palmitoyl and oleoyl acyl chains because the Databank does not yet contain enough data with varying number of double bonds. Such limitations are expected to alleviate with increasing amount of data in the future.

Calculation of lipid flip-flops

Flip-flop rates were calculated using the AssignLeaflets and FlipFlop tools from the LiPyphilic package68. Headgroup atoms of each molecule, as defined in the mapping file, were used to determine in which leaflet the molecule locates. The midplane cut-off, defining the region between leaflets, was 1 nm and the frame cutoff was 100. This means that if the headgroup of a molecule entered within the distance of 1 nm from the bilayer midplane and was found in the opposing leaflet after 100 steps, this event was considered as a successful flip-flop event. The code that finds the flip-flop events from all simulations in the NMRlipids Databank is available at scripts/FlipFlop.py, and the results at Data/Flipflops/ in the repository at https://github.com/NMRLipids/DataBankManuscript/.

Analysing anisotropic diffusion of water in a membrane environment from NMRlipids Databank

Water permeability through membranes was calculated from equation P = r/2cw, where r is the rate of permeation events per time and area, and cw =  33.3679 nm−3 is the concentration of water in bulk37. The number of permeation events in each trajectory was calculated using the code by ref. 38, available at https://github.com/crobertocamilo/MD-permeation. The code that calculates permeabilities for all simulations in the NMRlipids Databank is available at /scripts/calcMD-PERMEATION.py, and the resulting permeabilities are stored at /Data/MD-PERMEATION in the repository containing all analyses specific for this publication at https://github.com/NMRLipids/DataBankManuscript/. This repository is organized similarly to the NMRlipids Databank repository, enabling the upcycling of also the analyzed data without overloading the main NMRlipids Databank repository.

The lateral diffusion of water along the membrane surface, D, was calculated with the Einstein’s equation using the -lateral option in the gmx msd program within the Gromacs software package69. The code that calculates D for water from all simulations in the NMRlipids Databank is available at /scripts/calcWATERdiffusion.py, and the resulting diffusion coefficients are stored at /Data/WATERdiffusion in the repository at https://github.com/NMRLipids/DataBankManuscript/.

Water diffusion along the perpendicular direction of lipid bilayers in a multilamellar stack was estimated from the Tanner equation \({D}_{{{{{{{{\rm{\perp }}}}}}}}}=\frac{{D}_{\parallel }P{z}_{w}}{{D}_{\parallel }+P{z}_{w}}\)44,45, where the water layer thickness, zw, was estimated by subtracting the bilayer thickness from the size of the simulation box in the membrane normal direction.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.