Background & Summary

Poor aqueous solubility and permeability are recognized as major contributors to limited oral drug bioavailability. Indeed, these are integral considerations of theoretical frameworks such as Lipinski’s Rule of Five, the Biopharmaceutical Classification System (BCS), and the expanded Developability Classification System (DCS), which provide ways to differentiate promising drugs for oral administration1,2,3. Over time, it has been reported that a growing number of small molecule drug candidates exhibit properties that may hinder oral absorption. In fact, in the 20 years since the rule of five was first proposed, new chemical entities approved by the FDA have been shown to increase in molecular weight and calculated water-octanol partition coefficient (clogP)4,5. In general, the successful clinical approval of less traditionally drug-like molecules underscores the critical role of pharmaceutical formulations.

Advanced lipid-based formulation strategies have enabled enhancement of oral absorption of drugs with poor water solubility and/or low intestinal permeability (i.e., BCS II and IV drugs). One such example is self-emulsifying drug delivery systems (SEDDS), a combination of oils, surfactants, and/or cosolvents that spontaneously emulsify in the aqueous environment of the gastrointestinal tract6. The ability of SEDDS formulations to improve oral bioavailability has been attributed to a number of mechanisms, notably through increased apparent solubility of highly lipophilic drugs, as well as reduced metabolism or efflux7. As a result, several clinically approved drugs rely on delivery in SEDDS formulations including cyclosporine A (e.g., Sandimmune, Neoral), tipranavir (e.g., Aptivus), and fenofibrate (e.g., Lipofen), among others8,9,10.

Despite the relative simplicity of SEDDS in principle, the path to design such formulations remains non-trivial. The traditional approach to SEDDS development is an empirical process relying on iterative trial-and-error to screen, optimize, and evaluate the formulations. One of the most pertinent questions lies with the selection of appropriate excipients and mixtures thereof. Typically, this begins with quantification of the drug solubility in excipients, followed by screening excipient mixtures based on their emulsification properties, through visual assessment11. Given the range of possible excipients for SEDDS (i.e., oils, surfactants, cosolvents – all of which may differ in terms of hydrophilicity/lipophilicity, purity, etc.), selection is often narrowed based on generally recognized as safe (GRAS) status. An established tool to facilitate the process of formulation development is the Lipid-based Formulation Classification System (LFCS). The LFCS defines four categories of oral lipid-based formulations according to their compositions, which essentially range from a pure mixture of oils to a combination of exclusively surfactants and cosolvents6. While the LFCS relates these compositional ranges to typical properties, it does not eliminate the need to develop bespoke formulations by exploring various excipient combinations. Nonetheless, methods to shift away from the traditional development of SEDDS have emerged, largely employing data-driven tools.

In recent years, there has been significant interest in the integration of artificial intelligence (AI) and machine learning (ML) in pharmaceutical sciences, including drug formulation. These tools have been used in a variety of advanced applications, from the expedited design of polymeric long-acting injectables to engineering peptides for sustained delivery to the eye, and the development of ionizable lipids for lipid nanoparticle delivery of mRNA12,13,14. In the context of oral lipid-based formulations, ML and computational techniques have played a role in early-stage development, notably based on small molecule drug solubility screening15. Preliminary ML modeling has been used to predict drug supersaturation in lipid-based formulations and increases in the apparent solubility of drug upon dispersion of SEDDS16,17. In these cases, a limited number of formulation compositions (i.e., two representative examples) were explored. Few studies have performed extensive investigations relating to SEDDS compositions. One example includes an approach integrating ML and molecular dynamics to predict self-emulsification regions for SEDDS formulations, which also reported the distribution of excipients in their dataset18. However, this study did not identify drugs that were in the formulations in the dataset.

Thus, although SEDDS are a well-established formulation strategy, there are currently no open-access SEDDS datasets with a focus on formulation composition. Here, we present a literature mined SEDDS dataset containing 668 unique formulations, with drug, excipient, and formulation features that may be used to better understand composition patterns or relationships and predict formulation properties (Fig. 1). Our dataset contributes to the development of SEDDS formulations by providing a resource with documented formulations and related information that may serve as a starting point for excipient selection and screening.

Fig. 1
figure 1

A schematic overview of the study. Graphical illustration of self-emulsifying drug delivery systems (SEDDS), which spontaneously emulsify into colloidal particles upon dispersion of the preconcentrate (i.e., drug-excipient mixture) in aqueous media (a). Workflow for the collection of the SEDDS dataset (b).

Methods

Data collection

All SEDDS formulations in the dataset were collected from published literature. The dataset was constructed based on a search of the Web of Science database covering its inception to March 2023, using the keywords “self-emulsifying drug delivery systems” or “SEDDS” or “SNEDDS” or “SMEDDS” and “drug” from a list of 20 poorly water-soluble drugs (i.e., active pharmaceutical ingredients (APIs)). Search results were limited to articles and filtered by publisher (i.e., Elsevier, Springer Nature, Taylor & Francis, Wiley, MDPI). An initial pool of 307 articles were manually screened, yielding 152 articles that encompassed 668 unique formulations for inclusion in the dataset (Fig. 2). Articles were omitted if they did not provide relevant information, such as insufficient formulation compositional details, description of formulations not corresponding to the drug in question, or a non-unique formulation. The full list of source studies is provided in the source and DOI columns of the sedds_dataset_full.csv file.

Fig. 2
figure 2

Sankey diagram illustrating the number of articles identified and screened for construction of the SEDDS dataset. An initial pool of 307 articles was selected following a search of the Web of Science database. Manual screening of the articles yielded 152 articles containing 668 unique formulations for inclusion in the dataset. Meandering flows indicate article searches that corresponded to one drug but provided relevant information for a different drug.

Information obtained for an individual sample in the dataset included the identity and relative proportion of the drug, as well as each individual excipient (i.e., oils, surfactants, cosolvents, and other ingredients). Other additives or ingredients were grouped by function (e.g., absorption enhancer, precipitation inhibitor, etc.), as opposed to the individual identity, to facilitate downstream analysis. The proportions of each component for a given formulation were standardized as compositional data, such that they totaled to 100% in units by weight. Additional descriptors included the average particle size (i.e., droplet diameter of SEDDS upon dispersion) and average droplet polydispersity index, where applicable. A manually defined descriptor denoting whether a given formulation was found to be promising in the context of its source article was also included. A formulation was considered to be promising if it was selected for further development and/or exhibited the most favourable properties (i.e., dependent on the original study) from a panel of screened formulations.

The literature-mined dataset was further extended by appending additional features relating to each component of each formulation. Drug physicochemical properties were sourced from DrugBank, while excipient properties were reported according to the literature and supplier or manufacturer information.

Data preprocessing and feature engineering

To obtain a tractable dataset amenable to downstream analysis and modeling, data cleaning and preprocessing were performed. First, the trade names of excipients were all converted to chemical names, to remove redundancy. For each formulation, the number of oils, surfactants, cosolvents, or other ingredients were counted and converted into a single so-called SEDDS complexity feature. This feature was a min-max normalization performed on the total number of ingredients in each formulation (x), according to:

$$x{\prime} =\frac{x-\min (x)}{\max (x)-\min (x)}$$

Furthermore, features describing the oil, surfactant, and cosolvent properties of the formulation were derived from individual component properties. For instance, using the dominant fatty acid within a certain oil (or across mixtures of oils), binary features for whether there is a long aliphatic chain and/or saturated chain described the oil character of a formulation. For surfactant and cosolvent features, weight-average properties were calculated based on the proportions of each excipient in a formulation. The complete procedure and calculations used to generate the dataset are provided in the available R code.

Data Records

The SEDDS dataset and related data are available in CSV formats on Open Science Framework (OSF)19. A summary of the available files is provided in Table 1. Data files contained in the Components folder report all individual drugs and excipients, as well as their associated properties, collated in the final dataset, sedds_df.csv. The data contains 20 drugs, 44 unique oils, 31 unique surfactants, and 17 unique cosolvents. In total, the final cleaned dataset comprised 29 features for 668 SEDDS formulations (Table 2).

Table 1 Summary of available data files and their descriptions.
Table 2 List of features in the SEDDS dataset and their related formulation component and description.

Technical Validation

Given the dataset is sourced from the literature, the validity is directly related to the quality of the source studies. Therefore, limitations pertaining to the sparsity and accuracy of reported data, and the influence of publication bias, are to be expected. By including a range of drugs and all their available SEDDS formulations, we strove to impart the dataset with a more representative breadth of samples (i.e., combination of BCS II and IV drugs; some drugs are less amenable to SEDDS formulations than others). Furthermore, studies were assessed for completeness of information and uniqueness of the reported formulation. This ensured all compositional details are available for each sample. All possible features from the source studies were included in the dataset, but there is scope to potentially expand it with additional descriptors, such as structural representations (e.g., for drugs or excipients) for researchers aiming to use the dataset in ML applications. It is notable that droplet size and PDI of SEDDS upon dispersion are not reported in all cases, with only 506 (75.7%) formulations reporting the former and 289 (43.3%) formulations reporting the latter. While this is related to the nature of the data, missing data may be addressed through imputation, the application of synthetic data generation techniques, or by omission.