Introduction

The rapid technological progress in the last centuries has been largely fuelled by the success of the scientific method. However, in some of the most important fields, such as material or drug discovery, productivity has been decreasing dramatically1, and by today it can take almost a decade to discover new material and cost upwards of $10–$100 million. One of the most daunting challenges in materials discovery is hypothesis generation. The reservoir of natural products and their derivatives has been largely emptied2 and bottom-up human-driven hypotheses have shown that it is extremely challenging to identify and select novel and useful candidates in search spaces that are overwhelming in size, e.g., the chemical space for drug-like molecules is estimated to contain >1033 structures3.

To overcome this problem, in recent years, machine learning-based generative models, e.g., variational autoencoders (VAEs4), generative adversarial networks (GANs;5) have emerged as a practical approach to designing and discovering molecules with desired properties leveraging different representations for molecular structure, e.g., text-based like SMILES6 and SELFIES7 or graph-based8. Compared to exhaustive or grid searches, generative models more efficiently and effectively navigate and explore vast search spaces learned from data based on user-defined criteria. Leveraging these approaches in With a series of seminal works9,10,11,12,13, research has covered a wide variety of applications of generative models, including design, optimisation and discovery of sugar and dye molecules14, ligands for specific targets15,16,17,18, anti-cancer hit-like molecules19,20, antimicrobial peptides21 and semiconductors22.

At the same time, we have witnessed growing community efforts for developing software packages to evaluate and benchmark machine learning models and their application in material science. On the property prediction side, models, data-mining toolkits and benchmarking suites for material property prediction, such as CGCNN23, pymatgen24, Matminer25 or Matbench/AutoMatminer26 were released. On the generative side, initial efforts for generic frameworks implementing popular baselines and metrics such as GuacaMol27 and Moses28 paved the way for domain-specific generative model software that is gaining popularity in the space of drug discovery such as TDC (Therapeutics Data Commons29,30).

More recently, novel families of methods have been proposed. Generative flow networks (GFN31,32,33), a generative model that leverages ideas from reinforcement learning to improve sample diversity, provides a non-iterative sampling mechanism for structured data over graphs. GFNs are particularly suited for molecule generation, where sample diversity is challenging. Diffusion models (DM34,35,36) are generative models that learn complex high-dimensional distributions denoising the data at multiple scales. DMs achieve impressive results in terms of sample quality and diversity for unconditional and conditional vision tasks. Recently, text-conditional diffusion models37,38,39 have paved the way for a new age of human–machine interaction. Leveraging such advances in conditioning generative models, DMs have been used in the biological domain for molecule conformation using equivariant graph networks40, conditioning on a 2D representation of the molecule to generate the 3D pose in space41, for protein generation42,43 and docking44.

In this landscape, there is a growing need for libraries and toolkits that can lower the barrier to using generative models. This need is becoming significantly more pressing given the growing models’ size and their significant requirements for considerable computational resources for training them. This trend creates an imbalance between a small, privileged group of researchers in well-funded institutions and the rest of the scientific community, thus impeding open, collaborative, and fair science principles45.

We introduce the generative toolkit for scientific discovery (GT4SD) as a remedy. This Python library aims to bridge this gap by developing a framework that eases the training, execution and development of generative models to accelerate scientific discovery. As visualised in Fig. 1, GT4SD provides a harmonised interface with a singular application registry for all generative models and a separate registry for properties. This expenses the need to familiarise with the original developer’s code, thus significantly lowering the access barrier. Moreover, the high standardisation across models eases the integration of new models and facilitates consumption by containerisation or distributed computing system. To the best of our knowledge, GT4SD provides the largest framework for accessing state-of-the-art generative models. It can be used to execute, train, fine-tune and deploy generative models, all either directly through Python or via a highly flexible command line interface (CLI). All pre-trained models can be executed directly from the browser through web apps hosted on Hugging Face Spaces. Last, for advanced users, the GT4SD model hub simplifies the release of existing algorithms trained on new datasets for instant and continuous integration in their discovery workflows.

Fig. 1: GT4SD overview and structure.
figure 1

The library implements pipelines for the inference and training of generative models. In addition, GT4SD offers utilities for algorithm versioning and sharing for broader usage in the community. The standardised interface enables algorithm instantiation and runs for generating samples with less than five lines of code (top, left panel). Furthermore, the CLI tools ease the run of a full discover pipeline in the terminal (top, right panel). The library provides (bottom, from left to right) algorithms for inference, a CLI utility, target domains, a property prediction interface, interfaces and implementations of generative modelling frameworks and training pipelines. In the blue box, we provide a sample of available frameworks and methodologies for inference algorithms.

GT4SD offers a set of capabilities for generating novel hypotheses (inference pipelines) and for fine-tuning domain-specific generative models (training pipelines). It is designed to be compatible and inter-operable with existing popular libraries, including PyTorch46, PyTorch Lightning47, Hugging Face Transformers48, Diffusers49, GuacaMol27, Moses28, TorchDrug50, GFlowNets33 and MoLeR51. It includes a wide range of pre-trained models and applications for material design.

GT4SD provides simple interfaces to make generative models easily accessible to users who want to deploy them with just a few lines of code. The library provides an environment for researchers and students interested in applying state-of-the-art models in their scientific research, allowing them to experiment with a wide variety of pre-trained models spanning a broad spectrum of material science and drug discovery applications. Furthermore, GT4SD provides a standardised CLI, APIs for inference and training without compromising on the ability to specify an algorithm’s finer-grained parameters and >15 web apps of various pre-trained models.

Results

A case study in molecular discovery

Arguably, the most considerable potential for accelerating scientific discovery lies in the field of de novo molecular design, particularly in material and drug discovery. With several (pre)clinical trials underway52, it is a matter of time until the first AI-generated drug will receive FDA approval and reach the market. In a seminal study by15, a deep reinforcement learning model (GENTRL) was utilised for the discovery of potent DDR1 inhibitors, a prominent protein kinase target involved in fibrosis, cancer, and other diseases53. Six molecules were synthesised, four were found active in a biochemical assay, and one compound (in the following called gentrl-ddr1) demonstrated favourable pharmacokinetics in mice. As an exemplary case study in molecular discovery, we consider a contrived task of adapting the hit-compound gentrl-ddr1 to a similar molecule with an improved estimated water solubility (ESOL; Delaney54). Low aqueous solubility affects >40% of new chemical entities55, thus posing major barriers to drug delivery. Improving solubility requires exploring the local chemical space around the hit (i.e., gentrl-ddr1) to find an optimised lead compound.

A summary of how this task can be addressed using the GT4SD is shown in Fig. 2. In the first step, a rich set of pre-trained molecular generative models is accessed with the harmonised interface of the GT4SD. Two main model classes are available. The first category is represented by graph generative models, such as MoLeR51 or models from the TorchDrug library, specifically a graph-convolutional policy network12 and a flow-based autoregressive model (GraphAF56). The second model class is chemical language models (CLM), which treat molecules as text (SMILES6 or SELFIES7 sequences). Most of the chemical language models in the GT4SD are accessed via the libraries MOSES28 or GuacaMol27; in particular, a VAE9, an adversarial autoencoder (AAE57) or an objective-reinforced GAN model (ORGAN;58). In the first step, we randomly sample molecules from the learned chemical space of each model. Assessing the Tanimoto similarity of the generated molecules to gentrl-ddr1 reveals that this approach, while producing many molecules with satisfying ESOL, did not sufficiently reflect the similarity constraint to the seed molecule (cf. Fig. 2, bottom left). This is expected because the investigated generative models are unconditional.

Fig. 2: Case study using the GT4SD for molecular discovery.
figure 2

Starting from a compound designed using generative models by15 (gentrl-ddr1), we show how GT4SD can be used to swiftly design molecules with desired properties using a battery of algorithms available in the library in two settings: unconditional (bottom left) and conditional (bottom right). The conditional models can be constrained with chemical scaffolds or conditioned on desired property values.

As a more refined approach, the GT4SD includes conditional molecular generative models that can be primed with natural text queries (Text+Chem T559), continuous property constraints or molecular substructures (e.g., scaffolds) such as MoLeR51, REINVENT60 or even with combinations of property constraints and molecular substructures (Regression Transformer;61). The molecules obtained from those models, in particular MoLeR and RT, largely respected the similarity constraint and produced many molecules with a Tanimoto similarity > 0.5 to gentrl-ddr1. MoLeR and the RT improved the ESOL by more than 1M/L (cf. Fig. 2, right). In a realistic discovery scenario, the molecules generated with the described recipes could be manually reviewed by medicinal chemists and selectively considered for synthesis and screening.

Discussion

The GT4SD is the first step toward a harmonised generative modelling environment for accelerated material discovery. For the future, we plan to expand application domains (e.g., inorganic materials, climate, weather62, sustainability, geo-informatics and human mobility63), and integrate novel algorithms, ideally with the support of a steadily growing open-science community.

Future developments will focus on two main components: expanding model evaluation and sample properties predictions; developing an ecosystem for sharing models built on top of the functionalities exposed via the existing CLI commands for model lifecycle management. For the first aspect, we will expand the currently integrated metrics from GuacaMol and Moses and explore bias measures to better analyse performance in light of the generated examples and their properties. Regarding the sharing ecosystem, we believe GT4SD will further benefit from an intuitive application hub that facilitates the distribution of pre-trained generative models (largely inspired by the Hugging Face model hub48) and enables users to easily fine-tune models on custom data for specific applications.

We anticipate GT4SD to democratise generative modelling in the material sciences and to empower the scientific community to access, evaluate, compare and refine large-scale pre-trained models across a wide range of applications.

Methods

Library structure

The GT4SD library follows a modular structure (Fig. 1) where the main components are: (i) algorithms for serving models in inference mode following a standardised API; (ii) training pipelines sharing a common interface with algorithm families-specific implementations; (iii) domain-specific utilities shared across various algorithms; (iv) a property prediction interface to evaluate generated samples (currently covering small molecules, proteins and crystals); (v) frameworks implementing support for complex workflows, e.g., granular for training mixture of generative and predictive models or exceptional for enzyme design. Besides the core components, there are sub-modules for configuration, handling the cloud object storage-based cache and error handling at the top level.

Inference pipelines

The API implementation underlying the inference pipelines has been designed to support various generative model types: generation, conditional generation, controlled sampling and simple prediction algorithms. All the algorithms implemented in GT4SD follow a standard contract that guarantees a standardised way to call an algorithm in inference mode. The specific algorithm interface and applications are responsible for defining implementation details and loading the model files from a cache synced with a cloud object storage hosting their versions.

Training pipelines

Training pipelines follow the same philosophy adopted in implementing the inference pipelines. A common interface allows implementing algorithm family-specific classes with an arbitrary customisable training method that can be configured using a set of data classes. Each training pipeline is associated with a class implementing the actual training process and a triplet of configuration data classes that control arguments for model hyper-parameters, training parameters and data parameters.

CLI commands

To ease consumption of the pipelines and models implemented in GT4SD, a series of CLI endpoints are available alongside the package: (i) gt4sd-inference, to inspect and run pipelines for inference; (ii) gt4sd-trainer, to list and configure training pipelines; (iii) gt4sd-saving, to persist in a local cache a model version trained via GT4SD for usage in inference mode; (iv) gt4sd-upload, to upload model versions trained via GT4SD on a model hub to share algorithms with other users. The CLI commands allow to implement a complete discovery workflow where, starting from a source algorithm version, users can retrain it on custom datasets and make a new algorithm version available in GT4SD.