Abstract
Selecting diverse molecules from unexplored areas of chemical space is one of the most important tasks for discovering novel molecules and reactions. This paper proposes a new approach for selecting a subset of diverse molecules from a given molecular list by using two existing techniques studied in machine learning and mathematical optimization: graph neural networks (GNNs) for learning vector representation of molecules and a diverseselection framework called submodular function maximization. Our method, called SubMoGNN, first trains a GNN with property prediction tasks, and then the trained GNN transforms molecular graphs into molecular vectors, which capture both properties and structures of molecules. Finally, to obtain a subset of diverse molecules, we define a submodular function, which quantifies the diversity of molecular vectors, and find a subset of molecular vectors with a large submodular function value. This can be done efficiently by using the greedy algorithm, and the diversity of selected molecules measured by the submodular function value is mathematically guaranteed to be at least 63% of that of an optimal selection. We also introduce a new evaluation criterion to measure the diversity of selected molecules based on molecular properties. Computational experiments confirm that our SubMoGNN successfully selects diverse molecules from the QM9 dataset regarding the propertybased criterion, while performing comparably to existing methods regarding standard structurebased criteria. We also demonstrate that SubMoGNN with a GNN trained on the QM9 dataset can select diverse molecules even from other MoleculeNet datasets whose domains are different from the QM9 dataset. The proposed method enables researchers to obtain diverse sets of molecules for discovering new molecules and novel chemical reactions, and the proposed diversity criterion is useful for discussing the diversity of molecular libraries from a new propertybased perspective.
Introduction
Chemical space^{1,2,3,4}, a concept to represent an ensemble of chemical species, was originally established in medicinal chemistry^{2,5} and is used in a wide area of chemistry. The size of chemical space, i.e., the number of molecules in it, is estimated to be \(10^{60}\) even if it is limited to druglike molecules^{6}, and other estimations of chemicalspace sizes have also been reported^{4,7}. In any case, the number of molecules is too large to explore exhaustively. Currently, more than 68 million molecules are registered in the chemical abstracts service (CAS) of Americal Chemical Society^{8,9}, and some accessible online molecular databases, e.g., PubChem^{10}, ZINC^{11}, have been constructed. Moreover, owing to recent advances in high throughput screening, chemoinformatics^{12}, and machine learning^{13}, many chemical compounds have been discovered from chemical space in the fields of organic lightemitting diode^{14}, organic synthesis^{15}, and catalyst^{16}. These are, however, small fractions of chemical space, and vast areas remain unexplored.
Selecting diverse molecules from chemical space is an important task for discovering molecules that exhibit novel properties and new chemical reactions^{3,17}. In medicinal chemistry, diversity selection algorithms have been widely studied for exploring chemical space and discovering bioactive molecules^{5,18,19,20,21}. The diversity of a set of molecules is also essential in molecular library design^{17,22}. Furthermore, when analyzing the quality of molecular libraries, the way to assess their diversity is crucial. This paper contributes to diverse molecular selection by proposing a novel selection framework and a new criterion for evaluating the diversity of molecules.
For computing the diversity of sets of molecules, most existing methods start by specifying molecular descriptors, which encode molecules as vectors. Examples of molecular descriptors include ECFP^{23}, MACCS keys^{24}, and Daylight fingerprints^{25}, which typically encode structural information of molecules as binary vectors. Given such descriptors, pairwise dissimilarities are defined to quantify how dissimilar two molecules are. A widely used pairwise dissimilarity is the Tanimoto coefficient (more precisely, the Tanimoto coefficient indicates a similarity value, and subtracting it from 1 yields the dissimilarity)^{26}. Computation of molecular similarities constitutes a broad research area, and other approaches based on, e.g., graph edit distances^{27}, cosine similarities of SMILES kernels^{28}, maximum common substructures^{29}, a root mean square deviation of 3Dmolecular structures^{30} and the persistent homology (a topological signature)^{31} have also been proposed. Given such pairwise (dis)similarity measures, the diversity of sets of molecules is usually evaluated with, e.g., the mean pairwise dissimilarity (MPD) or the mean distance to closest neighbors calculated over selected molecules.
While the diversity of molecules can be computed as above, selecting molecules that maximize a diversity measure from given molecular lists is computationally more challenging. For example, a naive brute force search for selecting 10 out of 100 compounds requires calculating diversity values \(\left( {\begin{array}{l}100\\ 10\end{array}}\right) \) times. To overcome this computational difficulty, the greedy algorithm, which iteratively selects a new molecule that is the most dissimilar to a set of currently selected molecules, has been widely used as an efficient heuristic method^{32}. In each iteration, the dissimilarity between a new molecule and a set of selected molecules is computed according to a certain rule, e.g., MaxSum^{33} and MaxMin^{34,35}, and the choice of such rules affects outputs of the greedy algorithm. The diversity of molecular sets obtained by the greedy algorithm is usually evaluated with, e.g., the MPD defined with the Tanimoto coefficient of MACCS keys. Thus calculated diversity values intrinsically depend on the choice of molecular descriptors and pairwise dissimilarities. Consequently, the existing framework for selecting molecules and evaluating the diversity puts much weight on structural information of molecules since molecular descriptors usually encode structural information of molecules and pairwise dissimilarities are calculated based on such structurebased descriptors.
On the other hand, exploration of chemical space that takes the diversity of molecular properties into account has been reported to be effective for discovering novel functional materials^{36}. Also, in drug discovery, the Fréchet ChemNet Distance (FCD), which is a novel propertybased metric using hidden layers of prediction models for bioactivities as representation of molecules, has been reported to be useful for evaluating models for generating molecules^{37}. When it comes to discovering novel reactions, examining collections of molecules that are diverse regarding molecular properties (in particular, reactivity) is vital for efficient exploration of chemical space. Therefore, utilizing not only structural information but also properties of molecules can be a promising approach to pushing the diverse molecular selection framework to the next level, which will facilitate the discovery of novel molecules and new chemical reactions.
In the field of machine learning, neural network (NN) architectures have yielded great success in various areas such as image recognition and natural language processing. Following the achievements, researchers have applied them to molecular property prediction tasks. Among such approaches, graph neural networks (GNNs) have been gaining attention since many GNNbased prediction methods have achieved high performances^{38,39,40,41,42,43}. GNNs transform molecular graphs into vectors, which are used in downstream property prediction tasks. Notably, GNNs generate vectors taking both molecular properties and structural information of molecules into account, and it is reported that molecular vectors obtained from trained GNNs successfully reflect chemists’ intuition of molecular structures^{41}. Therefore, GNNbased molecular vectors can be effective alternatives to the aforementioned traditional molecular descriptors. However, to leverage GNNbased vectors for selecting diverse molecules, we need to discuss how to select diverse molecular vectors generated by GNNs, for which the existing structurebased selection framework is not necessarily appropriate.
Mathematically, the problem of selecting diverse items (in our case, molecular vectors) has been widely studied as submodular function maximization^{44,45}. This framework is one of the best ways for diverse selection due to the following two advantages. First, many submodular functions for quantifying the diversity have been developed in various fields, and thus we can choose an appropriate one to achieve desirable diverse selection. In particular, some submodular functions can represent relationships between multiple molecules that pairwise similarities cannot capture. For example, the logdeterminant function, a submodular function our method will use, serves as a volumetric diversity measure of molecular vectors. Such functions can offer us the potential for going beyond the existing pairwisesimilaritybased framework. Second, and more importantly, we can mathematically guarantee the greedy algorithm to select nearoptimally diverse molecules in terms of submodular function values. Specifically, resulting submodular function values are guaranteed to be at least \(63\%\) of those achieved by optimal selection^{44}. Moreover, the empirical performance of the greedy algorithm for submodular function maximization is known to be much better; it often achieves more than \(90\%\) of optimal values^{46,47}. Therefore, the submodularitybased approach enables us to efficiently obtain nearoptimally diverse sets of molecules without relying on costly selection algorithms such as the brute force search.
This paper proposes a new approach to diverse molecular selection by utilizing the aforementioned GNNbased molecular vectors and the existing submodularitybased selection method. First, we train a GNN with property prediction tasks and use the trained GNN to transform molecular graphs into molecular vectors. Then, we define a submodular function that quantifies the diversity of those molecular vectors as volumes of parallelotopes spanned by them. Owing to the submodularity of the function, we can select nearoptimally diverse molecular vectors by using the greedy algorithm. Both GNNs and submodular function maximization are known to be effective in various tasks, and thus each of them has been well studied. However, few existing studies utilize both of them for a single purpose. The only exception is a recent study on multirobot action selection^{48}, which uses GNNs in selection methods, while we use GNNs to design submodular functions. In view of this, our work provides a new type of application combining GNNs and submodular function maximization. Furthermore, to evaluate the diversity of selected molecules based on molecular property values, we introduce a new diversity measure using the Wasserstein distance^{49,50} to uniform distributions defined on molecular property values. This propertybased measure can play a complementary role to the existing structurebased measures such as the MPD of the Tanimoto coefficients, thus enabling researchers to more profoundly discuss the diversity of molecules. Computational experiments compare the proposed method with the existing structurebased methods and confirm that our method selects more diverse molecules regarding molecular properties. Furthermore, although our method does not explicitly use structurebased descriptors (e.g., ECFP and MACCS key), it successfully selects diverse molecules in terms of MPD values calculated with the Tanimoto coefficient of such structurebased descriptors. We also validate the practical effectiveness of our method via experiments on outofdomain settings, where we use datasets in different domains between training of GNNs and selection of molecules.
Method
This section presents our molecular selection method, which comprises two steps: training a GNN that generates molecular vectors and selecting GNNbased molecular vectors via submodular function maximization. Figure 1 shows a highlevel sketch of our method. In Step 1, we train a GNN and taskspecific layers with property prediction tasks, where the GNN converts molecular graphs into molecular vectors and the taskspecific layers take them as input and predict molecular properties. In this step, parameters of the GNN and the taskspecific layers are updated by the error backpropagation method. In Step 2, we transform graphs of candidate molecules into molecular vectors by using the GNN trained in Step 1, and then select a predetermined number of molecular vectors based on submodular function maximization.
We also introduce a new propertybased diversity criterion, which quantifies the diversity of selected molecules as the Wasserstein distance to uniform distributions defined on molecular property values. Intuitively, we regard a set of molecules as diverse if the property values of those molecules are evenly distributed.
Graph neural networks for generating molecular vectors
We briefly explain how GNNs generate molecular vectors. GNNs are deep learning architectures that work on graph domains. Taking a graph with node and edge features as input, GNNs capture structures of the graph by iteratively passing messages, which are calculated based on the features. Specifically, each node iteratively receives messages from its neighbors, aggregates them, and pass them to its neighbors; after this message passing phase, a molecular vector, denoted by \({\varvec{x}}\), is computed based on the resulting messages of all nodes. Along the way, messages are updated with certain parameterized functions. Our specific choice of a GNN architecture is Attentive FP^{41}, which is reported to achieve high performances in molecular property prediction. For the sake of completeness, we present mathematical details of GNNs in the “Supplementary information”.
In the taskspecific layer, molecular properties, \({\varvec{y}}\), are predicted with molecular vector \({\varvec{x}}\) via simple linear regression as \(\hat{{\varvec{y}}} = {\varvec{W}} {\varvec{x}}+ {\varvec{b}}\), where \(\hat{{\varvec{y}}}\) is a prediction of \({\varvec{y}}\). In the training step (Step 1 in Fig. 1), we update \({\varvec{W}}\), \({\varvec{b}}\), and the parameters of the GNN by backpropagation, where a loss function is the mean squared error between \(\hat{{\varvec{y}}}\) and \({\varvec{y}}\). Consequently, the GNN, which captures structures of molecular graphs via message passing, is trained to predict molecular properties. Therefore, the trained GNN generates molecular vectors taking both structures and properties of molecules into account.
Selection of diverse molecular vectors
Given molecular vectors generated by the trained GNN, we aim to obtain a set of diverse molecules by selecting diverse molecular vectors. For selecting diverse vectors, we utilize the mathematical framework called submodular function maximization.
Submodular function maximization
Submodular function maximization has been studied in the field of combinatorial optimization. This framework enables development of effective diverse selection methods by offering flexible models for representing the diversity and efficient selection algorithms with mathematical performance guarantees; below we detail these two advantages.
The first advantage of using the submodularfunctionmaximization framework is that there are various known functions for representing the diversity. To find a diverse subset from a large pool of molecules, researchers specify a diversity criterion and search for a diverse subset based on the criterion. Here, a diversity criterion is formally regarded as a set function, which assigns to each subset a real value that indicates how diverse the subset is. Some of such functions have a special property called submodularity, and they are called submodular functions. Many submodular functions have been developed as diversity criteria for various kinds of data such as images, documents, and videos. Therefore, we can choose a suitable function from them for modeling the diversity of molecular vectors. For example, the Shannon entropy is known to satisfy submodularity with respect to the selection of random variables. Other diversity criteria that have submodularity include the ROUGEN score for document summarization^{51,52} and facility location functions^{53}. In the area of bioinformatics, submodular functions for peptide identification are also developed^{54}.
The second advantage of the submodularfunctionmaximization framework is that we can utilize various simple, efficient, and mathematically rigorous algorithms for selecting a diverse subset. When selecting a subset from a large number of molecular vectors, there are exponentially many possible candidate subsets. Therefore, we need efficient algorithms for finding diverse subsets. In a series of studies on submodular function maximization, many simple and efficient algorithms for finding subsets with large submodular function values have been developed. Notably, the resulting submodular function values are often guaranteed to be nearly optimal by mathematical analyses. Therefore, once we specify a submodular function as a diversity criterion, we can automatically ensure that those algorithms return highly diverse subsets with respect to the criterion. Among such algorithms, the greedy algorithm is widely used due to its simplicity, efficiency, and strong mathematical guarantee^{44}.
In the “Supplementary information”, we present mathematical details of submodular function maximization and the greedy algorithm.
Logdeterminant function
In our computational experiments, we use a submodular function called a logdeterminant function, which quantifies the diversity of selected molecular vectors based on the volume of a parallelotope spanned by the selected vectors. As depicted in Fig. 2a, the more diverse the directions of vectors are, the larger the volume of the parallelotope spanned by them. Thus the logdeterminant function provides a volumebased measure of the diversity of vectors, and it is often used for expressing the diversity of vector datasets^{55}. Note that the volumebased diversity can capture relationships of vectors that cannot be represented in a pairwise manner. Therefore, the logdeterminant function yields a different selection rule than existing methods such as MaxSum and MaxMin, which use pairwise dissimilarities of molecular structures.
Formally, the logdeterminant function is defined as follows. Suppose that n candidate molecules are numbered by \(1,\ldots ,n\) and that the ith molecule is associated with mdimensional molecular vector \({\varvec{x}}_i\) for \(i=1,\ldots ,n\). Let \({\varvec{X}}=[{\varvec{x}}_1\ {\varvec{x}}_2\ \ldots \ {\varvec{x}}_n]\) be an \(m \times n\) matrix whose columns are given by n molecular vectors. For any \(S\subseteq N{:}{=} \{1,\ldots ,n\}\), we denote by \({\varvec{X}}[S]\) an \(m \times S\) submatrix of \({\varvec{X}}\) with columns restricted to S. We define the logdeterminant function \(f_{\text {logdet}}\) by
for any \(S \subseteq N\), where \({\mathbf {I}}_{S}\) is the \(S \times S\) identity matrix. The relationship between the \(f_{\text {logdet}}\) value and the volume of a parallelotope can be formally described as follows. Let \(\tilde{{\varvec{x}}_i}\) (\(i=1,\ldots ,n\)) be a vector of length \(m+n\) such that the first m elements are given by \({\varvec{x}}_i\), the (\(m+i\))th element is 1, and the others are 0. When \(S\subseteq N\) is selected, \(f_{\text {logdet}}(S)\) indicates the volume of a parallelotope spanned by \(\{\tilde{{\varvec{x}}}_i\}_{i\in S}\).
Given the function, \(f_{\text {logdet}}\), and the number, \(k\), of molecules to be selected, the greedy algorithm operates as follows: it first sets \(S = \emptyset \) and sequentially adds \(i \in N\setminus S\) with the largest \(f_{\text {logdet}}(S\cup \{i\})  f_{\text {logdet}}(S)\) value to S while \(S < k\) holds. In our computational experiments, we use a fast implementation of the greedy algorithm specialized for the logdeterminant function^{56}. Function \(f_{\text {logdet}}\) satisfies \(f_{\text {logdet}}(\emptyset ) = 0\), monotonicity (i.e., \(S\subseteq T\) implies \(f_{\text {logdet}}(S) \le f_{\text {logdet}}(T)\)), and submodularity. With these properties, we can mathematically guarantee that the greedy algorithm returns a subset whose \(f_{\text {logdet}}\) value is at least \(11/{\text {e}}\approx 63\%\) of an optimal selection.
Refinements to molecular vector generation: ReLU and normalization
We refine the GNNbased vector generation process so that it works better with the logdeterminant function. Specifically, we make GNNs output nonnegative and normalized. Below we detail why we need these refinements and explain how to modify the vector generation process.
First, as in Fig. 2b, if vectors are allowed to have negative elements, nearly originsymmetric vectors form a parallelotope with a small volume even though their directions are diverse. Consequently, the logdeterminant function fails to indicate that such molecular vectors correspond to diverse molecules. To circumvent this issue, we add a ReLU layer to the end of the GNN, which makes all entries of output vectors nonnegative.
Second, if GNNs are allowed to output vectors with different norms, taskspecific layers may distinguish molecules with different properties based on the norm of molecular vectors. In such cases, maximizing the logdeterminant function may result in selecting nondiverse vectors due to the following reason. As mentioned above, the logdeterminant function represents the volume of the parallelotope spanned by selected vectors, and the volume becomes larger if selected vectors have larger norms. Consequently, molecular vectors with larger norms are more likely to be selected, which may result in selecting molecules with almost the same properties as in Fig. 2c. To resolve this problem, after passing through the ReLU layer, we normalize molecular vectors so that their norms become 1 by projecting them onto a hypersphere. In other words, we add a normalization layer that transforms molecular vector \({\varvec{x}}\) as
As a result, \(\hat{{\varvec{x}}}\) becomes nonnegative and its norm is equal to 1. In the training phase, we train the GNN with the additional ReLU and normalization layers, where nonnegative normalized vector \(\hat{{\varvec{x}}}\) is used for predicting property values as \(\hat{{\varvec{y}}} = {\varvec{W}} \hat{{\varvec{x}}} + {\varvec{b}}\). Due to the above normalization, the taskspecific layers cannot distinguish molecular vectors by using their norms, and thus the GNN learns to generate molecular vectors so that taskspecific layers can predict molecular property values based not on norms but on angles of vectors. Consequently, as illustrated in Fig. 2d, diverse molecular vectors can be obtained by maximizing the logdeterminant function value. We experimentally confirmed that GNNs trained with normalization yield similar prediction results to those obtained without normalization (see, the “Supplementary information”). This implies that GNNs trained with normalization can successfully generate molecular vectors whose angles have enough information for predicting molecular properties.
Propertybased evaluation of diversity
By using our selection method, we can select molecules so that corresponding molecular vectors are diverse. However, even if molecular vectors are diverse, selected molecules themselves may not be diverse. This issue is also the case with the existing structurebased methods, and it has been overlooked in previous studies. That is, the existing methods select molecules that are diverse in terms of the Tanimoto coefficient of molecular descriptors (e.g., MACCS keys or ECFP), and thus those methods naturally achieve high mean pairwise distance (MPD) values, which are also calculated by using the Tanimoto coefficient of such descriptors. If we are to evaluate selection methods fairly, we need diversity criteria that do not use molecular descriptors employed by selection methods. This section presents such a criterion for evaluating the diversity of selected molecules in terms of their property values without using molecular vectors. In contrast to the existing structurebased criteria (e.g., the aforementioned MPD values), our criterion is based on the diversity of property values, thus offering a new perspective for evaluating the diversity of molecules.
Our idea is to regard molecular property values as diverse if evenly distributed over an interval on the propertyvalue line. We quantify this notion of the diversity using a statistical distance between a distribution of property values of selected molecules and a uniform distribution. As a distance between two distributions, we use the Wasserstein distance, which is defined by the minimum cost of transporting the mass of one distribution to another, as detailed below. We call this diversity criterion the Wasserstein distance to a uniform distribution (WDUD). A smaller WDUD value implies that selected molecules are more diverse since the distribution of their property values is closer to being uniform.
Formally, WDUD is defined as follows. Let \(v_{\max }\) and \(v_{\min }\) be the maximum and minimum property values, respectively, in a given list of molecules. Suppose that \(k\) molecules with property values \(y_1,y_2,\ldots ,y_k\) are selected from the list. We assign probability mass \(1/k\) to each \(y_i\) and compute how far this discrete distribution is from a uniform distribution over \([v_{\min }, v_{\max }]\). Let V and U be the cumulative distribution functions of the two distributions, respectively. Defining the transportation cost from \(y\in [v_{\min }, v_{\max }]\) to \(y_i\) as \(y  y_i\), the WDUD value can be computed as \(\int _{v_{\min }}^{v_{\max }} U(x)  V(x){\text {d}}x\)^{50}, which we use for quantifying the diversity of property values \(\{y_1,y_2,\ldots ,y_k\}\) of selected molecules.
There are other possible choices of statistical distances, such as the variance or the Kullback–Leibler (KL) divergence. However, the Wasserstein distance is more suitable for measuring the diversity than the variance and the KL divergence for the following reasons. If we use the variance of property values of selected molecules as a diversity measure, a set of molecules with extreme property values is regarded as diverse, although this selection is biased since it ignores property values nearby the mean (see, Fig. 3a). If we use the KL divergence between the propertyvalue distribution of selected molecules and the uniform distribution, the distance structure of the support is ignored unlike WDUD, which takes the \(\ell _1\)distance, \(y  y_i\), into account. As a result, we cannot distinguish molecular sets with completely different diversities as in Fig. 3b.
Wasserstein greedy: a propertybased benchmark method
In the computational experiments, we use a benchmark method that is intended to minimize the WDUD value directly. To the best of our knowledge, selecting a set of molecules that exactly minimizes the WDUD value reduces to mixed integer programming, which is computationally hard in general. Instead, we select molecules with small WDUD values by using a simple greedy heuristic, which starts with the empty set and repeatedly selects a molecule that yields the largest decrease in the WDUD value. When considering the WDUD of multiple molecular properties, we normalize the property values to [0, 1] and use the mean WDUD value. In the experiments, property values are known only for a training dataset, while we have to select molecules from a test dataset. Therefore, we compute WDUD values by using property values predicted by the trained GNN (without the normalization technique). Compared with our proposed method, this benchmark method is specialized for achieving small WDUD values (i.e., diversity of molecular property values), while it does not explicitly use information on molecular structures.
Existing structurebased selection methods and evaluation criterion
We also use MaxMin and MaxSum as baseline methods, which greedily select molecules based on dissimilarities of molecular descriptors. We use MACCS keys and ECFP as descriptors and define the dissimilarity of those descriptors based on the Tanimoto coefficient, i.e., given the ith and jth descriptors, we compute the Tanimoto similarity of them and subtract it from 1 to obtain dissimilarity values \(d_{i,j}\). Given dissimilarity values \(d_{i,j}\), MaxSum and MaxMin operate as with the greedy algorithm; formally, MaxMin (resp. MaxSum) sequentially adds \(i\in N\setminus S\) with the largest \(\min _{j\in S} d_{i,j}\) (resp. \(\sum _{j\in S} d_{i,j}\)) value to S whlie \(S < k\) holds, where the first molecule \(i\in N\) is set to the one with the largest \(\sum _{j\ne i}d_{i,j}\) value. We denote MaxMin and MaxSum methods by MM and MS, respectively, and MACCS keys and ECFP by MK and EF, respectively, for short. We use, for example, MSMK to represent the MaxSum method that uses MACCSkeys as descriptors.
When evaluating selection methods in the experiments, we also use the mean pairwise dissimilarity (MPD), the existing structurebased criterion, in addition to WDUD. Specifically, given dissimilarity values \(d_{i,j}\) for all pairs in n molecules, we compute an MPD value as \(\frac{1}{\left( {\begin{array}{c}n\\ 2\end{array}}\right) } \sum _{i < j} d_{i,j}\). We define the dissimilarity values by the Tanimoto dissimilarity of MACCS keys or ECFP. Depending on the choice of descriptors, we denote the diversity criterion by MPDMK or MPDEF, respectively.
Details of computaitonal experiments
We conducted computational experiments with the QM9 dataset in MoleculeNet^{57,58}, which is a quantum mechanical dataset with labels of energetic, electronic, and thermodynamic properties computed based on the density functional theory (DFT). Each molecule in the dataset is associated with 12 property values: dipole moment in Debye (mu), isotropic polarizability in \({\hbox {Bohr}}^{3}\) (alpha), highest occupied molecular orbital energy in Hartree (HOMO), lowest unoccupied molecular orbital energy in Hartree (LUMO), gap between HOMO and LUMO in Hartree (gap), electronic spatial extent in \({\hbox {Bohr}}^{2}\) (R2), zeropoint vibrational energy in Hartree (ZPVE), internal energy at 0 K in Hartree (U0), internal energy at 298.15 K in Hartree (U), enthalpy at 298.15 K in Hartree (H), free energy at 298.15 K in Hartree (G), and heat capacity at 298.15 K in cal/(mol K) (Cv). Following the previous work^{41}, we used all the 12 properties to train GNNs. The QM9 dataset contains 133,885 molecules, and we randomly divided them into three datasets as is done in the previous study^{41}: 80% (107,108 molecules) for training a GNN, 10% (13389 molecules) for validating prediction accuracy of the trained GNN, and 10% (13388 molecules) for a test dataset, from which we selected molecules. Each method selected 133 molecules (1% of the test data) from the test data. Note that when training GNNs, we did not use the test data. We thus created the situation where we select molecules whose property values are unknown in advance.
The diversity of property values of selected molecules was evaluated by computing WDUD values for each molecular property. In this evaluation, we used the above 12 properties in the QM9 dataset. However, among the 12 properties, the use of U0, U, H, and G would be inappropriate for evaluating the chemical diversity because their magnitudes depend mostly on the system size. For example, these values are more similar between acetone and acetamide, isoelectronic molecules, than between acetone and methyl–ethyl–ketone, even though most chemists would say that acetone and methyl–ethyl–ketone are both alkyl ketones and chemically more similar. Therefore, we additionally used molecular energy values divided by the number of electrons (denoted by \(N_{\text{{elec}}}\)) in the evaluation to weaken the systemsize dependence and focus more on chemical diversity. These values for U0, U, H, and G are denoted by U0/\(N_{\text{{elec}}}\), U/\(N_{\text{{elec}}}\), G/\(N_{\text{{elec}}}\), and H/\(N_{\text{{elec}}}\), respectively. Similarly, we used variants of the two values, ZPVE and Cv, divided by \(N_{\text{{mode}}} = 3N_{\text{{atom}}}  6\), where \(N_{\text{{atom}}}\) is the number of atoms. These values for ZPVE and Cv are denoted by ZPVE/\(N_{\text{{mode}}}\) and Cv/\(N_{\text{{mode}}}\), respectively. Consequently, for evaluating molecular diversity based on WDUD values, we used 18 properties in total: the 12 properties of the QM9 dataset and the additional six properties, ZPVE/\(N_{\text{{mode}}}\), U0/\(N_{\text{{elec}}}\), U/\(N_{\text{{elec}}}\), G/\(N_{\text{{elec}}}\), H/\(N_{\text{{elec}}}\), and Cv/\(N_{\text{{mode}}}\).
We also conducted computational experiments on the outofdomain setting. That is, while the GNN is trained with the QM9 dataset, we select molecules from other test datasets than QM9, where we know nothing about the target property labels. This setting is more challenging than the previous one since the test datasets are completely different from QM9; in particular, the target property labels are different from the aforementioned 12 properties of QM9. On the other hand, this setting is more realistic since GNNs are usually trained on some large datasets, while we often want to select molecules from new test datasets that belong to other domains than those of training datasets. In the experiments, we used three test datasets obtained from MoleculeNet: the Delaney dataset (ESOL)^{59}, the free solvation database (FreeSolv)^{60}, and the lipophilicity dataset (Lipop)^{61}. ESOL contains 1128 molecules labeled by logscale water solubility in mol/L. FreeSolv contains 642 molecules labeled by experimentally measured hydration free energy in water in kcal/mol. Lipop contains 4200 molecules labeled by experimentally measured octanol/water distribution coefficient (logD). These property labels were used only when computing WDUD values for evaluation. For each of the three datasets, we selected 100 molecules and evaluated their diversity. Note that unlike the previous case, we select molecules without knowing what properties are used when evaluating WDUD values. Thus, this setting models a situation where we want to select molecules that are diverse regarding some unknown properties.
Results and discussion
We present the results obtained by the following molecular selection methods:

SubMoGNN (Submodularitybased Molecular selection with GNNbased molecular vectors) is our proposed method, which greedily maximizes the logdeterminant function^{55} defined with GNNbased molecular vectors.

WGGNN (Wasserstein Greedy with GNNbased prediction) is our new benchmark method. It selects molecules by greedily minimizing the WDUD values, where molecular property values are predicted by the trained GNN.

MSMK is the existing MaxSum algorithm^{33} that uses MACCS keys^{24} as molecular descriptors.

MMMK is the existing MaxMin algorithm^{34,35} that uses MACCS keys^{24} as molecular descriptors.

MSEF is the existing MaxSum algorithm^{33} that uses ECFP^{23} as molecular descriptors.

MMEF is the existing MaxMin algorithm^{34,35} that uses ECFP^{23} as molecular descriptors.

Random selects molecules randomly according to the distribution of a test dataset.
We briefly mention the position of each method. WGGNN is a benchmark method that is specialized for the diversity of property values, while the structurebased baseline methods, MSMK, MMMK, MSEF, and MMEF, focus on the diversity of molecular structures. Our SubMoGNN is an intermediate of the two kinds of methods and can leverage information of both molecular structures and properties, since the GNNbased molecular vectors are generated by taking molecular graphs as input and training the GNN with propertyprediction tasks.
Propertybased diversity evaluation with WDUD
We evaluated the diversity of property values of selected molecules by the Wasserstein distance to uniform distribution (WDUD). Note that a smaller WDUD value is better since it means the distribution of selected molecules is closer to being uniform.
Table 1 shows the WDUD values attained by the six methods for the aforementioned 18 properties. Since the results of SubMoGNN and WGGNN fluctuate due to the randomness caused when training GNNs, we performed five independent runs and calculated the mean and standard deviation. The results of Random also vary from trial to trial, and thus we present the mean and standard deviation of five independent trials. Figure 4 visualizes the results in Table 1, where the WDUD values are rescaled so that those of Random become 1 for ease of comparison.
In this experiment, each method obtains a single set of molecules, for which we calculate the 18 WDUD values. Therefore, choosing a set of molecules that attains small WDUD values for some properties may result in large WDUD values for other properties. Such a choice of molecules does not meet our purpose, and it is better to balance the tradeoff so that none of the 18 WDUD values become too large. A reasonable way to check whether this is achieved is to compare the results with those of Random. If WDUD values of some properties become larger than those of Random, it is probable that selected molecules are biased; that is, the diversity of some properties is sacrificed for achieving small WDUD values of other properties. On the other hand, WGGNN is expected to achieve almost the best WDUD values since it aims to minimize WDUD values directly (this, however, can result in nondiverse selection regarding other aspects than WDUD, as we will discuss later). Therefore, we below discuss the results regarding the WDUD values of WGGNN as benchmarks that are close to the best possible ones.
We first discuss the results of SubMoGNN and the existing structurebased methods in comparison with Random and WGGNN. Figure 4 shows that SubMoGNN, MSEF, and MMEF attained smaller WDUD values than Random for all molecular properties. This indicates that both our method and the ECFPbased methods were able to choose diverse molecules in terms of WDUD, even though they do not explicitly minimize WDUD. Since we did not feed the test dataset when training GNNs, the results suggest that GNNs well generalized to unknown molecules and achieved diverse selection from the test dataset consisting of unknown molecules. In contrast to SubMoGNN and the ECFPbased methods, MSMK and MMMK resulted in larger WDUD values in mu than Random as in Fig. 4a. That is, the selection methods based on MACCS keys failed to select diverse molecules with respect to mu values. This suggests that selection methods that use only structural information can sometimes result in nondiverse selection in terms of molecular property values. On the other hand, as expected, WGGNN achieved the smallest WDUD values in 12 out of the 18 properties. Surprisingly, however, SubMoGNN achieved better WDUD values than WGGNN in six properties, demonstrating the effectiveness of SubMoGNN for selecting molecules with diverse property values.
We then compare our SubMoGNN with the existing structurebased selection methods. Compared to MaxMinbased methods (MMMK and MMEF), SubMoGNN achieved smaller WDUD values for all properties. SubMoGNN also outperformed MaxSumbased methods (MSMK and MSEF) for all but six properties (U0, U, H, G, Cv, and ZPVE/\(N_{\text{{mode}}}\)). Note that U0, U, H, and G are related to molecular energies and their values are strongly correlated with each other; previous studies have reported that property prediction methods applied to the QM9 dataset exhibited almost the same performances as regards the four properties^{41}. This is consistent with our results in Fig. 4b, where each method attained almost the same performance regarding the four properties. Furthermore, when the energyrelated properties are divided by \(N_{\text{{elec}}}\), MSMK and MSEF are outperformed by SubMoGNN (see the results on U0/\(N_{\text{{elec}}}\), U/\(N_{\text{elec}}\), H/\(N_{\text{{elec}}}\), and G/\(N_{\text{{elec}}}\) in Fig. 4c). In view of this, the MaxSumbased methods seem to have put too much weight on the diversity of properties correlated with \(N_{\text{{elec}}}\), which resulted in biased selections and degraded the WDUD values of mu. In summary, in terms of WDUD values, the overall performance of SubMoGNN is better than those of the existing structurebased methods.
Figure 5 shows propertyvalue distributions of all molecules in the dataset (blue) and molecules selected by each method (red). The horizontal and vertical axes represent property values and frequency, respectively. For ease of comparison, the histogram height is normalized to indicate the density rather than the count. We regard a set of molecules as diverse if its distribution is close to being uniform. As expected, the distribution of molecules selected by Random is close to the distribution of the original dataset. By contrast, SubMoGNN and MSMK selected molecules that did not appear so frequently in the dataset, particularly for HOMO, R2, U0, and U0/\(N_{\text{{elec}}}\). As a result, the distributions of selected molecules became closer to being uniform than Random. Regarding the results of mu, both SubMoGNN and MSMK chose many molecules with nearzero mu values; this seems to be necessary for selecting diverse molecules regarding other properties than mu due to the aforementioned tradeoff between properties. Nevertheless, MSMK chose too many molecules with nearzero mu values, resulting in a biased distribution. This visually explains why the WDUD value of MSMK in mu is larger than that of Random. Compared with MSMK, SubMoGNN selected more molecules with large mu values, which alleviated the bias and led to diverse selection in all properties. SubMoGNN selected more molecules with large R2 and high HOMO values than MSMK, and consequently SubMoGNN ’s distributions were closer to being uniform. In U0, however, MSMK selected more molecules with high U0 values than SubMoGNN and MSMK ’s distribution was closer to being uniform than SubMoGNN. By contrast, as regards U0/\(N_{\text{elec}}\), MSMK selected too many molecules with high \(N_{\text{elec}}\) values compared with SubMoGNN, resulting in a distribution that is farther from being uniform.
To conclude, by incorporating supervised learning of GNNs into the system of diverse molecular selection, our method can select diverse molecules regarding target molecular properties in the sense that their distributions are close to being uniform. On the other hand, if we use standard molecular descriptors (e.g., MACCS keys and ECFP) that encode only structural information of molecules, selected molecules can be nondiverse regarding some molecular properties.
Structurebased diversity evaluation with MPD
We then evaluated selection methods in terms of the diversity of molecular substructures. As a criterion for evaluating the diversity of molecular substructures, we used the mean pairwise dissimilarity (MPD), where molecular descriptors were given by MACCS keys or ECFP. We denote those criteria by MPDMK and MPDEF for short. A larger MPD value is better since it implies that selected molecules are more dissimilar to each other. It should be noted that MSMK and MSEF greedily maximize MPDMK and MDPEF, respectively, and thus they are inherently advantageous in this setting. MMMK and MMEF also explicitly maximize the diversity calculated with MACCS keys and ECFP, respectively, and thus this setting is also favorable for them. By contrast, SubMoGNN and WGGNN use neither MACCS keys nor ECFP, and thus it has no inherent advantage as opposed to the structurebased methods.
Table 2 shows the results. As expected, MSMK and MMMK, which explicitly aim to maximize the diversity calculated with MACCS keys, achieved high MPDMK values. In particular, MSMK attained a far higher MPDMK value than the others. This result is natural since MSMK has the inherent advantage of greedily maximizing MPDMK. As regards MPDEF, all methods achieved high MPD values. Note that although SubMoGNN and WGGNN used neither MACCS keys nor ECFP, they attained higher MPD values than Random and performed comparably to (sometimes outperformed) the structurebased methods. These results imply that selecting molecules with diverse property values is helpful in selecting molecules with diverse structures.
At this point, the effectiveness of selecting molecules with diverse predicted property values has been confirmed for the case where GNNs are trained on a training QM9 dataset and molecules are selected from a test QM9 dataset, i.e., training and test datasets are in the same domain. In practice, however, we often encounter a situation where GNNs are trained on some large datasets, while we select molecules from new datasets whose domains are different from those of training datasets. In such cases, the diversity of properties registered in the training datasets does not always imply the diversity of molecules in test datasets. Below we experimentally study such outofdomain settings.
Experiments on outofdomain setting
We performed experiments on the outofdomain setting. Specifically, while we trained GNNs on the QM9 dataset as with the previous section, we selected molecules from other test datasets: ESOL, FreeSolv, and Lipop. SubMoGNN used molecular vectors generated by the trained GNN, and WGGNN selected molecules greedily to minimize WDUD values of the QM9 properties predicted by using the trained GNN. Note that we cannot train GNNs to predict ESOL, FreeSolv, and Lipop values since nothing about those properties is available. In other words, we consider training GNNs without knowing that they are used for selecting diverse molecules from ESOL, FreeSolv, and Lipop datasets. On the other hand, the structurebased descriptors, ECFP and MACCS keys, have nothing to do with the property labels of test datasets. Therefore, the existing structurebased methods select molecules in the same way as in the previous section. Unlike the previous QM9 case, we selected 100 molecules independently for each of ESOL, FreeSolv, and Lipop since the three datasets consist of different molecules.
In this setting, since target property labels and structures of molecules in test datasets are unavailable in advance, we want to select diverse molecules regarding a wide variety of unknown molecular characteristics. To this end, selection methods should not overfit to certain molecular characteristics; they should select molecules that are diverse regarding various aspects, including both property values and structures.
Table 3 shows the WDUD values achieved by each method for ESOL, FreeSolv, and Lipop, and Fig. 6 visualizes the results. SubMoGNN and WGGNN selected molecules more diversely than Random, even though the GNN was feeded no information on ESOL, FreeSolv, and Lipop. From the fact that WGGNN achieved small WDUD values, we can say that molecules with diverse ESOL, FreeSolv, and Lipop values can be obtained by selecting molecules with diverse QM9 property values. On the other hand, although the structurebased methods achieved small WDUD values for FreeSolv and Lipop, they selected less diverse molecules than Random in ESOL. This implies that, as with the case of mu values in the QM9 dataset, structurebased methods can result in nondiverse selection regarding some property values.
Table 4 and Fig. 7 present MPDMK and MPDEF values for each dataset. SubMoGNN achieved higher MPD values in all cases than WGGNN and Random, and it performed comparably to the structurebased methods. On the other hand, WGGNN failed to outperform Random in ESOLMPDMK and LipopMPDMK. These results suggest that WGGNN does not always perform well regarding the diversity of structures in the outofdomain setting. By contrast, the results of SubMoGNN imply that the GNNbased molecular vectors learned on the QM9 dataset well generalize to outofdomain datasets and successfully convey information on both molecular properties and structures, thus enabling SubMoGNN to select diverse molecules regarding both properties and structures even in the outofdomain setting.
Note that in the above QM9 and outofdomain experiments, only SubMoGNN achieved better performances than Random in all criteria. This suggests that the proposed combination of the logdeterminant function maximization and the GNNbased descriptors, which are designed to represent both molecular properties and structures, is effective for delivering stable performances in diverse molecular selection regarding various aspects of molecules.
Discussion on MaxSum and MaxMin with GNN vectors and effects of normalization
The previous experiments confirmed that GNNbased molecular vectors can capture both properties and structures of molecules, which enabled our SubMoGNN to select diverse molecules. In this additional experiment, we again use the QM9 training and test datasets and present an ablation study to see how the choice of selection methods affects outputs if GNNbased molecular vectors are used as descriptors. Moreover, as an attempt to elucidate how the blackbox GNNbased vector generation affects the molecular selection phase, we take a closer look at norms of molecular vectors generated by GNNs and examine how the normalization procedure changes the behavior of selection methods.
In this section, all selection methods use GNNbased molecular vectors, and thus we denote our SubMoGNN simply by SubMo. We use the three selection methods: SubMo, MaxSum (MS), and MaxMin (MM). Each method employs GNNbased molecular vectors with and without normalization, denoted by “w/N” and “w/o N”, respectively, as molecular descriptors. Regarding MaxSum and MaxMin, the pairwise dissimilarity between two vectors is given by their Euclidian distance.
Table 5 presents WDUD values achieved by each method. SubMo and MS tended to achieve smaller WDUD values than MM. It also shows that normalization did not always yield better WDUD values. Only from the results of WDUD values, it may seem that MaxSum without normalization (MS w/o N) performs as well as (or better than) SubMo w/ and w/o N. As discussed below, however, the superiority of MS w/o N is brittle and it can result in nondiverse selection in some cases.
Figure 8 illustrates the relationship between property values and the norms of molecular vectors generated by a GNN without normalization. The vertical and horizontal axes indicate norms and property values, respectively. The blue and red points correspond to all molecules in the test dataset and selected molecules, respectively. The green vertical lines show the means of property values in the test dataset. The figures imply the correlation between the norm and deviation of property values from their means. That is, GNNs tend to assign large norms to molecules whose property values are far from the means, and molecules with small norms tend to have property values that are close to the means. This tendency suggests that GNNs convey the information of how far molecular property values are from the means by using norms of molecular vectors.
Since MS greedily maximizes the sum of pairwise dissimilarity values, it prefers selecting molecular vectors that are distant to each other. As a result, MS tend to select molecular vectors with large norms, as we can confirm in the rightmost column of Fig. 8. In the case of the QM9 dataset, GNNs assigned large norms to some molecules whose property values were close the means. Therefore, by simply selecting molecules with large norms as MS did, molecules with diverse property values can be obtained. However, depending on datasets and how GNNs are trained, the correlation between norms and property values can become much stronger. In such cases, MS cannot select molecules whose property values are close to the means, resulting in biased selection.
Compared with MS, SubMo selected molecular vectors with various norms. Therefore, even if norms and property values are strongly correlated, SubMo is expected to select molecules with more diverse property values than MS. As regards normalization, norms of vectors selected by SubMo w/ N were almost the same as those selected by SubMo w/o N, while there is a clear difference between MS w/ N and MS w/o N.
To conclude, no single selection method outperforms in all cases, and thus we should employ appropriate selection methods that are suitable for datasets at hand. Nevertheless, MaxSum seems to rely too much on norms of molecular vectors relative to SubMo, and thus we are required to carefully examine molecular vectors when using MaxSum. We finally emphasize that a notable advantage of SubMo is its theoretical guarantee. That is, the logdeterminant function values achieved by the greedy algorithm is always at least 63% of optimal function values.
Detailed experimental settings and running times
We trained Attentive FP with the following hyperparameters: radius = 2, T = 2, fingerprint dimension = 280, dropout = 0.5, weight decay = 0, learning rate = 0.0004, and epoch = 300, where radius and T are the number of times the hidden states are updated in the message passing and readout phases, respectively. In the QM9 experiment, the size of the matrix \({\varvec{X}}\) in the logdeterminant function is \(13388\times 280\) (the number of candidates \(\times \) the dimension of molecular vectors). In the ESOL, FreeSolv, and Lipop experiments, the sizes of \({\varvec{X}}\) are \(1128\times 280\), \(642\times 280\), and \(4200\times 280\), respectively.
We performed computational experiments on Amazon EC2 P3.2xlarge. It has a single Tesla V100 GPU (16GB) and 8 vCPUs (61GB). In the QM9 experiments, training of GNN took 2900 seconds. For selecting molecules, SubMoGNN, WGGNN, MSMK, MMMK, MSEF, and MMEF took 5.1, 5700, 240, 240, 200, and 200 s, respectively. In the ESOL experiments, SubMoGNN, WGGNN, MSMK, MMMK, MSEF, and MMEF took 0.56, 650, 1.7, 1.7, 1.3, and 1.3 s, respectively. In the FreeSolv experiments, they took 0.51, 420, 0.57, 0.57, 0.40, and 0.40 s, respectively, in the same order. In the Lipop experiments, they took 0.58, 2600, 26, 26, 35, and 36 s, respectively, in the same order. Note that while the greedy algorithm in SubMoGNN used a specialized implementation technique^{56}, the other algorithms are implemented in a naive manner and thus have room for acceleration. Therefore, the presented running times are only for reference. Faster implementation of the baseline algorithms is beyond the scope of this paper.
Conclusion
We addressed the problem of selecting diverse molecules for facilitating chemical space exploration. Our method consists of two steps: construction of molecular vectors using the GNN and selection of molecules via maximizing submodular functions defined with molecular vectors. Owing to the use of GNNs trained with property prediction tasks, we can take both molecular structures and properties into account for selecting diverse molecules. Moreover, the submodular function maximization framework enables the greedy algorithm to return subsets of molecules that are mathematically guaranteed to be nearly optimal. We also introduced a new evaluation criterion, the Wasserstein distance to uniform distributions (WDUD), to measure the diversity of sets of molecules based on property values. Computational experiments on the QM9 dataset showed that our method could successfully select diverse molecules as regards property values. Regarding the diversity of molecular structures, it performed comparably to the existing structurebased methods (MaxSum and MaxMin with MACCS keys and ECFP). Experiments with outofdomain settings demonstrated that our method with the GNN trained on the QM9 dataset could select molecules with diverse property values and structures from outofdomain datasets: ESOL, FreeSolv, and Lipop. To conclude, our diverse selection method can help researchers efficiently explore the chemical space, which will bring great advances in searching for novel chemical compounds and reactions.
We finally mention some future directions. In this study, we evaluated the diversity of molecular properties using the 12 properties of the QM9 dataset, ESOL, FreeSolv, and Lipop. On the other hand, molecular properties used in medicinal chemistry, e.g., Pharmacokinetic properties (logP), druglikeness (QED), and biological activities, are important in the field of virtual screening. Although the goal of diverse selection is different from that of virtual screening, evaluating diverse selection methods based on properties such as logP and QED may offer an interesting direction of study. Mathematically, studying the relationship between the logdeterminant function value and the WDUD value is interesting future work.
Data availability
Source codes of our method are available at https://github.com/tomotomonakanaka/SUBMO.git, which were implemented in Python 3.7.10. We converted SMILES into MACCS key, ECFP, and molecular graphs by using RDKit 2018.09.1, which is available at https://www.rdkit.org/. The QM9 dataset was downloaded from Xiong’s GitHub repository (https://github.com/OpenDrugAI/AttentiveFP). The ESOL, FreeSolv, and Lipop datasets were downloaded through DeepChem^{62}. GNNs were implemented using PyTorch 1.8.0^{63}, DGL 0.5.3^{64}, and DGLLifeSci 0.2.6 (available at https://github.com/awslabs/dgllifesci).
References
Kirkpatrick, P. & Ellis, C. Chemical space. Nature 432, 823–823 (2004).
Reymond, J.L., Ruddigkeit, L., Blum, L. & van Deursen, R. The enumeration of chemical space. WIREs Comput. Mol. Sci. 2, 717–733 (2012).
Reymond, J.L. & Awale, M. Exploring chemical space for drug discovery using the chemical universe database. ACS Chem. Neurosci. 3, 649–657 (2012).
Reymond, J.L. The chemical space project. Acc. Chem. Res. 48, 722–730 (2015).
AlainDominique, G. Diversity in medicinal chemistry space. Curr. Top. Med. Chem. 6, 3–18 (2006).
Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structurebased drug design: A molecular modeling perspective. Med. Res. Rev. 16, 3–50 (1996).
Ertl, P. Cheminformatics analysis of organic substituents: Identification of the most common substituents, calculation of substituent properties, and automatic identification of druglike bioisosteric groups. J. Chem. Inf. Comput. Sci. 43, 374–380 (2003).
Hamill, K. A., Nelson, R. D., Vander Stouw, G. G. & Stobaugh, R. E. Chemical abstracts service chemical registry system. 10. Registration of substances from pre1965 indexes of chemical abstracts. J. Chem. Inf. Comput. Sci. 28, 175–179 (1988).
American Chemical Society. CAS—Chemical abstracts service—Database counter. http://web.cas.org/cgibin/regreport.pl (Accessed 31 January 2021)
Kim, S. et al. PubChem 2019 update: Improved access to chemical data. Nucleic Acids Res. 47, D1102–D1109 (2019).
Irwin, J. J. & Shoichet, B. K. ZINCa free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. 45, 177–182 (2005).
Takeda, S., Kaneko, H. & Funatsu, K. Chemicalspacebased de novo design method to generate druglike molecules. J. Chem. Inf. Model. 56, 1885–1893 (2016).
SanchezLengeling, B. & AspuruGuzik, A. Inverse molecular design using machine learning: Generative models for matter engineering. Science 361, 360–365 (2018).
GómezBombarelli, R. et al. Design of efficient molecular organic lightemitting diodes by a highthroughput virtual screening and experimental approach. Nat. Mater. 15, 1120–1127 (2016).
Ahneman, D. T., Estrada, J. G., Lin, S., Dreher, S. D. & Doyle, A. G. Predicting reaction performance in c–n crosscoupling using machine learning. Science 360, 186–190 (2018).
Zahrt, A. F. et al. Prediction of higherselectivity catalysts by computerdriven workflow and machine learning. Science 363, eaau5631 (2019).
Gillet, V. J. Diversity selection algorithms. WIREs Comput. Mol. Sci. 1, 580–589 (2011).
Lajiness, M. & Watson, I. Dissimilaritybased approaches to compound acquisition. Curr. Opin. Chem. Biol. 12, 366–371 (2008).
Rognan, D. The impact of in silico screening in the discovery of novel and safer drug candidates. Pharmacol. Ther. 175, 47–66 (2017).
Gorgulla, C. et al. An opensource drug discovery platform enables ultralarge virtual screens. Nature 580, 663–668 (2020).
Grygorenko, O. O., Volochnyuk, D. M., Ryabukhin, S. V. & Judd, D. B. The symbiotic relationship between drug discovery and organic chemistry. Chem. Eur. J. 26, 1196–1237 (2020).
Maldonado, A. G., Doucet, J. P., Petitjean, M. & Fan, B.T. Molecular similarity and diversity in chemoinformatics: From theory to applications. Mol. Divers. 10, 39–79 (2006).
Rogers, D. & Hahn, M. Extendedconnectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Symyx Technologies Inc. Maccs keys.
Daylight Chemical Information Systems, Inc. Daylight fingerprints.
Tanimoto, T. T. An Elementary Mathematical Theory of Classification and Prediction (International Business Machines Corporation, 1958).
GarciaHernandez, C., Fernández, A. & Serratosa, F. Ligandbased virtual screening using graph edit distance as molecular similarity measure. J. Chem. Inf. Model. 59, 1410–1421 (2019).
Öztürk, H., Ozkirimli, E. & Özgür, A. A comparative study of SMILESbased compound similarity functions for drug–target interaction prediction. BMC Bioinform. 17, 128 (2016).
Cao, Y., Jiang, T. & Girke, T. A maximum common substructurebased algorithm for searching and predicting druglike compounds. Bioinformatics 24, i366–i374 (2008).
Fukutani, T., Miyazawa, K., Iwata, S. & Satoh, H. GRMSD: Root mean square deviation based method for threedimensional molecular similarity determination. Bull. Chem. Soc. Jpn. 94, 655–665 (2021).
Keller, B., Lesnick, M. & Willke, T. L. Persistent homology for virtual screening. ChemRxiv (2018).
Lajiness, M. S. Molecular SimilarityBased Methods for Selecting Compounds for Screening 299–316 (Nova Science Publishers Inc., 1990).
Holliday, J. D., Ranade, S. S. & Willett, P. A fast algorithm for selecting sets of dissimilar molecules from large chemical databases. Quant. Struct.Act. Relat. 14, 501–506 (1995).
Snarey, M., Terrett, N. K., Willett, P. & Wilton, D. J. Comparison of algorithms for dissimilaritybased compound selection. J. Mol. Graph. Model. 15, 372–385 (1997).
Agrafiotis, D. K. & Lobanov, V. S. An efficient implementation of distancebased diversity measures based on \(k\)–\(d\) trees. J. Chem. Inf. Comput. Sci. 39, 51–58 (1999).
Terayama, K. et al. Pushing property limits in materials discovery via boundless objectivefree exploration. Chem. Sci. 11, 5959–5968 (2020).
Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S. & Klambauer, G. Fréchet ChemNet distance: A metric for generative models for molecules in drug discovery. J. Chem. Inf. Model. 58, 1736–1741 (2018).
Duvenaud, D. K. et al. Convolutional networks on graphs for learning molecular fingerprints. Adv. Neural Inf. Process. Syst. 28, 2224–2232 (2015).
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry. Proc. 34th Int. Conf. Mach. Learn. 70, 1263–1272 (2017).
Schütt, K. T. et al. Schnet: A continuousfilter convolutional neural network for modeling quantum interactions. Adv. Neural Inf. Process. Syst. 30, 991–1001 (2017).
Xiong, Z. et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J. Med. Chem. 63, 8749–8760 (2020).
Rahaman, O. & Gagliardi, A. Deep learning total energies and orbital energies of large organic molecules using hybridization of molecular fingerprints. J. Chem. Inf. Model. 60, 5971–5983 (2020).
Hwang, D. et al. Comprehensive study on molecular supervised learning with graph neural networks. J. Chem. Inf. Model. 60, 5936–5945 (2020).
Nemhauser, G. L., Wolsey, L. A. & Fisher, M. L. An analysis of approximations for maximizing submodular set functionsI. Math. Program. 14, 265–294 (1978).
Krause, A. & Golovin, D. Submodular Function Maximization 71–104 (Cambridge University Press, 2014).
Sharma, D., Kapoor, A. & Deshpande, A. On greedy maximization of entropy. Proc. 32nd Int. Conf. Mach. Learn. 37, 1330–1338 (2015).
Balkanski, E., Qian, S. & Singer, Y. Instance specific approximations for submodular maximization. Proc. 38th Int. Conf. Mach. Learn. 139, 609–618 (2021).
Zhou, L. et al. Graph neural networks for decentralized multirobot submodular action selection. arXiv preprint. arXiv:2105.08601 (2021).
Vaserstein, L. N. Markov processes over denumerable products of spaces, describing large systems of automata. Probl. Peredachi Inf. 5, 64–72 (1969).
Peyré, G. & Cuturi, M. Computational optimal transport: With applications to data science. Found. Trends Mach. Learn. 11, 355–607 (2019).
Lin, C.Y. ROUGE: A package for automatic evaluation of summaries. in Text Summarization Branches Out, 74–81 (ACL, 2004).
Lin, H. & Bilmes, J. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 510–520 (ACL, 2011).
Cornuejols, G., Fisher, M. L. & Nemhauser, G. L. Location of bank accounts to optimize float: An analytic study of exact and approximate algorithms. Manag. Sci. 23, 789–810 (1977).
Bai, W., Bilmes, J. & Noble, W. S. Submodular generalized matching for peptide identification in tandem mass spectrometry. IEEE/ACM Trans. Comput. Biol. Bioinform. 16, 1168–1181 (2019).
Kulesza, A. & Taskar, B. Determinantal Point Processes for Machine Learning (Now Publishers Inc., 2012).
Chen, L., Zhang, G. & Zhou, E. Fast greedy map inference for determinantal point process to improve recommendation diversity. in Advances in Neural Information Processing Systems, vol. 31, (eds Bengio, S. et al.) 5627–5638 (Curran Associates, Inc., 2018).
Ramakrishnan, R., Dral, P. O., Rupp, M. & von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 1, 140022–140029 (2014).
Wu, Z. et al. MoleculeNet: A benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
Delaney, J. S. ESOL: Estimating aqueous solubility directly from molecular structure. J. Chem. Inf. Comput. Sci. 44, 1000–1005 (2004).
Mobley, D. L. & Guthrie, J. P. FreeSolv: A database of experimental and calculated hydration free energies, with input files. J. Comput. Aided Mol. Des. 28, 711–720 (2014).
Wenlock, M. & Tomkinson, N. Experimental in vitro DMPK and physicochemical data on a set of publicly disclosed compounds. https://doi.org/10.6019/CHEMBL3301361 (2015).
Ramsundar, B. et al. Deep Learning for the Life Sciences (O’Reilly Media, 2019).
Paszke, A. et al. Pytorch: An imperative style, highperformance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8026–8037 (2019).
Wang, M. et al. Deep graph library: A graphcentric, highlyperformant package for graph neural networks. arXiv preprint. arXiv:1909.01315 (2019).
Acknowledgements
This work was supported by the JST via ERATO Grant JPMJER1903. Support was also provided by the Institute for Chemical Reaction Design and Discovery (ICReDD), which was established by the World Premier International Research Initiative (WPI), MEXT, Japan.
Author information
Authors and Affiliations
Contributions
S.S., K.F., and Y.H. formulated the problem and developed a basic method. T.N. made substantial contributions to conceptual design and conducted computational experiments with extensive help from S.S., K.F., and Y.H. T.N., S.S., K.F., and Y.H. analyzed the computed results. S.M. and S.I. improved the analysis of the results. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Nakamura, T., Sakaue, S., Fujii, K. et al. Selecting molecules with diverse structures and properties by maximizing submodular functions of descriptors learned with graph neural networks. Sci Rep 12, 1124 (2022). https://doi.org/10.1038/s41598022049679
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598022049679
This article is cited by

A semiautomated material exploration scheme to predict the solubilities of tetraphenylporphyrin derivatives
Communications Chemistry (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.