Chemical property prediction under experimental biases

Predicting the chemical properties of compounds is crucial in discovering novel materials and drugs with specific desired characteristics. Recent significant advances in machine learning technologies have enabled automatic predictive modeling from past experimental data reported in the literature. However, these datasets are often biased because of various reasons, such as experimental plans and publication decisions, and the prediction models trained using such biased datasets often suffer from over-fitting to the biased distributions and perform poorly on subsequent uses. Hence, this study focused on mitigating bias in the experimental datasets. We adopted two techniques from causal inference combined with graph neural networks that can represent molecular structures. The experimental results in four possible bias scenarios indicated that the inverse propensity scoring-based method and the counter-factual regression-based method made solid improvements.

Predicting the chemical properties of compounds is crucial in discovering novel materials and drugs with specific desired characteristics. Various computational approaches, including those based on density functional theory, have been widely used to accelerate the discovery process; however, they remain expensive and time consuming. In recent decades, researchers have shifted to data-driven approaches by fast-progressing machine learning technologies 1 . Recently, this trend has been further accelerated by the remarkable development of deep learning; in particular, graph neural networks (GNNs) 2,3 have achieved remarkable performance in predicting chemical properties via automatic feature extraction from molecular structures represented as graphs [4][5][6][7][8][9][10] . Their applications have expanded to various tasks, such as molecular generation [11][12][13] , molecular explanation 14,15 , and analysis of inter-molecular interactions 16,17 .
Accurate predictive models often rely on large-scale labeled datasets; they are frequently collections of knowledge (e.g., experimental results reported on scientific papers) that are the product of extensive scientific efforts. Unsurprisingly, scientists do not uniformly sample molecules from a large chemical space at random nor based on their natural distribution (i.e., the actual distribution of molecules existing in the chemical space) . Rather, their decisions on experimental plans or publication of results are biased due to physical, economic, or scientific reasons. For instance, a large proportion of molecules are not investigated experimentally because of molecular mechanics-related factors, such as solubility 18 , weights 19 , toxicity 20 , and side effects, or molecular structure-related factors, such as crystals 21 . In the pharmaceutical domain, drug likeness is an important factor for target selection, as exemplified by "Lipinski's rule of five" 22 . Further, concerns about the cost and availability of molecules can be some reasons to exclude certain groups of molecules. Conversely, popularity considerations based on current research trends 23 and the experimental methods wherein each lab specializes 24 influence the selection of compounds. These propensities related to researchers' experience and knowledge can contribute to more efficient search and discovery in the chemical space; however, they influence the data in an undesirable manner. Biases from human scientific research result in datasets that are sampled from distributions that differ from the natural ones. Thus, prediction models trained using such biased datasets suffer from over-fitting to the biased distributions, leading to poor performance on subsequent uses [25][26][27][28] .
Evidence on the existence of bias in scientific experiments and their harmful effects has been reported [18][19][20][21][22][23][24][25][26][27][28] . However, almost no attempts to address this challenge have been found. The problems of learning prediction models, when the distributions of the training and test datasets are different, are called domain adaptation, covariate shift adaptation 29 , or transfer learning 30 , and they have been a major topic in machine learning. Recently, deep neural networks for domain adaptation based on the concept of domain-invariant representation learning were proposed [31][32][33][34][35][36][37] , which have been primarily applied in the study of images. The problem of biased observations is also discussed in the context of causal inference. Inverse propensity scoring (IPS) is a general method from causal inference 38,39 , which has been successfully applied to various applications such as recommender systems 39,40 , natural language processing 41 , and treatment estimation in different fields ranging from healthcare 42 , economy 43 , and education 44 . Meanwhile, the domain-invariant representation learning concept is also introduced in the • We first address the problem of predicting properties of chemical compounds under experimental biases.
• We introduce two bias mitigation techniques, IPS and CFR, combined with GNN-based prediction of chemical properties. • We validated the two proposed approaches using various experimentally biased sampling scenarios and demonstrated that both of them improves the predictive performance significantly.

Results
Problem setting of predicting chemical properties under experimental biases. We assume a that includes N molecular graphs, where G i ∈ G is a molecular graph (biasedly) sampled from a large chemical space G , and y i ∈ R is the target chemical property. Each molecular graph G i = (V i , E i , σ i ) ∈ G has a set of nodes (i.e., atoms) V i , a set of edges (i.e., bonds) E i ⊆ V i × V i , and a label function σ : V i ∪ E i → L . Here, L is the set of possible node labels (i.e., atom types such as hydrogen and oxygen) and edge labels (i.e., bond types such as double bond and aromatic bond). Our aim is to obtain a prediction function f : G → R that predicts a particular target chemical property over the chemical space G that we are interested in. We assume that we have a uniformly randomly sampled part of (or possibly the entire) . This problem is difficult because the training data are constructed from past experiments reported in the literature; therefore, they are significantly biased with respect to the uniform distribution over the chemical space The left-hand figure shows the biased and unbiased distribution compared with the natural universe distribution of a chemical domain or sub-domain. The right-hand figure shows the proposed methods to train GNNs for chemical property prediction. Our aim was to apply bias cancelling techniques for GNNs to achieve significantly lower errors (i.e., MAE) when tested on a randomly sampled test dataset, whose distribution is similar to nature. G is the molecular graph, and y k is the value of the k-th chemical property. www.nature.com/scientificreports/ G because of the decisions implemented by researchers on experimental plans or publication options. Hence, there is no guarantee that the predictor derived from the biased training data has high predictive performance even on the chemical space G . Note that the D test can also be a biased sample; however, without loss of generality, we only assume it is a uniformly random subset of the molecules that we are interested in.
In summary, the inputs and outputs of the problem are as follows:  www.nature.com/scientificreports/ bias mitigation) with IPS and CFR, respectively; ** means the p-value was less than 0.01, and * means that it was less than 0.05. Note that, in the case of QM9, the result for the HOMO-LUMO gap (denoted by 'gap') in Scenario 3 was equivalent to that in Scenario 4 because QM9 contains HOMO-LUMO gap information, The overall results for all the scenarios indicated that the IPS approach improved the performance for many properties and scenarios; in particular the performance was statistically significantly improved for the five properties of QM9 (zvpe, u0, u298, h298, and g298) in all of the four scenarios, and for the three properties of QM9 (mu, alpha, cv) in three out of four scenarios, which indicated that IPS has solid effectiveness and potential in mitigating experimental biases on these tasks. However, we found that there were some statistically insignificant comparison and even significant failure for the four properties of QM9 (homo, lumo, gap, r2) and the properties of ZINC, ESOL, and FreeSolv. These failures indicated that although IPS achieved improvements on most of the tasks, it was not stable. In addition, the improvements by IPS of QM9 were more significant for Scenarios 1 and 2 than Scenarios 3 and 4. The differences might be explained by the accuracy of the propensity score model; the accuracies in the four scenarios were 81.05%, 87.49%, 76.04%, and 79.02%, respectively, which meant that the propensity scores were more accurate in Scenarios 1 and 2.
The CFR approach achieved more remarkable predictive performance than the IPS approach for most of the properties and scenarios. For the four properties of QM9 (homo, lumo, gap, and r2) and the three properties of ZINC, ESOL, and FreeSolv in all of the four scenarios (except for lumo and the property of ESOL in Scenario 2), where IPS failed to improve the predictive performance, CFR achieved statistically significant improvements comparing to the baseline method. Further, for the four properties of QM9 (mu, alpha, zvpe, and cv) in all of the four scenarios (except for alpha in Scenario 1), the improvements achieved by CFR were more remarkable than IPS. However, we found that for the four properties of QM9 (u0, u298, h298, g298), where IPS achieved remarkable improvements, CFR failed to show statistically significant increasing or decreasing of predictive performance comparing to the baseline method, which indicated that CFR was not always effective.
In summary, both of the IPS and CFR approaches made solid improvements in mitigating experimental biases for most of the properties and scenarios, and CFR showed better performance than IPS. However, some failures also indicated that the IPS and CFR approaches were not stable.

Prediction accuracy depending on indicators for biased sampling.
We further investigated why the IPS and CFR approaches successfully corrected the bias by visualizing the prediction accuracy depending on the indicators used in the biased sampling scenarios. Because of space limitations, we only show the predictive performance for the chemical property mu and u0 of QM9 in For predicting mu, according to the MAE comparison results, we know that both IPS and CFR achieved significant improvements of predictive performance for most scenarios and CFR outperformed IPS. In Scenario 1, most of the molecules used for training had a number of atoms ranging from 5 to 15. For predicting mu of molecules with 15 or more atoms, primarily in the test dataset, IPS and CFR consistently outperformed the baseline method, and in addition, CFR consistently outperformed IPS. In Scenario 2, most of the molecules used for training had a proportion of double/triple/aromatic bonds higher than 0.4. Again, on those molecules with proportion less than 0.4, IPS and CFR consistently performed better than the baseline method, and CFR almost outperformed IPS. Similarly in Scenarios 3 for predicting mu, we observed the advantage of CFR for the smaller indicator values (less than 0.2) corresponding to the test datasets. However, in Scenario 4, we failed to observe the similar advantages of IPS and CFR for smaller indicator values, which was not consistent with our findings in Figure 2. This failure also indicated that the IPS and CFR approaches lacked stability on some tasks.
For predicting u0, we know that IPS achieved significant improvements of predictive performance for all scenarios while CFR failed. For predicting mu of molecules with 15 or more atoms, IPS consistently outperformed www.nature.com/scientificreports/ the baseline method while CFR almost failed. In Scenario2, IPS achieved better performance on those molecules with proportion less than 0.4, which was consistent with our findings. However, the performance of CFR was not consistent, which further indicated the low stability of this method on this task. In Scenario 3 and 4 for predicting u0, we observed the advantage of IPS for the smaller indicator values corresponding to the test datasets while we failed to observe the advantage of CFR, which was consistent with our findings. In summary, by visualizing the prediction accuracy depending on those indicators for biased sampling, we can partially conclude that, on most of the tasks, IPS and CFR improved the predictive performance on molecules with lower chance for observation, which led to the overall improvements.

Discussion
We considered the prediction of chemical properties from datasets that have experimental biases. We introduced two promising bias mitigation techniques by combining the recent developments in causal inference and GNNbased graph learning. We tested four practical biased sampling scenarios on the well-known QM9, ZINC, ESOL, and FreeSolv datasets for experiments. The experimental results confirmed that the two approaches improved the predictive performance in all scenarios on most of the tasks with statistical significance compared with the baseline method, which had no effort for bias mitigation. We also found that the more modern CFR approach outperformed the IPS approach for most of the tasks and scenarios. However, based on our experimental results, the IPS and CFR approaches were not stable enough, which should be further improved. We attribute some of this failure in part to the discussion on transferability and discriminability 52 . When the information that is useful for discriminating the biased and unbiased data is also useful for predicting some chemical properties, it is eliminated by IPS and CFR, resulting in poor discriminability (i.e., prediction power).
Further, as a new study, we would like to describe the future scope in two directions. The fundamental direction of our research is to mitigate experimental biases in scientific research. In this study, we only focused on the population of chemical compounds and the task of property prediction. Possible future topics are extensions to more complex prediction tasks under experimental biases, including chemical-chemical link prediction in the chemical domain, biological network prediction in the biological domain, and discovery of unknown compounds in the material science domain. Another possible direction is applications of more modern and robust methods in covariate shift, causal inference, and domain adaptation. Since the difficulty in our problem setting lies mainly in learning unbiased representation of biased observed instances, it would be promising to adapt general ideas from those related areas and to develop theories and techniques for our specific tasks.

Materials and methods
Datasets. We first used the chemical molecule dataset QM9 48   We further used three chemical molecule datasets, i.e., ZINC 49 , ESOL, and FreeSolv 50 . The ZINC dataset contains 250,000 drug-like commercially available molecules graphs with up to 38 heavy atoms. The task of ZINC is to predict a molecular property known as constrained solubility. ESOL is a small dataset consisting of water solubility data for 1,128 molecular graphs. FreeSolv is also a small dataset consisting of 643 molecular graphs with hydration free energy of small molecules in water. Note that molecule structures in these three datasets are far less than exhaustively enumerated. Thus, in contrast with QM9, there is no guarantee that the D test of these three datasets can be regarded as an unbiased subset of a chemical space . However, without loss of generality, we only assume the D test is unbiased or less biased (compared with the D train ) corresponding to the indicators that we used to induce biases.
Biased sampling scenarios. Because we cannot know why the compounds reported in the literature were selected, we simulated several possible scenarios to sample our training datasets. We considered the following four possible biased sampling scenarios: • Scenario 1 assumes that molecules with a smaller number of atoms have higher sampling chances, because smaller molecules are considered more common and better explored 22 . • Scenario 2 prefers molecules with smaller proportions of single bonds (i.e., with higher proportions of the other bond types). The spectral manifestations of chemical functional groups significantly depend on bond types within the group 53 . For simplicity, we assumed that less single-bonded molecules would have stronger spectral manifestations. • Scenario 3 selects molecules with higher values of the gap between the highest energy occupied molecular orbital (HOMO) and lowest energy unoccupied molecular orbital (LUMO), because larger HOMO-LUMO gap values often indicate higher stability and lower chemical reactivity 54 . • Scenario 4 assumes that scientists focus on compounds with high target property values based on their expertise and experience to conduct experiments. Further, compounds with higher targets values have higher sampling chances.
We first sampled a test dataset with a size of 10% of the entire dataset uniformly at random. Then, according to each of the four biased sampling scenarios, we sampled a biased training dataset from the remaining molecules. Each compound had a sampling chance determined by the sigmoid function depending on the corresponding sampling criteria. Take QM9 for example, in Scenario 1, the smallest molecules with three atoms had the largest sampling chances, while the ones with 27 atoms had the smallest chances. The larger the gain of the sigmoid curve becomes, the more the training and test datasets were separated; we tuned the gain to bring the average sampling probability to 10%. We used sampling without replacement; therefore, no graph belonged to the training and test sets simultaneously. We repeated the sampling procedure 30 times to build training and test datasets for statistical testing. Figure 5 shows

Methods.
We begin by reviewing the GNN architecture that is the fundamental building block of our model, and then we describe two bias canceling schemes: IPS and CFR. They are combined to solve the problem of chemical graph property prediction under experimental biases. We summarize the symbols used in this paper in Table 1. www.nature.com/scientificreports/ GNNs to predict chemical properties. Among many successful GNNs, we selected the message-passing GNN architecture proposed by Gilmer et al. 5 owing to its generality, simplicity, and fair performance in the chemical domain.
A GNN uses a graph G = (V, E, σ ) ∈ G as its input. In the t-th layer of the GNN, it updates the current set of the node representation vectors {h t v } v∈V i to {h t+1 v } v∈V i . Specifically, the representation vector of node v is updated depending on the current vectors of its neighbor nodes using the following update formula: where N(v) denotes the set of the neighbor nodes of v, and m t is a so-called message passing function that collects the information (i.e., the representation vectors) of the neighbors, and it is a linear function of (h t v , h t u ) depending on the edge label σ (v, u) . Further, a is an activation function, for which we select the rectified linear unit (ReLU) function, and u t is the vertex update function, for which we use the gated recurrent unit (GRU). The initial node representations {h 0 v } v∈V are initialized depending on their atom types. As the message passing operation is repeatedly applied, the node representation vector gradually incorporates information about its surrounding structure.
After being processed through T layers, the final node representations {h T v } v∈V are obtained; they are aggregated to the graph-level representation h G using a readout function h G = r {h T v } v∈V , where we could simply use summation, followed by a linear transformation as the readout function. However, in our implementation, we used a slightly more complex solution: a long short-term memory (LSTM) pooling layer, followed by a linear layer. The graph-level representation h G is passed to the final layer to achieve the outputs of the GNN, such as the chemical property, propensity score, and domain weights prediction, as we discuss later.
Bias correction using IPS. If we assume that no biases exist in our training dataset, the distributions of the training and target (test) datasets are identical. This implies that minimizing the empirical mean of the loss function ℓ , that is, 1 , directly obtains a good prediction model g that achieves a small expected loss E[ℓ(y, g(G))] for the test data. However, because the training dataset is sampled in a biased manner in our scenario, the minimization of the standard empirical loss results in a biased prediction model. Table 1. List of symbols used in this paper.

Symbol in PROBLEM SETTING Description
Test dataset of M molecules  www.nature.com/scientificreports/ A possible remedy to this problem is the use of a propensity score 38 to adjust the importance weight of each training instance. The propensity score π(G) of molecular graph G is the probability that the molecule is included in the experimental data. The loss for each molecule is inversely weighted with the propensity score, resulting in the modified objective function: With the correct propensity score function π , the weighted loss function is unbiased with respect to the uniform sampling from the chemical space G . As shown in Figure 6, the IPS approach involves two steps: propensity score estimation and chemical property prediction. We first estimated the propensity score function that represents the probability of each molecule being experimented. It is estimated from the biased training dataset and the unbiased test set (uniformly sampled from the chemical space G ). This step is frequently performed by solving a two-class probabilistic classification problem to classify the two datasets using the logistic loss (also called the cross-entropy loss). Note that the target property values are not used for propensity modeling. The second step estimates the chemical property prediction model with the loss function weighted using the inverse of the propensity score. We used the squared loss function as ℓ , and the problem is cast as a weighted regression problem. The propensity score function and chemical property prediction model are both implemented as GNNs because they use graph-structured molecules as their inputs.
Bias correction using CFR. The CFR approach is another option to correct sample selection biases. We adopted the method proposed by Hassanpour et al. 47 , which is an improvement of the best-known method proposed by Shalit et al. 45 . To apply CFR on our tasks, we introduce d i ∈ {0, 1} to indicate the treatment (i.e., domain in general) that G i belongs to (i.e., test or training). The method has four components: a feature extractor, label predictor (replacement of the original treatment outcome predictors), internal probability metric, and weight estimator, denoted by f F , f L , f IPM , and f W , respectively. Different from the original causal effects estimation setting, only the molecular graphs from D train will be passed trough the treatment outcome predictors because molecular graphs from D test have no labels. Thus, there is only one treatment outcome predictor exists in our model for predicting chemical property, which we call label predictor. As shown in Figure 7, in contrast with IPS, which has two separate GNN models, three paths exist in one single model; the first path is for predicting chemical property of G i ∈ D train , and it is specified by a label predictor f L concatenated after a feature extractor f F . The second path, which takes a batch of molecular graphs {G i } ⊆ D train ∪ D test as input, is for measuring distance between distributions, which is specified by internal probability metric f IPM concatenated after the feature extractor f F . The third path is for estimating weight of G i ∈ D train ∪ D test , which is specified by an estimator f W concatenated after the feature extractor f F ; Note that f F commonly appears in all of the paths. The final deliverable f that we desire is the composite function of f F and f L ; that is, f (G) = f L (f F (G) . The weight estimator and the internal probability metric are not our final objective, but they aid in extracting debiased representations of the inputs in the training phase.
In the training phase, the first path (consisting of f F and f L ) aims to predict the target chemical property, by minimizing where ℓ is the loss function for chemical properties, namely, the squared loss in our case. Term w i is the importance sampling weight. According to Hassanpour et al 47 , if we denote φ i = f W (f F (G i )) ∈ R 2 (outputs of two domains) and use softmax function to obtain probabilities, . www.nature.com/scientificreports/ is similar to inverse propensity score in the IPS approach. Note that in our case, d i is fixed to 1 while calculating w i because only G i ∈ D train will be passed through f L . In the second path (including f F and f IPM ), the internal probability metric f IPM aims to measure the distance between distributions of molecular graphs from D train and D test . The objective function for the second path is expressed as In the third path (including f F and f W ), the f W aims to correctly classify the domain (i.e., test or training) of the input. Denoting the loss function for domain classification (i.e., the cross-entropy loss in our case) by c, the objective function for the third path is expressed as Parameters of f F , f L , and f IPM are updated by minimizing the objective function Implementation details. We used the PyTorch Geometric library 55 to implement the GNN models. In QM9 and ZINC, each atom is encoded as a 13 and 28-dimensional vector (one hot), respectively, depending on the atom type. In ESOL and FreeSolv, we followed the settings in MoleculeNet 50 , where each atom is encode as a 9-dimensional vector. The message passing function m t depends on edge types and is 32-dimensional. The update function u t is a gated recurrent unit with 32 internal dimensions. The readout function r is a sequence-to-sequence layer followed by two linear layers with 32 internal dimensions. The IPS approach required two GNN models, one for the propensity score function and the other for chemical property prediction. In the former, we used T = 3 GNN layers and a logistic function as the final layer because the propensity score indicated the probability that a chemical compound is observed. We used the ADAM 56 optimizer with no weight decay and a batch size of 64 for each iteration. We used a learning rate updating scheduler with an initial learning rate of 1e − 5 ; this reduced the learning rate by a factor of 0.7 until the validation error stopped reducing for five training epochs. The validation datasets were randomly selected to include 20% of the training and test sets. The optimized model that achieved the lowest validation error on the validation dataset was applied to infer the importance. The chemical property prediction models also had T = 3 GNN layers, and the other training settings were almost the same as those for the propensity score model. The validation set was 20% of the training set. Because we had 15 target chemical properties, we trained 15 different GNN predictors.
The network structure for the CFR approach had three output paths: the label predictor, internal probability metric, and weight estimator. The label predictor and weight estimator shared the common feature extraction layers on the input side. We set the number of the GNN layers corresponding to the feature extractor and those for the weight estimator and label predictor to 3. Similar to the IPS approach, the weight estimator had a readout function for classification and the label predictor had one for regression. Note that the feature extractor had no readout function. We used Wasserstein distances 57 for the internal probability metric. In addition, before the internal probability metric, there was readout function to aggregate features extracted by the feature extractor to a batch of graph-level features. We set the α for QM9, ZINC, ESOL, and FreeSolv to 10, 10, 100, and 10, respectively. As with the IPS approach, we used ADAM with no weight decay as the optimizer. We also used a learning rate updating scheduler, with an initial learning rate of 1e − 5 and reduced the learning rate by a factor www.nature.com/scientificreports/ of 0.7 until the validation error stopped reducing for five training epochs. The batch size was set to 64. In contrast with IPS, the entire network was trained in an end-to-end manner. We trained 15 GNNs for each of the target properties in each trial. 20% of the training and test sets were used to validate the domain classifier, and 20% of the training set was used for the label predictor.
As the baseline method, we used the same GNN structure as the one for IPS, but without bias mitigation. In contrast to the IPS approach, we used the standard unweighted average loss for the training dataset. The same settings as for IPS were used except for those specific to IPS, such as the number of GNN layers, the selection of the hyperparameters, and the training and validation sets.

Data availability
The datasets used and/or analysed during the current study are all available at https:// pytor ch-geome tric. readt hedocs. io/ en/ latest/ modul es/ datas ets. html. We further processed these datasets with biased sampling scenarios, which were introduced in Material and Methods section. The processed datasets used and/or analysed during the current study available from the corresponding author on reasonable request.