Accelerated knowledge discovery from omics data by optimal experimental design

How to design experiments that accelerate knowledge discovery on complex biological landscapes remains a tantalizing question. We present an optimal experimental design method (coined OPEX) to identify informative omics experiments using machine learning models for both experimental space exploration and model training. OPEX-guided exploration of Escherichia coli’s populations exposed to biocide and antibiotic combinations lead to more accurate predictive models of gene expression with 44% less data. Analysis of the proposed experiments shows that broad exploration of the experimental space followed by fine-tuning emerges as the optimal strategy. Additionally, analysis of the experimental data reveals 29 cases of cross-stress protection and 4 cases of cross-stress vulnerability. Further validation reveals the central role of chaperones, stress response proteins and transport pumps in cross-stress exposure. This work demonstrates how active learning can be used to guide omics data collection for training predictive models, making evidence-driven decisions and accelerating knowledge discovery in life sciences.


Overview
We have designed the Optimal Experimental Design Framework (OPEX) to identify optimal set of transcriptomic experiments for maximizing prediction power in unobserved culture conditions in three steps ( Fig. 1, steps 1-3). In the first step, we use the available transcriptomic data to Build Predictive Model of gene expression using culture condition as the model input. In the second step, we Calculate Utility Scores for unobserved culture conditions using the predictive model from the first step. In the third step, we Select Optimal Conditions amongst all unobserved culture conditions given their utility scores from the second step. In its general form, OPEX is the following optimization problem: where the matrix s denotes the culture conditions for the next batch of experiments, the matrix denotes the culture conditions for the observed experiments (with each row of the matrix being an experiment), matrix o contains the gene expression profiles that map to the corresponding experiments of o and scalar denotes the batch size (i.e. number of conditions to run for the next batch). The optimality of a batch of candidate conditions in matrix is determined using the function ℎ_ and the optimal batch is returned by ArgMax .

General Mathematical Formulation
The following describes the three-step algorithm of OPEX algorithm for finding s . The modular design of the OPEX algorithm (Algorithm 1) allows different methods to be used in each step.  The vector u contains u real valued utility scores, one for each unobserved condition that is encoded by a corresponding row of u . Next, we describe the methods used in our implementation and results.

Build Predictive Model
For each gene of E. coli, we used Gaussian Processes (GP) to build a predictive model (i.e.
to predict the expression level of the gene under a culture condition characterized by a horizontal vector . In our real-data results, is a 14-bit binary vector representing presence/absence of ten biocides and four antibiotics which characterize a given culture condition. In our synthetic-data results, is a real The covariance matrix = [ 1 2 ⋮ ] represents all pairwise correlations for the given gene. The parameter represents the amplitude of overall correlation along all dimensions in the SE kernel while parameter is used for automatic relevance determination [1]. A larger value of represents a smaller influence of the ℎ independent variable of a culture condition on the gene expression. These parameters are learned by maximizing the marginal likelihood of the observed data given the parameters. For a detailed derivation of the equations related to GP, see [1].
Given the selection of GP as our model in this work (from equations (1-4)), for each gene the trained ( new ) is well defined by new , and which are used to predict the gene expression using ( new ) from equation 2 and to calculate utility scores as described next.

Calculate Utility Scores
We evaluated OPEX using three different utility score calculation methods described here. The utility score new for a new unobserved condition is calculated using the utility function. The utility scores of all unobserved conditions are represented by vector = [ u 1 , u 2 ,…, u ], where u is the utility score of the ℎ unobserved condition for a given gene. Mutual Information (MI). In the setting of MI, the idea is to select the most representative culture condition amongst all possible unobserved culture conditions. The representativeness of n observed culture conditions is quantified by the mutual information (MI) between the observed and the  to the unobserved ones. One sequential design implementation is to select the culture condition which provides the highest increase in covariance between the observed datapoints and unobserved datapoints [3]. The covariance matrix can be calculated by the following equation for a given gene.

Entropy (EN
u,o is a covariance matrix composed of the pairwise correlation between the unobserved conditions and the observed conditions. Each entry in the matrix is calculated by the kernel function of the GP. o,o a covariance matrix composed of the pairwise correlation between the observed conditions. The covariance utility function is equal to the increment in the trace of the covariance matrix calculated by the following equation: where {u−new},{o+new} is the same as u,o except that a given unobserved condition new is removed from the set of unobserved and added to observed conditions for a given gene.

Select Optimal Conditions for each gene
Following the general optimization equation (1) and using the terminology above, the next condition to select is the one that has the maximum utility score according the ℎ gene is: where new is the utility score corresponding a given unobserved new condition calculated by one of equations 8, 10, and 12, depending on the utility function used. For example, when we use mutual information (i.e. equation (10)) as the utility function, the most informative condition for ℎ gene is selected by solving the following OPEX optimization problem: where new is a 14-bit binary vector representing presence/absence of ten biocides and four antibiotics for a culture condition.
Finally, [ 1 , 2 , … ], where is the number of genes. With the optimal unobserved conditions selected for each gene, we count the frequency of each selected batch and select the most frequent one from For a larger batch size (i.e. > 1), the next condition in the batch was selected by greedy, constrained or adaptive sampling, as described in the respective results. In greedy sampling, the conditions are ranked based on their utility scores. The top conditions with highest utility scores are selected for the next batch. In constrained sampling, the condition with the highest utility score is selected and added to the batch. Then we iterate through the remaining conditions ordered by their utility scores and calculate their Euclidean distance to selected items in the batch. Conditions with a minimum distance (based on a predefined threshold) are added until the batch-size limit is reached. Finally, in adaptive sampling, the condition with the highest utility score is selected and added to the batch. Then, the predicted gene expression profile of the newly selected condition is considered as observed leading into the newly trained model and updated utility scores. The condition with the highest updated utility score is then added to the batch. This process is repeated until the batch-size limit is reached.

Select Optimal Conditions for most of genes
With the optimal unobserved conditions selected for each gene, we count the frequency of each selected batch and select the most frequent one from [ 1 , 2 , … ], where is the number of genes.

Alternative Optimal Experimental Design Methods
For benchmarking, we used three other optimal experimental design approaches, query by the committee using different types of models [4][5], query by committee using bootstrapping [6], and D-optimal experimental design [7]. Compared to OPEX, all approaches differ in the utility function used while the last approach also employs a different predictive model.

Query by Committee Using Different Types of Models.
We used feedforward neural network (FNN), linear regression, Gaussian process and Support Vector Regression (SVR). For training an FNN and SVR, we used the packages, neuralnet and e1071, respectively. The number of hidden nodes of FNN and the two hyperparameters of SVR were optimized by grid search. In each iteration, the condition with the highest disagreement amongst different models (i.e. highest variance) was selected for the next iteration. When generating a learning curve, GP model was used.
Query by Committee Using Bootstrapping. Here we used one type of model (GP), but changed the training set using bootstrapping to build a committee of four GP models. Likewise, the condition with the highest disagreement amongst different models (i.e. highest variance) was selected for the next iteration. When generating the learning curve, the GP model was trained without bootstrapping.

D-optimal Experimental Design.
Here, we used a linear model to predict gene expression (linear models were trained by the built-in implementation for linear regression in R). The condition that increased the determinant of information matrix (X T X) to the most extent was selected at each iteration.
X is a matrix consisting of the vectors each of which represents a culture condition.

Expert Sampling
We designed three strategies for expert sampling by consulting one chemist and three biologists to evaluate the effectiveness of OPEX compared to human experts. These are: First strategy: Structural similarity. The first strategy relied on comparing the pairwise structural similarity among 10 biocides and 4 antibiotics. The least similar culture condition was selected in each iteration. Specifically, a 1024-bit topological fingerprint was generated for each chemical using the Python package, rdkit. The pairwise Tanimoto similarity among the biocides and antibiotics were calculated [8]. The similarity between the two culture conditions was defined as the sum of the similarity between the two biocides and that between the two antibiotics. When exploring the space defined by the biocides and antibiotics, we looked up the similarity of each unobserved culture condition to all the observed culture conditions and picked the least similar one.
Second strategy: Mechanism of action. The second strategy was based on similarity in the mechanism of action of each antibiotic and biocide. In each iteration, we sampled the antibiotic and biocide that were most different from the observed ones. For the mechanism of action of each antibiotic and biocide, see Supplementary Table 1.
Third strategy: Effect size. For the third strategy, three experts first ordered the four antibiotics based on their expected dominant impacts on transcription in the central dogma of molecular biology, Ampicillin < Norfloxacin < Kanamycin < Rifampicin. Rifampicin is known to inhibit RNA polymerase hence directly impacting transcription [9]. Kanamycin interferes with translation hence indirectly impacting transcription through transcription factors [10]. Norfloxacin and Ampicillin are known to impact DNA replication and cell wall hence ordered last with respect to their impact on overall transcription profile [11][12]. If an antibiotic is dominant, the choice of biocide would be expected to have a smaller impact on gene expression of E. coli. We rationalized that if we have the gene expression under a culture condition that has a more dominant antibiotic, we are likely to do a good prediction for the culture conditions that have the same antibiotic but different biocides. Based on such reasoning, we grouped the unobserved culture conditions into four groups based on the antibiotic. The group that has a less dominant antibiotic was sampled before the group that has a more dominant antibiotic. In each group, we split the culture conditions into 5 buckets based on the mechanism of each biocide. Among the 5 buckets, we randomly selected one in each iteration while making sure two consecutive datapoints were not from the same bucket.

Random Sampling
For random sampling, we randomly selected a datapoint (an experimental condition in our setting) from all the unobserved datapoints as the next datapoint to collect. The default random function in R was used.

Exploration-Exploitation Tradeoff
The calculated utility scores have a potential myopic bias, therefore relying on them for selecting the next batch of experiments (i.e. exploitation of the model) can lead to overfitting. To avoid this, a portion of the selected conditions for the next batch can be selected randomly (i.e. exploration of the sample space). The exploration-exploitation trade-off is fundamental in optimal experimental design [2].
Exploration refers to switching to a strategy different from the predefined strategy based on one of the utility functions. The role of exploration is similar to simulated annealing [14]. Exploitation means exploiting the information learned from the collected data and selecting the next datapoint based on the prediction of a model trained on the collected data. We used exploration frequency to control the tradeoff.
The effect of the exploration frequency parameter was evaluated on synthetic datasets and the RNA-Seq dataset.

Synthetic and Real RNA-Seq Datasets
OPEX was evaluated on a synthetic dataset and real RNA-Seq datasets.

RNA-Seq Dataset
We measured the gene expression profiles of E. coli under 45 culture conditions defined by 4 antibiotics and 40 combinations of 10 biocides and 4 antibiotics, and an untreated control. Out of all the genes of E. coli, 1,123 genes had a count per million (CPM) larger than 100 in at least half of the samples. The fold changes of the genes that have a CPM less than 100 in at least half of the samples are expected to be sensitive to the sequencing depth [13]. To exclude the effect of those genes, we tested OPEX only using the 1,123 genes. We also tested OPEX and variations of OPEX on all the 4,391 genes, which resulted in similar results (Section 2.5).

Validating OPEX on Synthetic Datasets
When validating the performance of OPEX on the synthetic datasets, we first randomly split a synthetic dataset into three datasets, a training dataset, a pool of candidate conditions, and a benchmark set for evaluating the prediction performance of a trained predictive model. Then we evaluated OPEX in 30 iterations. In each iteration, we trained a GP model using the training dataset, calculated the utility score of each candidate condition that remained in the pool and selected a batch of conditions for adding to the training set. Finally, we evaluated the predictive performance by the mean absolute error of predictions on the benchmark set. After running OPEX for 30 iterations, we visualized the prediction accuracy at each iteration as a learning curve and compare the learning curve of OPEX and that of the baseline. When selecting a batch of conditions in a batch, we tested three approaches as outlined in section 1.2.3. Random sampling was used as the baseline for evaluating the performance of OPEX.

Validating OPEX on an RNA-Seq Dataset
Since the RNA-Seq dataset does not have as many records as in the synthetic datasets, we slightly changed our validation method for it. We first split the whole dataset into two parts instead of three. One part served as the starting point of the training set. The other part served both as the pool of culture conditions for selection, and as the benchmark dataset. The initial training set consisted of 15 randomly selected culture conditions where each antibiotic and biocide were selected at least once. In each iteration, we trained a GP using the current training set, selected a candidate culture condition from the pool and moved it to the training dataset, then evaluated the prediction performance of the retrained GP on the benchmark set. Note that the size of the benchmark was reduced at each iteration. We run 50 times the whole process with a different random seed at each time.
Two types of methods were used for comparison, random sampling and expert sampling (for details, see the section entitled Expert Sampling). Each sampling method was evaluated against random sampling using the MAE of gene expression predictions in a given iteration.

Cluster Analysis on 40 Culture Conditions
We ran Principal Component Analysis (PCA) [15] and t-Distributed Stochastic Neighbor Embedding (t-SNE) [16] on the gene expression profiles of all 40 conditions using prcomp and Rtsne respectively in the R programming language, and projected the first two dimensions.
Hierarchical clustering was also performed on all 40 gene expression profiles (Fig. 3B).

OPEX Accelerates Knowledge Discovery
To test whether OPEX can accelerate knowledge discovery (Fig. 3C

Antimicrobials
10 biocides and 4 antibiotics were used in this study. Biocides were selected based on their widespread use in hospitals and households [17], and antibiotics were selected based on their unique cellular targets (Supplementary table 1

Strains and Culture Conditions
Escherichia coli MG1655 was used in all experiments, excluding the experiments performed to validate the genes involved in cross-protection and cross-vulnerability where wild type Keio strain BW25113 and its derivative single gene knock out (KO) strains [18] were used. Since, KO strains had kanamycin resistance gene, which might influence the validation experiment, it was removed by a method described elsewhere [19].

Synthetic Datasets
Seven datasets whose skewness in the distribution of the output varies from 1.17 to 7.86 were generated.
In a highly skewed dataset, in most cases the output was close to zero and only several sharp peaks exist in the space. See Supplementary Figure 1, for a visualization of the datasets, distribution of the output of each dataset and statistics of the datasets. The data for the synthetic datasets is in Supplementary Data 2.

The Performance of OPEX on Synthetic Datasets
We evaluated the performance of OPEX using seven synthetic datasets with respect to five factors including: skewness in the distribution of the output, the noise level in the measured output, frequency of exploration, initial dataset size, and batch size. We did not find such a systematic analysis of these factors in another study.
The Effect of Skewness and Noise. Interestingly, the advantage of OPEX advantage concerning the baseline was inversely proportional to the skewness of the dataset (p-value < 10 -3 by t-test; Supplementary Figure 2A). OPEX was found to be robust when noise is present in the training set, outperforming the baseline even at very high noise levels (for the entropy utility function, 16% better than the baseline at 90% white noise on the 1st synthetic dataset, p-value < 10 -6 by t-test; Supplementary  Figure 2D), and that window varied given the dataset skewness (Supplementary Figure 5). When the dataset size was too small, the benefit of OPEX methods was generally limited until more samples were collected. When enough information is not initially available, we don't expect OPEX to effectively drive experimentations. Similarly, when the initial dataset size was large relative to the size of the experimental space, the experimental space has been largely explored.
Therefore, increasing the dataset size with more experiments does not impact the information content of the dataset regardless of the underlying method sampling method (e.g. OPEX vs. Random sampling).

OPEX Performance on the Biocide-Antibiotic Transcriptional Profiling
OPEX with entropy as the utility function, outperformed expert sampling and random sampling significantly for exploring the interaction between biocides and antibiotics (Supplementary Figure 9).
The gap between the learning curves of OPEX and random sampling kept expanding until 23 more datapoints were collected and the MAE achieved by OPEX was 22% smaller than random sampling at that point. To reach the same prediction accuracy by random sampling, OPEX needed 50% fewer datapoints.
Surprisingly, the performances of the three expert sampling strategies was worse than that of random sampling. Among the three expert sampling approaches, the one based on the chemical structure of antibiotics and biocides is slightly better than the other two (Supplementary Figure 10). When adding more and more exploration, the performances of expert sampling strategies became close to that of random sampling but never surpassed (Supplementary Figure 10).

Retrospective Analysis of the OPEX Strategy
To analyze the effectiveness of OPEX in exploring the space of unexplored culture conditions, we plotted the distance between the gene expression profiles of consecutively selected datapoints (the consecutive distance) over 30 iterations. Not surprisingly, the consecutive distance fluctuated, and no pattern was observed in the case of random sampling (Fig. 2D). However, the consecutive distance for OPEX with an even tradeoff between exploration and exploitation increased gradually in the first 10 iterations (pvalue = 0.05), and finally kept decreasing (p-value < 10 -6 , Fig. 2E), indicating that OPEX can capture the similarity of gene expression profiles under different culture conditions. This, reveals the underlying strategy of OPEX in progressive exploration of the condition space, first at a coarse granularity and then on a finer granularity, which was confirmed by the fact that the distance in the first 15 iterations was above the median distance of all the 30 iterations, and the distance in the latter 15 iterations fell below the median.
In more detail, the distance between adjacent points in the gene expression space increased in the first 10 iterations and decreased afterwards, showing that OPEX explored the space on an increasingly higher level of granularity and then decreased the level of granularity. The impact of exploration percentage used by OPEX on the sampling strategy, is illustrated in Supplementary Figure 11.
In the case of expert sampling based on structure similarity, the consecutive distance first fluctuated and increased sharply at the end (Supplementary Figure 12A). For the other two expert sampling approaches, the consecutive distance in the first 10 iterations was flat and then increased slightly and finally decreased slightly (Supplementary Figure 12B-C). Not surprisingly, it gets closer to the random sampling curve as additional exploration percentage is added (Supplementary Figure 12D-I).

Sensitivity Analysis of the OPEX Method
Here, we investigated the impact of the exploration frequency, skewness and noise level on the performance of OPEX evaluated using RNA-Seq dataset.
The Effect of Exploration. When the space to explore is of low complexity (e.g. convex fitness functions, few parameters/dimensions) following a single sampling strategy with zero percentage of exploration used by OPEX is sufficient, as it was the case with the synthetic dataset (Supplementary Figure 4). However, for a complex space as in the case of RNA-Seq data with 14 independent variables and thousands of genes to predict, OPEX with zero exploration can overfit (Supplementary Figure 11A We analyzed the diversity of the selected condition at each iteration among all the OPEX runs. Shannon index was used to quantify the diversity of the sampled conditions at each iteration (Supplementary Figure 15A). The diversity of the selected condition at each iteration among all the OPEX runs for OPEX were very low compared to that of random sampling, which is indicative of the tendency of OPEX in selecting particular conditions at each iteration, suggesting that OPEX tended to sample this outlier condition (i.e. peracetic acid + kanamycin) regardless of the starting training datapoints. We confirmed this by visualizing the distribution of the culture conditions selected by OPEX at the last two iterations.
At the 27 th and 28 th iteration, OPEX chose the peracetic acid + kanamycin condition 33 times among the 50 OPEX runs (Supplementary Figure 15B-C). Note that for the initial dataset (i.e. 15 randomly selected conditions) the peracetic acid + kanamycin condition was only selected in 11 OPEX runs (amongst 50 total runs). The expected number of conditions that include a specific condition in 50 OPEX runs is 11.2.
Thus, OPEX chose to sample peracetic acid + kanamycin condition at the end in 84% of the 39 OPEX runs (39=50-11). When adding 50% of exploration, the diversity was increased (Supplementary Figure   15D-E) and the performance of OPEX was optimal (Supplementary Figure 14). Similarly, the diversity can be increased by adding more exploration in the case of expert sampling (Supplementary Figure 15F), but since the sampling strategy of expert sampling was not effective, the performance could not surpass that of random sampling (Supplementary Figure 10). impacting the topology of the space (e.g. number of genes and independent variables). We also performed gene ontology (GO) enrichment analysis on the list of 969 genes using DAVID [20]. The resulting enriched biological process GO terms were related to translation, glycolytic processes, cell division, peptidoglycan biosynthetic processes, and regulation of cell shape (with a threshold for the adjusted p-value of 0.01). No enriched biological process GO terms were observed for the 154 genes (14%) for which OPEX did not outperform the expert sampling.

Peracetic Acid and Kanamycin Condition as an Outlier
We investigated further why OPEX deprioritizes the selection of peracetic acid + kanamycin condition until later iterations while having a poor performance for predicting its gene expression profile. We hypothesized that the GP model predicted a gene expression profile for this condition that was similar to the ones in the training dataset. Thus, we visualized the predicted gene expression profile of those conditions with the measured gene expression profile for other conditions in 2d space by t-SNE (Supplementary Figure 17). The predicted gene expression was close to other conditions in which peracetic acid or kanamycin was present. Since the gene expression was predicted based on the culture conditions, it is reasonable that the model made such a prediction. However, the true gene expression under the peracetic acid + kanamycin condition was close to another antibiotic (the cluster in the top right of Fig. 3A, the green cluster in Supplementary Figure 18), which suggests that the model could not determine the gene expression of that condition based on the antibiotic and biocide used.

Validating OPEX on all 4,391 Genes
We have shown the superior performance OPEX when evaluated using 1,123 genes which meet minimum sequencing coverage criteria (count per million >100, as described under RNA-Seq data analysis in Methods section of the main manuscript), following recommended guideline for RNA-Seq analysis. This raises a question on whether OPEX also performs well in the case that the set of inactive genes are unknown beforehand. Therefore, we also assessed the performance gain of OPEX compared to random sampling without removing any genes (i.e. used all 4,391 genes), and achieved similar results.
OPEX needed approximately 40% datapoints to reach the same accuracy compared to random sampling (See Supplementary Table 2 and Supplementary Figure 19), similar to results of Fig. 2B where only 1,123 were used. We compared OPEX runs that consider all 4,391 genes versus OPEX runs that rely on the 1,123 active genes and found that they provide similar improvement of performance relative to  Figure 19A versus Fig. 2B). This can be explained by the higher variance in the measurements among replicates when non-active genes are included (Supplementary Figure 21).

OPEX Comparison to Other OED Approaches
We compared the performance of OPEX with three alternative approaches, query by committee using different types of models, query by committee using bootstrapping, and D-optimal experimental design (see section 2.1 for more information in this file). The performance of each approach was evaluated based on the average MAE of predictions for the expression of all the 4,391 genes (Supplementary Table   1 and Supplementary Figure 19). OPEX with entropy as utility function reached a better performance compared to query by committee using different types of models (QBC-Mixed-Models) when compared based on the maximum percentage of data points saved relative to random sampling (13 versus 14 iterations to reach MAE that random sampling achieves at iteration 27). With respect to the overall improvement of MAE relative to random sampling in all iterations, OPEX with entropy achieved 12.7% while QBC-Mixed-Models achieved 11.0% showing a slight advantage for OPEX with entropy (p-value = 9 × 10 −9 ). OPEX with mutual information as the utility function performed similar to QBC-Mixed-Models. The query by committee using bootstrap QBC-Bootstrap and D-Optimal methods did not show a consistent advantage over random sampling. The prediction performance of the four types of models from query by committee was ranked in this order: Support Vector Regression, Gaussian process, linear regression and feed-forward neural network (Supplementary Figure 22).

Exposure to Biocides and Cross-protection to Antibiotics
Our fitness measurements demonstrated that biocide treated E. coli cells, in majority of cases, exhibited cross-protection in antibiotics, excluding a couple of cases of cross-vulnerability. In 29 out of 40 treatment conditions, biocide treatment increased the fitness in antibiotic, while in 4 cases, treatment reduced the fitness. Cross-protection between biocides and antibiotics has been brought up to attention by researchers [21][22][23] and regulatory agencies [24][25][26] before. Biocides are regularly used as a sanitizer in hospitals, houses and food industries, and such study could help guide regulations to reduce the emergence of antimicrobial resistance.
Interestingly, pre-exposure to all biocides conferred protection against the antibiotic rifampicin ( Fig.   3B), which was also the group that formed a cluster in the t-SNE/PCA analysis for transcriptomics of biocide/antibiotic pairs ( Fig. 3A and Supplementary Figure 16). The highest fitness value was observed for the povidone iodine/kanamycin combination, and the lower for chlorophene/norfloxacin. These two extreme cases were selected for further investigations.

Three distinct clusters for all conditions
We examined cross-resistance of wild-type E. coli to each of the four antibiotics after pre-exposure to one of the ten biocides. Although there are 40 pairs of biocides and antibiotics, the gene expression (GE) profile was often dominated by only one factor (biocide or antibiotic). The dominating factor can be explained using three rules as evident by the three clusters in Fig. 3A. First, the alcohol biocides (Ethanol, Isopropanol and Chlorhexidine) had a dominating effect on GE profile regardless of the antibiotic they are paired with. Second, apart from the alcohols, in the majority of cases, rifampicin had a dominating effect on GE profile regardless of the biocide that it was paired. Third, the choice of biocide determined GE profile except when Rifampicin was used. This is evident by the proximity of points related to each biocide Benzalkonium chloride, Chlorhexidine, Chlorophene, Ethanol, Glutaraldehyde, H2O2, Isopropanol and Peracetic acid on the t-SNE plot ( Fig 3A). The same clusters were also detected by the PCA plot (Supplementary Figure 17). The kanamycin/ peracetic acid pair is particularly interesting since it did not follow the general pattern. We further asked whether these clustering patterns can be explained

Supplementary Figures
Supplementary Figure 1 Supplementary Figure 3: The effect of noise on the performance of the OED methods compared to random sampling on synthetic datasets 2-7 (A-F), whose skewness are 2.07, 3.05, 4.13, 5.29, 6.55 and 7.06, respectively. The setting for other hyper-parameters are as follows: starting size=300, exploration frequency=1/6, batch size=3, the number of iterations=50. The error bar denotes standard deviation (number of datapoints=50). The bar represents the mean of 50 runs.

Supplementary Figure 4 A-F:
The effect of exploration frequency on the performance of the OED methods compared to random sampling on synthetic datasets 2-7, whose skewness are 2.07, 3.05, 4.13, 5.29, 6.55 and 7.06, respectively. The y-axis is the MAE of random sampling minus the MAE of OPEX divided by the MAE of random sampling. A positive value means OPEX is more effective. The setting for other hyper-parameters are as follows: starting size=300, noise level=20, batch size=3, the number of iterations=50. The error bar denotes standard deviation (number of datapoints=50). The bar represents the mean of 50 runs.

Supplementary Figure 5:
The effect of starting size on the performance of the OED methods compared to random sampling on datasets 2-7 (A-F), whose skewness are 2.07, 3.05, 4.13, 5.29, 6.55 and 7.06, respectively. The setting for other hyper-parameters are as follows: noise level=20, batch size=3, minimum distance=0.2, exploration frequency=1/6, the number of iterations=50. The error bar denotes standard deviation (number of datapoints=50). The bar represents the mean of 50 runs.
Supplementary Figure 6: Performance of adaptive sampling, constrained sampling and greedy sampling on the data whose skewness is 1.1. The setting for other hyper-parameters are as follows: starting size=300, noise level=20, exploration frequency=1/6, the number of total additional datapoints sampled=160. For Panel B, the minimum distance between datapoints in a batch is 0.2. As the batch size, k, goes beyond 16, we cannot select k points with pairwise distances greater than 0.2 hence the xaxis is from 2 to 16 in Panel B. The number of datapoints for each box in the boxplot is 50. Each box represents an interquartile range which consists of data points between the 25th and 75th percentiles. The whiskers extend to the maximum and minimum values but no further than 1.5 times of the interquartile range for a given whisker. The horizontal line within each box represents the median.
Supplementary Figure 10: The performance of three expert sampling approaches. We used expert sampling to sample one datapoint every k iterations and used random sampling otherwise to introduce exploration. The percentage of exploration in the legends is equal to k/(k+1). E.g. 67%=2/(2+1). (A) Structure similarity means that the culture condition that was most dissimilar to all the observed conditions was selected in each iteration. The similarity between two culture conditions is quantified by the structure similarity between the biocides used in the two conditions and the antibiotics used. (B) Mechanism similarity means the mechanism of each antibiotic and biocide was considered when selecting the most dissimilar culture condition. (C) Dominance of antibiotic means the more dominant an antibiotic is, the later a culture condition that has that antibiotic condition was selected.

Supplementary Figure 12:
The distance between the select datapoints in every two adjacent iterations by expert sampling with various percentage of exploration. The number on the right in each row of panels represents the percentage of exploration used. We sampled one datapoint every k iterations based on expert sampling and used random sampling otherwise to introduce exploration. The percentage of exploration in the legend is equal to k/(k+1). E.g. 66%=2/(2+1). The number of datapoints for each box in the boxplot is 50. The definition of box plot is defined the same way as in Supplementary Figure 11.
Supplementary Figure 13: The performance of OPEX which used mutual information (A) or entropy (B) as the utility function. The effect of the tradeoff between exploration and exploitation on the performance of OPEX was visualized. We sampled one datapoint every k iterations based on the entropy or mutual information and used random sampling otherwise to introduce exploration. The percentage of exploration in the legends is equal to k/(k+1). E.g. 67%=2/(2+1).