Machine learning enables polymer cloud-point engineering via inverse design

Inverse design is an outstanding challenge in disordered systems with multiple length scales such as polymers, particularly when designing polymers with desired phase behavior. We demonstrate high-accuracy tuning of poly(2-oxazoline) cloud point via machine learning. With a design space of four repeating units and a range of molecular masses, we achieve an accuracy of 4 {\deg}C root mean squared error (RMSE) in a temperature range of 24-90 {\deg}C, employing gradient boosting with decision trees. The RMSE is>3x better than linear and polynomial regression. We perform inverse design via particle-swarm optimization, predicting and synthesizing 17 polymers with constrained design at 4 target cloud points from 37 to 80 {\deg}C. Our approach challenges the status quo in polymer design with a machine learning algorithm, that is capable of fast and systematic discovery of new polymers.

: Study framework. First, we train a machine learning model to predict cloud point on the basis of poly(2-oxazoline) structure, with varying ratios of four monomer units (building blocks) and molecular weights. Second, we demonstrate inverse design using the trained algorithm and particle swarm optimization, predicting 17 polymer structures from user-defined cloud points. The model accommodates the inherent complexity of polymers over multiple length scales.

Results & Discussion
We combine and curate literature and experimental data to create the input into our machine learning framework. Historical cloud-point data for poly(2-oxazoline)s 15,[25][26][27][28][29][30] was curated into a set of input variables ((1) molecular weight of the polymers; (2) polydispersity index; (3) polymer type (homo, statistical, or block); (4) total number of each monomer unit in the final polymer (A: EtOx, B: nPropOx, C: cPropOx, D: iPropOx, E: esterOx)) and output variables (cloud point in ˚C) ( Table S1). We synthesized a series of poly(2-oxazoline)s by similar methods to augment this data (Table S2). Cloud point was evaluated by dynamic-light scattering (DLS) in accordance with best practices, 31 particularly since DLS affords greater weightage to the modal mass as a correction for the unsymmetric molecular weight distributions (MWD) of our synthesized polymers (details in ESI). Due to data scarcity, esterOx was not synthesized nor considered in inverse design. While a general relationship of input variables to output could be observed from Figure 2, it is well documented that machine learning methods generally have superior predictive accuracies in multi-variable parameter space. [32][33][34]  We compared the root-mean-squared errors (RMSE) of simple linear and quadratic regressions against machine learning methods including support vector regressions (SVR), neural networks (NN) and gradient boosting regression with decision trees (GBR) (Figures 3, S3). The accuracies of the various models were determined by splitting the input dataset into training, validation, and test sets, with training and validation performed from historical data, while testing was performed with experimental data. The RMSE and inference times are reported in Table S3. The literature data is split into 68 training data points and 7 validation data points. Test datapoints are 42 experimental data points produced in the lab. The results were compared using the root-mean-squared error. We observe that GBR achieves the best generalization. (bottom row) Final GBR model performance on 3 different random train-test splits of the combined dataset.
Linear and polynomial regressions, while significantly faster than the others, performed poorly when compared to SVR, NN and GBR. Of the latter three, GBR was the more accurate "out of the box". Moreover, it possesses fast inference speed, which is essential for efficient exploration of the parameter space in inverse design. We increased the predictive accuracy by tuning via a cross-validation grid search on hyper-parameters. We used both historic and experimental data, with a test set of 10% in order to validate our choice of hyper-parameters with the test error on 3 randomly split training and test sets (Figure 3). We now observe improved performance with an increased dataset and thorough tuning.
This algorithm is shown to generalize well across the variation in polymer dataset of varying polydispersity. The historical datasets had narrow polydispersity indices with the assumption of symmetrical MWDs, while the synthesized polymers had broad and unsymmetrical MWDs. The robustness of this algorithm in handling "noisy" data renders this far more powerful than a simple algorithm which only works for the highest quality of data. With a sufficiently accurate model, we finally retrain (using the tuned hyper-parameters) on the entire dataset to produce a finalized forward model that we use for subsequent inverse design.
While a forward predictive models in machine learning approaches for materials science are fairly common, inverse design is far more challenging. This is because the descriptors, which are usually high dimensional, are difficult to predict from outputs which are low dimensional.
In the case of our polymer dataset, the output of cloud point is a single number, attributed to the 5 numbers representing molecular mass and composition of the polymer.
Inverse design would provide the ability to design polymers based on a desired final property and accelerate the synthesis process of target polymers based on design constraints to meet desired cloud points. To further realize new material discovery, we propose to extrapolate from our training dataset by designing terpolymers, which are non-existent in our training set, and limiting EtOx composition which is common.
Typically, inverse optimization on piece-wise constant functions provides a large number of different predicted designs. These may achieve our optimization and constraint target according to the fitted GBR model. However, the quality of these designs vary, particularly in the case of extrapolation. Validating all of these experimentally would be inefficient and so a filtering method with an ensemble of M three-layer fully connected neural networks (NN) was employed to select the most promising design candidates for experimental validation. small. This ensures that is predicted with high confidence and not an ad-hoc extrapolation. Figure 4 illustrates the principle of this approach. Although the NNs are also good approximators for the cloud point, they were not used as the forward model for producing inverse design candidates because the feed-forward step of the NN ensemble is still too slow compared with GBR, which consists of simple summing of piecewise constant functions. Using this technique, we down-selected 17 polymers over our 4 desired cloud-points (37, 45, 60, 80 ˚C), and imposed design criteria weighted on minimizing EtOx and designing polymers with more than two components -unseen in the training data. These polymers were synthesized, although an average of 3 iterations were required to achieve the target mass and composition of the designs, owing to the difficulties with terpolymer synthesis, where the Mayo-Lewis equation does not apply in calculating required feed ratio of monomer for desired final copolymer composition. The mass and composition of the synthesized polymers are reported in Table S4, showing minimal deviation from algorithmic design, along with their cloud points (an average of 3 measurements). The RMSE of the obtained cloud points was 3.9 ˚C, however when the polymer structure of the new polymers is fed back into the NN ensemble, a larger RMSE is observed (6.1 ˚C) (Figure 4). Deviation from the target cloud points was within test RMSE between 37-60 ˚C but above it at 80 ˚C, and can be attributed to sparseness of the data set at higher temperatures ( Figure 2F) -an in-depth analysis is a b c d provided in ESI. These results show that our combination of slow and fast algorithms are able to design polymers with unique compositions with control over the desired physical property and structural design.
Overall, a significant conceptual advance in polymer inverse design has been achieved via judicious application of machine learning methods. This was done in three steps. First, we curated and categorized historical and new data. Second, we selected and fine-tuned a machine learning model based on gradient boosting regression with decision trees, resulting in a cloud point predictive accuracy of 3.9 ˚C (RMSE). The model was able to generalize well with both well-defined historic datasets as well as newly synthesized polymers of unsymmetrical MWDs. Third, inverse design by particle swarm optimization which predicted the design of new polymers based on desired cloud points (37, 45, 60, 80 ˚C). Extrapolation beyond the training set was achieved via an ensemble of neural-networks as a crossvalidation technique to down-select 17 polymers with the lowest variance across predictions. The RMSE of predicted polymers were similar to those of the forward model. This methodology offers unprecedented control of polymer design, which may significantly accelerate the development of polymers with other physical properties.
Details of our code implementation and dataset can be found in our repository.

Material synthesis and characterization
were synthesized as described in literature, and distilled over calcium hydride and stored with molecular sieves (size 5 Å) in a glovebox. 2-ethyl-2-oxazoline (EtOx, Sigma-Aldrich) was distilled over calcium hydride and stored with molecular sieves (size 5 Å) in glovebox. All other reagents were used as supplied unless otherwise stated.

Analytical Methods
Nuclear magnetic resonance (NMR). The compositions of the polymers were determined using 1 H NMR spectroscopy. 1 H NMR spectra were on JEOL 500MHz NMR system (JMN-ECA500IIFT) in CDCl3. The residual protonated solvent signals were used as reference.

Dynamic Light Scattering (DLS).
Measurements at various temperatures were conducted using a Malvern Instruments Zetasizer Nano ZS instrument equipped with a 4 mV He-Ne laser operating at l = 633 nm, an avalanche photodiode detector with high quantum efficiency, and an ALV/LSE-5003 multiple tau digital correlator electronics system. on Malvern Nano ZS. Solutions of polymers (5 mg/mL) were prepared by dissolving polymer in deionized water at room temperature. The solutions were then heated to 100 °C and cooled down to remove thermal memory, before measurements were taken.

Experiments
For all polymerizations, the polymerization mixture was prepared in vials that were dried in 100 °C oven overnight before use, and crimped air-tight in a glove box.

Curation and synthesis of the polymer library
To augment the historical dataset reported in Table S1, (4-10) a series of poly(2-oxazolines) were synthesized by cationic ring-opening polymerization in a microwave reactor at 140 °C and terminated with tetramethyl-ammonium hydroxide at the end of the reaction. All copolymers were synthesized with EtOx and one of the propyl oxazolines and variations in feed ratio were performed. SEC results are reported for all synthesized polymers in Table S2   Table S1: DLS measurements were performed in triplicate by preparing solutions of polymers at a concentration of 5 mg/mL in deionized water. The solutions were then heated to 100 °C and cooled down before measurements were taken to negate effect of thermal history. DLS measurements of the polymer solutions were performed over a temperature sweep between 20 to 90 ˚C. The cloud point temperature for the synthesized polymers (Table S2) was determined as the temperature at which the dissolved polymer chains of small hydrodynamic diameter agglomerate to form large particles or mesoglobules, as demonstrated in Figure S1 for poly(nPropOx-co-EtOx) copolymers with a compositional variation at 20% increments. Figure S1: Temperature dependent DLS measurements for poly(nPropOx-co-EtOx) at various compositional ratios demonstrating the cloud point dependence on polymer composition.  The PDIs obtained experimentally are much higher than the PDIs from the historical data. It can be assumed that the molecular weight distributions (MWD) for the historical data, where the PDI is lower than 1.4, are typically symmetrical. Conversely, the MWD of the polymers made experimentally had a long low-molecular weight tail (Figure S2). In the case of cationic ring opening polymerization, this long tail can be attributed to impurities such as water which terminate actively propagating chains. Due to the unsymmetrical MWD, the number average molecular weight (Mn), is no longer a proper representation of the MWD, particularly when comparing the dataset to historical data with polymers of narrow polydispersities.
Zhang et al. (11) propose that DLS is one of the better methods to characterize cloud points. They note that the intensity of scattered light due to a sharp change in refractive index is influenced by the chains that are dehydrating and thereby changing morphology from coil to globules.In contrast, only a minor difference in refractive index is observed from the hydrated chains. For broad or unsymmetric MWDs such as with our polymers, it seems intuitive that the cloud point by DLS of the modal polymer molecular weight would represent the polymer as a whole. Figure S2: Gel permeation chromatogram and temperature dependent DLS data of poly(nPropOx-co-EtOx) (sample numbers 38 & 39, Table S2) before and after dialysis showing a narrowing of the molecular weight distribution, with no change in cloud point To validate this theory, a polymer was selected at random, and dialyzed against water to remove some of the low molecular weight tail. Comparisons of the MWD before and after dialysis ( Figure S2) show the removal of the low molecular weight tail, and the narrowing of the MWD. However, DLS results ( Figure S2-inset) show no change to the cloud point of the polymer. Thus, to better represent the polymer dataset, the modal molecular weight, or peak molecular weight (Mp) was used to represent the molecular weight of the polymers from Table S2.

Machine-Learning methodology Establishing a Machine Learning Baseline
It is often useful to establish a baseline for statistical methods on the currently available data before further data collection and algorithm exploration. In this section, we outline the development of our basic data driven approach which are broadly classified as statistical models (e.g., multivariate analysis (12) and Bayesian inference (13)) and machine learning models (e.g., support vector machines, (14) decision tree learning, (15) and deep neural networks (16)). The former perform well on relatively small datasets, but require non-trivial domain information such as statistical priors and a forward mathematical model, which may not always be available and can thus limit their applicability. On the other hand, machine learning models lend their applicability to datasets where the underlying physical mechanisms are unclear, or when the dataset has noise corruption. (17) While machine learning typically requires large datasets and cannot infer underlying physical relationships, its accuracy and fast inference speed makes it suitable for inverse design via global optimization.
In this work, we recall that we wish to predict the cloud point ( ∈ ) based on the polymer composition and other properties ( ∈ 7 ). We assume that there is some relationship = ( ) for some unknown function . Hence, our goal is to parameterize and fit an approximator # of . The literature dataset is split into 68 training samples and 7 test samples, and we evaluate a total of five methods for fitting: 1) Linear regression; 2) Polynomial regression of degree up to 2; 3) Support vector regression; 4) Neural network regressor (2 hidden layers) 3) Gradient boosting regression with decision trees (GBR). (17) Below, we sketch the basic idea of the GBR method, which is the final choice of our forward model for inverse design and refer the reader to the text authored by Hastie, Tibshirani and Friedman(18) for more details.
A Sketch of Gradient Boosting Regression GBR makes use of the idea of "boosting", which is a class of sequential ensembling methods, where weak regressors (regression models with low capacity or approximation power) are iteratively combined to form a strong regressor. The basic idea is as follows: fix a space of weak approximators (e.g. decision trees) and start with a constant function : . For each ≥ 0, we set where the loss function measures the "distance" between its arguments. In other words, at each step we fit some function to approximate the current residual error − > , and this successively improves the approximation. Of course, in practice the minimization step in (1) may be hard to evaluate, hence one can use "gradient boosting", where ℎ is not chosen as a true minimizer, but a function in the "steepest descent direction" of the loss function with respect to ℎ. Detailed exposition on gradient boosting can be found int he previously mentioned text. (18) The results of the comparisons are shown in Figure 3 and S3, where we measure the root-mean-squared error on training, validation and test sets, the latter of which is the quantity to be used to discriminate model performances. The RMSE and the inference time is reported in Table S3. Note that while the training and validation sets are random splits in of the literature data, the test set are sample points obtained in our experiments. Thus, a model that performs well on the tests set indicates that it has the ability to fuse both literature data and our experimental data to form a more robust model. From our results, we observe that linear regression and polynomial regression, while having fast inference speeds, perform poorly in terms of test error. Moreover, polynomial regression suffers from the "curse of dimensionality" when higher order polynomials are included, since the number of terms increases exponentially with increasing maximum degree.
While all of the more sophisticated machine learning methods perform significantly better, the most outstanding is GBR method performs the best when weighing both in RMSE and inference time, even with minimal tuning. The inference time is important since we will need to repeated call this forward model in our inverse design process, and a faster inference time greatly enhances our exploration of the design space. Moreover, GBR (with decision trees as base regressors) give us a measure of feature importance using the Gini impurity. (18) In the present application, this gives us an estimation of the sensitivity of our cloud-point model on the polymer properties, seen in Figure S3.  To optimize the GBR for inverse design, hyperparameter tuning was further conducted to bring the RMSE down to 3.9 ˚C. Details of which are presented in our data repository. With a tuned model, we look towards inverse design in order to predict polymer structure from desired cloud points.

Inverse Design via Particle Swarm Optimization
Our data-driven approximation # of the forward relationship between the polymer properties and the cloud point was demonstrated previously to be close to the true function . In this section, we consider the problem of inverse design, where we want to find a polymer configuration that achieves certain targets (e.g. cloud point, desired proportions), while respecting certain constraints (e.g. molecular weight). Mathematically, this can be posed as a constrained optimization problem min L ( , # ( )) subject to W , # ( )X ≥ 0.
Where : 7 × → is the objective function and : 7 × → ] is the vector-valued constraint function.
The problem (2) is posed as a global optimization problem. In general, there are many heuristic methods for solving it, including simulated annealing, (19) genetic algorithms,(20) differential evolution,(21)etc. In this paper, we employ the particle swarm optimization (PSO) algorithm. (22) It is