Introduction

The ubiquitousness of polymers in modern technology highlights their importance in and for the modern world. The features that are responsible for their widespread use and applicability are favorable mechanical and thermal properties, high durability, and general resistance to corrosion1. Within the broad category of polymeric materials, there is a subgroup of polymers known as polyesters, which are composed of ester repeat units2,3,4. The most common polyester produced is polyethylene terephthalate (PET), which can be found in a host of applications including packaging, textiles, thermoplastic resins, and photovoltaic devices5,6,7. PET is comprised of repeating units of terephthalic acid and ethylene glycol. The versatility of the esterification reaction allows many different types of multifunctional acids and glycols to be polymerized into polyesters and indeed some applications make use of multiple acids and multiple glycols in the same composition. Adding to this complexity are the various molecular weights and end-group distributions, as well as branching of the polymer backbone that have been realized. Together, this gives rise to an ever-increasingly large materials design space.

For the exploration of such a large materials space, machine learning (ML) can be utilized in order to derive highly nonlinear relationships between the polymeric materials and their corresponding properties8,9,10,11,12,13,14,15,16. The glass transition temperature (Tg), molecular weight (MW), the molecular weight distribution or polydispersity index (PDI) and the inherent viscosity (IV) are such properties that are correlated with the functionality and performance of the material. In order to develop polyesters, which have favorable Tg, MW, and IV values for a given application, a costly and time-consuming approach is often utilized, which involves testing many different combinations of diacids and diols in different experimental setups, including synthesis conditions, catalyst selection, and different monomer ratios. This can take hours or days for a single batch and is not able to cover a large number of targeted materials in a single instance. To circumvent such demanding processes, ML can be utilized to map the correlation between a given structural input (identities of the diacids and diols used) and output (the desired properties such as Tg, MW, and IV) in order to help guide experimental work on the targeted synthesis of materials with enhanced properties.

Previous work has demonstrated great performance of ML models on the prediction of glass transition temperatures14,16,17,18,19,20. Tao and coworkers tested a large array of ML models with varying structure and feature representations in order to provide Tg predictions14. Using a dataset of about 7000 homopolymers, they developed a ML model with good predictability and were able to provide estimates on ~5700 homopolymers with unknown experimental Tg values. A similar study examined polyacrylamides with quantum chemical descriptors in order to provide Tg predictions16. A Gaussian process regression model was developed from a small dataset (20 instances) to estimate Tg using thermal energies and the total electronic energies of the repeat units as input values.

A task that still remains elusive for ML applications on polymers is the prediction of multiple properties by a single model, which can lead to more effective material optimization15. Another important challenge is the accurate prediction of IV values. ML models that are based solely on monomer composition ignore important structural information such as the number of end-groups or the polymer chain length (often approximated by the molecular weight) and struggle to differentiate between values within a narrow range (between 0.2 and 0.4 dL/g). In this range, the relationship between Tg and IV can vary substantially21 and while Tg accuracy is fairly straightforward, predictions for IV are more challenging. Few studies have targeted IV values using ML9,11 with limited success and thus, alternative methodologies should be considered. Recent work has shown that graph neural networks (GNNs) provide increased predictability regarding thermal and mechanical properties of polymers22,23, on other families of materials24,25, as well as on molecular properties26,27,28,29,30. As a result of the success of GNNs, graphs provide a promising direction for representing molecules. Neural networks are particularly well suited to combine molecular graphs into macromolecules in a similar manner to how representation methods such as BigSMILES31,32 represents polymers. Another approach that has been recently explored covers ML models on feature-engineered polymer data that capture higher-order structural interactions between monomeric units14,33.

In this work, we have developed a multitask ML architecture that aims to provide reliable values for polymer properties. To access these complex structure-function correlations, we have utilized a GNN-based model (PolymerGNN), which has been tested on a dataset of experimentally measured properties of polyesters (Tg and IV values). We demonstrate the generality of this model in its ability to predict Tg and IV as single tasks as well as both Tg and IV in a multitask learning framework. PolymerGNN outperforms other molecular embedding techniques in the tested prediction tasks, while it has the ability to work in low-data availability regimes. In addition, we demonstrate the robustness of PolymerGNN through an explainability study, showing that the model appears to learn chemically relevant patterns and features in the dataset. This proposed methodology, while demonstrated for polyester prediction, is transferable to other families of materials. While GNN-based models for machine learning predictions of polymer properties have been previously developed and successfully tested22,23,33, the pooling mechanism introduced here further advances these models (vide infra). This mechanism creates a centralized vector enriched with information from all monomers and allows PolymerGNN to make predictions on monomer input without any direct modeling of polymers.

Results

The polymer database

A diverse database of polyester resins that includes experimental data was generated. These materials contain between 1 and 4 different diacids (referred to in the next paragraphs as acids) and between 1 and 4 diols (glycols), while a small number also includes trimethylolpropane (TMP) that allows the synthesis of branched polymers (Fig. 1a, the full list of all monomers is given at the Supplementary Note 1). The overall database contains 186 linear polymers (62.8%) and 110 branched polymers (37.2%). The linear polymers can be further classified as “homopolyesters”, which only have 1 acid and 1 glycol (24.0%), and “co-polyesters”, which have multiple acids and/or glycols (38.8%). A small percentage of the linear polyesters (21.3% of the total database) include high molecular weight polymers with a characterized amount of cyclic oligomers (referred in the next sections as “cyclic”). Pictorial representations of the subsets of polyester resins are shown in Fig. 1b–d. The polymer properties collected for each material include Tg, IV, and weight-average molecular weight (Mw) (as a function of polystyrene), acid number (AN) and hydroxyl number (OHN). Figure 1e, f, and g show the distribution of the Tg and IV values for each subset (linear, branched, and “cyclic”, respectively), while representative examples of the three subsets are given in Fig. 1h, i, and j, respectively. It is thus evident that the compiled database has extensive diversity with respect to the material composition and structure, as well as with respect to the targeted properties. In addition, not all data entries have measured Tg and IV values. 210 instances in the database contain measured Tg values, 243 instances have measured IV values, and 163 instances have both Tg and IV values.

Fig. 1: The polyester database.
figure 1

a Polyesters are composed by combinations of diacid (red) and glycol (blue) monomers, and they can form b linear, c branched, and d cyclic chains that are present in linear polyesters. For branched polyesters, a small amount of trimethylolpropane (TMP, blue/white striped monomer) is required. The distribution of Tg and experimentally refined IV values for the e linear, f branched, and g “cyclic” polyesters demonstrate the heterogeneity of the total database. Representative examples of input/output values of a h linear, i branched, and j “cyclic” polyester. Each sample consists of a set of acid and glycol monomers together with their corresponding percentage, and a vector of resin properties: end-group statistics (AN and OHN) and weight-average molecular weight (Mw). Tg in C, IV in dL/g, Mw in g/mol.

Initial model analysis

Using the diverse polyester resin dataset, we initially performed a wide-scale study to examine how different machine learning architectures, molecular representations, and polyester chain lengths affected the prediction of Tg and IV values (Supplementary Note 2). With regards to the machine learning architecture, we found that the kernel ridge regression (KRR) method resulted in the highest or near-highest predicted R2 values for Tg and IV with values of 0.8624 and 0.7067, respectively. From this study, we found that the inclusion of Mw values in the input vector improves the ability to predict IV values significantly whereas it does not improve the prediction of Tg values: IV was predicted with R2 values of 0.4288 and 0.7067 without and with Mw while Tg was predicted with R2 values of 0.8624 and 0.8582 without and with Mw using the KRR model. We also found no systematic increase in property accuracy when lengthier oligomers were used as input to the ML model since the use of individual acids and glycols monomers resulted in the highest R2 values for both Tg and IV. Thus, these are the values that will serve as a baseline for the PolymerGNN architecture.

PolymerGNN architecture

We introduce PolymerGNN, a neural network and general training procedure to predict properties of polymers of known monomer composition. The overall data modality introduces a challenge as it is not straightforward to represent the polymer composition as simple vectors or mathematical objects that can naturally be input to machine learning algorithms. For that reason, PolymerGNN leverages graph neural networks (GNNs) and a pooling mechanism to produce outputs for varying numbers of inputs (number of acids and glycols in a given resin). Importantly, PolymerGNN separates acid and glycol inputs and combines representations from both of these sets of monomers to produce downstream representations with rich chemical information. As PolymerGNN utilizes a neural network, it can also perform multitask learning to produce embedding vectors that are optimized for predicting multiple properties.

The full PolymerGNN architecture consists of three separate units: (1) a molecular embedding block, (2) a central embedding block, and (3) a prediction network, as seen in Fig. 2. Each unit is presented separately in the following paragraphs.

Fig. 2: Model architecture for PolymerGNN.
figure 2

a The three major sections of the PolymerGNN architecture are: (1) the molecular embedding blocks, (2) the central embedding mechanism, and (3) the prediction network. b Architecture for the Tg prediction network, where tanh and \(\exp (\cdot )\) correspond to a hyperbolic tangent activation function and the exponential function ex, respectively, while  corresponds to a multiplication of scalar outputs from tanh and \(\exp (\cdot )\). c Architecture for the IV prediction network. d Joint model setup, with “Tg Prediction Network” and “IV Prediction Network” corresponding to the architectures shown in b and c, respectively.

Molecular embedding block

The molecular embedding block is responsible for transforming input molecular graphs into vectors, or representations of the molecular structures. Each resin is represented by its constituent monomers—initial inputs into the synthesis of the resin. The molecular structure of each monomer is then encoded into a molecular graph where nodes, or vertices, correspond to atoms, and edges correspond to chemical bonds. A GNN with two graph convolutional layers is used for each acid and glycol. Through rigorous testing of various GNN layers, we found that a two-layer GNN, with a Graph Attention Network (GAT) layer34 followed by a GraphSAGE layer35, provided exceptional performance. Following standard GNN design principles suggested by You, Ying, and Leskovec36, we use a Parameterized ReLU activation function37 and a Batch Normalization layer38 between graph convolutional layers within the GNN. These previous steps work to embed the nodes of each molecular graph, and to produce a graph-level embedding, we use a Self-Attention Graph Pooling mechanism39,40.

We train two GNN blocks, one to embed the molecular structure of acids (Φa) and one to embed the molecular structure of glycols (Φg). These GNN components share an identical architecture. We present two ways of using these separate GNN blocks. The first considers training of both Φa and Φg with the same weights, updating them simultaneously within the model. In the second approach, each block is trained separately where weights are not shared across each of the models. Intuitively, this corresponds to learning two different models that embed acids and glycols in a way that is more advantageous for the downstream prediction network.

We obtain sets of molecular embeddings \({{{{\mathcal{A}}}}}_{z}=\{{{{{\bf{z}}}}}_{1}^{{{{\rm{a}}}}},..,{{{{\bf{z}}}}}_{n}^{{{{\rm{a}}}}}\,| \,{{{{\bf{z}}}}}_{i}^{{{{\rm{a}}}}}\in {{\mathbb{R}}}^{d}\}\) and \({{{{\mathcal{G}}}}}_{z}=\{{{{{\bf{z}}}}}_{1}^{{{{\rm{g}}}}},..,{{{{\bf{z}}}}}_{m}^{{{{\rm{g}}}}}\,| \,{{{{\bf{z}}}}}_{i}^{{{{\rm{g}}}}}\in {{\mathbb{R}}}^{d}\}\) for the acids and glycols, respectively, with the size of each set n, m varying with each input sample. Both sets of acids and glycols are permutation invariant, i.e., the ordering within each set is arbitrary.

While the GNN shows the best performance in subsequent experiments with the proposed architecture, we note that any type of molecular embedding technique can be used in this pipeline, as long as the output is a one-dimensional vector of constant size. Therefore, this model can be easily amended to future developments in molecular representations, including more advanced GNN architectures. The advantage of the GNN embedding tool over deterministic molecular fingerprinting methods is that the model can be trained in an end-to-end fashion, i.e., the molecular representations can be tuned to different tasks and datasets.

Central embedding block

The central embedding block of PolymerGNN combines all molecular embeddings into a chemically informed, constant-size vector for use in downstream tasks. The output \({{{{\mathcal{A}}}}}_{z}\) and \({{{{\mathcal{G}}}}}_{z}\) sets from the GNN blocks are permutation invariant sets of variable length. Thus, combining the embeddings requires a permutation-invariant aggregating function, or pooling function POOL(). Some examples of these pooling operations are the element-wise SUM, MAX, or MEAN. We choose to use an element-wise MAX pooling in PolymerGNN for predicting Tg and IV as early experiments showed slight gains in performance with this pooling method. Applying the POOL() function to both \({{{{\mathcal{A}}}}}_{z}\) and \({{{{\mathcal{G}}}}}_{z}\) produces constant-size output vectors denoted as za and zg.

The final portion of the central embedding layer incorporates the resin properties. Three key resin properties that are characteristic of polyesters were encoded in PolymerGNN in addition to the structural information of each monomer. The considered properties are the weight-average molecular weight (Mw), the terminal acid number (AN), and terminal hydroxyl number (OHN) of the polymer chains. An additional property considered is that of explicit percentage of TMP that facilitates the synthesis of branched polymers. The introduction of branching can significantly change the shape of the polymer architecture since increasing the level of branching agent (TMP in this case) causes Mw to build more rapidly than the average molecular weight Mn as a function of reaction progress (decreasing OHN and AN), which leads to significant difference in the polydispersity index and IV41,42. The TMP percentage is relevant for nearly half of the dataset and ranges from 0.0% (linear chains) to 15.7%. Explicitly providing TMP in the input gives the model a direct method to account for the approximate amount of branching in the final product of the synthesis. We will generally denote resin properties as \({{{\bf{p}}}}\in {{\mathbb{R}}}^{n\times m}\), for n samples in the dataset and m properties. The resin properties for sample i are denoted as \({{{{\bf{p}}}}}_{i}\in {{\mathbb{R}}}^{m}\). Denoting  as the concatenation operator, we construct one vector zagp = zazgpi for sample i. Note that this vector zagp is a constant size, as all constituent vectors composing it are also of constant size. We then use a fully connected neural network layer to transform this zagp into a central embedding vector, zcentral, which is enriched with information from the acid embeddings, glycol embeddings and resin properties of the given sample. This vector serves as input to downstream prediction models. In addition to resin properties, additional information can be added in the central embedding block, such as experimental data related to the differential scanning calorimetry parameters, the quenching rate used for measuring the glass transition temperature Tg, or the temperature under which IV values were obtained. Since a constant rate and a fixed temperature of 25o were applied for the Tg and IV measurements, respectively, our data are independent of the experimental conditions.

Prediction network

The prediction network predicts a given target value (Tg and/or IV) from the central embedding vector. We have explored whether it is advantageous to predict each property of interest separately (Fig. 2b, c for Tg and IV and, respectively) or both in a jointly trained model (Fig. 2d).

The prediction network for Tg consists of two separate branches, the prediction branch and the multiplier branch (Fig. 2b). In the prediction branch, the model uses a two-layer neural network to learn an output Tg value, transforming the output with an exponentiation. This exponentiation is motivated by observations of a log-log relationship between some of the resin properties such as Tg and Mw as well as results from the ablation study (see Supplementary Note 5).

The prediction network for IV is a simple two-layer neural network with PReLU activation functions (Fig. 2c). We experiment with the same log-log transformation applied to the Tg network, as inspired by the Mark-Houwink Equation43 relating Mw to IV. However, it was found through ablation studies that log-log transformation of input data and model output decreased performance of the IV model, so we use standard scaling of resin properties to produce the best results.

Finally, the joint model is trained to predict simultaneously both target values and shares similarities with the individual models. The difference from single-task models lies after the pooling and concatenation operation. After applying a linear layer and a PReLU activation function37, the network diverges into two prediction branches—the Tg and IV branch. Each branch adopts an identical architecture to the prediction networks for Tg and IV.

Model behavior and performance

We experiment with replacing the model’s molecular embedding layer with several types of molecular representations, the simplest of which involves application of one resin property (Mw) to predict another property (Tg or IV). In addition, we encode the composition of each resin using a binary approach. This is done by placing a 1 in the input vector at the location of each monomer (acid or glycol) that is present and a 0 in the input vector if the monomer is not present. This forms a vector of length twenty-five (thirteen different acids and twelve different glycols), which can be used as an identification of the specific resin in the dataset. To augment this approach, we have also added the Mw value at the end of this input vector to analyze its effect on the model accuracy. These two approaches are referred to as the ‘Binary’ method and the ‘Binary + Properties’ method, respectively.

Four additional molecular representations were tested, Coulomb Matrices (CM)44, Smooth overlap of atomic potentials (SOAP)45, Persistence Images (PI)46, and Many-body Tensor Representations (MBTR)47, which are popular non-deep learning methods to vectorize molecular structures. To keep comparisons similar to PolymerGNN trials, we use the same resin features for each respective task that were found to be optimal for prediction with PolymerGNN (see Methods). A kernel ridge regression algorithm is used to predict values for CM, SOAP, PI, and MBTR, as this was found to be the optimal model for learning on these representations in our previous wide-scale analysis.

Figure 3 shows a comparison of distributions of performance metrics across 50 trials of 5-fold cross validation on the dataset using the previously mentioned methods. PolymerGNN outperforms other methods, yielding a higher R2 score for both Tg and IV prediction tasks and an approximate 0.25 dL/g lower mean absolute error (MAE) when predicting IV. MBTR yields the next-best performance, even outperforming PolymerGNN in MAE for the Tg prediction task. The distribution of PolymerGNN metrics are wider—have a higher standard deviation—than for other approaches such as CM and MBTR. This is because the training of neural networks is more unstable under cross validation than a method such as kernel ridge regression, which is used to predict values for the other representations.

Fig. 3: Ridgeline plots of performance comparisons across different models for polymer property prediction.
figure 3

Tg results are shown on the top row while IV results are shown on bottom. In these trials, each task is a singular output, thus the top row shows results predicting only Tg while the bottom row shows results predicting only IV. The two left plots show the R2 scores while the two right plots show the mean absolute error. Results are sorted by lowest mean MAE across each prediction task. Supplementary Table 13 shows the numerical results of these model comparisons. “PGNN” is an abbreviated notation for PolymerGNN model.

In order to test the joint prediction model, the molecular representations described in the previous experiments for Tg and IV model comparison were applied and their performance was evaluated. This modification replaced the “Molecular Embedding” block in Fig. 2 with different molecular representations. The downstream architecture of the joint prediction model was held equal in order to maintain the joint prediction task across each trial. The results are shown in Fig. 4. The PolymerGNN model ultimately outperforms other methods across both tasks, producing the lowest R2 and MAE errors for both Tg and IV prediction tasks. However, several embedding techniques perform well on this task, especially MBTR, demonstrating that the proposed model architecture can sufficiently learn both Tg and IV jointly with multiple types of molecular embedding techniques.

Fig. 4: Ridgeline plots of performance comparisons across multiple representations used for the joint learning task of Tg and IV.
figure 4

Tg results are shown on the top row while IV results are shown on bottom. The two left plots show the R2 scores while the two right plots expose the mean absolute error. Results are sorted by lowest mean MAE across each prediction task. Supplementary Table 14 shows the numerical results of these model comparisons. “PGNN” is an abbreviated notation for PolymerGNN model.

Computational screening of polymers

In order to demonstrate the applicability of PolymerGNN, we have screened a virtual database of 1000 materials with variable compositions. We chose isophthalic acid (IPA), teraphthalic acid (TPA), adipic acid (AA), 2-methyl-1,3-propanediol (MP Diol), and 1,4-cyclohexanedimethanol (1,4-CHDM) due to their widespread use in polyester materials. In addition, we varied the OHN value in the input vector while the AN was kept fixed (value of 1), effectively varying the molecular weight by changing the stoichiometry of diacids and diols. We train a joint PolymerGNN instance on the entire labeled dataset described in section “Results”; this model is then used to predict Tg and IV values on the large virtual database. This procedure is performed ten separate times on the same dataset in order to provide confidence levels for each sample in the set.

In Fig. 5a, we plot the results of our screening analysis for Tg and IV predictions. This plot shows the strong correlation between Mw and IV prediction, which is expected based on classical relationships43. Interestingly, PolymerGNN identifies several candidates falling into the high-Tg, low-IV region (top left) of the plot, a region that could be of interest to some niche applications. We provide additional information on the composition of these interesting polymers along with other experimental details in Supplementary Note 7.

Fig. 5: Results of a large-scale screen of PolymerGNN on a computationally generated dataset.
figure 5

a is colored by Mw to accentuate the strong positive correlation of Mw and IV learned by the model. b shows how adipic acid is negatively correlated to Tg; points are colored by standard error to highlight a low-confidence region in the high-adipic acid, high-Tg scenario.

Figure 5b shows the inverse relationship between adipic acid and Tg. As the concentration of more flexible monomers such as adipic acid increase in the composition, the polymer backbone will require less thermal energy to move around and thus will resist forming glasses at lower temperatures. Increasing the concentration of more rigid, stiffer components will have the opposite effect. A small region of the plot seems to contradict this largely negative correlation, namely the polymers above 60% adipic acid; these polymers seem to increase in Tg as the percentage of adipic acid increases. However, these samples have higher standard error relative to the entire plot. While Mw distributions seem to be the same, the OHN values are slightly lower on average for the outliers (see Supplementary Note 7). This discrepancy might be causing some out-of-distribution effects since OHN typically directly correlates with Mw. It is reasonable to conclude that these samples were simply out-of-distribution from the original training data, thus causing the model to predict outside of the expected relationship between adipic acid and Tg. This shows the utility of using standard error as an uncertainty statistic for predictions in the screen.

Explainability

We examine the attribution scores given to the resin properties for a given material using the Grad-CAM attribution method48. Attribution scores in this context can be interpreted as the relative importance of all variables, with more positive values indicating the highest importance.

In the Tg plot (Fig. 6a), we see that Mw is important for the prediction, but less important than having information on the molecular structure of the acids and glycols. For the IV prediction, Mw has the largest overall attribution of all variables, including acid and glycol embeddings (Fig. 6b). As a result, it is reasonable to conclude that Mw is very important for predicting IV, which matches chemical intuition based on the Mark-Houwink equation43 directly relating IV to Mw and the strong correlation seen in Mw and IV predictions in the computational screening. AN, OHN, and TMP seem to have less of an importance in predicting IV values, which mirrors results seen in the ablation study. This also highlights the fact that although these parameters can be used to calculate a theoretical Mw41,42, additional variables that are difficult to experimentally capture in complex copolymer compositions must be considered (i.e., the presence of additional end-groups beyond COOH and OH and non-statistical distributions of monomers throughout the polymer backbone). Finally, both acid and glycol embeddings are shown to have great importance for both prediction tasks. Glycol embeddings are slightly more important than acid embeddings in the IV prediction task, but both acid and glycol embeddings seem to be equally important for Tg prediction.

Fig. 6: Attribution scores.
figure 6

PolymerGNNTg model (a) and PolymerGNN IV model and (b) attribution scores computed by Grad-CAM on the central embedding layer of the model. A log-scale is used in b to show the separation between components with small attribution scores. Distributions are shown for attributions from every trained model for 50 5-fold cross validations on the dataset.

Discussion

This work proposes PolymerGNN, a general framework, GNN-based machine learning model for single-task and multitask learning of polymer properties. PolymerGNN uses as input a graph-based representation of each monomer present in a material, and it is able to provide high accuracy for predicting polymer properties. The model can embed and process an arbitrary number of input monomers as sets that are permutation-invariant, i.e., the order of molecular inputs is not relevant. In addition, PolymerGNN computes embeddings that are useful in downstream tasks. This benefit is demonstrated in the joint PolymerGNN model, where a model is trained to predict both Tg and IV with performance metrics close to that of the model trained on a single task. Because of superior joint prediction performance, representations learned by the model may be transferable to differing downstream tasks. Therefore, PolymerGNN could potentially learn properties on which there is limited data (few-shot learning) by using embeddings from a model pre-trained on another task with more abundant data.

For this project, more than 240 polyesters were synthesized and their properties such as glass transition temperature (Tg) and intrinsic viscosity (IV) were compiled in a database. PolymerGNN demonstrated remarkable accuracy for both properties, independently if it was trained on one or both properties. In addition, combination of the PolymerGNN architecture with other commonly used molecular representations managed to provide increased performance when compared to models that do not use the architecture that was developed for this project. Further development of PolymerGNN can include the use of self-attention mechanisms49, a useful approach to encode dependencies between monomer inputs. Finally, this type of design is not fixed to polyesters, as is described in this work, but can rather be transferred to the prediction of other types of polymers and properties.

Methods

Polyester resin synthesis

The polyols were produced using either a resin kettle reactor setup via solvent-assisted polycondensation or a resin rig reactor setup via melt polycondensation, both of which were controlled with automated control software.

The solvent-assisted resins were produced on a 3.5 mole scale using a 2 L kettle with overhead stirring and a partial condenser topped with total condenser and Dean Stark trap. Approximately 10 wt% (based on reaction yield) azeotroping solvent of high boiling point (A150 and A150ND) was used to both encourage egress of the water condensate out of the reaction mixture and keep the reaction mixture viscosity at a reasonable level using the standard paddle stirrer. Chemical reagents were added to the kettle, which was then completely assembled. The Fascat 4100 (monobutyltin oxide) catalyst was added via the sampling port after the reactor had been assembled and blanketed with nitrogen for the reaction. Additional A150/A150ND solvent was added to the Dean Stark trap to maintain the ~10 wt% solvent level in the reaction kettle. The reaction mixture was heated without stirring from room temperature to 150 oC using a set output controlled through the automation system. Once the reaction mixture was fluid enough, the stirring was started to encourage even heating of the mixture. At 150 oC, the control of heating was switched to automated control and the temperature was ramped to 230 oC over the course of 4 h. The reaction was held at 230 oC for 1 h and then heated to 240 oC over the course of 1 h. The reaction was then held at 240 oC and sampled every 1–2 h upon clearing until the desired acid value was reached. The ~ 90%-solids resins were ground into 6 mm pellets and thoroughly dried in a vacuum oven at 150C for 24 h prior to characterization.

The melt polycondensation resins were produced on a 0.5 mole scale. A 500 mL, one-neck, round-bottom flask was carefully charged with all chemical reagents and Fascat 4100 (monobutyltin oxide) catalyst. The flask was equipped with a polymer head adapter with stainless steel mechanical stirrer and securely clamped to the polymerization rig. To the polymer head, a distillation side arm and Erlenmeyer flask were attached. The automated-controlled vacuum system was attached to the flask side arm to allow for a reduction in pressure of the reaction vessel. The Belmont metal bath was preheated to 20 oC above the recipe starting temperature (180 oC). The apparatus was subjected to two iterations of a nitrogen (N2) purge to remove oxygen and then dunked into the metal bath to begin. The flask was held at 180 oC for 10 min to melt the starting materials and then stirring was started to encourage even heating on the mixture. The flask was heated to 240 oC over 4 h and then held there for an additional hour. Pressure in the reaction flask was reduced to 1.5 torr over 45 min and then the reaction was held at 1.5 torr until the final acid value was reached. Dry-ice was used to ensure that the solvent traps were sufficiently cold to prevent any solvent/organic matter from going to the vacuum pump. After completion, the flask was slowly brought back to atmospheric pressure and removed from the hot metal bath. Upon solidification, the polymer was pulled from the round-bottom flask by partially melting the edges, then the glass flask was broken with a hammer to give the solid polymer ‘lollipop’ on the stir rod. The polymer was cooled in dry-ice, removed from the stir-rod, and ground into 6 mm pellets prior to characterization.

Polyester resin characterization

Acid number (AN) was determined using colorimetric titration in pyridine with phenolphthalein indicator and 0.1 N KOH titrant administered with an auto-dispensing titrator. Hydroxyl number (OHN) was determined via 1H NMR end-group analysis on a Bruker 500 MHz spectrometer or by reaction of hydroxyl groups with p-toluenesulfonyl isocyanate and subsequent potentiometric tritration of the acid carbamate product. The OHN results obtained are then corrected for contributing acid number. The inherent viscosity (IV) of all polymers was determined in 0.5 wt% PM 95 (60/40 phenol/1,1,2,2-tetrachloroethane) solution at 25 oC. Molecular weights were determined by gel permeation chromatography (GPC) with 95/5 methylene chloride/HFIP mobile phase and calibration curves for polystyrene standards. Monomer composition was determined via gas chromatography (GC) hydrolysis. The glass transition temperature (Tg) was determined using differential scanning calorimeter (DSC) at 20 oC/min. ramp rate with a N2 sweep. The Tg was based on second heat thermograms.

Linear polyesters typically have a polydispersity index (PDI) between 1.5–2.5 and branched polyesters can have a much wider range depending on how much branching agent is added, the degree of polymerization, and other factors. Our polyester dataset includes both linear and branched resins. IV is affected by both resin composition and molecular weight, with Mw more strongly affecting IV than the average molecular weight Mn. As Mw increases, typically so does IV. In our dataset, increased PDI is usually associated with increased Mw due to branching and this consequently increases the IV. Similarly, Tg increases as Mw and IV increase and begins to plateau at higher IVs and Mws. These relationships can be quantified for specific polyester compositions, but becomes very complicated in complex datasets such as described in this paper and therefore machine learning can help make useful predictions in this design space.

Quantum chemical calculations

The individual monomers for each resin were optimized using metadynamics to sample the conformational space of the monomer. The Conformer-Rotamer Ensemble Sampling Tool (CREST)50 was utilized in order to sample the conformational space of each monomer and generate a list of minimum energy conformation using semiempirical density functional tight binding (DFTB). Tight convergence criteria was utilized for both the geometry optimizations and the self-consistent-field cycles of DFTB. The lowest energy conformer generated was then used as the input structure for the different molecular representations.

Graph neural network

We will establish some preliminaries for graph neural networks. Let G = (V, E) be a graph with V nodes and E edges. If G is a molecular graph, we consider V to consist of all atoms in the molecule, and E to comprise all bonds between those atoms. Indeed, if vi and vj are atoms in V, a bond between them would be denoted by the edge (vi, vj) E. It is also useful to define the neighborhood of a node vi, denoted \({{{\mathcal{N}}}}({v}_{i})\). The neighborhood is the set of all nodes for which there exists an edge connecting to vi, i.e., \({{{\mathcal{N}}}}({v}_{i})=\{{v}_{j}\,| \,({v}_{i},{v}_{j})\in E\}\). For each viV, we have a d-dimensional feature xi. The collection of node features for all n nodes in a graph is denoted by \({{{\mathcal{X}}}}=\{{x}_{1},...,{x}_{n}\},{x}_{i}\in {{\mathbb{R}}}^{d}\) and may include atomic properties such as atomic charge, atomic mass, or scalar properties associated with each atom in the molecule. In PolymerGNN, we use six properties: the charge, degree, mass, aromaticity—a Boolean variable indicating whether the atom is found within an aromatic portion of the molecule—the explicit number of hydrogen atoms bonded with the atom, and the number of valence electrons. All features are extracted automatically using RdKit51. In a similar manner to node features, edge features can also be introduced into the graph construction; however, we omit any additional edge features in this work since preliminary benchmarking showed no empirical boost in performance. This described formulation allows us to treat molecular representations as a graph onto which we can apply graph machine learning algorithms and methods, namely graph neural networks.

A graph neural network (GNN) is a machine learning algorithm that learns embeddings of nodes within a graph. These so-called node-level embeddings can be combined into a graph-level embedding that represents the entire graph G. Graph-level embeddings can be used in downstream prediction tasks, such as predicting Tg or IV. We specifically focus on graph convolutional neural networks, thus when mentioning the term “GNN” in this work, it is assumed that a graph convolutional neural network is being discussed.

A GNN model consists of iterations of AGGREGATE-COMBINE steps that update the representation of nodes by aggregating information from the local topology of the graph. We denote \({h}_{i}^{(l)}\) as the representation for node vi at layer l of the network. As an initial step, we set \({h}_{i}^{(0)}={x}_{i}\). Each layer l then performs the following functions to obtain the successive layer’s node embeddings: \({a}_{i}^{(l)}={{{{\rm{AGGREGATE}}}}}^{(l)}(\{{h}_{j}^{(l-1)}\,| \,{v}_{j}\in {{{\mathcal{N}}}}({v}_{i})\})\), such that \({h}_{i}^{(l)}={{{{\rm{COMBINE}}}}}^{(l)}({h}_{i}^{l-1},{a}_{i}^{(l)})\). Intuitively, the AGGREGATE and COMBINE functions work to mix information between neighboring atoms within the molecule. Different GNN layers introduce variations to the AGGREGATE and COMBINE functions. Common functions for the AGGREGATE step are MEAN and MAX while COMBINE is commonly performed by a single, fully connected neural network, as in refs. 52,34, and ref. 35. To produce a graph-level embedding, hG, a READOUT function is used to pool all node representations from the graph, i.e., \({h}_{G}={{{\rm{READOUT}}}}(\{{h}_{i}^{(L)}\,| \,{v}_{i}\in V\}).\) After the READOUT operation, it is guaranteed that hG is a constant-size vector regardless of the size of G. READOUT can be performed by simple, permutation-invariant function such as MEAN, MAX, or more advanced pooling methods39,53. In this work, we utilize the self-attention pooling mechanism39. After the READOUT function is performed, the resulting hG should contain information about the entire graph in question, making this representation useful in downstream prediction tasks.

Loss function

Both Tg and IV tasks utilize the Mean Squared Error (MSE) loss function to train the networks. The joint model was trained using the following loss function, \(L=\gamma {L}_{IV}+{L}_{{T}_{g}},\) where LIV is the MSE of the IV prediction with respect to the true IV value and \({L}_{{T}_{{{{\rm{g}}}}}}\) is the MSE of the Tg prediction with respect to the true Tg value. The γ constant serves as a weighing factor to scale the IV loss in proportion to how much the models should prioritize learning IV relevant features. The IV and Tg learning tasks have different units that have varying scales. The scale of Tg values is much larger than that of IV; this would result in MSE being very large for Tg while the MSE is very low for IV, even if the performance is equal for each of them. Therefore, we set γ to a very large arbitrary value (10,000 herein) to offset the effect of units for this joint learning problem. Another rationale for a large γ is because it prioritizes the IV task, which is more difficult to learn based on previous trials.