polyBERT: a chemical language model to enable fully machine-driven ultrafast polymer informatics

Polymers are a vital part of everyday life. Their chemical universe is so large that it presents unprecedented opportunities as well as significant challenges to identify suitable application-specific candidates. We present a complete end-to-end machine-driven polymer informatics pipeline that can search this space for suitable candidates at unprecedented speed and accuracy. This pipeline includes a polymer chemical fingerprinting capability called polyBERT (inspired by Natural Language Processing concepts), and a multitask learning approach that maps the polyBERT fingerprints to a host of properties. polyBERT is a chemical linguist that treats the chemical structure of polymers as a chemical language. The present approach outstrips the best presently available concepts for polymer property prediction based on handcrafted fingerprint schemes in speed by two orders of magnitude while preserving accuracy, thus making it a strong candidate for deployment in scalable architectures including cloud infrastructures.

Polymers are an integral part of our everyday life and instrumental in the progress of technologies for future innovations. 1The sheer magnitude and diversity of the polymer chemical space provide opportunities for crafting polymers that accurately match application demands, yet also come with the challenge of efficiently and effectively browsing this gigantic space.The nascent field of polymer informatics [2][3][4][5] allows access to the depth of the polymer universe and demonstrates the potency of machine learning (ML) models to overcome this challenge.ML frameworks have enabled substantial progress in the development of polymer property predictors [6][7][8][9][10] and solving inverse problems in which polymers that meet specific property requirements are either identified from candidate sets, 11,12 or are freshly designed using genetic 13,14 or generative [15][16][17] algorithms.
An essential step in polymer informatics pipelines is the conversion of polymer chemical structures to numerical representations that are often called fingerprints, features, or descriptors (see blue boxes in Figure 1a).Past and current handcrafted fingerprinting approaches [20][21][22][23][24] utilize cheminformatics tools that numerically encode key chemical and structural features of polymers.Although such handcrafted fingerprints build on invaluable intuition and experience, they are tedious to develop, involve complex computations that often consume most of the time during model training and inference, and lack generalization to all polymer chemical classes (i.e., new features may have to be added to the catalog of features in an ad hoc manner).ML pipelines that use handcrafted fingerprints are thus prone to errors during the exploration of new polymer chemical classes.Also, handcrafted fingerprints present barriers for the development and deployment of fully machine-driven informatics pipelines, which are suited for scalability in cloud computing and high-throughput environments.
The present contribution overcomes the previously mentioned limitations by replacing handcrafted fingerprints with fully machine-crafted "Transformer" fingerprints (see right pipeline of Figure 1a).Transformers 25 were recently developed in the field of Natural Language Processing (NLP) and have swiftly become the gold standard in ML language mod-Polymer Properties   shows the prediction using handcrafted fingerprints from cheminformatics tools, while the right pipeline (present work) portrays a fully end-to-end machine-driven predictor using polyBERT.Property symbols are defined in Table 1.ID1 and ID3 are copolymers, and ID2 is a homopolymer.b polyBERT is a polymer chemical language linguist.polyBERT canonicalizes, tokenizes, and masks polymer SMILES 18 (PSMILES) strings strings before passing them to the DeBERTa model.A last dense layer with a softmax activation function finds the masked tokens.polyBERT fingerprints (dashed arrow) are the averages over the token dimension (sentence average) of the last Transformer encoder.c 100 million hypothetical PSMILES strings.First, 13 766 known (i.e., previously synthesized) polymers are decomposed to 4 424 fragments using the BRICS 19 method.Second, re-assembling the BRICS fragments in many different ways generates 100 million hypothetical polymers by randomly and enumeratively combining the fragments.
eling.In this work, we envision simplified molecular-input line-entry system (SMILES) 18 strings that have been used to represent polymers as the "chemical language" of polymers.
We use millions of polymer SMILES (PSMILES) strings for training a language model called polyBERT to become an expert -a linguist -of the polymer chemical language.In combination with multitask deep neural networks, 6,7 polyBERT enables a fully end-to-end machinedriven polymer informatics pipeline that uses and unleashes the true power of artificial intelligence methods.Multitask deep neural networks harness inherent correlations in multifidelity and multi-property data sets, scale effortlessly in cloud computing environments, and generalize to multiple prediction tasks.
Recent studies [26][27][28] demonstrated the benefits of using Transformers in the molecule chemical space.For example, Wang et al. 27 have trained a BERT model 29 (the most common general language model) with a data set of molecule SMILES strings.Using BERT's latent space representations of molecules as fingerprints, the authors show that their approach outperforms other fingerprinting methods (including fingerprints of an unsupervised recurrent neural network and a graph-based neural network).Similarly, Schwaller et al. we map the polyBERT fingerprints to about 3 dozen polymer properties using our multitask ML framework to yield fully machine-driven ultrafast polymer property predictors.
For benchmarking, the performance (both accuracy and speed) of this new end-to-end prop-erty prediction pipeline is compared with the state-of-the-art handcrafted Polymer Genome 8 (PG) fingerprint based pipeline pioneered previously.Using the ultrafast polyBRET polymer informatics pipeline, we are in a position to predict the properties of the 100 million hypothetical polymers intending to find property boundaries of the polymer universe.This work contributes to expediting the discovery, design, development, and deployment of polymers by harnessing the true power of language, data, and artificial intelligence models.

Results
Data Sets Figure 1c  Once polyBERT has completed its unsupervised learning task using the 100 million hypothetical PSMILES strings, multitask supervised learning maps polyBERT polymer fingerprints to multiple properties to produce property predictors.We use the property data set in Table 1 for training the property predictors.The data set contains 28 061 (≈80 %) homopolymer and 7 456 (≈20 %) copolymer (total of 35 517) data points of 29 experimental and computational polymer properties that pertain to 11 145 different monomers and 1 338 distinct copolymer chemistries, respectively.Each of the 7 456 copolymer data points involves two distinct comonomers at various compositions.All data points in the data set have been used in past studies 6,7,11,[34][35][36][37][38][39][40][41] and were produced using computational methods or obtained from literature and other public sources.Supplementary Figures S3-S8 show  Third, polyBERT masks 15 % (default parameter for masked language models) of the tokens to create a self-supervised training task.In this training task, polyBERT is taught to predict the masked tokens using the non-masked surrounding tokens by adjusting the weights of the Transformer encoders (fill-in-the-blanks task).We use 80 million PSMILES strings for training and 20 million PSMILES strings for validation.The validation F1-score is > 99.This exceptionally good F1-score indicates that polyBERT finds the masked tokens in almost all cases.The total CO 2 emissions for training polyBERT on our hardware are estimated to be 12.6 kgCO 2 eq (see CO 2 Emission section).
The training with 80 million PSMILES strings renders polyBERT an expert polymer chemical linguist who knows grammatical and syntactical rules of the polymer chemical language.polyBERT learns patterns and relations of tokens via the multi-head self-attention mechanism and fully connected feed-forward network of the Transformer encoders. 25The attention mechanism instructs polyBERT to devote more focus to a small but essential part of a PSMILES string.polyBERT's learned latent spaces after each encoder block are numerical representations of the input PSMILES strings.The polyBERT fingerprint is the average over the token dimension (sentence average) of the last latent space (dotted line in Figure 1b).We use the Python package SentenceTransformers 44   Fingerprints For acquiring analogies and juxtaposing chemical relevancy, we compare polyBERT fingerprints with the handcrafted Polymer Genome 8 (PG) fingerprints that numerically encode polymers at three different length scales.A description of PG fingerprints can be found in the Methods section.The PG fingerprint vector for the data set in this work has 945 components and is sparsely populated (93.9 % zeros).The reason for this ultra sparsity is that many PG fingerprint components count chemical groups in polymers. 8A fingerprint component of zero indicates that a chemical group is not present.In contrast, polyBERT fingerprint vectors have 600 components and are fully dense (0 % zeros).Fully dense and lower-dimensional fingerprints are often advantageous for ML models whose computation time scales superlinear (O(n s ), s > 1) with the data set size (n) such as Gaussian process or kernel ridge techniques.Moreover, in the case of neural networks, sparse and high-dimensional input vectors can cause unnecessary high memory load that reduces training and inference speed.We note that the dimensionality of polyBERT fingerprints is a parameter that can be chosen arbitrarily to yield the best training result.A summary of the key figures can be found in Supplementary Table S2.
Figure 2 shows uniform manifold approximation and projection (UMAP) 45 plots for all homo-and copolymer chemistries in Table 1.The colored triangles in the first column indicate the coordinates of three selected polymers for polyBERT and PG fingerprints.We observe for both fingerprint types that the orange and blue triangles are very close, while the green triangle is separate.We also note that polymers corresponding to the orange and blue    S2.
Figure 4a shows the color-encoded five-fold cross-validation coefficient of determination (R 2 ) averages across the five validation data sets for 29 polymer properties.Root-meansquare errors (RMSEs) can be found in Supplementary Table S1.Overall, PG performs best (R 2 = 0.81) but is very closely followed by polyBERT (R 2 = 0.80).This overall performance order of the fingerprint types is persistent with the category averages and properties, except for X c , X e , and b , where polyBERT slightly outperforms PG fingerprints.We note that polyBERT and PG fingerprints are both practical routes for polymer featurization because their R 2 values lie close together and are generally high.polyBERT fingerprints have the accuracy of the handcrafted PG fingerprints but are over two orders of magnitude faster (see Figure 3).
Figure 4b shows high R 2 values for each meta learner (one for each category), suggesting an exceptional prediction performance across all properties.We train the meta learners on unseen 20 % of the data set and validate using 80 % of the data set (also used for crossvalidation).The reported validation R 2 values thus only partly measure the generalization performance with respect to the full data set.Meta learners can be conceived as taking decisive roles in selecting the best values from the predictions of the five cross-validation models.We use the meta learners for all property predictions in this work.Supplementary Figures S9-S14 show the meta learners' parity plots.
The ultrafast and accurate polyBERT-based polymer informatics pipeline allows us to predict all 29 properties of the 100 million hypothetical polymers that were originally created to train polyBERT.Figure 4c shows the minimum, mean, and maximum for each property.
Histograms are given in Supplementary Figures S17-S22.Given the vast size of our data set and consequent chemical space of the 100 million hypothetical polymers, the minimum and maximum values can be interpreted as potential boundaries of the total polymer property space.The data set with 100 million hypothetical polymers including the predictions of 29 properties is available for academic use.The total CO 2 emissions for predicting 29 properties of 100 million hypothetical polymers are estimated to be 5.5 kgCO 2 eq (see CO 2 Emission section).

Other Advantages of polyBERT: Beyond Speed and Accuracy
The feed-forward network (last layer in Figure 1b), which predicts masked tokens during the self-supervised training of polyBERT, enables the mapping of numerical latent spaces (i.e., fingerprints) to PSMILES strings.However, because we average over the token dimension of the last latent space to compute fingerprints, we cannot unambiguously map the current polyBERT fingerprints back to PSMILES strings.A modified future version of polyBERT that provides PSMILES strings encoding and fingerprint decoding could involve inserting a dimensionality-reducing layer after the last Transformer encoder.Fingerprint decoders are important elements of design informatics pipelines that invert the prediction pipeline to meet property specifications.We note that the current choice of computing polyBERT fingerprints as pooling averages stems from basic dimensionality reduction considerations that require no modification of the Transformer architecture.
A second advantage of the polyBERT approach is interpretability.Analysing the chemical relevancy of polyBERT fingerprints (as discussed in the Fingerprints section) in greater detail can reveal chemical functions and interactions of structural parts of the polymers.
As shown for trained NLP Transformers, 46 deciphering and visualizing the attention layers of the Transformer encoders can reveal such information.Saliency methods 47 may explain the relationships between structural parts of the PSMILES strings (inputs) and polymer properties (outputs).

Discussion
Here, we show a generalizable, ultrafast, and accurate polymer informatics pipeline that is seamlessly scalable on cloud hardware and suitable for high-throughput screening of huge polymer spaces.polyBERT, which is a Transformer-based NLP model modified for the polymer chemical language, is the critical element of our pipeline.After training on 100 million hypothetical polymers, the polyBERT-based informatics pipeline arrives at a representation of polymers and predicts polymer properties over two orders of magnitude faster but at the same accuracy as the best pipeline based on handcrafted PG fingerprints.
The accurate prediction of 29 properties for 100 million hypothetical polymers in a reasonable time demonstrates that polyBERT is an enabler to extensive explorations of the polymer universe at scale.polyBERT paves the pathway for the discovery of novel polymers 100 times faster (and potentially even faster with newer GPU generations) than state-of-the-art informatics approaches -but at the same accuracy as slower handcrafted fingerprinting methods -by leveraging Transformer-based ML models originallydro developed for NLP.polyBERT fingerprints are dense and chemically pertinent numerical representations of polymers that adequately measure polymer similarity.They can be used for any polymer informatics task that requires numerical representations of polymers such as property predictions (demonstrated here), polymer structure predictions, ML-based synthesis assistants, etc. polyBERT fingerprints have a huge potential to accelerate past polymer informatics pipelines by replacing the handcrafted fingerprints with polyBERT fingerprints.polyBERT may also be used to directly design polymers based on fingerprints (that can be related to properties) using polyBERT's decoder that has been trained during the self-supervised learning.This, however, requires retraining and structural updates to polyBERT and is thus part of a future work.the canonical PSMILES string.The canonicalize psmiles package is available at https: //github.com/Ramprasad-Group/canonicalize_psmiles.

Methods
Polymer Fingerprinting Fingerprinting converts geometric and chemical information of polymers (based upon the PSMILES string) to machine-readable numerical representations in the form of vectors.These vectors are the polymer fingerprints and can be used for property predictions, similarity searches, or other tasks that require numerical representations of polymers.
We compare the polyBERT fingerprints, developed in this work, with the handcrafted Polymer Genome (PG) polymer fingerprints.PG fingerprints capture key features of polymers at three hierarchical length scales. 8,22At the atomic scale (1 st level), PG fingerprints track the occurrence of a fixed set of atomic fragments (or motifs). 23The block scale (2 nd level) uses the quantitative structure-property relationship (QSPR) fingerprints 20,36 for capturing features on larger length-scales as implemented in the cheminformatics toolkit RDKit. 33The chain scale (3 rd level) fingerprint components deal with "morphological descriptors" such as the ring distance or length of the largest side-chain. 36The PG fingerprints are developed within the Ramprasad research group and used, for example, at https://PolymerGenome.org.More details can be found in References 8,36.
As discussed recently, 6,11 we sum the composition-weighted polymer fingerprints to compute copolymer fingerprints F = N i F i c i , where N is the number of comonomers in the copolymer, F i the i th comonomer fingerprint, and c i the fraction of the i th comonomer.
This approach renders copolymer fingerprints invariant to the order in which one may sort the comonomers and satisfies the two main demands of uniqueness and invariance to different (but equivalent) periodic unit specifications.Contrary to homopolymer fingerprints, copolymer fingerprints may not be interpretable (e.g., the composition-weighted sum of the fingerprint component "length of largest side-chain" of two homopolymers has no physical meaning).splits based on the polymer properties.All parameters of the neural networks, such as the number of layers, number of nodes, dropout rates, and activation functions, are optimized using the Hyperband method 48 of the Python package KerasTuner. 49The multitask deep neural networks are implemented using the Python API of TensorFlow. 50

CO 2 Emission
Experiments were conducted using a private infrastructure, which has an estimated carbon efficiency of 0.432 kgCO 2 eq kWh −1 .A cumulative of 31 hours of computation was performed on four Quadro-GP100-16GB (thermal design power of 235 W) for training polyBERT.Total emissions are estimated to be 12.6 kgCO 2 eq.The total emissions for predicting 29 properties for 100 million hypothetical polymers are estimated to be 5.5 kgCO 2 eq.Estimations were conducted using the Machine Learning Impact calculator presented in Reference 51.

Data and Code Availability
The

Declaration of Interests
R.R. is the founder of the company Matmerize, Inc., that intends to provide polymer informatics services.A provisional patent on polyBERT has been filed by R.R. and C.K.

Figure 1 :
Figure 1: Polymer informatics with polyBERT.a Prediction pipelines.The left pipelineshows the prediction using handcrafted fingerprints from cheminformatics tools, while the right pipeline (present work) portrays a fully end-to-end machine-driven predictor using polyBERT.Property symbols are defined in Table1.ID1 and ID3 are copolymers, and ID2 is a homopolymer.b polyBERT is a polymer chemical language linguist.polyBERT canonicalizes, tokenizes, and masks polymer SMILES18 (PSMILES) strings strings before passing them to the DeBERTa model.A last dense layer with a softmax activation function finds the masked tokens.polyBERT fingerprints (dashed arrow) are the averages over the token dimension (sentence average) of the last Transformer encoder.c 100 million hypothetical PSMILES strings.First, 13 766 known (i.e., previously synthesized) polymers are decomposed to 4 424 fragments using the BRICS 19 method.Second, re-assembling the BRICS fragments in many different ways generates 100 million hypothetical polymers by randomly and enumeratively combining the fragments.
polyBERT polyBERT iteratively ingests 100 million hypothetical PSMILES strings to learn the polymer chemical language, as sketched in Figure1b.polyBERT is a DeBERTa 32model (as implemented in Huggingface's Transformer Python library 42 ) with a supplementary three-stage preprocessing unit for PSMILES strings.First, polyBERT transforms a input PSMILES string into its canonical form (e.g., [*]CCOCCO[*] to [*]COC[*]) using the canonicalize psmiles Python package developed in this work.Details can be found in the Methods section.Second, polyBERT tokenizes canonical PSMILES strings using the Senten-cePiece 43 tokenizer.The tokens are frequent patterns in PSMILES strings and determined in a pretraining process of SentencePiece with the 100 million hypothetical PSMILES strings.

Figure 2 :
Figure 2: Two-dimensional UMAP 45 plots for polyBERT and Polymer Genome fingerprints and all homo-and copolymer chemistries in Table 1.The triangles (blue, orange, and green) in the first column indicate fingerprint positions in the UMAP spaces of three selected polymers.The colored dots in columns two, three, and four indicate property values of T g , T d , and E gc , which stand for the glass transition temperature, degradation temperature, and band gap (chain), respectively.Light gray dots show polymers with unknown property values.The PSMILES strings [*]CC([*])CC, [*]CC([*])CCC, and [*]CC([*])c1ccncc1 denote poly(but-1-ene), poly(pent-1-ene), and poly(4-vinylpyridine), respectively.
triangles, namely poly(but-1-ene) and poly(pent-1-ene), have similar chemistry (different by only one carbon atom), but poly(4-vinylpyridine) represented by a green triangle, is different.This chemically intuitive positioning of fingerprints suggests the chemical relevancy of fingerprint distances.The cosine fingerprint distances reported in Supplementary FigureS1allow for the same conclusion.The second, third, and fourth columns of Figure2display the same UMAP plots as in the first column.Colored dots indicate the property values of T g , T d , and E gc , while light gray dots show polymer fingerprints with unknown property values.We observe localized clusters of similar color in each plot pertaining to polymers of similar properties.Although this finding is not surprising for the PG fingerprint because it relies on handcrafted chemical features that purposely position similar polymers next to each other, it is remarkable for polyBERT.With no chemical information and purely based on training on a massive amount of PSMILES strings, polyBERT has learned polymer fingerprints that match chemical intuition.This again shows that polyBERT fingerprints have chemical pertinence and their distances measure polymer similarity (e.g., using the cosine distance metric).

PSMILES
Canonicalization The string representations of homopolymer repeat units in this work are polymer SMILES (PSMILES) strings.PSMILES strings follow the SMILES 18 syntax definition but use two stars to indicate the two endpoints of the polymer repeat unit (e.g., [*]CC[*] for polyethylene).The raw PSMILES syntax is non-unique; i.e., the same polymer may be represented using many PSMILES strings; canonicalization is a scheme to reduce the different PSMILES strings of the same polymer to a singel unique canonicalized PSMILES string.polyBERT requires canonicalized PSMILES strings because polyBERT fingerprints change with different writings of PSMILES strings.In contrast, PG fingerprints are invariant to the way of writing PSMILES strings and, thus, do not require canonicalization.Figure 5 shows three variances of PSMILES strings that leave the polymer unchanged.The translational variance of PSMILES strings allows to move the repeat unit window of polymers (cf., white and red box).The multiplicative variance permits to write polymers as multiples of the repeat unit (e.g., two-fold repeat unit of Nylon 6), while the permutational variance stems from the SMILES syntax definition18 and allows syntactical permutations of PSMILES strings that leave the polymer unchanged.For this work, we developed the canonicalize psmiles Python package that finds the canonical form of PSMILES strings in four steps; (i) it finds the shortest PSMILES string by searching and removing repetition patterns, (ii) it connects the polymer endpoints to create a periodic PSMILES string, (iii) it canonicalizes the periodic PSMILES string using RDKit 33 's canonicalization routines, (iv) it breaks the periodic PSMILES string to create

Figure 5 :
Figure 5: Translational, multiplicative, and permutational variances of PSMILES strings.The gray and red boxes represent the smallest repeat unit of poly(hexano-6-lactam) (Nylon 6).The red box can be translated to match the black box.The dashed boxes show the second smallest repeat unit (two-fold repeat unit) of Nylon 6.
Figure S2 details this process.After shuffling, we split the data set into two parts and use 80 % for the five cross-validation models and for validating the meta learners.20 % of the data set is used for training the meta learners.We perform data set stratification of all

Table 1 :
Training data set for the property predictors.The properties are sorted into categories, showed at the top of each block.The data set contains 29 properties (dielectric constants k f are available at 9 different frequencies f ).HP and CP stand for homopolymer and copolymer, respectively.
6,7,11 4: R 2 performance values for polyBERT (PB) and Polymer Genome (PG) fingerprints.aR 2 averages of the five cross-validation validation data sets along with standard deviations (1σ).bR 2 values of the meta learner's test data set.The category-averaged R 2 values are stated in the last rows of each block, while overall R 2 values are given in the very last block.The properties gas permeabilities (µ x ) and elongation at break ( b ) are trained on log base 10 scale (x → log 10 (x+1)).The R 2 values are reported on this scale.cMinimum,mean,andmaximum of polyBERT-based property predictions for 100 million hypothetical polymers.Symbols are defined in Table1.forpolymerpropertypredictions,6,7,11while being fast, scalable, and readily amenable if more data points become available.Unlike single-task models, multitask models simultaneously predict numerous properties (tasks) and harness inherent but hidden correlations in data to improve their performance.Such correlation exists, for instance, between T g and T m , but the exact correlation varies across specific polymer chemistries.Multitask models learn and improve from these varying correlations in data.The training protocol of the multitask deep neural networks follows state-of-the-art methods involving five-fold cross-validation and a consolidating meta learner that forecasts the final property values based upon the ensemble of cross-validation predictors.More details about multitask deep neural networks are provided in the Methods section.Their training process is outlined in Supplementary Figure and PG fingerprints, we train multitask deep neural networks for each property category defined in Table1.Multitask deep neural networks have demonstrated best-in-class results polyBERT code and data set of 100 million hypothetical polymers with the predictions of 29 properties are available for academic use at https://github.com/Ramprasad-Group/polyBERT.The trained polyBERT model is available at https://huggingface.co/kuelumbus/ polyBERT.The Python package for canonicalizing PSMILES strings is available at https: //github.com/Ramprasad-Group/canonicalize_psmiles.polyBERT-based property predictions will be made accessible through the polymer informatics platform Polymer Genome at https://PolymerGenome.org.