Introduction

The accurate and efficient property prediction is essential to the design of polymers in various applications, including polymer electrolytes1,2, organic optoelectronics3,4, energy storage5,6, and many others7,8. Rational representations which map polymers to continuous vector space are crucial to applying machine learning tools in polymer property prediction. Fingerprints (FPs), which have been proven to be effective in molecular machine learning models, are introduced for polymer-related tasks9. Recently, deep neural networks (DNNs) have revolutionized polymer property prediction by directly learning expressive representations from data to generate deep fingerprints, instead of relying on manually engineered descriptors10. Rahman et al. used convolutional neural networks (CNNs) for the prediction of mechanical properties of polymer-carbon nanotube surfaces11, whereas CNNs suffered from failure to consider molecular structure and interactions between atoms. Graph neural networks (GNNs)12, which have outperformed many other models on several molecules and polymer benchmarks13,14,15,16,17, are capable of learning representations from graphs and finding optimal fingerprints based on downstream tasks10. For example, Park et al.18 trained graph convolutional neural networks (GCNN) for predictions of thermal and mechanical properties of polymers and discovered that the GCNN representations for polymers resulted in comparable model performance to the popular extended-connectivity circular fingerprint (ECFP)19,20 representation. Recently, Aldeghi et al. adapted a graph representation of molecular ensembles along with a GNN architecture to capture pivotal features and accomplish accurate predictions of electron affinity and ionization potential of conjugated polymers21. However, GNN-based models require explicitly known structural and conformational information, which would be computationally or experimentally expensive to obtain. Plus, the degree of polymerization varies for each polymer, which makes it even harder to accurately represent polymers as graphs. Using the repeating unit only as graph is likely to result in missing structural information. Therefore, the optimal method of graph representation for polymers is still obscure.

Meanwhile, language models, like recurrent neural networks (RNNs) based models22,23,24,25, treat polymers as character sequences for featurization. Chemistry sequences have the same structure as a natural language like English, as suggested by Cadeddu et al., in terms of the distribution of text fragments and molecular fragments26. This elucidates the development of sequence models similar to those in computational linguistics for extracting information from chemical sequences and realizing the intuition of understanding chemical texts just like understanding natural languages. Multiple works have investigated the development of deep language models for polymer science. Simine et al. managed to predict spectra of conjugated polymers by long short-term memory (LSTM) from coarse-grained representations of polymers27. Webb et al. proposed coarse-grained polymer genomes as sequences and applied LSTM to predict the properties of different polymer classes28. Patel et al. further extended the coarse-grained string featurization to copolymer systems and developed GNN, CNN, as well as LSTM to model encoded copolymer sequences29. Bhattacharya et al. leveraged RNNs with sequence embedding to predict aggregate morphology of macromolecules30. Plus, sequence models could represent molecules and polymers with Simplified Molecular-Input Line-Entry system (SMILES)31 and convert the strings to embeddings for vectorization. Some works, like BigSMILES32, have also investigated the string-based encoding of macromolecules. Goswami et al. created encodings from polymer SMILES as input for the LSTM model for polymer glass transition temperature prediction33. However, RNN-based models are generally not competitive enough to encode chemical knowledge from polymer sequences because they rely on previous hidden states for dependencies between words and tend to lose information when they reach deeper steps. In recent years, the exceptionally superior performance demonstrated by Transformer34 on numerous natural language processing (NLP) tasks has shed light on studying chemistry and materials science by language models. Since proposed, Transformer and its variants have soon brought about significant changes in NLP tasks over the past few years. Transformer is featured with using attention mechanism only so that it can capture relationships between tokens in a sentence without relying on past hidden states. Many Transformer-based models like BERT35, RoBERTa36, GPT37, ELMo38, and XLM39 have emerged as effective pretraining methods by self-supervised learning of representations from unlabeled texts, leading to performance enhancement on various downstream tasks. On this account, many works have already applied Transformer on property predictions of small organic molecules40,41,42,43. SMILES-BERT was proposed to pretrain the model of BERT-like architecture through a masked SMILES recovery task and then generalize into different molecular property prediction tasks44. Similarly, ChemBERTa45, a RoBERTa-like model for molecular property prediction, was also introduced, following the pretrain-finetune pipeline. ChemBERTa demonstrated competitive performance on multiple downstream tasks and scaled well with the size of pretraining datasets. Transformer-based models could even be used for processing reactions. Schwaller et al. mimicked machine translation tasks and trained Transformer on reaction sequences represented by SMILES for reaction prediction with high accuracy46. Recently, Transformer has been further proven to be effective as a structure-agnostic model in material science tasks, for example, predicting MOF properties based on a text string representation47. Despite the wide investigation of Transformer for molecules and materials, such models have not yet been leveraged to learn representations of polymers. Compared with small molecules, designing Transformer-based models for polymers is more challenging because the standard SMILES encoding fails to model the polymer structure and misses fundamental factors influencing polymer properties like degree of polymerization and temperature of measurement. Moreover, the polymer sequences used as input should contain information on not only the definition of monomers but also the arrangement of monomers in polymers48. In addition, sequence models for polymers are confronted with an inherent scarcity of handy, well-labeled data, considering the hard work in the characterization process in the laboratory. The situation becomes even worse when some of the polymer data sources are not fully accessible49,50.

Herein, we propose TransPolymer, a Transformer-based language model for polymer property predictions. To the best of our knowledge, it is the first work to introduce the Transformer-based model to polymer sciences. Polymers are represented by sequences based on SMILES of their repeating units as well as structural descriptors and then tokenized by a chemically-aware tokenizer as the input of TransPolymer, shown in Fig. 1a. Even though there is still information which cannot be explicitly obtained from input sequences, like bond angles or overall polymer chain configuration, such information can still be learned implicitly by the model. TransPolymer consists of a RoBERTa architecture and a multi-layer perceptron (MLP) regressor head, for predictions of various polymer properties. In the pretraining phase, TransPolymer is trained through Masked Language Modeling (MLM) with approximately 5M augmented unlabeled polymers from the PI1M database51. In MLM, tokens in sequences are randomly masked and the objective is to recover the original tokens based on the contexts. Afterward, TransPolymer is finetuned and evaluated on ten datasets of polymers concerning various properties, covering polymer electrolyte conductivity, band gap, electron affinity, ionization energy, crystallization tendency, dielectric constant, refractive index, and p-type polymer OPV power conversion efficiency52,53,54,55. For each entry in the datasets, the corresponding polymer sequence, containing polymer SMILES as well as useful descriptors like temperature and special tokens are tokenized as input of TransPolymer. The pretraining and finetuning processes are illustrated in Fig. 1b and d. Data augmentation is also implemented for better learning of features from polymer sequences. TransPolymer achieves state-of-the-art (SOTA) results on all ten benchmarks and surpasses other baseline models by large margins in most cases. Ablation studies provide further evidence of what contributes to the superior performance of TransPolymer by investigating the roles of MLM pretraining on large unlabeled data, finetuning both Transformer encoders and the regressor head, and data augmentation. The evidence from visualization of attention scores illustrates that TransPolymer can encode chemical information about internal interactions of polymers and influential factors of polymer properties. Such a method learns generalizable features that can be transferred to property prediction of polymers, which is of great significance in polymer design.

Fig. 1: Overview of TransPolymer.
figure 1

(a) Polymer tokenization. Illustrated by the example, the sequence which comprises components with polymer SMILES and other descriptors is tokenized with chemical awareness. b The whole TransPolymer framework with a pretrain-finetune pipeline. c Sketch of Transformer encoder and multi-head attention. d Illustration of the pretraining (left) and finetuning (right) phases of TransPolymer. The model is pretrained with Masked Language Modeling to recover original tokens, while the feature vector corresponding to the special token ‘〈s〉’ of the last hidden layer is used for prediction when finetuning. Within the TransPolymer block, lines of deeper color and larger width stand for higher attention scores.

Results

TransPolymer framework

Our TransPolymer framework consists of tokenization, Transformer encoder, pretraining, and finetuning. Each polymer data is first converted to a string of tokens through tokenization. Polymer sequences are more challenging to design than molecule or protein sequences as polymers contain complex hierarchical structures and compositions. For instance, two polymers that have the same repeating units can vary in terms of the degree of polymerization. Therefore, we propose a chemical-aware polymer tokenization method as shown in Fig. 1a. The repeating units of polymers are embedded using SMILES and additional descriptors (e.g., degree of polymerization, polydispersity, and chain conformation) are included to model the polymer system. Plus, copolymers are modeled by combining the SMILES of each constituting repeating unit along with the ratios and the arrangements of those repeating units. Moreover, materials consisting of mixtures of polymers are represented by concatenating the sequences for each component as well as the descriptors for the materials. Besides, each token represents either an element, the value of a polymer descriptor, or a special separator. Therefore, the tokenization strategy is chemical-aware and thus has an edge over the tokenizer trained for natural languages which tokenizes based on single letters. More details about the design of our chemical-aware tokenization strategy could be found in the Methods section.

Transformer encoders are built upon stacked self-attention and point-wise, fully connected layers34, shown in Fig. 1c. Unlike RNN or CNN models, Transformer depends on the self-attention mechanism that relates tokens at different positions in a sequence to learn representations. Scaled dot-product attention across tokens is applied which relies on the query, key, and value matrices. More details about self-attention can be found in the Methods section. In our case, the Transformer encoder is made up of 6 hidden layers and each hidden layer contains 12 attention heads. The hyperparameters of TransPolymer are chosen by starting from the common setting of RoBERTa36 and then tuned according to model performance.

To learn better representations from large unlabeled polymer data, the Transformer encoder is pretrained via Masked Language Modeling (MLM), a universal and effective pretraining method for various NLP tasks56,57,58. As shown in Fig. 1d (left), 15% of tokens of a sequence are randomly chosen for possible replacement, and the pretraining objective is to predict the original tokens by learning from the contexts. The pretrained model is then finetuned for predicting polymer properties with labeled data. Particularly, the final hidden vector of the special token ‘〈s〉’ at the beginning of the sequence is fed into a regressor head which is made up of one hidden layer with SiLU as the activation function for prediction as illustrated in Fig. 1d (right).

Experimental settings

PI1M, the benchmark of polymer informatics, is used for pretraining. The benchmark, whose size is around 1M, was built by Ma et al. by training a generative model on polymer data collected from the PolyInfo database51,59. The generated sequences consist of monomer SMILES and ‘*’ signs representing the polymerization points. The ~1M database was demonstrated to cover similar chemical space as PolyInfo but populate space where data in PolyInfo are sparse. Therefore, the database can serve as an important benchmark for multiple tasks in polymer informatics.

To finetune the pretrained TransPolymer, ten datasets are used in our experiments which cover various properties of different polymer materials, and the distributions of polymer sequence lengths vary from each other (shown in Supplementary Fig. 1). Plus, data in all the datasets are of different types: sequences from Egc, Egb, Eea, Ei, Xc, EPS, and Nc datasets are about polymers only so that the inputs are just polymer SMILES; while PE-I, PE-II, and OPV datasets describe polymer-based materials so that the sequences contain additional descriptors. In particular, PE-I which is about polymer electrolytes involves mixtures of multiple components in polymer materials. Hence, these datasets provide challenging and comprehensive benchmarks to evaluate the performance of TransPolymer. A summary of the ten datasets for downstream tasks is shown in Table 1.

Table 1 Summary of datasets for downstream tasks.

We apply data augmentation to each dataset that we use by removing canonicalization from SMILES and generating non-canonical SMILES which correspond to the same structure as the canonical ones. For PI1M database, each data entry is augmented to five so that the augmented dataset with the size of ~5M is used for pretraining. For downstream datasets, we limit the numbers of augmented SMILES for large datasets with long SMILES for the following reasons: long SMILES tend to generate more non-canonical SMILES which might alter the original data distribution; we are not able to use all the augmented data for finetuning given the limited computation resources. We include the number of data points after augmentation in Table 1 and summarize the augmentation strategy for each downstream dataset in Supplementary Table 1.

Polymer property prediction results

The performance of our pretrained TransPolymer model on ten property prediction tasks is illustrated below. We use root mean square error (RMSE) and R2 as metrics for evaluation. For each benchmark, the baseline models and data splitting are adopted from the original literature. Except for PE-I which is trained on data from the year 2018 and evaluated on data from the year 2019, all other datasets are split by five-fold cross-validation. When cross-validation is used, the metrics are calculated by taking the average of those by each fold. We also train Random Forest models using Extended Connectivity Fingerprint (ECFP)19,20, one of the state-of-the-art fingerprint approaches, to compare with TransPolymer. Besides, we develop long short-term memory (LSTM), another widely used language model, as well as unpretrained TransPolymer trained purely via supervised learning as baseline models in all the benchmarks. TransPolymerunpretrained and TransPolymerpretrained denote unpretrained and pretrained TransPolymer, respectively.

The results of TransPolymer and baselines on PE-I are illustrated in Table 2. The original literature used gated GNN to generate fingerprints for the prediction of polymer electrolyte conductivity by Gaussian Process53. The fingerprints are also passed to random forest and supporting vector machine (SVM) for comparison. Another random forest is trained based on ECFP fingerprints. The results of most baseline models indicate strong overfitting which is attributed to the introduction of unconventional conductors consisting of conjugated polybenzimidazole and ionic liquid. For instance, Gaussian Process trained on GNN fingerprints achieves a R2 of 0.90 on the training set but only 0.16 on the test set, and Random Forest trained on GNN FP gets a negative test R2 even the train R2 is 0.91. Random Forest trained on ECFP stands out among all the baseline models, whereas its performance on test dataset is still poor. However, TransPolymerpretrained not only achieves the highest scores on the training set but also improves the performance on the test set significantly, which is illustrated by the R2 of 0.69 on the test set. Such information demonstrates that TransPolymer is capable of learning the intrinsic relationship between polymers and their properties and suffers less from overfitting. Notably, TransPolymerunpretrained also achieves competitive results and shows mild overfitting compared with other baseline models. This indicates the effectiveness of the attention mechanism of Transformer-based models. The scatter plots of ground truth vs. predicted values for PE-I by TransPolymerpretrained are illustrated in Fig. 2a and Supplementary Fig. 2a.

Table 2 Performance of TransPolymer and baseline models on PE-I.
Fig. 2: Ground truth vs. predicted values by TransPolymerpretrained.
figure 2

Scatter plots of ground truth vs. predicted values for downstream tasks: a PE-I, b PE-II, c Egc, d Egb, e Eea, f Ei, g Xc, h EPS, i Nc, and j OPV. The dashed lines on diagonals stand for perfect regression.

As is shown in Table 3, the results of TransPolymer and baselines including Ridge, Random Forest, Gradient Boosting, and Extra Trees which were trained on chemical descriptors generated from polymers from PE-II in the original paper52 are listed, as well as Random Forest trained on ECFP. Although Gradient Boosting surpasses other models on training sets by obtaining nearly perfect regression outcomes, its performance on test sets drops significantly. In contrast, TransPolymerpretrained, which achieves the lowest RMSE of 0.61 and highest R2 of 0.73 on the average of cross-validation sets, exhibits better generalization. The scatter plots of ground truth vs. predicted values for PE-II by TransPolymerpretrained are illustrated in Fig. 2b and Supplementary Fig. 2b.

Table 3 Performance of TransPolymer and baseline models on PE-II.

Table 4 summarizes the performance of TransPolymer and baselines on Egc, Egb, Eea, Ei, Xc, EPS, and Nc datasets from Kuenneth et al.54. In the original literature, both Gaussian process and neural networks were trained on each dataset with polymer genome (PG) fingerprints60 as input, some of which resulted in desirable performance while some of which did not. Meanwhile, PG fingerprints are demonstrated to surpass ECFP on the datasets used by Kuenneth et al. For Egc, Egb, and Eea, despite the high scores by other models, TransPolymerpretrained is still able to enhance the performance, lowering RMSE and enhancing R2. In contrast, baseline models perform poorly on Xc whose test R2 scores are less than 0. However, TransPolymerpretrained significantly lowers test RMSE and increases R2 to 0.50. Notably, The authors of the original paper used multi-task learning to enhance model performance and achieved higher scores than TransPolymerpretrained on some of the datasets, like Egb, EPS, and Nc (the average test RMSE and R2 are 0.43 and 0.95 for Egb, 0.39 and 0.86 for EPS, and 0.07 and 0.91 for Nc, respectively). Access to multiple properties of one polymer, however, may not be available from time to time, which limits the application of multi-task learning. In addition, the TransPolymerpretrained still outperforms multi-task learning models on four out of the seven chosen datasets. Hence the improvement by TransPolymer compared with single-task baselines should still be highly valued. The scatter plots of ground truth vs. predicted values for Egc, Egb, Eea, Ei, Xc, EPS, and Nc datasets by TransPolymerpretrained are depicted in Fig. 2c–i and Supplementary Fig. 2ci, respectively.

Table 4 Performance of TransPolymer and baseline models on datasets from literature by Kuenneth et al.54.

TransPolymer and baselines are trained on p-type polymer OPV dataset whose results are shown in Table 5. The original paper trained random forest and artificial neural network (ANN) on the dataset using ECFP55. TransPolymerpretrained, in comparison with baselines, gives a slightly better performance as the average RMSE is the same as that of random forest, and the average test R2 is increased by 0.05. Although all the model performance is not satisfying enough, possibly attributed to the noise in data, TransPolymerpretrained still outperforms baselines. The scatter plots of ground truth vs. predicted values for OPV by TransPolymerpretrained are depicted in Fig. 2j and Supplementary Fig. 2j.

Table 5 Performance of TransPolymer and baseline models on p-type polymer OPV.

Table 6 summarizes the improvement of TransPolymerpretrained over the best baseline models as well as TransPolymerunpretrained on each dataset. TransPolymerpretrained has outperformed all other models on all ten datasets, further providing evidence for the generalization of TransPolymer. TransPolymerpretrained exhibits an average decrease of evaluation RMSE by 7.70% (in percentage) and an increase of evaluation R2 by 0.11 (in absolute value) compared with the best baseline models, and the two values become 18.5% and 0.12, respectively, when it comes to comparison with TransPolymerunpretrained. Therefore, the pretrained TransPolymer could hopefully be a universal pretrained model for polymer property prediction tasks and applied to other tasks by finetuning. Besides, TransPolymer equipped with MLM pretraining technique shows significant advantages over other models in dealing with complicated polymer systems. Specifically, on PE-I benchmark, TransPolymerpretrained improves R2 by 0.37 comparing with the previous best baseline model and by 0.39 comparing with TransPolymerunpretrained. PE-I contains not only polymer SMILES but also key descriptors of the materials like temperature and component ratios within the materials. The data in PE-I is noisy due to the existence of different types of components in the polymer materials, for instance, copolymers, anions, and ionic liquids. Also, models are trained on data from the year 2018 and evaluated on data from the year 2019, which gives a more challenging setting. Therefore it is reasonable to infer that TransPolymer is better at learning features out of noisy data and giving a robust performance. It is noticeable that LSTM becomes the least competitive model in almost every downstream task, such evidence demonstrates the significance of attention mechanisms in understanding chemical knowledge from polymer sequences.

Table 6 Improvement of performance of TransPolymerpretrained compared with baselines and TransPolymerunpretrained in terms of decrease of test RMSE (in percentage) and increase of test R2 (in absolute value).

Abaltion studies

The effects of pretraining could be further demonstrated by the chemical space taken up by polymer SMILES from the pretraining and downstream datasets visualized by t-SNE61, shown in Fig. 3. Each polymer SMILES is converted to TransPolymer embedding with the size of sequence length × embedding size. Max pooling is implemented to convert the embedding matrices to vectors so that the strong characteristics in embeddings could be preserved in the input of t-SNE. We use openTSNE library62 to create 2D embeddings via pretraining data and map downstream data to the same 2D space. As illustrated in Fig. 3a, almost every downstream data point lies in the space covered by the original ~1M pretraining data points, indicating the effectiveness of pretraining in better representation learning of TransPolymer. Data points from datasets like Xc which exhibit minor evidence of clustering in the chemical space cover a wide range of polymers, explaining the phenomenon that other models struggle on Xc while pretrained TransPolymer learns reasonable representations. Meanwhile, for datasets that cluster in the chemical space, other models can obtain reasonable results whereas TransPolymer achieves better results. Additionally, it should be pointed out that the numbers of unique polymer SMILES in PE-I and PE-II are much smaller than the sizes of the datasets as many instances share the same polymer SMILES while differing in descriptors like molecular weight and temperature, hence the visualization of polymer SMILES cannot fully reflect the chemical space taken up by the polymers from these datasets.

Fig. 3: t-SNE visualization of pretraining and downstream data.
figure 3

The embeddings are obtained by first fitting on the (a) 1M (original), (b) 50K, and (c) 5K pretraining data and then transforming downstream data to the corresponding data space.

Besides, we have also investigated how the size of the pretraining dataset affects the downstream performance. We randomly pick up 5K, 50K, 500K, and 1M (original size) data points from the initial pretraining dataset without augmentation, and pretrain TransPolymer with them and compare the results with those by TransPolymer trained with 5M augmented data. The results are summarized in Supplementary Table 5. Plus, Fig. 4 presents the bar plot of R2 for each experiment we have performed. Error bars are included in the figure if cross-validation is implemented in experiments. As shown in the table and the figure, the results demonstrate a clear trend of enhanced downstream performance (decreasing RMSE and increasing R2) with increasing pretraining size. In particular, the model performance on some datasets, for example, PE-I, Nc, and OPV, are even worse than training TransPolymer from scratch (the results by TransPolymerunpretrained in Tables 25). A possible explanation is that the small amount of pretraining size results in the limited data space covered by pretraining data, thus making some downstream data points out of the distribution of pretraining data. Figure 3b, c visualize the data space by fitting on 50K and 5K pretraining data, respectively, in which a lot of space taken up downstream data points is not covered by pretraining data. Therefore, the results emphasize the effects of pretraining with a large number of unlabeled sequences.

Fig. 4: Model performance with varying pretraining data sizes.
figure 4

The R2 for each downstream task with different pretraining data sizes are presented in the bar plot. Error bars are included if cross-validation is implemented.

The results from TransPolymerpretrained so far are all derived by pretraining first and then finetuning the whole model on the downstream datasets. Besides, we also consider another setting where in downstream tasks only the regressor head is finetuned while the pretrained Transformer encoder is frozen. The comparison of the performance of TransPolymerpretrained between finetuning the regressor head only and finetuning the whole model is presented in Table 7. Standard deviation is included in the results if cross-validation is applied for downstream tasks. Reasonable results could be obtained by freezing the pretrained encoders and training the regressor head only. For instance, the model performance on Xc dataset already surpasses the baseline models, and the model performance on Ei, Nc, and OPV datasets is slightly worse than the corresponding best baselines. However, the performance on all the downstream tasks increases significantly if both the Transformer encoders and the regressor head are finetuned, which indicates that the regressor head only is not enough to learn task-specific information. In fact, the attention mechanism plays a key role in learning not only generalizable but also task-specific information. Even though the pretrained TransPolymer is transferable to various downstream tasks and more efficient, it is necessary to finetune the Transformer encoders with task-related data points for better performance.

Table 7 Comparison of performance of TransPolymerpretrained between finetuning the regressor head only and finetuning the whole model in terms of test RMSE and R2.

Data augmentation is implemented not only in pretraining but also in finetuning. The comparison between the model performance on downstream tasks with pretraining on the original ~1M dataset and the augmented ~5M dataset (shown in Supplementary Table 5) has already demonstrated the significance of data augmentation in model performance enhancement. In this part, we use the model pretrained on the ~5M augmented pretraining dataset but finetune TransPolymer without augmenting the downstream datasets to investigate to what extent the TransPolymer model can improve the best baseline models for downstream tasks. The model performance enhancement with or without data augmentation compared with best baseline models is summarized in Table 8. For most downstream tasks, TransPolymerpretrained can improve model performance without data augmentation, while such improvement would become more significant if data augmentation is applied. For PE-II dataset, however, TransPolymerpretrained is not comparable to the best baseline model without data augmentation since the original dataset contains only 271 data points in total. Because of the data-greedy characteristics of Transformer, data augmentation could be a crucial factor in finetuning, especially when data are scarce (which is very common in chemical and materials science regimes). Therefore, data augmentation can help generalize the model to sequences unseen in training data.

Table 8 Improvement of performance of TransPolymerpretrained without and with data augmentation in finetuning compared with best baselines in terms of decrease of test RMSE (in percentage) and increase of test R2 (in absolute value).

Self-attention visualization

Attention scores, serving as an indicator of how closely two tokens align with each other, could be used for understanding how much chemical knowledge TransPolymer learns from pretraining and how each token contributes to the prediction results. Take poly(ethylene oxide) (*CCO*), which is one of the most prevailing polymer electrolytes, as an example. The attention scores between each token in the first and last hidden layer are shown in Fig. 5a and b, respectively. The attention score matrices of 12 attention heads generated from the first hidden layer indicate strong relationships between tokens in the neighborhood, which could be inferred from the emergence of high attention scores around the diagonals of matrices. This trend makes sense because the nearby tokens in polymer SMILES usually represent atoms bonded to each other in the polymer, and atoms are most significantly affected by their local environments. Therefore,i the first hidden layer, which is the closest layer to inputs, could capture such chemical information. In contrast, the attention scores from the last hidden layer tend to be more uniform, thus lacking an interpretable pattern. Such phenomenon has also been observed by Abnar et al. who discovered that the embeddings of tokens would become contextualized for deeper hidden layers and might carry similar information63.

Fig. 5: Visualization of attention scores from pretrained TransPolymer.
figure 5

a Attention scores in the first hidden layer. b Attention scores in the last hidden layer.

When finetuning TransPolymer, the vector of the special token ‘〈s〉’ from the last hidden state is used for prediction. Hence, to check the impacts of tokens on prediction results, the attention scores between ‘〈s〉’ and other tokens from all 6 hidden layers in each attention head are illustrated with the example of the PEC-PEO blend electrolyte coming from PE-II whose polymer SMILES is ‘*COC(=O)OC*.*CCO*’. In addition to polymer SMILES, the sequence also includes ‘F[B-](F)(F)F’, ‘0.17’, ‘95.2’, ‘37.0’, ‘−23’, and ‘S_1’ which stand for the anion in the electrolyte, the ratio between lithium ions and functional groups in the polymer, comonomer percentage, molecular weight (kDa), glass transition temperature (Tg), and linear chain structure, respectively. As is illustrated in Fig. 6, the ‘〈s〉’ token tends to focus on certain tokens, like ‘*’, ‘$’, and ‘−23’, which are marked in red in the example sequence in Fig. 6. Since Tg usually plays an important role in determining the conductivity of polymers64, the finetuned Transpolyemr could understand the influential parts on properties in a polymer sequence. However, it is also widely argued that the attention weights cannot fully depict the relationship between tokens and prediction results because a high attention score does not necessarily guarantee that the pair of tokens is important to the prediction results given that attention scores do not consider Value matrices65. More related work is needed to fully address the attention interpretation problem.

Fig. 6: Visualization of attention scores from finetuned TransPolymer.
figure 6

The attention scores between the ‘〈s〉’ token and other tokens at different hidden layers in each attention head after finetuning are visualized. At the bottom is the sequence used for visualization in which the tokens having high attention scores with ‘〈s〉’ are marked in red.

Discussion

In summary, we have proposed TransPolymer, a Transformer-based model with MLM pretraining, for accurate and efficient polymer property prediction. By rationally designing a polymer tokenization strategy, we can map a polymer instance to a sequence of tokens. Data augmentation is implemented to enlarge the available data for representation learning. TransPolymer is first pretrained on approximately 5M unlabeled polymer sequences by MLM, then finetuned on different downstream datasets, outperforming all the baselines and unpretrained TransPolymer. The superior model performance could be further explained by the impact of pretraining with large unlabeled data, finetuning Transformer encoders, and data augmentation for data space enlargement. The attention scores from hidden layers in TransPolymer provide evidence of the efficacy of learning representations with chemical awareness and suggest the influential tokens on final prediction results.

Given the desirable model performance and outstanding generalization ability out of a small number of labeled downstream data, we anticipate that TransPolymer would serve as a potential solution to predicting newly designed polymer properties and guiding polymer design. For example, the pretrained TransPolymer could be applied in the active-learning-guided polymer discovery framework66,67, in which TransPolymer serves to virtually screen the polymer space, recommend the potential candidates with desirable properties based on model predictions, and get updated by learning on data from experimental evaluation. In addition, the outstanding performance of TransPolymer on copolymer datasets compared with existing baseline models has shed light on the exploration of copolymers. In a nutshell, even though the main focus of this paper is placed on regression, TransPolymer can pave the way for several promising (co)polymer discovery frameworks.

Methods

Polymer tokenization

Unlike small molecules which are easily represented by SMILES, polymers are more complex to be converted to sequences since SMILES fails to incorporate pivotal information like connectivity between repeating units and degree of polymerization. As a result, we need to design the polymer sequences to take account of that information. To design the polymer sequences, each repeating unit of the polymer is first recognized and converted to SMILES, then ‘*’ signs are added at the places which represent the ends of the repeating unit to indicate the connectivity between repeating units. Such a strategy to indicate repeating units has been widely used in string-based polymer representations68,69. For the cases of copolymers, ‘.’ is used to separate different constituents, and ‘^’ is used to indicate branches in copolymers. Other information like the degree of polymerization and molecular weight, if accessible, will be put after the polymer SMILES separated by special tokens. Take the example of the sequence given in Fig. 1a, the sequence describes a polymer electrolyte system including two components separated by the special token ‘’. Descriptors like the ratio between repeating units in the copolymer, component type, and glass transition temperature (Tg for short) are added for each component separated by ‘$’, and the ratio between components and temperature are put at the end of the sequence. Adding these descriptors can improve the performance of property predictions as suggested by Patel et al.29. Unique ‘NAN’ tokens are assigned for missing values of each descriptor in the dataset. For example, ‘NAN_Tg’ indicates the missing value of glass transition temperature, and ‘NAN_MW’ indicates the missing molecular weight at that place. These unique NAN tokens are added during finetuning to include available chemical descriptors in the datasets. Therefore, different datasets can contain different NAN tokens. Notably, other descriptors like molecular weight and degree of polymerization are omitted in this example because their values for each component are missing. However, for practical usage, these values should also be included with unique ‘NAN’ characters. Besides, considering the varying constituents in copolymers as well as components in composites, the ‘NAN’ tokens for ratios are padded to the maximum possible numbers.

When tokenizing the polymer sequences, the regular expression in the tokenizer adapted from the RoBERTa tokenizer is transformed to search for all the possible elements in polymers as well as the vocabulary for descriptors and special tokens. Consequently, the polymer tokenizer can correctly slice polymers into constituting atoms. For example, ‘Si’ which represents a silicon atom in polymer sequences would be recognized as a single token by our polymer tokenizer whereas ‘S’ and ‘i’ are likely to be separated into different tokens when using the RoBERTa tokenizer. Values for descriptors and special tokens are converted to single tokens as well, where all the non-text values, e.g., temperature, are discretized and treated as one token by the tokenizer.

Data augmentation

To enlarge the available polymer data for better representation learning, data augmentation is applied to the polymer SMILES within polymer sequences from each dataset we use. The augmentation technique is borrowed from Lambard et al.70. First, canonicalization is removed from SMILES representations; then, atoms in SMILES are renumbered by rotation of their indices; finally, for each renumbering case, grammatically correct SMILES which preserve isomerism of original polymers or molecules and prevent Kekulisation are reconstructed31,71. Also, duplicate SMILES are removed from the expanded list. SMILES augmentation is implemented by RDKit library72. In particular, data augmentation is only applied to training sets after the train-test split to avoid information leakage.

Transformer-based encoder

Our TransPolymer model is based on Transformer encoder architecture34. Unlike RNN-based models which encoded temporal information by recurrence, Transformer uses self-attention layers instead. The attention mechanism used in Transformer is named Scaled Dot-Product Attention, which maps input data into three vectors: queries (Q), keys (K), and values (V). The attention is computed by first computing the dot product of the query with all keys, dividing each by \(\sqrt{{d}_{k}}\) for scaling where dk is the dimension of keys, applying softmax function to obtain the weights of values, and finally deriving the attention. The dot product between queries and keys computes how closely aligned the keys are with the queries. Therefore, the attention score is able to reflect how closely related the two embeddings of tokens are. The formula of Scaled Dot-Product Attention can be written as:

$${{{\rm{Attention}}}}(Q,K,V)={{{\rm{softmax}}}}\left(\frac{Q{K}^{{{{\rm{T}}}}}}{\sqrt{{d}_{{{{\rm{k}}}}}}}\right)V$$
(1)

Multi-head attention is performed instead of single attention by linearly projecting Q, K, and V with different projections and applying the attention function in parallel. The outputs are concatenated and projected again to obtain the final results. In this way, information from different subspaces could be learned by the model.

The input of Transformer model, namely embeddings, maps tokens in sequences to vectors. Due to the absence of recurrence, word embeddings only are not sufficient to encode sequence order. Therefore, positional encodings are introduced so that the model can know the relative or absolute position of the token in the sequence. In Transformer, position encodings are represented by trigonometric functions:

$${{{{\rm{PE}}}}}_{{{{\rm{pos}}}},2{{{\rm{i}}}}}=\sin ({{{\rm{pos}}}}/1000{0}^{2{{{\rm{i}}}}/{d}_{{{{\rm{model}}}}}})$$
(2)
$${{{{\rm{PE}}}}}_{{{{\rm{pos}}}},2{{{\rm{i}}}}+1}=\cos ({{{\rm{pos}}}}/1000{0}^{2{{{\rm{i}}}}/{d}_{{{{\rm{model}}}}}})$$
(3)

where pos is the position of the token and i is the dimension. By this means, the relative positions of tokens could be learned by the model.

Pretraining with MLM

To pretrain TransPolymer with Masked Language Modeling (MLM), 15% of tokens of a sequence are chosen for possible replacement. Among the chosen tokens, 80% of which are masked, 10% of which are replaced by randomly selected vocabulary tokens, and 10% are left unchanged, in order to generate proper contextual embeddings for all tokens and bias the representation towards the actual observed words35. Such a pretraining strategy enables TransPolymer to learn the “chemical grammar" of polymer sequences by recovering the original tokens so that chemical knowledge is encoded by the model.

The pretraining database is split into training and validation sets by a ratio of 80/20. We use AdamW as the optimizer, where the learning rate is 5 × 10−5, betas parameters are (0.9, 0.999), epsilon is 1 × 10−6, and weight decay is 0. A linear scheduler with a warm-up ratio of 0.05 is set up so that the learning rate increases from 0 to the learning rate set in the optimizer in the first 5% training steps then decreases linearly to zero. The batch size is set to 200, and the hidden layer dropout and attention dropout are set to 0.1. The model is pretrained for 30 epochs during which the binary cross entropy loss decreases steadily from over 1 to around 0.07, and the one with the best performance on the validation set is used for finetuning. The whole pretraining process takes approximately 3 days on two RTX 6000 GPUs.

Finetuning for polymer property prediction

The finetuning process involves the pretrained Transformer encoder and a one-layer MLP regressor head so that representations of polymer sequences could be used for property predictions.

For the experimental settings of finetuning, AdamW is set to be the optimizer whose betas parameters are (0.9, 0.999), epsilon is 1 × 10−6, and weight decay is 0.01. Different learning rates are used for the pretrained TransPolymer and regressor head. Particularly, for some experiments a strategy of layer-wise learning rate the decay (LLRD), suggested by Zhang et al.73, is applied. Specifically, in LLRD, the learning rate is decreased layer-by-layer from top to bottom with a multiplicative decay rate. The strategy is based on the observation that different layers learn different information from sequences. Top layers near the output learn more local and specific information, thus requiring larger learning rates; while bottom layers near inputs learn more general and common information. The specific choices of learning rates for each dataset as well as other hyperparameters of the optimizer and scheduler are exhibited in Supplementary Table 2. For each downstream dataset, the model is trained for 20 epochs and the best model is determined in terms of the RMSE and R2 on the test set for evaluation.