Introduction

Plastics are an integral part of our everyday life and modern technology. Their simple, yet diverse, chemistries and tunable properties make plastics versatile and desirable; plastics display high or low flexibility, strength, thermal, or electronic conductivity along with low cost, low weight, and abundance1. The global plastic production of 2019 amounts to an unimaginable 368 million tonnes and is expected to further increase in the coming years. About 40% (145 million tonnes) of the yearly plastic production accounts for packaging products such as bags, food containers, cutlery, or bottles, which have very short service lifetimes and often end up in landfills, seawater, or other natural environments2,3,4. It is therefore not surprising that packaging plastics are one of the largest polluters of our world’s ecosystems, severely threatening the existence of animals and humans through waste and microplastic particles on land and in oceans that last for decades or longer5. Finding eco-friendly plastics (bioplastics) with properties akin to conventional plastics but with sustainable recycling options is therefore of utmost importance for a circular economy.1 The bio-derived and biodegradable family of polyhydroxyalkanoates (PHAs) is a promising cradle-to-cradle material that can be synthesized by several microorganisms directly using sunlight and CO2 from the environment or industrial point sources3,6,7. Existing commercial entities have already studied the biosynthesis, processing, and industrialization of PHA-based polymers and blends8. Diverse chemistries harbored in PHAs span a large property space with ample opportunities to design mechanical and thermal properties such as the Young’s modulus (E), tensile strength (σ), elongation (ϵ), glass transition temperature (Tg), melting temperature (Tm), and degradation temperature (Td)3,9,10,11,12,13,14.

PHAs provide copious opportunities for chemical modification and property modulation3,15. Key parameters of these modifications are the numbers of carbons in the main-chain and side-chain (n and m in Fig. 1a), and the terminating functional groups of the side-chain (R in Fig. 1a)15,16. For instance, the most widely known PHA, poly-3-hydroxybutyrate (P3HB, with n = 1, m = 1, and no R), is brittle and inflexible17. As the number of carbon atoms in the backbone increases, the resulting polymers tend to display higher elongation at break (ϵb) combined with improved mechanical strength and enhanced tendency for degradability3. Also, PHAs with side-chain-terminating phenyl groups exhibit higher Tgs because of increased rigidity due to enhanced interchain interactions resulting from the polar side chain functional groups15,18. Besides systematic structural and chemical alterations, copolymers provide an additional knob to grow the accessible property space by not only combining multiple PHA-based motifs but also PHAs with conventional polymers19. In the past, PHA-only copolymers have been found to improve mechanical properties while keeping high Tm and low Tg values, which is ideal for applications that require large temperature operation windows3,14. By forming copolymers of PHAs with conventional polymers, one may harness synergistic effects, potentially leading to recyclable polymers with enhanced mechanical strength and improved gas permeability. Polymers with low gas permeabilities are of particular interest to the food packaging industry where the containment of gases such as CO2, or the prevention of gases such as O2 from permeating into the food products, is an important factor in deciding application-specific suitability of a polymer.

Fig. 1: Bioplastic design using multitask deep learning predictors.
figure 1

a Design pipeline. A large search space is created by combining 540 polyhydroxyalkanoates (PHAs) and 13 conventional polymers to copolymers. Property predictors and property requirements of commonly used polymers allow us to identify bioplastic candidates within the search space. b Architecture of the multitask neural network predictors. Three separate predictors are trained; one for each of the selector vectors (S1, S2, S3). S is a binary vector (1 for the selected component and 0 everywhere else) that selects the output property. c Prediction pipeline. The two SMILES58 strings belong to comonomers in a copolymer.

Just contemplating copolymer compositions on a rather coarse composition grid (c = 0, 0.1, …, 1), the total number of distinct PHA-only copolymer possibilities is far beyond a million, effectively rendering trial and error-based high-throughput experiments an impractical route of searching for application-specific candidate materials20. Also, the sheer size of the search space disqualifies time-consuming computational methods such as density functional theory (DFT) or even classical molecular dynamics (MD) simulations. The burgeoning field of polymer informatics21,22,23,24,25,26 offers an exciting alternative route to address such search problems by using modern data-driven machine learning approaches18,27,28,29.

The present study, with the details of the workflow and machine learning framework outlined in Fig. 1, has several vital elements. First, we develop efficient multitask deep neural network-based multiproperty predictors for copolymers that forecast three different thermal (Tg, Tm, and Td), four different mechanical (E, σy, σb, and ϵb), and six gas permeability (μgg{O2, CO2, N2, H2, He, CH4}) properties using nearly 23,000 experimental data points pertaining to a diverse range of homo- and copolymer chemistries. Here, Tg, Tm, Td, E, σy, σb, and ϵb are the glass transition temperature, melting temperature, degradation temperature, Young’s modulus, tensile strength at yield, tensile strength at break, and elongation at break, respectively. \({\mu }_{{{{{{{{{\rm{O}}}}}}}}}_{2}}\), \({\mu }_{{{{{{{{{\rm{CO}}}}}}}}}_{2}}\), \({\mu }_{{{{{{{{{\rm{N}}}}}}}}}_{2}}\), \({\mu }_{{{{{{{{{\rm{H}}}}}}}}}_{2}}\), μHe, and \({\mu }_{{{{{{{{{\rm{CH}}}}}}}}}_{4}}\) stand for the gas permeabilities of O2, CO2, N2, H2, He, and CH4. The thermal, mechanical, and gas permeability properties are selected as they play a critical role in the design and selection of plastics for packaging and other large-scale industrial applications. Second, we create a bioplastic candidate space of nearly 1.4 million bioplastics, which is spanned by 540 PHAs and 13 conventional polymer chemistries. Third, we follow a two-step protocol to find several PHA-only and PHA-conventional polymer bio-replacements in the candidate space for seven petroleum-based and commonly used plastics. Possible synthesis routes of the bio-replacements are discussed. This work represents the state-of-the-art in polymer informatics, and contributes to and accelerates the identification of sustainable functional polymer candidate materials.

Results and discussion

Data Set

Our data set for training of the multitask property predictors includes a total of 22,731 homopolymer (≈60%) and copolymer (≈30%) data points of the thermal, mechanical, and the small molecule gas permeability properties as reported in Table 1. Each of the 7512 copolymer data points involves two distinct comonomers at various compositions while spanning over 1440 distinct copolymer chemistries. Homo- and copolymer data points of Tg, Tm, and Td, and homopolymer data points of μgs, E, and σb were already utilized in previous studies30,31,32,33,34,35. The copolymer data points belonging to μgs, E, σy, σb, and ϵb, and homopolymer data points of σy and ϵb were collected from the PoLyInfo35 repository for this study. If multiple data points were available for a single polymer in PoLyInfo, we used the average values after a manual curation. For consistency and uniformity, only Tg and Tm data points measured via differential scanning calorimetry (DSC), Td data points measured via thermogravimetric analysis (TGA), and mechanical data points recorded around room temperature (300 K) were included in the data set. Moreover, for configurational consistency, all copolymer data points in this study are from random copolymers. As part of an additional curation step and our due diligence strategy, we employed a clustering algorithm (DBSCAN using standard parameters as implemented in Scikit-learn36) to identify outliers and select suspicious data points for manual inspection. The degree of polymerization and molecular weight were not taken into account because they were not uniformly available for all data points. Mandated by the multitask method, all property values were scaled to the range of [0, 1] (min-max scaling) for training and transformed back to the actual ranges before computing the respective error metrics. Additionally, ϵb and the gas permeabilities were transformed to the log base 10 scale (\(x\,\mapsto\, {\log }_{10}(x+1)\)) before training because of their power-law-shaped data distributions (see Supplementary Figs. 46).

Table 1 Overview of our data set used for training the multitask predictors.

Property predictors

Multitask deep neural networks with meta learners have shown best-in-class performance in past polymer informatics studies33,34 due to their ability to utilize inherent correlations in data that helps to overcome data sparsity. Here, we create three multiproperty predictors (one for each category in Table 1) to predict, in total, 13 polymer properties using the data set and categories profiled in Table 1 and fingerprints outlined in the Methods section. Figure 1b schematically shows the architecture of the multitask predictors, while implementation details are given in the Methods section and in Supplementary Fig. 1.

The developed meta learners and cross-validation predictors display outstanding overall coefficient of determination (R2) values of 0.97 and 0.93, respectively. The root-mean-square error (RMSE) and R2 values of all properties are reported in Table 2. The meta learner (cross-validation) predictors of the thermal properties perform very well with R2s as high as 0.98 (0.92), 0.97 (0.84), and 0.96 (0.72) for Tg, Tm, and Td, respectively. This is expected because of the large number of data points and high data fidelity of the thermal property data points. The reported validation metrics here are slightly better than those reported in Ref. 35, which uses a very similar thermal property data set. This is because of the extra data curation and cleaning steps adopted in this work, as discussed above in the Data Set section. Wu et al.37 and Tao et al.38 report similar R2 values of 0.91 and 0.93, respectively, for a similar-sized Tg data set. Chen et al.39 and Tao et al.40 obtain slightly worse R2 values for learning Tg. We note that different training and evaluation schemes lead to different metrics and only the exact same data sets (and splits) should be compared for benchmarking purposes. The mechanical and gas permeability meta learners show very high R2s of 0.94, 0.96, 0.94, and 0.91 for E, σy, σb, and ϵb, respectively, and 0.99, 0.99, 0.99, 1.00, 0.99, and 0.99 for the six gases g {O2, CO2, N2, H2, He, CH4}, respectively. The overall performance of the three developed meta-learners with averaged R2s of 0.97, 0.94, and 0.99 is exceptional and may be credited to the large data set of almost 23,000 data points, additional data curation measures, well-conditioned and smooth fingerprints, and fully-hyperparameter-optimized multitask deep neural networks. We use the meta learners for property predictions. The cross-validation metrics indicate the generalization error for learning the data set (see also Supplementary Discussion on “Generalization and data set errors”). The individual parity plots of the meta learners for each property can be found in the Supplementary Figs. 79. Supplementary Fig. 11 shows a good agreement of predictions and data points of four copolymers, which are included in the test data set only.

Table 2 The RMSEs and R2s averages of the five cross-validation models and meta learner from predictions on the respective validation data sets. The cross-validation deviations show one standard deviation (1σ).

Bioplastic search space

In the next step, we consider a bioplastic space that can be searched using the property predictors developed in the last section. As shown in Fig. 1a, 540 PHAs and 13 conventional polymers define and bound this space. The 540 PHAs are devised through variations of the number of carbon atoms in the main-chain and side-chain from 1 to 6 (n and m in Fig. 1a), and by terminating the side-chains with 17 different functional groups (see Supplementary Fig. 2). The bio-copolymers of this space are generated by the outer product of PHAs and conventional polymers at eleven different compositions (c = 0, 0.1, …, 1). The total number of bioplastics in the search space amounts to 1 373 503 and is composed of 553 homopolymers, 146 070 copolymers of PHA-only, and 7 033 copolymers of PHAs and conventional polymers. The 13 conventional polymers were selected according to the list of most commonly used plastics and are documented in Supplementary Fig. 3.

Figure 2a, b display the 2D uniform manifold approximations and projections (UMAPs)41 of two different fingerprint subspaces. These subspaces contain all fingerprint components of the selected polymers. The fingerprint subspace of Fig. 2a contains five PHAs (green dots, c = 1), the remaining 548 polymers (blue dots, c = 0), and copolymers (c = 0.1, 0.2, …, 0.9) that connect the five PHAs and the remaining 548 polymers. Interestingly, UMAP has identified similar polymers and agglomerated them into the shape of stars. The corner vertices of these stars are the fingerprints of the copolymers closest to the five PHAs (c = 0.9), while all other dots in the stars indicate fingerprints of polymers of similar chemistry. For example, the dots of the topmost star in Fig. 2a show fingerprints of PHA copolymers containing a nitro phenyl functional group, while the rightmost star includes fingerprints of PHA copolymers containing benzonitrile. The different clusters in Fig. 2b have 13 corner vertices (some of them are hidden) that indicate the fingerprints of the copolymers closest to the 13 conventional polymers (c = 0.9, see Fig. 1a), which are included in the fingerprint subspace of Fig. 2b, instead of the five PHAs as in Fig. 2a. Equivalent to the stars, the different clusters contain fingerprints of polymers of similar chemistry. The agglomeration to stars or clusters illustrates that the used fingerprints (i) unambiguously distinguish polymers; (ii) position related polymers of similar chemistries in the vicinity (i.e., pack all copolymers with a specific side-chain functional group in the same part of the space); (iii) and thus create a physically meaningful learning space with chemical similarity integrated that is well-suited for machine learning.

Fig. 2: Two-dimensional UMAP41 plot of two fingerprint subspaces that contain only fingerprints of selected polymers.
figure 2

a Five PHAs. The dark green dots (c = 1) show the fingerprints of the SMILES strings [*]OCCC(=O)[*], [*]OC(O)CC(=O)[*], [*]OC(C(=O)O)CC(=O)[*], [*]OC(C(=O)OC)CC(=O)[*], and [*]OC(N)CC(=O)[*]. b 13 conventional polymers. The dark green dots show the fingerprints of the 13 conventional polymers. The dark blue dots (c = 0) in panels a and b indicate the fingerprints of the remaining 548 and 540 polymers in the bioplastic search space (a total of 553 polymers), respectively. The dots with intermediate colors (green to blue) indicate the fingerprints of connecting copolymers. PC1 and PC2 represent the first and second principal components in the UMAP projection, respectively.

Figure 3 displays property relations for a selected set of properties of the bioplastic search space in order to qualitatively assess our predictions and verify physical trends expected based on chemical intuition. The trend in Fig. 3a is that polymers of high Tg values also have high Tm and room temperature E values. This confirms our chemical intuition that Tg is approximately linearly correlated to Tm, and high Tg and/or Tm polymers have stiffer morphologies thus possessing high E values. Also, we observe that the correlation of Tg and Tm is not sharp but broad, which arises from the different side-chain functional groups in the search space. In contrast, Fig. 3b suggests little to no correlation of Tm and σb, except that the range spanned by σb at a given Tm broadens as Tm increases. However, σb is intuitively correlated to E, i.e., stiffer materials (high E) break at higher stresses (high σb). Figure 3c shows a roughly linear correlation of \({\mu }_{{{{{{{{{\rm{CO}}}}}}}}}_{2}}\) and \({\mu }_{{{{{{{{{\rm{O}}}}}}}}}_{2}}\) that again agrees with chemical intuition and lends credibility to the developed predictors.

Fig. 3: Property relations of almost 1.4 million bioplastic candidates.
figure 3

Tm, Tg, σb, \({\mu }_{{{{{{{{{\rm{O}}}}}}}}}_{2}}\), \({\mu }_{{{{{{{{{\rm{CO}}}}}}}}}_{2}}\), and E in the panels ac stand for melting temperature, glass transition temperature, tensile strength at break, O2 gas permeability, CO2 gas permeability, and Young’s modulus, respectively. The data point densities are indicated in the plot margins.

Bioplastic Replacements

Up to this point, we have discussed the training and validation of three multitask deep neural networks (each targeting separately the thermal, mechanical, and gas permeability properties) to forecast 13 polymer properties, the consideration of a search space of over 1.3 million bioplastic candidates, and predictions for each of the candidates in the search space. Next, we search the candidate set for suitable replacements for seven petroleum-based and commonly used plastics listed in Table 3. The search is performed following a two-step protocol. In the first step, we employ a nearest neighbors search to find the five closest replacements (within the target property space) for each of the seven plastics and in each copolymer subgroup of PHA-only and PHAs with conventional polymers. We employ the nearest neighbors algorithm using standard parameters as implemented in Scikit-learn36. Table 3 reports averaged property values of polymers at standard condition (films or pieces from neat resin). The values of PE match the properties of medium-/ high-density PE the best. Process and manufacturing conditions can impact these property values. We note that the nearest neighbors search may be performed for any polymer with property values that deviate from the average values considered here. The nearest neighbors search algorithm and nearly 1.4 million bioplastic candidates, including the predicted properties, are shared on GitHub (see Code Availability section).

Table 3 Measured properties of petroleum-based commodity plastics that in total account for 75.1% of Europe’s yearly plastic production in 2019 (see usage column)2.

In the second step, we use our domain expertise to pick the most promising bio-replacement from the five candidates based on its potential to be synthesized. The most promising bio-replacements for each commodity plastic and for each of the two copolymer subgroups are reported in Fig. 4. The full list of bio-replacements (70) is provided as Supplementary Data 1.

Fig. 4: PHA-only and PHA-conventional bio-replacements for seven commodity plastics.
figure 4

Full polymer names are listed in Table 3.

Figure 5 a shows the property distributions of the bioplastic candidate set along with the properties of the seven commodity plastics (see Table 3) indicated as triangles. As expected, the Tm peak is shifted to higher temperatures (by around 80 K) with respect to the Tg peak. Among the mechanical properties, the densities of E and σb demonstrate a peak at around 540 MPa and 20 MPa, while ϵb shows a broad distribution, which covers most of the data range. Moreover, the majority of \({\mu }_{{{{{{{{{\rm{O}}}}}}}}}_{2}}\) and \({\mu }_{{{{{{{{{\rm{CO}}}}}}}}}_{2}}\) values are below 4 barrer and 20 barrer, which match the expected value range of this polymer class35. Overall, all commodity plastics (triangles) lie within the property ranges spanned by the bioplastic search space. However, because the triangles often lie in the tails of the property distributions, it is challenging (but possible) to find suitable replacements. Similar to Fig. 5a–h compare the experimental properties of the commodity plastics with their top bio-replacements identified in Fig. 4 in a radar chart. Qualitative graphical pairwise similarities between the property profiles in each radar chart indicate that the identified bio-replacements do indeed exhibit similar sets of properties with respect to the experimental properties. A comparison of the predicted and experimental properties of the seven commodity plastics is shown in Supplementary Fig. 10.

Fig. 5: Experimental and predicted properties.
figure 5

a Property density profiles computed over the entire prediction set of bioplastic candidates. Missing x-axes beyond a certain cutoff indicate zero predicted property densities over those property ranges. The triangles show experimental properties of the seven commodity plastics. Full polymer names are listed in Table 3. bh Property radar charts for each commodity plastic. Triangles with solid lines show the experimental properties. Circles with dashed lines and diamonds with dotted lines indicate predicted properties of the bio-replacements in Fig. 4 for the copolymer subgroups of PHA-only and PHA-conventional polymers, respectively.

Synthesis Opportunities

It is interesting to note that all PHA-only and PHA-conventional bio-replacements in Fig. 4 contain aromatic groups in the side-chain. The biosynthesis of PHAs containing an aromatic monomer was first reported in 1990 by Fritzsche et al.42 for Poly(3-hydroxy-5-phenylvalerate) and since then a wide range of aromatic side-chain functional groups have been introduced into PHAs through biosynthesis routes. For instance, see a recent comprehensive review by Ishii-Hyakutake et al.43 for a more detailed discussion. Looking at the complexity of the chemistries that are already accessible via biosynthesis, prospects of producing the identified PHA-only bio-replacements are rather optimistic. For example, Aróstegui et al.44 reported that the Pseudomonas oleovorans bacterium synthesizes PHAs with nitrophenyl side-chain functional groups, which occur in the PHA-only bio-replacements of PVC, PS, and PEN in Fig. 4. Moreover, engineering the bacterial PHA synthesis remains a highly active area of research with novel genome editing techniques, such as clustered regularly interspaced short palindromic repeats interference (CRISPRi),45 that carry enormous potential for future breakthroughs in terms of both the accessible chemical diversity in PHA biosynthesis of homo- and copolymers as well as the yield optimization of the produced polymer chemistries.

Chemical synthesis routes for PHAs have been widely reported in the literature as well46,47,48,49. The potential for the chemical synthesis of the copolymers in this work (see Fig. 4 or Supplementary Data 1) lies in the ability to combine two comonomers of differing composition. Methods used in the synthesis of polystyrene-co-lactone copolymers50,51 can lead to the desired formation of the PHA-conventional bio-replacements of PP and PS. A chemical synthesis route for the PE and Nylon 6 PHA-conventional bio-replacements may follow similar steps used for the copolymerization of PHA/PEO (polyethylene oxide) copolymers19. The PVC, PET and PEN bio-replacements from PHA-conventional polymers may be produced via a reactive twin-screw extrusion process to form block structures containing phthalate-co-lactones.52 Alternatively, a copolymer of repeating phthalate/lactone units has recently been produced via a copolymerization reaction of ε-lactone with degraded PET53,54. We believe the predictions based on our work have potential to translate into new PHA biopolymers or copolymers and will inspire development of new PHA-only or hybrid conventional-PHA polymer synthesis routes.

The developed property predictors used in this work have several limitations which are largely tied to the availability of training data. The property predictors ignore process and manufacturing conditions as well as certain details of morphology (e.g., % crystallinity). Furthermore, other relevant factors such as the molecular weight distribution, form factors (e.g., linear versus branched), additives, etc. are currently not integrated in the property prediction pipeline. In addition, low level configurational trends across diverse chemistries such as subtle variations stemming from chain morphologies or different relative placements of distinct monomer units across the chains are not accounted for. If such information becomes available in the future, the developed predictors can easily be updated to account for these additional effects. The proposed informatics pipeline should be considered an essential first step to solve the sustainable polymer design problem that calls for future improvements such as the development of improved property predictors, integrated synthesizability criteria, and above all, process optimization protocols to make the designed polymers a practical reality.

Conclusion

We have developed an informatics-based bioplastic design pipeline, which has identified promising PHA-based bioplastic replacements for seven petroleum-based commodity plastics. Our study starts with the data collection and curation of approximately 23,000 homo- and copolymer data points spanning 13 properties critical for everyday applications and use. Multitask neural networks with a meta learner, pioneered by us for polymer informatics, forecast thermal, mechanical, and gas permeability properties for polymers over a broad chemical space with unprecedented performance. Using the trained models, we predict the 13 key thermal, mechanical, and gas permeability properties of all polymers in a bioplastic search space of almost 1.4 million polymers. The property predictions are validated and subsequently utilized to find bio-replacements for seven commodity plastics that, in total, account for more than 75% of the yearly plastic production. Using a two-step selection protocol of a nearest neighbors search and synthesizability criteria, we propose two bio-replacements for each commodity plastic and discuss chemical synthesis and biosynthesis routes for these promising polymer replacements. Informatics can help to identify suitable synthesis strategies as well55.

The implications of this work are far-reaching. We currently produce by far more plastics than we can recycle,4 and the demand for plastics is expected to continue to grow at an annual rate of 4%.56 As countries begin to implement restrictions on plastic use, there is an urgent need for bioplastic alternatives to conventional plastics. Yet, the options of commercially available biopolymers are currently very limited.57 Our approach to design and discover functional biopolymers can be applied to greatly accelerate the replacement of conventional plastic materials with more sustainable alternatives, and with possibly even greater performance advantages. The candidate biopolymers, in particular PHAs, might be synthesized by known chemical or biosynthetic routes, hybrid routes, or routes yet to be developed. Our approach can augment conventional empirically based design approaches by guiding the way to more targeted experiments, fewer experimental trials, or shorter times to market. Our work provides an informatics-based screening tool for researchers and developers aiming to produce bioplastics with improved thermomechanical and transport properties for better performance in specific applications, thus accelerating the transition to a circular economy.

Methods

Fingerprinting

The fingerprinting process converts geometric and chemical information of polymers to machine-readable numerical representations for training machine learning models. Homopolymer repeat units are represented as simplified molecular-input line-entry system (SMILES)58 strings that use two stars to indicate the two endpoints of the repetitive unit of the polymers, but otherwise follow the SMILES syntax. The fingerprint vector (F) in this work has 849 components and is calculated based upon the SMILES string following a two-step protocol34: First, we compute hierarchical fingerprints that capture structural and key chemical features of each comonomer at three different length scales59,60. At the atomic scale, our fingerprints track the occurrence of a fixed set of atomic fragments (or motifs)61. For example, the fragment “C3-S2-C3” is composed of three contiguous atoms, in this order, a three-fold coordinated carbon, a two-fold coordinated sulfur, and a three-fold coordinated carbon. A vector of such triplets represents the fingerprint components at the lowest hierarchy. The next level uses the quantitative structure-property relationship (QSPR) fingerprints62 to capture features on larger length-scales. QSPR fingerprints are often used in chemical and biological sciences, and used here as implemented in the chem informatics toolkit RDKit63. Examples of such fingerprints are the van der Waals surface area64, the topological polar surface area (TPSA)65,66, the fraction of atoms that are part of rings (i.e., the number of atoms associated with rings divided by the total number of atoms in the formula unit), and the fraction of rotatable bonds. The highest length-scale fingerprint components in our polymer fingerprints deal with “morphological descriptors”. They include features such as the shortest topological distance between rings, the fraction of atoms that are part of side-chains, and the length of the largest side-chain30.

Second, we sum the composition-weighted comonomer fingerprints to compute the total copolymer fingerprint vector \({{{{{{{\mathcal{F}}}}}}}}=\mathop{\sum }\nolimits_{i}^{N}{{{{{{{{\bf{F}}}}}}}}}_{i}{c}_{i}\), where N is the number of comonomers in the copolymer, Fi the ith comonomer fingerprint, and ci the fraction of the ith comonomer. This copolymer fingerprint satisfies the two main demands of uniqueness and invariance to different (but equivalent) periodic unit specifications and renders the fingerprinting routine invariant to the order in which one may sort the comonomers. Contrary to homopolymer fingerprints, copolymer fingerprints may not be interpretable (e.g., the composition-weighted sum of the fingerprint component “shortest distance between rings” of two homopolymers has no physical meaning). In our work, all copolymer data points are of random copolymers, and alternating copolymers were treated as homopolymers.

Multitask predictor and meta learner

Multitask deep neural networks simultaneously learn multiple polymer properties to utilize possible inherent correlations in data. Figure 1b schematically portrays the architecture of the three concatenation-conditioned multitask predictors: the copolymer fingerprint and selector vector are fed to a feed-forward deep neural network that outputs a single property. The selector vector is a binary vector and specifies the output property. For instance, the selector vector of the thermal properties predictor (S1) has three components and encodes Tg as [100], Tm as [010], and Td as [001]. All parameters of the neural networks, such as the number of layers, number of nodes, dropout rates, and activation functions, are optimized using the Hyperband method67 of the Python package KerasTuner68. Final parameters are reported in Supplementary Table 1. All models were implemented using the Python API of TensorFlow69.

The training protocol of the predictors follows state-of-the-art techniques involving five-fold cross-validation and a meta learner that forecasts the final property values based upon the ensemble of cross-validation predictors34 (see also Supplementary Fig. 1). The parameters of the cross-validation models are fixed when used in the meta learner. The meta learner has the same network architecture as the multitask predictors but receives the five outputs of the multitask predictors as inputs (rather than the copolymer fingerprint). The cross-validation process ensures that each data point has at least once been in the validation data set and allows us to report the generalization error as averaged RMSEs and R2s of the validation data sets. The three meta learners operate as production predictors. After shuffling, the data set was split into two parts. 20% of the data set was set aside for training the meta learners, while the remaining 80% was used for five-fold cross-validation and the validation of the meta learner. All data set splits were stratified by the properties.