The topology of a polymer chain, or equivalently the chain architecture, can substantially influence their properties and those of derivative materials. For example, in natural polymers, while linear amylose forms dense aggregates with low aqueous solubility, the analogous but highly branched structure of amylopectin impedes association of chains, thereby enhancing its solubility1. In the realm of synthetic polymers, the branching in low-density polyethylene improves its processability for applications like blow and extrusion molding, whereas linear high-density polyethylene possesses superior mechanical strength and chemical resistance. There is also growing interest in understanding implications of polymer topology due to advancements in various controllable synthetic methodologies2,3,4,5. These methods enable the creation of polymers with a wide range of complex topologies, such as stars6,7, combs8,9, branches10,11, hyperbranches12,13, dendrimers14,15, rings16,17, and brushes18,19.

Establishing quantitative relationships between polymer topology and material properties remains challenging. Both experimental and computational investigations have enhanced understanding of how polymer topology influences properties of interest to many areas, such as enhanced oil recovery20,21, coatings and adhesives22,23, rheology and fluid dynamics24,25,26, energy storage27,28,29,30,31,32,33, and biomedical applications34,35,36,37,38. Nevertheless, the efforts of labor-intensive and potentially costly synthesis and characterization typically limits experimental studies to a small set of systems, which may still not yield well-defined topological ensembles3,39. Computationally, although there is no ambiguity associated with the underlying topologies of the polymers or their construction, simulations are often restricted to a particular class of topologies owing to computational costs and perhaps uncertainty with how to tangibly compare diverse topologies40,41. Overall, these factors obfuscate the construction of general topology-property correlations, which also precludes facile design of topologically complex polymers.

Recent advancements in and applications of machine learning have spurred significant developments in polymer design. These efforts span many applications, such as tailoring the structures of single-chain nanoparticles42,43, enhancing enzyme stability44,45, delivering drugs and therapeutics46,47,48, and identifying gas-separation membranes49. Generative machine learning models50 are a particularly intriguing class of algorithms for chemical design. For example, variational autoencoders (VAEs) are adept at encoding complex data into lower-dimensional latent spaces51,52 and have previously facilitated the generation of small molecules53,54. Applications of VAEs in polymer science are also emerging55,56. Shmilovich et al. combined VAEs with molecular dynamics (MD) simulations and Bayesian optimization to guide the discovery of π-conjugated oligopeptides57 with desirable aggregation behavior to influence optoelectronic properties. In devising the Open Macromolecular Genome (OMG), Kim et al. utilized a generative framework with VAEs that can not only provide polymer structures but also retrosynthesis58, thereby facilitating optimization of synthetically accessible materials. Nevertheless, these and other studies primarily focus on specific chemical spaces or linear polymers, highlighting the need for methods to generate polymers with complex topologies and tailored properties.

In this study, we create a multi-task VAE to generate polymers with specified topology and desired characteristics. This model is developed using an original dataset comprising coarse-grained MD data for over 1300 polymers of various topologies, including star, comb, αω-branched, linear, cyclic, and dendrimer structures, spanning a range of molecular weights. Input and encoding strategies are critically assessed by training several models that aim to reconstruct the polymer topology and also perform auxiliary tasks of estimating the characteristic size of the polymer and classifying its topology. We find that auxiliary tasks enhance the physical interpretability of the learned latent space of the VAE. Our most effective generative modeling framework, TopoGNN, incorporates both graph and topological descriptor features. For demonstrative purposes, TopoGNN is leveraged to produce sets of topologically diverse polymers that exhibit the same characteristic size in dilute solution (Fig. 1, top) but contrasting rheological behavior at finite concentrations (Fig. 1, bottom). This work expands the utility of generative modeling for polymer design and demonstrates how such algorithms can also facilitate controlled studies across complex, topologically diverse polymers.

Fig. 1: Strategy underlying a variational autoencoder of polymer topology.
figure 1

In the Training Phase (top), molecular dynamics (MD) simulations are employed to compute computationally tractable descriptors, such as the average squared radius of gyration, \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\), for a set of polymers. Information regarding topological descriptors and the polymer graph are then encoded into a lower-dimensional latent space using an artificial neural network (ANN) and a graph neural network (GNN). The latent space is decoded to accomplish reconstruction, regression, and classification tasks. These encoded features are concatenated to form a reduced-dimensional latent space, from which a decoder reconstructs the polymer structure. In the Search Phase (bottom), points are sampled from the latent space to proffer polymers that are predicted to exhibit a target \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) and specified topology. These predictions are evaluated against MD simulations, and post-validation, enable systematic analysis of how topology impacts additional properties, such as viscosity.


Polymer dataset

We first generate and characterize a topologically diverse set of polymers for training and evaluating the VAE. In particular, we initially prepare and simulate 1342 polymers across six architectural classes (11 each for linear and cyclic and 330 each for αω-branched, comb, star and dendrimer); the αω-branched architecture possesses side-chains at two backbone termini and is simply denoted “branch” in figures and tables. The degree of polymerization ranges from 90 to 100 for each architectural class. The VAE here is tasked to encode and decode a specific manifestation of a polymer topology, although the representation of ensembles of such structures is of future interest.

Figure 2a showcases the diversity of structures across a representative set of these polymers. This diversity is also manifest through the variation of numerous topological descriptors shown in Fig. 2b: Nnodes represents the number of nodes, Nedges the number of edges, \(\bar{d}\) the average node degree, \({\bar{d}}_{{{{\rm{nb}}}}}\) the average neighbor degree, δ the graph density, ϕdiam the graph diameter, ϕrad the graph radius, λalg the algebraic connectivity, Cdeg the degree centrality, Cbet the betweenness centrality, and rdeg the degree assortativity. These descriptors, which are derived purely from knowledge of the molecular graph/polymer connectivity, provide a first means to quantitatively characterize and distinguish polymer topologies. The current dataset is restricted to architectures that possess at most one cycle (corresponding to the macrocycle of the cyclic polymer) and also does not describe polymer networks; however, including descriptors related to cyclization or meshes may benefit future models. Despite the uniformity in the number of nodes and edges, which are commonly used to characterize polymers, significant variations are observed in other topological descriptors. For instance, comb, branch, star, and dendrimer topologies, exhibit notable differences in descriptors like graph diameter, radius, betweenness centrality, and degree assortativity, even when node and edge counts are identical. Our primary aim is to assess the efficacy of ML to describe properties of topologically complex polymers. Consequently, the CG simulations used to generate data, for training and benchmarking, are based on the phenomenological Kremer-Grest (KG) model, which is agnostic to constitutional unit chemistry. Furthermore, the results do not represent any specific polymer, although KG can be descriptive of many polymer systems based on mapping schemes59. We anticipate that future work could straightforwardly leverage similar ML architectures of chemically specific parameterized CG models, including those with hydrodynamic interactions.

Fig. 2: Characteristics of generated polymers.
figure 2

a Representative graphs of polymers from each architectural class. The number of polymers is proportional to occurrence in the dataset. b Comparison of topological descriptors across architectural classes. Values are standard-normalized for the dataset for each topological descriptor. Within a class, data for polymers are organized from left-to-right in ascending order of descriptor values, starting with the top (i.e., “Number of nodes”) and proceeding downward to successively break ties. c The distribution of simulated \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) for each architectural class. The white dot represents the median, the black bar spans the inter-quartile range (i.e., 25% to 75% percentiles), and the width indicates the distribution density. The color of the graphs in (b) align with those of the violins positioned over the respective classes in (c).

Figure 2c and Supplementary Fig. 1 illustrate the range of characteristic polymer sizes, as expressed through the simulated mean squared radius of gyration \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\), observed in each class. Because the present study imposes a maximum number of monomers, polymers from the linear, cyclic, and dendrimer classes exhibit relatively narrow distributions in \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) by contrast to comb, branch, and star classes. Dendrimers notably form compact, globular structures over the range of simulated molecular weights relative to all other classes. Overall, the dataset is partitioned into a 64/16/20 train/validation/test split for future model construction and evaluation; stratified sampling is used to ensure proportional representation of architectural classes across all splits.

Polymer reconstruction and property prediction

Based on prior work on linear polymer featurization60,61, we hypothesized that polymer reconstruction with a VAE could be enhanced if derived topological descriptors were supplied as inputs. To examine this, we evaluate three distinct encoding strategies: TopoGNN, which integrates topological descriptors with graph features; GNN, which exclusively relies on graph features; and Topo, which solely employs topological descriptors. For each strategy, we consider a multitude of models with distinct hyperparameters and their performance across a broad range of evaluation metrics. For example, reconstruction performance is quantitatively evaluated with balanced accuracy (BACC), which measures the accuracy of individual entries in the reconstructed adjacency matrix. For topology classification, F1 score is chosen to address the class imbalance in our dataset. Other metrics include the coefficient of determination R2 for regression on \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) and the Kullback-Leibler (KL) divergence. Representative models for each encoding strategy are selected using a comprehensive evaluation score (CES) that simultaneously considers all criteria:

$$\,{{\mbox{CES}}}\,\equiv \sqrt{{(\overline{1-{{\mbox{BACC}}}})}^{2}+{(\overline{{{\mbox{KL}}}})}^{2}+{(\overline{1-{R}^{2}})}^{2}+{(\overline{1-{F}_{1}})}^{2}}$$

where \(\overline{a}\) denotes the min-max normalized value of a; CES can be interpreted as the distance from the origin (a perfect model) in a vector space spanned by error metrics.

Table 1 summarizes the performance of these representative models. Across encoding strategies, TopoGNN emerges as the most overall effective, registering the smallest CES. By comparison, the Topo model yields slightly superior performance on regression and comparable F1 score. Conversely, the GNN model demonstrates a slightly higher balanced accuracy in reconstruction tasks and a lower KL divergence; however, it significantly underperforms in regression and classification. These results support the inclusion of topological descriptors during construction of the VAE.

Table 1 Performance of representative models for each encoding strategy on validation set

To assess model generalizability, we examine the performance of the representative models on the held-out test set. Figure 3 again indicates that TopoGNN delivers consistently strong performance across several evaluation criteria, while GNN and Topo can be deficient in particular metrics. Balanced accuracy is highest for GNN (0.9397), closely followed by TopoGNN (0.9369) and then Topo (0.9164). This suggests that topological descriptors do not necessarily enhance reconstruction performance, although the ability of Topo to effectively reconstruct certain topologies (e.g., branch polymers) highlights the extensive information content encompassed by the 11 topological descriptors. By contrast, directly supplying topological information is clearly advantageous for predicting the characteristic polymer size. Here, TopoGNN stands out as the most effective, achieving the highest mean value (0.9920), surpassing Topo (0.9854)and GNN (0.9639). Meanwhile, GNN achieves the highest mean F1 score (0.9783), followed by TopoGNN (0.9689) and Topo (0.9678); however all models display statistically comparable results regarding this classification metric. Taken together, this suggest workflows with VAEs can effectively address complexities induced by these polymer architectures.

Fig. 3: Performance of variational autoencoder models.
figure 3

Comparison of TopoGNN, GNN, and Topo in terms of polymer graph reconstruction, \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) regression, and topology classification. BACC represents balanced accuracy, R2 is the coefficient of determination, and F1 measures accuracy based on the harmonic mean of precision and recall. The error bars represent the standard deviation arising from 10 random samplings of the latent space.

For a more nuanced assessment of model quality, Fig. 4 breaks down TopoGNN performance across architectures; comparable information for other models is in Supplementary Figs. 2 and 3. In polymer reconstruction, TopoGNN excels but faces challenges with specific cyclic and comb polymers (Fig. 4a, gray dashed boxes). Notably, GNN generates errors, especially for star polymers, while Topo exhibits minor errors across most architectures. Regarding the prediction of \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) (Fig. 4b), TopoGNN performs well regardless of polymer class. Both GNN and Topo display high correlation, but errors are generally larger for GNN (Supplementary Fig. 2), indicating the difficulty in establishing a direct relationship between graph features and \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\). A saliency analysis (Supplementary Fig. 3) reveals that graph diameter, betweenness centrality, and algebraic connectivity most strongly influence \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\), aligning with their direct correlation with \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) (Supplementary Fig. 1). For topology classification, TopoGNN (Fig. 4c) is broadly effective, with most misclassifications occurring in linear, αω-branched, and comb architectures. These issues are more pronounced in Topo and GNN (Supplementary Fig. 3) and can be augmented with other misclassifciations. Overall, TopoGNN, which utilizes both graph and topological features, not only consistently outperforms other models but also delivers high-quality results. The remainder of the article therefore focuses on analysis and applications of TopoGNN to illustrate its practical deployment.

Fig. 4: Performance decomposition of TopoGNN.
figure 4

a Polymer graph reconstructions by TopoGNN, contrasting true (blue) and predicted (red) polymer topologies. b Regression parity plot. The diagonal line signifies ideal regression accuracy, and error bars show standard deviation from random latent space sampling. c Confusion matrix representing the classification performance across various topologies: linear (lin), cyclic (cyc), branch (brn), comb, star, and dendrimer (den). Diagonal entries correspond to accurate classifications, while off-diagonal entries indicate misclassifications.

Latent space exploration and polymer generation

Figure 5 presents the UMAP projection of the 8-dimensional latent space of topoGNN into a 2-dimensional space for visualization. Distinct topological clusters emerge in Fig. 5a and b, which reveals organization of the latent space that depends on relationships amongst architectures and their physical properties. Dendrimers, characterized by their high orders of branches, form three, mostly isolated and distinct clusters that reflect how the dendrimer architectures were algorithmically generated; they are most closely related to star polymers and αω-branched polymers (particularly those with pom-pom architectures). Branch, comb, and star polymers all notably overlap within the latent space, which is attributed to topological similarities (Fig. 2b). Cyclic and linear polymers are interspersed within comb and branch clusters, with linear polymers sharing a long backbone and cyclic polymers possessing a long ring-closed backbone. This organization is clearly informed training with auxiliary tasks for predicting \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) and classifying topologies, as illustrated in Fig. 5b. A vertical trajectory in the UMAP space (marked by an increase in Z2) results in an almost monotonic increase in \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) for the generated polymer topologies (Fig. 5c). Conversely, a horizontal trajectory (associated with an increase in Z1) moreso transitions topology classes with slight variations in \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) ((Fig. 5d). Omitting the auxiliary tasks leads to less distinct separation of topological classes and disrupts the monotonicity of the \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) (Supplementary Fig. 6). The latent spaces of GNN and Topo (Supplementary Fig. 5) are prone to similar issues. Overall, this highlights the effectiveness of the workflow for TopoGNN to produce an intuitive and physically meaningful latent space.

Fig. 5: Visualization and exploration of the latent space.
figure 5

a Two-dimensional visualization of the TopoGNN latent space using the Uniform Manifold Approximation and Projection (UMAP) technique. A subset of the data is displayed for clarity, with each marker representing a polymer graph based on its latent vector. Different colors denote distinct topologies. b Organization of (left) \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) and (right) topology in the UMAP-coordinate space. The dots signify the latent vectors of polymer graphs. The two arrows mark regions in the latent space targeted for exploration (i.e., new polymer topology generation). c As exploration progresses with an increase in Z2 in the latent space (represented by a solid line), there is a near-monotonic rise in \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) for the generated polymers. d Progression with an increase in Z1 (indicated by a dashed line) showcases shifts in polymer topology, moving through clusters characteristic of star, comb, and branch topologies.

The latent space of TopoGNN can be used to generate a diverse set of polymer topologies. This is exemplified by computing the Vendi Score (VS) for each architecture (see section “Machine Learning Details” for details) and comparing it to that of the originally constructed dataset. Whereas the VS for the original dataset (1342 points) is 2.0968, that for 1342 topologies generated using TopoGNN is 5.0684, which exceeds those for GNN (4.9580) and Topo (4.3305). Examples of the generated polymer topologies and their distribution are shown in Supplementary Figs. 710. This indicates that all models can generate a more diverse range of polymer topologies compared to the original handcrafted dataset, which could have implications for downstream tasks, as explored in the next section.

Property-guided polymer topology generation

To illustrate one application for TopoGNN, we generate a series of distinct polymer topologies that exhibit specific \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\). While \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) itself is a fundamental characteristic of the polymers, the rationale here is moreso to demonstrate the production of alternative materials with similar characteristics and further to assess how topology affects other polymer properties, such as rheology, without conflation of other factors. We therefore select target \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) ranges of 7.5 ± 2, 30 ± 2, and 50 ± 2 which represent the low, intermediate, and high regions of \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) in the dataset, respectively (Fig. 2) and conditionally sample polymers from the latent space across the different topological classes. The \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) are then validated for the generated polymer topologies using MD simulation. These results are shown in Fig. 6, which illustrates that TopoGNN can indeed produce a range of distinct structures that exhibit effectively similar \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\). \(\left.{{{\rm{Targeting}}}}\right\rangle =7.5\pm 2\) predominantly yields dendrimer and star topologies, targeting \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle =30\pm 2\) yields branch, comb, cyclic, and star topologies, and targeting \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle =50\pm 2\) mostly yields in branch and comb architectures. With the current approach, however, architectures that satisfy specific targets cannot be arbitrarily produced based on the molecular-weight restrictions. For example, dendrimers are more or less restricted to low \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\), while linear polymers are mostly restricted to larger \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\). Moreover, relatively few polymers meet the ambitious target of 50 ± 2, which is consistent with the paucity of data points around \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle =50\pm 2\) within the original dataset; however, the group of polymers here uniformly exceed those of the smaller 30 ± 2 target. Interestingly, TopoGNN also proffers architectures, such as irregular dendrimers and nuanced branching patters in stars and combs, that go beyond those of the original dataset. Overall, these results reflect the intended capability of TopoGNN to generate a broad spectrum of original polymer topologies that align with a target property.

Fig. 6: Generation of polymer topologies with target \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\).
figure 6

Topologies are generated aiming for target \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) values of 7.5 ± 2, 30 ± 2, and 50 ± 2. Each generated topology is accompanied by its type and the predicted \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) from TopoGNN, presented in parentheses on the x-axis. A violin plot showcases the revalidation of \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) via MD simulation for every topology. The gold dot marks the \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\), while the white dot stands for the median. The black bar represents the interquartile range, and the plot width reflects the distribution density of \({R}_{{{{\rm{g}}}}}^{2}\). Two dashed lines highlight the \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) range used in the guided search.

Rheological Analysis

The viscosity-modifying properties of polymers are key to numerous applications62,63,64 and depend on a variety of factors, including unit chemistry, polymer composition, and chain topology65,66. The relative impacts of such factors can be difficult to disentangle. Using TopoGNN, we specifically explore the influence of polymer topology on rheological characteristics. While solution viscosity at dilute concentrations is primarily determined by polymer size, which sets the overlap concentration67, we control for this factor by designing topologies with specified \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) and examine topological implications across a range of concentrations. Figure 7a examines the concentration-dependent shear viscosity as determined from MD simulations of four selected topologies. Figure 7a presents concentration-dependent shear viscosity from MD simulations of four selected topologies. Differences emerge beyond 0.4 σ−3, with cyclic polymers showing lower viscosities due to reduced entanglements, and branched polymers exhibiting elevated viscosities due to extended side chains. Star and comb polymers demonstrate similar, somewhat lower shear viscosities compared to branched polymers, highlighting the impact of side-chain position and density on entanglement effectiveness. Additionally, we observe nuanced differences in frequency-dependent storage and loss moduli, \(G^{\prime}\) and G″, across topologies and concentrations (Fig. 7b, c). While all solutions exhibit liquid-like viscous behavior at low frequencies and solid-like behavior at high frequencies below 0.6 σ−3, star, branch, and comb polymers display three crossover frequencies as concentration increases. In contrast, cyclic polymers maintain a single crossover frequency, indicating less nuanced viscoelastic behavior. The presence of multiple crossover frequencies at higher concentrations (Fig. 7c and Supplementary Fig. 11) has been previously observed in both simulations and experiments68,69,70,71,72. Notably, the plateau between the lowest and second lowest crossover frequencies, where \(G^{\prime}\, >\, G^{\prime\prime}\) signifies a rubbery plateau attributed to polymer entanglement. Regarding the relative viscosities of differing architectures, some results are also resonant with prior work. For example, cyclic polymers exhibit relatively lower viscosities, which is due to the absence of free ends that tends to reduce entanglements73, and αω-branched polymers tend to possess higher viscosities, which is consistent with expectations set by experimental investigation of the impact of side-chain length on viscosity74,75. Here, polymers classified with comb architectures have a similar number of side chains and similar backbone lengths as those classified as αω-branched architectures; however, the side-chains are shorter, resulting in less effective friction and lower viscosity. This highlights potential for how rheological properties might be modulated through strategic architecture design.

Fig. 7: Effect of polymer topology on shear viscosity and complex moduli at comparable \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\).
figure 7

a Influence of polymer topology and concentration on viscosity, featuring topologies such as star, branch, comb, and cyclic, each with a \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) of approximately 30 ± 2. b Relationship between polymer topology, concentration, and complex moduli crossover frequencies. c Complex moduli for various topologies at concentrations of 0.1 and 0.8, with the star symbol marking the crossover point.


This study employed variational autoencoders to address emergent combinatorial complexity of diverse polymer topologies, which has been scarcely addressed in machine learning of macromolecules. We constructed an extensive dataset featuring the average squared radius of gyration (\(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\)) for 1342 polymers with various architectures, including linear, cyclic, branch, comb, star, and dendrimer structures. By analyzing different encoding strategies and input representations, we found that meaningful latent spaces of polymers with complex topologies can be established by (i) incorporating both graph-explicit and graph-derived features and (ii) coupling graph reconstruction tasks with auxiliary prediction tasks, such as those related to physical properties. Probabilistic sampling over the latent space was shown to result in rich topological diversity. These generative capabilities were then used to produce distinct polymer topologies with target characteristic sizes in dilute solution. This enabled subsequent investigation by coarse-grained molecular dynamics into how topology influences rheological properties, such as shear viscosity and viscoelastic moduli, while controlling for polymer size. While all architectures exhibited similar rheological behavior at relatively low concentrations, distinct responses emerged at higher concentrations. For instance, localized branches at chain ends resulted in more viscous solutions compared to other architectures, including cyclic structures that exhibited minimal entanglements. Apart from illustrating how rheological behavior might be tuned or altered via polymer architecture, this also showcases a paradigm for studying the physical properties of topologically distinct systems.

This work also invites several future research directions. Particularly, TopoGNN exhibits promising potential as a generative model, offering a cost-effective alternative to experiments or simulations in predicting properties like \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\). While \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) serves as a straightforward and computationally accessible quantity, there is interest in extending the strategy to incorporate or utilize other properties. Although this work leveraged TopoGNN to simply compare rheological properties in systematic fashion, in the future, it may be deployed to guide design efforts aimed at optimizing polymer properties. We also note that the dataset and machine learning framework are currently limited to polymers with a narrow range of bead numbers (equivalently, molecular weights). Future research will explore the extensibility and transferability of machine learning architectures across various molecular weights, potentially through the use of string-based representations76,77,78.

This study also focused on specific structural and rheological characterizations of chemically homogeneous and precise polymers at coarse-grained resolution. The ML framework might be feasibly extended to address compositional complexity; however, such efforts will need to address increased data burdens to capture the behavior of such systems. For this and other reasons, TopoGNN and related ML strategies will benefit from advancements that accelerate molecular simulation to increase data throughput, and those that improve the accuracy of CG models, which will expand the validity and range of properties that can be reliably computed. In particular, parameterized CG models may not be transferable across all thermodynamic conditions of interest, and dynamical consistency between CG models and high-resolution systems poses a persistent challenge40. Finally, the dataset and ML models introduced here feature precisely defined polymer architectures. Although the architectural classes studied are broadly accessible, precision control over architecture is beyond current synthetic capabilities. Therefore, future efforts in both the ML and modeling space must address how to predict and represent ensembles of polymer structures77 that are accessible with modern synthetic approaches and appropriately tailoring generative capabilities towards these58. Overall, understanding and controlling the properties of polymers, which involve chemical, compositional, and topological complexity, and aligning these properties with synthesizable polymer systems remains a significant challenge in polymer science. This study provides a foundation to pursue these directions.


Description of dataset

The dataset comprises 1342 polymer architectures, each containing between 90 and 100 constitutional units, or beads. Polymer architectures encompass a wide range of topologies, including linear, cyclic, branch, comb, star, and dendrimer structures. Due to limitations bead count, linear and cyclic topologies are restricted to 11 distinct polymers each, whereas other topologies are represented by 330 distinct polymers each. The polymers are chemically homogeneous with all beads treated equivalently. The procedure for generating polymer graphs is described in the Supplementary Discussion Section 2. For each polymer graph, we calculate an 11-dimensional topological descriptor vector43,79 using the number of nodes, number of edges, average degree, average neighbor degree, density, diameter, radius, algebraic connectivity, degree centrality, betweenness centrality, and degree assortativity as elements. For further details on these descriptors, readers are referred to Supplementary Discussion Section 1.

Calculation of polymer properties

Radius of gyration

We investigate the structural properties of individual polymer chains using coarse-grained molecular dynamics. To do so, we compute the gyration tensor S:

$${{{\bf{S}}}}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\left({{{{\bf{r}}}}}_{i}-{{{{\bf{r}}}}}_{{{{\rm{cm}}}}}\right){\left({{{{\bf{r}}}}}_{i}-{{{{\bf{r}}}}}_{{{{\rm{cm}}}}}\right)}^{T}$$

where ri denotes the position vector of the ith bead, rcm represents the center-of-mass position of the polymer, and T indicates the transpose operation. Diagonalizing yields \({{{\bf{S}}}}={{{\rm{diag}}}}({\lambda }_{1}^{2},{\lambda }_{2}^{2},{\lambda }_{3}^{2})\) where the diagonal elements are the principal moments of the gyration tensor ordered as λ1 ≤ λ2 ≤ λ3. The squared radius of gyration can be subsequently computed as

$${R}_{{{{\rm{g}}}}}^{2}={\lambda }_{1}^{2}+{\lambda }_{2}^{2}+{\lambda }_{3}^{2}$$

and quantifies the size of a given polymer conformation. The ensemble average \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) is the constructed using a series all sampled configurations. This ensemble-averaged quantity serves as the target for the regression auxiliary task.

Rheological properties

We also characterize several rheology-related properties for select polymer systems. The shear viscosity η of the polymer solution is formally calculated via

$$\eta =\int_{0}^{\infty }G(t)dt$$

where G(t) denotes the stress relaxation modulus. We determine G(t) using the Green-Kubo relation

$$G(t)=\frac{1}{3}\sum\limits_{\alpha \beta =xy,xz,yz}\frac{V}{{k}_{B}T}\langle \overline{{\sigma }_{\alpha \beta }(t)}\overline{{\sigma }_{\alpha \beta }(0)}\rangle ,$$

with V representing the simulation box volume, \(\overline{{\sigma }_{\alpha \beta }}(t)\) signifying the off-diagonal stress tensor components averaged at intervals of 1000 steps, and 〈  〉 denoting an ensemble-average. Often, G(t) exhibits significant noise at long times, which renders direct numerical integration of Eq. (4) unreliable. Therefore, following prior work80, we fit the simulated G(t) data to a generalized Maxwell model, given by \(G(t)={\sum }_{p}{G}_{p}\exp (-t/{\tau }_{p})\), where Gp and τp represent the modulus and relaxation time of the p-th element, respectively. This approach yields the viscosity η = ∑pGpτp. We also compute the storage modulus (\(G^{\prime}\)) and the loss modulus (G) to better characterize the viscoelastic properties of the polymers. These moduli are obtained from the Fourier transform of the stress relaxation modulus, yielding

$$\begin{array}{l}{G}^{* }(\omega )\,=\,i\omega \int_{0}^{\infty }G(t){e}^{-i\omega t}dt\\\qquad\quad \,=\,G^{\prime} (\omega )+iG^{\prime\prime} (\omega ).\end{array}$$

Here, \(G^{\prime} (\omega )\), the storage modulus, reflects the elastic, or energy-storing, aspect of the material, while G(ω), the loss modulus, represents the viscous, or energy-dissipating, component. This analysis is thus restricted to linear viscoelasticity.

MD simulation details

MD simulations are used to generate polymer configurations for the characterization of polymer properties. All simulations are conducted using the LAMMPS simulation package81 in reduced units; the units of mass, distance, and energy are denoted by m, σ, and ε, respectively. The reduced time unit follows as \({(m{\sigma }^{2}/\varepsilon )}^{1/2}\). All simulations are considered to take place in an implicit athermal solvent environment, with dynamics of the polymer(s) governed by the Langevin equation, such that hydrodynamic interactions are neglected. The equations-of-motion are numerically integrated using the velocity-Verlet integration scheme with a 0.001 timestep. The solvent friction coefficient is set to ς = 0.1.

Polymer interactions are modeled via a combination of bonded and nonbonded potential energy contributions. The total potential energy U of a system with configuration rN is expressed as:

$$U({{\bf{r}}}^{N})=\sum\limits_{{\rm{bonds}}}{U}_{{\rm{vib}}}({r}_{ij})+\sum\limits_{i < j}{U}_{{\rm{nb}}}({r}_{ij}),$$

where rij represents the internal distance calculated from the coordinates rN. The nonbonded energy contributions for all pairs of beads are computed using the following equation:

$${U}_{{{{\rm{nb}}}}}({r}_{ij})=\left\{\begin{array}{ll}4{\varepsilon }_{ij}\left[{\left(\frac{{\sigma }_{ij}}{{r}_{ij}}\right)}^{12}-{\left(\frac{{\sigma }_{ij}}{{r}_{ij}}\right)}^{6}\right]+{\epsilon }_{ij},\quad {{{\rm{if}}}}\,i,j\,{{{\rm{are}}}}\,{{{\rm{bonded}}}}\,{{{\rm{and}}}}\,{r}_{ij} \,<\, {2}^{1/6}\\ 0,\qquad\qquad\qquad\qquad\qquad\qquad\quad {{{\rm{otherwise}}}},\end{array}\right.$$

where εij and σij are set to 1. For directly bonded beads, the stretching energy is calculated as:

$${U}_{{{{\rm{vib}}}}}({r}_{ij})=-\frac{1}{2}{K}_{ij}{({R}_{ij}^{(0)})}^{2}\ln \left[1-{\left(\frac{{r}_{ij}}{{R}_{ij}^{(0)}}\right)}^{2}\right],$$

where Kij is assigned a value of 30, and \({R}_{ij}^{(0)}\) is fixed at 1.5.

Single-chain simulations

Simulations of single coarse-grained polymer chains (no boundary conditions) are used to characterize \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\). Each simulation is conducted for 2 × 107 steps, allocating the first half for system equilibration. Configurations for analysis are sampled every 2 × 103 timesteps during the latter half of the simulation.

Many-chain simulations

Simulations of many chains within a simulation cell with cubic periodic boundary conditions are used for rheological analysis of a subset of polymers with comparable ensemble-averaged square radii of gyration, \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\). Simulations are performed across various concentrations (0.1 to 0.8) to cover both semi-dilute and semi-concentrated regimes. Each simulation uses 100 chains with the simulation cell dimensions adjusted to match the desired concentration. Equilibration periods of 107 steps are utilized for all simulation concentrations. Upon achieving equilibrium, data are collected for 107 steps at a timestep of 0.001. We note that using an implicit-solvent environment and neglecting hydrodynamic interactions has implications for simulating rheological properties82. However, while these choices affect data generation and its interpretation relative to the physics of real polymer solutions, they do not affect the analysis of the ML task.

Machine learning details

Data preprocessing

Polymers are represented using graph notation \({{{\mathcal{G}}}}=(V,E)\), where V is the set of nodes, and E is the set of edges. To address the variability in node counts across different polymers, ranging from 90 to 100, we introduce “ghost” nodes with zero-edge connections to standardize graph sizes to 100 nodes using node padding83,84. Because all polymer beads are equivalent, the adjacency vector \({a}_{i}\in {{\mathbb{R}}}^{100}\) serves as the sole node feature for each polymer bead. Elements of this vector are defined such that ai = 1 if node i is connected to the current node, and ai = 0 otherwise. All bonds are also equivalent, and so edge features are not included in the representation. Polymers are also characterized by an 11-dimensional topological descriptor vector \({{{\bf{t}}}}\in {{\mathbb{R}}}^{11}\) as previously described. For the task of polymer reconstruction, an adjacency matrix \({{{\bf{A}}}}\in {{\mathbb{R}}}^{100\times 100}\) is associated with each polymer, where Aij = 1 indicates an edge between nodes i and j, and Aij = 0 indicates no edge. For the auxiliary regression task, each polymer is associated with a label for \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\), denoted \({y}_{r}\in {\mathbb{R}}\). For the auxiliary classification task, each polymer is associated with a one-hot encoded topology label, denoted \({{{{\bf{y}}}}}_{{{{\rm{t}}}}}\in {{\mathbb{R}}}^{6}\). The dataset of 1342 polymers is divided into three subsets: 858 for training (64%), 215 for validation (16%), and 269 for testing (20%). Stratified splitting is used to ensure each subset represents all polymer topologies. The training set is utilized to train the VAE, the validation set for hyperparameter optimization, and the test set to evaluate the model generalizability.

Model architectures

Overall, we explore three distinct encoder architectures while maintaining a uniform decoder architecture. The first model, designated as TopoGNN, combines a graph encoder with a topological descriptor encoder, thus operating as a multi-input model. The second model, GNN, exclusively employs the graph encoder. The third model, Topo, relies solely on the topological descriptor encoder. The architecture of the VAE for TopGNN is depicted in Fig. 8. The encoder transforms input data into a latent space representation. Graph inputs are represented using an adjacency matrix \({{{\bf{A}}}}\in {{\mathbb{R}}}^{100\times 100}\) and a node feature matrix \({{{\bf{X}}}}\in {{\mathbb{R}}}^{100\times 100}\), with the adjacency vector serving as the node feature due to identical nodes. The Graph Isomorphism Network encoder85, equipped with two graph convolutional layers, maps these inputs into a 32-dimensional feature vector hg. Despite its shallow architecture and narrow receptive fields, GIN has demonstrated robust performance across a range of tasks in materials science and chemistry86,87. The topological descriptor vector is similarly converted into a 32-dimensional feature vector ht by a dense neural network (DNN) encoder. Subsequently, the feature vectors hg and ht are concatenated to yield a combined feature vector \({{{\bf{h}}}}\in {{\mathbb{R}}}^{64}\). Additional dense layers generate the parameters of the latent Gaussian distribution: the mean μ and the logarithm of variance \(\log {{{{\boldsymbol{\sigma }}}}}^{2}\). These parameters define the latent space embedding \({{{\bf{z}}}} \sim {{{\mathcal{N}}}}({{{\boldsymbol{\mu }}}},{{{\boldsymbol{\sigma }}}})\), which has a dimensionality of 8. The decoder then samples from z to reconstruct data. A convolutional neural network is used to reconstruct the adjacency matrix \(\hat{{{{\bf{A}}}}}\). Additionally, two additional and distinct neural networks are tasked with predicting \({\hat{y}}_{{{{\rm{r}}}}}\) and \({\hat{{{{\bf{y}}}}}}_{{{{\rm{t}}}}}\). We note that the present approach does not enforce symmetry of the reconstructed adjacency matrix during training, similar to the approach of prior work using VAE to generate conjugated peptides57. However, symmetry is enforced during the polymer graph reconstruction and generation process by selecting \({\hat{{{{\bf{A}}}}}}_{{{{\rm{sym}}}}}={\max }_{ij}\{{\hat{{{{\bf{A}}}}}}_{ij},{\hat{{{{\bf{A}}}}}}_{ij}^{T}\}\).

Fig. 8: Architecture of the variational autoencoder (VAE) for TopGNN.
figure 8

The model compresses information from the graph and topological descriptors. These two sets of compressed features are then concatenated and passed to the latent space, where the model learns a normal distribution characterized by parameters μ and σ. Subsequently, samples drawn from this distribution are used by the decoder to reconstruct the adjacency matrix of the input graph. Additionally, the same samples are used in two auxiliary tasks: predicting the radius of gyration and classifying the topology. The numbers in the parentheses indicates the size of the layer.

Loss functions

Training of the VAE uses a composite loss function \({{{{\mathcal{L}}}}}_{{{{\rm{VAE}}}}}\)

$${{{{\mathcal{L}}}}}_{{{{\rm{VAE}}}}}={{{{\mathcal{L}}}}}_{{{{\rm{Rec}}}}}+{{{{\mathcal{L}}}}}_{{{{\rm{KL}}}}}+{\lambda }_{{{{\rm{Reg}}}}}{{{{\mathcal{L}}}}}_{{{{\rm{Reg}}}}}+{\lambda }_{{{{\rm{Cls}}}}}{{{{\mathcal{L}}}}}_{{{{\rm{Cls}}}}},$$

which features terms associated with reconstruction, \({{{{\mathcal{L}}}}}_{{{{\rm{Rec}}}}}\) via binary cross-entropy (BCE); Kullback-Leibler (KL) divergence, \({{{{\mathcal{L}}}}}_{{{{\rm{KL}}}}}\); regression for yr\({{{{\mathcal{L}}}}}_{{{{\rm{Reg}}}}}\); and classification for yt via cross-entropy (CE), \({{{{\mathcal{L}}}}}_{{{{\rm{Cls}}}}}\). In Eq. (10), λReg and λCls are hyperparameter weights that are adjustable for optimizing performance. The individual loss terms are defined as follows:

$$\begin{array}{ll}{\mathcal{L}}_{\rm{Rec}}\,&=\,{\rm{BCE}}({{\bf{A}}},\hat{{\bf{A}}})\\ \,&=\,-\mathop {\sum} \limits_{i=1}^{100}\mathop{\sum}\limits_{j=1}^{100}{A}_{ij}{\log} ({\hat{A}}_{ij})+(1-{A}_{ij}){\log} (1-{\hat{A}}_{ij}),\end{array}$$
$$\begin{array}{ll}{\mathcal{L}}_{\rm{KL}}\,&=\,{D}_{\rm{KL}}({\bf{z}}\,| | \,{\mathcal{N}}(0,{\bf{I}}))\\ \,&=\,-\displaystyle\frac{1}{2}\sum\limits_{i=1}^{8}(1+{\log} ({\sigma }_{i}^{2})-{\sigma }_{i}^{2}-{\mu }_{i}^{2}),\end{array}$$
$$\begin{array}{ll}{\mathcal{L}}_{\rm{Reg}}\,&=\,{\rm{MAE}}({y}_{\rm{r}},{\hat{y}}_{\rm{r}})\\ \,&=\,| {y}_{\rm{r}}-{\hat{y}}_{\rm{r}}| ,\end{array}$$
$$\begin{array}{ll}{\mathcal{L}}_{\rm{Cls}}\,&=\,{\rm{CE}}({\bf{y}}_{\rm{t}},{\hat{\bf{y}}}_{\rm{t}})\\ \,&=\,-\sum\limits_{i = 1}^{6}{y}_{\rm{t},i}\log ({\hat{y}}_{\rm{t},i}).\end{array}$$

Model training and hyperparameter tuning

All models are implemented using TensorFlow88. Models undergo training for 1000 epochs with the Adam optimizer89. A broad range of hyperparameters is explored, encompassing batch sizes {32, 64, 128}, learning rates {0.0001, 0.001, 0.01}, and regularization terms λReg {0.01, 0.1, 1, 10, 100} and λCls {0.01, 0.1, 1, 10, 100}. Criteria for model weight saving include overall validation loss, Evidence Lower Bound (ELBO), and reconstruction balanced accuracy. Across three encoder types, this approach results in 2025 distinct hyperparameter combinations. For each encoder type, the optimal hyperparameter configuration is selected based on a composite validation metric that combines several key performance indicators: reconstruction balanced accuracy (BACC), KL divergence, \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) regression R2 value, and the topology classification F1 score.

These metrics are min-max normalized

$$\bar{{{{\bf{a}}}}}=\frac{{{{\bf{a}}}}-\min ({{{\bf{a}}}})}{\max ({{{\bf{a}}}})-\min ({{{\bf{a}}}})}$$

and consolidated into a four-dimensional vector as


Subsequently, the optimal hyperparameter configuration is determined as that nearest to the origin (0, 0, 0, 0). Since hyperparameter optimization does not involve updating model weights, compared to abstract loss functions, these metrics are more interpretable and directly related to our objectives, such as improving reconstruction, prediction accuracy, and model generalization.

Random polymer generation

To generate random polymer topologies, points are sampled from a predefined latent distribution, and the resultant latent vector, zgen, is transformed into an adjacency matrix, Agen. Each element in Agen indicates the connectivity between nodes. To avoid spurious and unphysical edge-formation or other errors during reconstruction, generated polymers then undergo a graph-cleansing step. This step principally removes isolated nodes and breaks small rings. Because this modifies the original adjacency matrix, we implement a validation protocol, which is fully described in Supplementary Discussion Section 3. Briefly, the cleansed graph and its recalculated topological descriptors are re-encoded to derive updated values for \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) and topology class. Cleansed graphs are considered valid if they satisfy three criteria. First, the difference in \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) values before and after cleansing is less than 2 σ2. Second, the topology classification is unchanged. Third, the mean squared difference between the pre- and post-cleansing latent vectors is less than 1. These criteria preserve the inherent properties of the generated polymers.

Polymer generation with target properties

To generate polymers with specific target properties, namely \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) and topology, “parent” polymers that exhibit these desired characteristics are first identified from the original dataset. The criterion for \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) is relaxed to allow a tolerance range of ± 2 around the target value. Points are then sampled near the latent-space vectors of the parent polymers by introducing Gaussian noise with a mean of 0 and a variance of 0.1. The \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) and topology of each generated candidate polymer is then predicted using the trained ML model. Candidates that do not exhibit target topology or deviate in \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) by more than 2 σ2 are discarded. Following this initial screening, polymer graphs undergo cleansing as previously described, except that \(\langle {R}_{{{{\rm{g}}}}}^{2}\rangle\) of candidates must more stringently remain within 2 σ2 of both the initial target and pre-cleansing values. Subsequently, non-distinct graphs, either duplicated from the original dataset or already present within the generated pool, are identified and removed through graph isomorphism checks. Additional details are in the Supplementary Discussion Section 4. The proportion of generated polymer graphs with target properties that undergo graph cleansing and pass all validation checks is detailed in Supplementary Table 1.

Latent-space visualization

The latent space is visualized using the Uniform Manifold Approximation and Projection (UMAP) algorithm90. The parameters follow that of prior work43, wherein the UMAP local neighborhood size is fixed at 200, the minimum embedding distance between points is set to 1, and the Euclidean distance metric is utilized in feature space analysis. This results in a mapping from \({{\mathbb{R}}}^{8}\) to \({{\mathbb{R}}}^{2}\): UMAP(z) = u, where z denotes a latent vector and u its corresponding low-dimensional representation.

Diversity evaluation

To calculate the diversity of a set of polymer topologies, each graph representation undergoes transformation into a Laplacian spectrum, encapsulating all eigenvalues of the graph Laplacian matrix. The Laplacian matrix is defined as the difference between the adjacency matrix and the degree matrix of the graph. Diversity quantification employs the Vendi Score (VS)91, defined as:

$${{{\rm{VS}}}}({{{\bf{K}}}})=\exp \left(-\mathop{\sum }_{i=1}^{n}{\lambda }_{i}\log {\lambda }_{i}\right),$$

where λi represents the eigenvalues of the matrix K/n, with the convention \(0\log 0=0\). The similarity function in use is the dot product between normalized Laplacian spectra, denoted as \({{{\bf{X}}}}\in {{\mathbb{R}}}^{n\times 100}\), with 100 indicating the maximum eigenvalue count. For spectral vectors shorter than 100, zero-padding ensures length standardization. For reference, the minimum VS value is unity.