Inverse-design of nonlinear mechanical metamaterials via video denoising diffusion models

The accelerated inverse design of complex material properties - such as identifying a material with a given stress-strain response over a nonlinear deformation path - holds great potential for addressing challenges from soft robotics to biomedical implants and impact mitigation. While machine learning models have provided such inverse mappings, they are typically restricted to linear target properties such as stiffness. To tailor the nonlinear response, we here show that video diffusion generative models trained on full-field data of periodic stochastic cellular structures can successfully predict and tune their nonlinear deformation and stress response under compression in the large-strain regime, including buckling and contact. Unlike commonly encountered black-box models, our framework intrinsically provides an estimate of the expected deformation path, including the full-field internal stress distribution closely agreeing with finite element simulations. This work has thus the potential to simplify and accelerate the identification of materials with complex target performance.


Introduction
Creating materials with tailored properties has gained popularity across disciplines ever since additive manufacturing enabled the manipulation of multi-material and cellular architectures across scales.Instead of choosing from the limited catalog of natural materials, engineers and designers now have access to the drastically expanded design and property spaces of so-called metamaterials, which have been designed, among others, to achieve mechanical properties previously not attainable.Realizations of metamaterials have various forms, most commonly involving the periodic arrangements of small-scale structural building blocks.[1][2][3] The physical mechanisms governing the mechanical behavior of such architected materials are mostly well understood, and various numerical frameworks such as the finite element (FE) method provide accurate structure-to-property relations, predicting the effective material properties based on an underlying small-scale architecture.By contrast, the inverse problem of identifying possible small-scale designs yielding a desired property has remained a challenge.Methods to address the latter include topology optimization (TO) [4][5][6] and, more recently, data-driven algorithms.Most of these approaches have, however, been restricted to linear material properties such as the effective elastic stiffness in three dimensions (3D) [7,8] or Poisson's ratio.[9] Extensions to nonlinearity (e.g., via multi-material configurations) have been presented recently [10] but involve computationally expensive simulations.
While tuning a material's stiffness is sufficient for applications involving small deformation (such as patient-specific bone implants matching the native bone properties, or vibration insulation by attenuating linear waves), controlling the nonlinear response of soft metamaterials over a finite deformation path can unlock advanced functionality for emerging fields such as soft robotics [11], tissue engineering [12], and impact energy absorption.[13] Metamaterials with tailored stress-strain responses can, e.g., mimic the nonlinear response of human fingers [14], enable actuation of soft robots via "snapthrough instabilities" [15], or serve as biomimetic scaffolds assisting in artery restoration.[16] Unfortunately, the nonlinear setting drastically adds to the complexity of the (inverse) map from property to structure.Extensions of TO to nonlinear properties exist [17,18] but remain challenging due to strong dependence on the initial guess and discretization [19], lack of physical effects such as contact [20], and degrading solver stability when considering non-trivial mechanisms such as post-buckling.[21] Most importantly, a single optimization study may require hours of runtime, which is a prime reason why recent studies focused on rather simple design spaces and optimization objectives.[22,23] Over the past decade, the rise of deep-learning models with their unparalleled ability to identify highly nonlinear maps has presented a potential alternative.When applied to nonlinear material property prediction, deep learning has served as an efficient forward approximation (replacing costly FE simulations) in combination with genetic algorithms to iteratively identify structures with a target nonlinear response, including, e.g., shell-like metamaterials and quadrilateral structures.[24,25] However, the considered design spaces have remained limited and predictions may lack physical intuition and rely on costly FE simulations to validate up to a hundred generated designs and to select the one closest to the desired stress-strain response.[25] These challenges resemble those addressed recently in the image generation community by (video) diffusion models.Diffusion models [26] have gained attention due to their ability to generate seemingly photo-realistic images based on text descriptors, a famous representative being DALL-E 2. [27], and have recently been extended to generate short video sequences with remarkable results.[28] Compared to variational autoencoders [29] or generative adversarial networks [30], diffusion models offer improved sample quality and simpler and more stable training protocols by gradually removing noise from a sample drawn from a prior distribution (typically unit Gaussian).[31] The shift from linear to nonlinear material properties can, at a high level, be compared to going from image to video generation.In both cases, a new data dimension must be learned, which requires some notion of consistencywhether in a temporal (consecutive images in a video must maintain temporal consistency) or mechanical sense (stresses in consecutive deformation steps must ensure mechanical consistency).Analogous to a text descriptor prompting an image sequence, the nonlinear target response here serves as input to predict a sequence of mechanically deformed microstructural configurations along the deformation path, ultimately resulting in the effective stress-strain response.This requires the definition of an efficient design/property space to be considered as training data for our generative model, whose key concepts and the considered model architecture are summarized in the following.

Generation of metamaterials with diverse properties
As our diffusion framework operates in a data-driven setting, we require a large collection of paired mechanical designs and their corresponding nonlinear stress-strain responses.The options for potential design spaces are virtually unlimited, ranging from truss descriptors [7] over shells [2] to composite structures.[32] We here consider a pixel-based design space parameterization with minimal constraints (aside from a periodic structure) to fully harness the generative power of diffusion models.While two-material composites could be generated with randomly drawn binary pixels and span a tremendous design space [32], the subset of structures with a non-trivial stress-strain response is comparably small.We therefore consider cellular structures (each pixel representing solid or void) as our design space to enable interesting mechanical behavior such as buckling-an instability that quickly transitions between distinct equilibrium configurations-and contact-arising under compressive loads and producing a sudden stiffness increase-overall resulting in a rich and possibly non-monotonic stress-strain curve.While modeling those effects Preprint by the FE method is challenging, inversely designing such structures is even more difficult due to the sensitivity of, e.g., the buckling response to small changes in the design.At the same time, incorporating such effects guarantees a highly diverse range of achievable stress-strain responses.To keep the problem tractable yet without loss of generality, we restrict our study to two dimensions (2D) and a periodic structure based on a square unit cell (UC).
The generation of the dataset used for training is performed as follows (see Fig. 1).To generate a random design with a certain level of structural features, we sample from a 2D Gaussian random field on a square domain and apply a binary threshold.Values above a specific threshold are considered material, those below are void.We ensure that opposite boundaries of the domain are connected with each other (and repeat the sampling until this condition is met) and mirror the pattern sequentially along both edges (see Fig. 1) to obtain mechanically intricate, periodic structures.Despite its simplicity, this stochastic approach produces a diverse dataset of designs with a broad range of stress-strain responses.We further induce different levels of relative density (or fill fraction) by randomly shifting the threshold within a specified range.Higher values promote low-density structures prone to buckling, which is important for the aforementioned reasons.
The stress-strain response of each design is obtained from FE simulations.As a technologically relevant load case, we place all samples between two rigid plates and apply a quasistatic compressive strain of up to ε = 20% in the vertical direction.Uniaxial compression is a frequent load characteristic of, e.g., impact applications [24], the compression of shoe soles [33], or so-called passive compliance in soft robotics (e.g., allowing a soft gripper to adapt its shape to the object being grabbed [34]).By applying periodic boundary conditions along the horizontal directions, we simulate an infinite periodic layer of the chosen design, as found in sandwich-type configurations.Within the cellular UC, we account for frictional contact and use an experimentally calibrated elastoplastic material model [35] (representative of a thermoplastic resin) to ensure realistic responses.Simulation details are provided in Methods.
Using this setup, we generate 53,007 pairs of unique designs and the corresponding stress-strain responses.We also collect the full-field stress distribution in the vertical direction, σ 22 , as well as displacement components u 1 and u 2 (all in the Lagrangian frame), since this data contains valuable information about the underlying physics, as also observed by Nie et al. [36] The overall effective stress response can be extracted either from the nodal reaction forces or directly from the full-field data, since in the considered quasi-static setting internal forces must be in equilibrium for any free cut of the UC (e.g., for any pixel row, see Supplementary Information Section S5.1).We evaluate all fields on a 96 × 96 pixel grid together with the overall (average) vertical stress at eleven equidistant strain increments between 01 and 20%.This strikes a reasonable balance between accuracy and computational feasibility and provides the training data for the generative model.(b) To obtain the stress-strain response, we place the UC between two rigid plates with periodic boundary conditions in the horizontal direction and apply a compressive strain of up to 20%.The corresponding stress and displacement fields within the UC are computed by FE simulations, and the overall effective stress-strain response σ eff. is extracted from the nodal reaction forces, though they can be equally obtained from the full-field data.A representative selection of responses of the generated designs is plotted in gray.

Video denoising diffusion model
Diffusion models are trained to reverse a stochastic forward process that gradually converts a data point (i.e., an image) drawn from the underlying data distribution x 0 ∼ q(x) to a prior distribution in T steps, typically a standard Gaussian.[26,37] This can formally be understood as a fixed Markov chain with given variance schedule {β t ∈ (0, 1)} This allows to sample x t at any timestep t via We approximate the reverse process q(x t−1 |x t ) by a neural network p θ (x t−1 |x t ) parameterized by θ.To generate new samples x * ∼ q(x), we run Figure 2 Denoising diffusion model architecture.The denoising diffusion model is based on the 3D U-Net video architecture [28] which iteratively adds information to a Gaussian prior.To include a temporal dimension, each spatial convolution and attention layer is followed by temporal attention computed over the eleven strain steps.We condition the model by transforming the stress-strain response to a token embedding, which is added via cross-attention into both spatial and temporal attention layers.
the reverse Markov chain to arrive at (2) Such models are typically trained to maximize the variational lower bound of the log-likelihood, which can be computed in closed form when conditioned on x 0 .As observed by Ho et al. [37], µ θ can be decoupled into two terms relating to x t and ϵ θ , allowing to simplify and re-parameterize the loss in terms of the Gaussian noise as To condition the model on some additional input c, we consider classifierfree guidance [38], not requiring an additional classifier p θ (c|x t ).We steer the reverse diffusion process by replacing ϵ θ by a linear combination of the conditional and unconditional noise estimates, i.e., where w ≥ 0 is the guidance weight, allowing to trade-off sample quality with conditioning augmentation, and ∅ denotes a fixed random embedding to represent the lack of conditioning.Details are provided in Supplementary Information Section S2.
Diffusion models map noisy input data to a less distorted one, making symmetric U-Net architectures [39] a common choice for ϵ θ .As our primary interest is in mapping from a target stress-strain curve to a design, training the model on simple images of UCs conditioned on the corresponding stress-strain curve is a straightforward approach and has been explored in recent work.[40] In our investigations, we observed similar success of such approaches for generating structures with a relatively simple stress-strain response (as the ones shown in [40]).However, the same setup proved ineffective in modeling more challenging responses such as, e.g., those induced by contact and buckling.We attribute this limitation to the highly indirect mapping the model must learn to predict the full deformation history and the corresponding internal stress distributions, which in turn dictate the overall stress-strain response.To facilitate the training, to improve the sample efficiency, and to obtain a full-field prediction of the expected deformation path and internal stresses for physical validation, we train the model not only on the UC design but also on the full-field data of the vertical stresses σ 22 for each strain step, as described in Section 2.1.We observed best results when using a Lagrangian frame instead of a Eulerian one (i.e., evaluating all evolving fields on the undeformed initial configuration), which we additionally supply with the horizontal and vertical displacements u 1 and u 2 .This allows us to optionally convert data to the Eulerian frame and provide information about the deformation path to the model.
Instead of simply concatenating this data along the image channels of the U-Net, we distinguish between the two fundamentally different causal relations of the data-space and applied strain-similar to recently proposed video generative models.[28] Here, variants of the 2D (space-only) U-Net architecture are extended by a temporal dimension, which effectively is treated as a batch axis and thus leaves the base architecture unaffected.The extension is a temporal attention [41] block (taking the pixels as batch axis and computing self-attention over the applied strain steps) after the spatial attention and convolution (taking the strain steps as batch axis and computing convolutions and self-attention over the pixels) to learn physical consistency across different strain steps.
This architecture allows for mechanically motivated conditioning of the model on a given nonlinear stress-strain response.The conditioned effective stress at the eleven strain steps is directly associated with the corresponding full-field response2 , which we leverage in our model architecture (unlike in video generation, in which words, as conditioning, do not directly correspond to specific image frames).To do so, we convert each stress value to a highdimensional token embedding by a (learnable) linear layer and fuse it with the pixel representation via cross-attention [41] in the spatial attention module of the corresponding strain step.In the subsequent temporal attention layer across all strain steps, we add a relative position encoding [42] to both the strain steps and token embeddings, so that the model receives information on the strain step order, and we apply "pseudo-temporal" cross-attention over Preprint the strain steps.Lastly, we augment the conditioning by adding a latent representation of the tokens to the diffusion time embedding (required as input to the model to indicate the diffusion time step).For further details see Methods, Supplementary Information Section S3 and S4, and Code Availability.

Full-field predictions for generated metamaterials
A key advantage of our setup over other deep-learning frameworks is its capability to provide physical insight into the deformation mechanisms of the generated metamaterial and the associated stress response.By reversing the diffusion process conditioned on the desired stress-strain curve, we obtain not only a potential design but also a predicted full-field σ 22 -distribution subjected to the applied strain throughout the deformation path.This enables us to evaluate the proposed deformation mechanism for physical validity and extract the predicted stress-strain response by row-wise pixel averaging of the internal stress σ 22 .In contrast to alternative approaches [40], our framework unifies inverse design and forward prediction in a single model without the need for an ad-hoc secondary model to evaluate the performance of the predicted designs.This also allows for the adoption of further design criteria (e.g., enforcing a maximum local stress to prevent failure).
We demonstrate the ability of the model to predict designs matching a given target stress-strain response by considering 100 responses of randomly generated designs (unseen during training).We plot four predictions in Fig. S6 of Supplementary Information and compute the average Normalized Root Mean Square Error (NRMSE, see Methods) of the FE-reconstructed response vs. the target response as ϵ = 6.98%.This is close to the mismatch of ϵ = 2.74% between the predicted and target responses, which underlines the model's ability to propose designs and concurrently estimate their mechanical behavior.The agreement between the predicted and true (i.e., high-fidelity FE) responses suggests an accurate estimate of the stress distribution, confirmed both qualitatively in Fig. S6 of Supplementary Information and quantitatively with ϵ L2 = 14.39%, averaged over all samples and strain steps.(Supplementary Information Section S6.1 summarizes a similar study on unconditionally sampled designs.)

Inverse design of unseen stress-strain responses
The above results provide only a limited measure of the model's generalization performance: although the conditioned stress-strain responses are based on designs not seen during training, they are, on average, well-represented by samples in the training data.To assess the model's generalization capability, we next examine its performance on such responses not closely represented in the training data.We create four benchmark examples of diverse stress-strain responses that cover a wide range of material responses of engineering interest and include the non-trivial mechanisms of contact and buckling.For each * The relative L 2 -error is numerically inflated due to the small magnitude of the stress field and is hence not truly indicative (but included for completeness).
Preprint case, we leverage the probabilistic nature of the model and generate ten samples conditioned on the target response and plot the best match.A guidance weight of w = 5 was observed to enhance the match between generated design and target response without sacrificing the accuracy of the generated full-field predictions.
First, we generate a design with high stiffness, strong (nonlinear) hardening, and large deformability, as used, e.g., in impact applications.We condition the model with an effective stress response 20% above the stiffest sample of the training set.As illustrated in Fig. 3a, the model generates a structure with a large fill fraction, closely matching the ground truth in both the FEreconstructed response (with ϵ = 1.5%; compared to ϵ = 20% of the best match in the training data) and the underlying stress distribution (ranging from ϵ L2 = 18.2% to ϵ L2 = 5.8%).Analogously, compliant low-density designs can be generated by choosing a target stress-strain response well below the most compliant design in the training data (see Supplementary Information Section S6.4), which is matched with ϵ = 4.3%.
Second, we consider a more complex target response exhibiting an abrupt stiffness increase midway through the loading path (at 10% applied strain, see Fig. 3b), which necessitates a change in deformation mode.Such stiffness changes can be leveraged, e.g., in soft robotic grippers.[43] The design proposed by the model indeed closely matches the target response (ϵ = 1.4%) and significantly outperforms the closest match in the training data (ϵ = 10.1%).Moreover, we observe that the generated design contains a fillet in its interior, which establishes contact at 10% strain in both forward prediction and FE simulation, leading to the desired stiffness increase.This demonstrates that the model can introduce new contact mechanisms to match unseen responses, while-importantly-contact has so far been outside the scope of, e.g., computational TO.
Third, we consider the more exotic target of a highly compliant response until 15% strain, followed by a drastic stiffness increase.(Such behavior can be caused by contact within the UC but is also characteristic of, e.g., structural transformations in metals.[44]) While, as expected, the generated design is not as close as the previous targets (ϵ = 14.1%), it considerably outperforms the best match in the training set (ϵ = 39.6%).The initial compliance and sudden stiffness increase are realized through a delicate interplay of an almost purely rotational, auxetic response of an inner segment of the UC and the subsequent emergence of contact at the critical strain level where hardening sets in (see Fig. 3c).
Fourth, we consider a response with significant softening, which is utilized, e.g., in snapping and release mechanisms.As illustrated in Fig. 3d, the model's design again outperforms the best match (ϵ = 2.4% vs. ϵ = 8.3%).The response is accommodated by a buckling mechanism.Interestingly, the relative L 2 -error of the predicted stress fields significantly increases in the post-buckling regime.This, however, stems from the symmetric buckling mode of the design and the fact that the FE simulation buckles to the right while the model predicts buckling to the left (Buckling is highly sensitive to the design (unlike contact): when a vertical column is compressed in 2D, it can buckle to the left or to the right, sensitive to smallest imperfections.).In this case, we cannot reasonably expect the model to match this response.Instead, this demonstrates its temporal consistency and logically completes the deformation trajectoryonce buckled to the right, the post-buckling follows this trend.(An example of a generated design with a predicted deformation mode matching the FE simulation is shown in Supplementary Information Section S6.5.)

Discussion
Soft robots and biomimetic structures, among others, require materials with precise nonlinear mechanical functionality-a challenge for conventional optimization techniques due to the complex inherent deformation mechanics including buckling and contact.Gradient-based optimizers may become numerically unstable due to the nonlinear and non-convex objective function.This issue worsens when considering contact, which leads to abrupt, nonsmooth kinks in the stress response.Our model, inspired by generative video modeling, is particularly suited to this nonlinear setting.It accurately captures the non-trivial mechanics at play and unifies an efficient surrogate forward model with the ability to generate unseen metamaterial designs exhibiting complex nonlinear responses, which must leverage buckling and contact.This is accomplished by training the model on the complete deformation trajectory rather than solely on the underlying designs (akin to extending image to video generative models), which may suffice for linear conditioning but is inadequate for complex nonlinear situations (see the ablation study in Supplementary Information Section S7).
The complex target responses may be associated with multiple designs, posing a challenge for direct optimization.Addressing this one-to-many mapping is a recurring issue in inverse problems across disciplines, for which the probabilistic nature inherent in the diffusion architecture is ideally suited.By repeatedly generating samples for identical target responses, our model proposes a variety of designs (which may be checked for secondary objectives such as manufacturability).Our work further demonstrates the efficacy of video diffusion models when data of different modalities, such as the effective stress-strain response and the full-field internal stress distribution, must be synthesized and optimized-a task where conventional optimization techniques may fail.
The presented framework admits extension to related fields such as fluid dynamics, serving both as a surrogate simulator and nonlinear optimizer.The efficiency of the denoising diffusion process may be increased by operating in a latent space, which may include additional data modalities in the conditioning such as the underlying base material.Moreover, alternative design spaces such as trusses [7] provide a more compact design parameterization for 3D structures and low fill fractions.As trusses can naturally be represented by graphs, graph diffusion models, mainly used in molecule design, can serve as a viable model architecture.

Methods
We here provide details of the data generation procedure, the methods employed for creating the metamaterials under consideration, and the FE setup to evaluate the nonlinear mechanical response of UCs.We further present the model architecture as well as the training and sampling protocol.Additional explanations can be found in Supplementary Information.

Design generation
We generate a random mechanical metamaterial by sampling a 2D Gaussian Random Field (GRF) on a square domain based on the algorithm proposed by Lang & Potthoff.[45] To do so, we sample complex Gaussian noise for a centered (even) N × N grid of Fourier coordinates and introduce spatial correlation by a power law of the type , where we set α = 3 to ensure sufficient smoothness for manufacturable structures.This representation is converted to the corresponding real N × N pixel set X by considering the standardized real part of the inverse Discrete Fourier Transform (DFT).Next, we convert it to binary values (1 representing material, 0 representing void) by considering a threshold t sampled as t ∼ U(0, t max ) with t max = 3/5, which was chosen to increase the variance (in terms of sparsity) of the sampled structures.Lastly, we check for the connectedness of the four boundaries of the square grid, which is defined as given if there exists a single material domain that covers at least 10% of the pixels (rounded down) of each side.This avoids structures with extremely sparse connectivity (and hence questionable manufacturability).We repeat the process until a valid structure has been found.The metamaterial is created by mirroring the found structure sequentially along the vertical and horizontal boundaries to ensure periodicity.While we only focus on periodicity in the horizontal direction in the examples presented in this work, the generated structures can also be tessellated along the vertical direction to produce 2D tessellations.Note that the GRFs are by construction periodic, so they can also be tessellated without mirroring.However, we found that mirroring generates in general more diverse stress-strain responses and further simplifies the mesh generation for periodic boundary conditions, which is why we chose this procedure.The pseudocode of this process is given in Algorithm 1 in Supplementary Information Section S1.

FE simulations
To evaluate the stress-strain responses of the generated structures, we use Abaqus CAE 2020.All of the following steps are implemented via User Subroutines.Note that we apply a smoothening of the boundary of the generated pixel structures to bypass issues with the meshing, presented in Supplementary Information Section S1.2.We generate a mesh compatible with periodic boundary conditions (i.e., featuring matching nodes on opposite boundaries) and select 3-node linear (CPE3) and 4-node bilinear elements with reduced integration and hourglass control (CPE4R) using default settings.The mesh was refined until sufficient convergence in the stress distributions and overall stressstrain responses was observed.We consider plane-strain conditions to represent the realistic scenario of an extruded structure in the out-of-plane dimension (thus avoiding challenges with out-of-plane buckling under compression).
The metamaterial is virtually positioned between two rigid horizontal platens, to which we attach the nodes on the top and bottom boundary.We assume lubricated surfaces, so that nodes may slip horizontally relative to the horizontal platens.Within the UC, we consider frictional self-contact with a friction coefficient k fric.= 0.4.Due to the presence of large deformations including buckling and contact, an implicit dynamic solver is chosen for numerical stability.We ensure a quasi-static simulation by setting the mass density to ρ = 10 −8 , applying displacements with a smooth amplitude from time t = 0 to t = 1, and confirming that the kinetic energy (ALLKE) does not exceed 1% of the internal energy (ALLIE) for all strain steps.We furthermore verify that artificial energy measures (ALLAE and ALLSD), introduced for stability reasons, do not individually exceed 1% of the internal energy across all strain steps.In general, we use unit-less values for all lengths in simulations (due to size invariance), while stresses are presented in units of MPa.
We record the horizontal and vertical displacement components (u 1 and u 2 , respectively), as well as the vertical stress component σ 22 on a 96 × 96 pixel grid at eleven evenly placed strain increments from the undeformed configuration to the total applied vertical strain in the Lagrangian (undeformed reference) frame.Note that instead of taking the initial step at 0% strain, we consider all fields at 0.2% strain, as this provides information on the smallstrain response of the structure instead of trivial all-zero values.To compute the effective, overall stress response (which is the net vertical force per initial (undeformed) area on the top or bottom surfaces) at any strain level, we record the vertical reaction forces (RF2) of those nodes in contact with the upper rigid surface.Details on the considered base material can be found in Supplementary Information Section S1.3.All simulations were carried out on the Euler high-performance cluster of ETH Zurich.

Spatial 2D U-Net architecture
We refer to the Code Availability Section for full technical details and below provide a high-level summary of the denoising diffusion model architecture.

Preprint
The PyTorch framework [46] was used throughout our implementation.Diffusion models iteratively remove noise from data, typically images.Consequently, their input and output dimensions must be equal, making U-Net architectures a prevalent choice.Our model builds upon the work of Ho et al. [28] and its implementation provided by Phil Wang [47], which, in turn, are based on derivations of the original 2D-U-Net architecture [39].This encoder-decoder architecture incrementally reduces spatial information while increasing latent feature information before reversing this operation by reducing the latent representation back to the spatial domain.In our work, each down-and upsampling pass comprises two ResNet [48] blocks consisting of a series of convolutional layers and SiLU activation functions [49], spatial linear selfattention [50] (to reduce computational complexity) across the (latent) pixel representation, and a down-or upsampling convolutional layer.The middle block between the encoder and decoder equally consists of two Resnet blocks with a (full) spatial self-attention layer in-between.We use four feature map resolutions (96 × 96 → 12 × 12) with expanding latent dimensions (64 → 512).Each attention block consists of 8 attention heads, each with a dimension of 32.We summarize the most relevant hyperparameters in Table S2 in Supplementary Information Section S3.

Extension to temporal 3D U-Net architecture
We extend the 2D U-Net by incorporating a temporal dimension [28], where we understand the "temporal" dimension as the applied strain steps.In all building blocks described above, the temporal dimension is treated as a batch dimension and therefore does not affect the setup.The key difference is that we insert a temporal self-attention layer at the beginning before the encoder-decoder architecture and additionally, after every spatial attention layer, which treats the spatial dimension as batch axes and performs attention over the eleven strain steps.We consider relative positional encoding [42] to pass information of the strain step order to the model.

Conditioning on nonlinear stress-strain responses
To condition the model on the stress-strain response, we convert all eleven scalar stress values at the corresponding strain steps to an embedding via a (learnable) linear layer.Note that we omit the corresponding strain value since we keep these fixed in this work, thus providing no further information, though a future extension can explore adaptive stepping techniques, such as sampling more densely at strain steps with significant deformation changes.These token embeddings are concatenated to the spatial attention tokens at the corresponding strain step for cross-attention, while we concatenate all eleven token embeddings with a relative positional encoding to the temporal attention tokens in the temporal attention layer.Note that for cross-attention we derive the queries from the pixel embedding but the keys and values from the conditioning embedding.To further enhance the conditioning, we average all eleven token embeddings over the strain steps and convert this to a latent representation by a two-layer MLP and SiLU activation function [49], which transforms this representation to the same dimension as the latent embedding of the diffusion time step t.The latter is necessary for the model to determine the current step of the denoising process.We add both embeddings and incorporate them into the ResNet blocks.

Training protocol
We first pre-process the data as follows.We apply a min-max normalization to transform all input data x (i.e., stress and displacement distributions) and conditioning (i.e., stress-strain responses) to the [−1, 1], i.e., where the min and max operators are applied across all corresponding data points.For the stress and displacement fields, we consider all corresponding pixel values for all strain steps in the entire training dataset.For the stressstrain responses, we consider the minimum and maximum recorded stress response for all strain steps in the entire training dataset.Note that we store the image/video data generated with Abaqus in the gif format to reduce storage requirements.We provide the training hyperparameters in Table S3 and the loss plots in Supplementary Information Section S4.The model was trained on the Euler high-performance cluster of ETH Zurich, utilizing parallel and mixed precision processing.We use the Accelerate library from Hugging Face to facilitate the training setup, which was conducted on eight Nvidia Quadro RTX 6000 GPUs, each equipped with 24 GB GDDR6 memory.The training process took approx.70h.

Sampling protocol
Since the model does not directly predict binary pixels but stress and displacement distributions (which may be close to zero at the initial deformation stages), we require a robust method of extracting the underlying (undeformed) structure.We achieve this by considering the vertical displacement u 2 of the upper left quarter (corresponding to the gray area in Fig. 1a) of the predicted field, which is sufficient to extract the full topology due to symmetry.For each pixel, we check whether its value is within a 2% tolerance around zero displacement (relative to the maximum displacement range) across all strain steps.If so, we consider it void (and otherwise material).We found this method to be highly robust, as the upper boundary of the structure is compressed and thus all 'material pixels' will likely undergo some level of displacement (exceeding the set tolerance).We remove any disconnected sub-domains of the obtained design (though these were rarely observed).Further details on the effective Preprint stress response prediction and the mitigation of accuracy losses are provided in Supplementary Information Section S5.

Error measures
To obtain an objective and scale-invariant error norm of the stress-strain curves, we consider the Normalized Root Mean Square Error (NRMSE) computed as ϵ(σ where σ 22 ∈ R N ×N denotes the σ 22 -stress values of the discretized pixel grid in the Lagrangian frame for the corresponding strain step, and ∥•∥ is the Frobenius norm.

Supplementary Information
We will release the Supplementary Information along with the publication.

Data availability
We will release the training and validation dataset consisting of pairs of fullfield data and the effective stress-strain response along with the publication.

Code availability
We provide the code used to train the model and generate new metamaterial designs conditioned on a given stress-strain response in https://github.com/jhbastek/VideoMetamaterials.

Figure 1
Figure 1 Metamaterial generation process.(a) A 2D cellular UC is generated by sampling from a 2D Gaussian random field, applying a varying threshold to extract a binary field, and mirroring the resulting pattern when connectivity to the boundaries is ensured.(b)To obtain the stress-strain response, we place the UC between two rigid plates with periodic boundary conditions in the horizontal direction and apply a compressive strain of up to 20%.The corresponding stress and displacement fields within the UC are computed by FE simulations, and the overall effective stress-strain response σ eff. is extracted from the nodal reaction forces, though they can be equally obtained from the full-field data.A representative selection of responses of the generated designs is plotted in gray.

Figure 3
Figure 3 Metamaterial synthesis for four stress-strain responses not represented in the training dataset.(a-d) The model is conditioned on four technically relevant, challenging target responses.Validation of the predicted effective stress response σ eff.('Fwd.eval.';NRMSE with respect to the target response in brackets) of the generated designs is achieved by FE simulations ('FE eval.'),agreeing with the predicted response and significantly outperforming the best match in the training dataset ('Best match').We additionally compare the predicted full-field σ 22 -distribution (indicated in MPa in the Eulerian frame) with the FE ground truth and provide the corresponding relative L 2 -errors.To highlight the range of responses in the training dataset, we plot a representative selection in gray in (a).*The relative L 2 -error is numerically inflated due to the small magnitude of the stress field and is hence not truly indicative (but included for completeness).
eff. ∈ R11is the vector collecting the effective stress values σ eff. at the eleven strain steps, and ∥•∥ is the Euclidean norm.For the full-field responses, we compute the analogous relative L 2 -error per strain step as ϵ L2 (σ pred.22 , σ true 22 ) =