Main

The success of life on Earth derives from its use of molecules to carry information, implement algorithms that control chemistry and respond intelligently to the environment. Genetic information encodes not only molecules with structural and chemical functionality, but also biochemical circuits that in turn process internal and external information relevant for cellular decision-making. Whereas some biological systems may, like modern modular engineering, isolate information processing from the physical subsystems being controlled8, other critical decision-making may be embedded within and inseparable from processes such as protein synthesis, metabolism, self-assembly and structural reconfiguration. Understanding such physically entangled computation is necessary not only for understanding biology, but also for engineering autonomous molecular systems such as artificial cells, in which it is essential to pack as much capability as possible within limited space and energy budgets.

The interplay of structure and computation is particularly rich in molecular self-assembly. In biological cells, decisions about navigation, chemotaxis and phagocytosis are made through structural rearrangements of the cytoskeleton that integrate mechanical forces and chemical signals9,10,11,12, but where and how information processing occurs remains elusive. In DNA nanotechnology13, self-assembly of DNA tiles has been shown theoretically and experimentally to be capable of Turing-universal computation through simulation of cellular automata and Boolean circuits14,15,16, but this digital model of computation lacks a clear analogue in biology.

Neural computation is an alternative form of naturally compact computation with several distinctive hallmarks1,2,3: mixed analogue and digital decision-making, recognition of high-dimensional patterns, reliance on the collective influence of many distributed weak interactions, robustness to noise and an inherent ability to learn and generalize. A paradigmatic neural network model is the Hopfield associative memory17, which conceptualizes dynamics as a random walk on an energy landscape that has been sculpted by learning to contain attractor basins at each memory. Remarkably, neural network models map naturally onto models of well-mixed chemical networks4,5, genetic regulatory networks6 and signal transduction cascades7; such networks have been experimentally demonstrated both in cell-free systems and within living cells18,19,20,21. However, these well-mixed approaches still separate decision-making from downstream processes.

Neural information-processing principles embedded within molecular self-assembly have been harder to discern, and perhaps at first appear as a contradiction in terms. An early thermodynamic view of how free-energy minimization in molecular self-assembly could be akin to the Hopfield model did not lead to concrete realizations22. However, a recent kinetic view of multicomponent systems that permit assembly of many distinct structures using the same components (‘multifarious self-assembly’)23,24 revealed concrete connections to Hopfield associative memories17 and models of hippocampal place cells25 at the level of collective dynamics, even though individual molecules do not explicitly mimic the mechanistic behaviour of individual neurons.

Here we reformulate this connection as an intrinsic feature of heterogeneous nucleation kinetics and experimentally demonstrate its power for high-dimensional pattern recognition using DNA nanotechnology13. The phenomenon arises when the same components can form several distinct assemblies in different geometric arrangements (Fig. 1). Nucleation proceeds by spontaneous formation of a critical seed that subsequently grows into a structure26. Because the nucleation rate of a seed depends strongly on the bulk concentrations of components that occur in that seed, and many distinct seeds and pathways may be viable, the overall rate of formation of a given structure is a complex function of the concentration pattern. Further, because components are shared between structures, competition for resources27 results in a winner-take-all (WTA) effect that accentuates the discrimination between concentration patterns.

Fig. 1: Conceptual framework for pattern recognition by nucleation.
figure 1

When one set of molecules can potentially assemble multiple distinct structures, the nucleation process that selects between outcomes is responsive to high-dimensional concentration patterns. Assembly pathways can be depicted on an energy landscape (schematic shown) as paths from a basin for unassembled components that proceed through critical nucleation seeds (barriers) to a basin for each possible final structure. Seeds that colocalize high-concentration components will lower the nucleation barrier for corresponding assembly pathways. The resulting selectivity of nucleation in high-dimensional self-assembly is sufficiently expressive to perform complex pattern recognition in a manner analogous to neural computation (Extended Data Fig. 1).

Molecular system design

To explore these principles experimentally, we take advantage of the powerful foundation provided by DNA nanotechnology for programming molecular self-assembly. The well-understood kinetics and thermodynamics of Watson–Crick base pairing enables systematic sequence design28,29 for DNA tiles that reliably self-assemble into periodic, uniquely addressed and algorithmically patterned structures with hundreds to thousands of distinct tile types15,16,30,31,32,33,34. These classes of self-assembly differ in the structures produced and in the nature of interactions: in periodic and uniquely addressed structures, each molecular component typically has a unique possible binding partner in each direction. For algorithmic patterns (as for multifarious assembly), some components have multiple possible binding partners, such that which one attaches at a given location is decided during self-assembly on the basis of which forms more bonds with neighbouring tiles.

We build on these ideas to create a molecular system capable of assembling multiple target structures (H, A and M in Fig. 2) from a shared set of interacting components by colocalizing them in different ways. The first stage of design begins with a set S of shared tiles that do not directly bind each other; then three sets of interaction-mediating tiles (also called H, A and M) are introduced for each of the respective desired structures. Each interaction tile in, for example, H, binds four specific S tiles together in a chequerboard arrangement that reflects neighbourhood constraints between shared S tiles in structure H. These H interaction tiles are unique to structure H and do not occur in the assembled A or M structures.

Fig. 2: A multifarious mixture of 917 molecular species that can assemble into three distinct structures from one set of molecules.
figure 2

a, Here 42-nucleotide DNA strands self-assemble into two-dimensional (2D) structures by forming bonds with four complementary strands  using four 10 or 11 nucleotide domains. The strands can be abstracted as square tiles, each named and shown with distinct binding domains identified by number, such that, for example, 708 is complementary to 708*. At nucleation and growth temperatures, attaching by two bonds or more is favourable whereas one is insufficient. b, One pool of 917 tile types assembles into three distinct shapes, H, A and M, through a multitude of pathways. Whereas each tile occurs at most once in each shape, the shared purple species recur in multiple shapes, in distinct spatial arrangements; for example, S149 is highlighted in red. c, Annealing an equal mix of all tiles results in a mixture of fully and partially assembled H, A and M, imaged by AFM. This is the same sample as SHAM60 in Fig. 6e. The inset illustrates the expected slant of the shapes due to SST geometry. Scale bars, 50 nm. d, A typical experiment mixes the  desired concentrations of each tile type into a single tube, with some tiles swapped for fluorophore- and quencher-modified versions. The sample is heated to remove any pre-existing binding, cooled to a temperature slightly above where any growth is observed, then slowly annealed through a small range of temperatures while fluorescence is measured in a qPCR machine; samples are then imaged by AFM.

Source Data

Tiles in a 1:1 stoichiometric mix of S + H, S + A or S + M will have no promiscuous interactions and will assemble H, A or M, respectively, as with previous work on uniquely addressable structures32. But a 1:1:1:1 mix of S + H + A + M, henceforth called our SHAM mix, can assemble three distinct structures. This additive construction of interaction-mediating tiles is analogous to Hebbian learning of multiple memories in Hopfield neural networks17,23 (Extended Data Fig. 1). Furthermore, the use of interaction-mediating tiles avoids constraints from Watson–Crick complementarity, allowing almost arbitrary interactions to be engineered between S tiles. To avoid undesired consequences of the extensive promiscuous interactions present in the SHAM mix, later design stages optimized this initial layout using self-assembly proofreading principles to reduce errors35,36 (Extended Data Figs. 2 and 3).

The resulting design in Fig. 2b has 168 tiles shared across all three shapes, 203 tiles shared across a pair and 546 tiles unique to a specific shape. Our experimental implementation used 42-nucleotide single-stranded DNA tiles32 (Fig. 2a) with sequences designed using tools from previous work16 to reduce unintended interactions and secondary structure and to ensure nearly uniform binding energies.

To test whether proofreading was sufficient to combat promiscuity and to test the unbiased yield of different structures, we annealed all tiles at equal concentration (60 nM) in solution over 150 hours from 48 to 45 °C. Atomic force microscopy (AFM) revealed a roughly equal yield of all three structures (Fig. 2c). Despite being a slow anneal, this uniform distribution is incompatible with an equilibrium Boltzmann distribution that would exponentially magnify differences in the area and perimeter (and thus free energy) of H, A and M; but it is compatible with kinetically controlled assembly in which nucleation rates are linearly proportional to a shape’s area, as nucleation could occur anywhere within the shape. Furthermore, we did not observe significant chimeric structures or uncontrolled aggregation, indicating that proofreading was functioning as desired. However, many structures appeared to be incomplete—often missing tiles from two specific corners, perhaps due to asymmetric growth kinetics or lattice curvature31—or (in the case of A only) showed signs of spiral defect growth (Extended Data Fig. 3).

Colocalization controls nucleation

Understanding nucleation in multicomponent self-assembly has required extensions of classical nucleation theory26 that have effectively guided the design of programmable DNA tile systems with well-defined assembly pathways37,38,39,40. Building on this work, here we examine how selection between target structures that differ in colocalization of tiles can be determined by nucleation kinetics and controlled by concentration patterns. We model the free energy of a structure A with B total bonds as \(G(A)={\sum }_{i\in A}{G}_{{\rm{m}}{\rm{c}}}^{\,i}-B{G}_{{\rm{s}}{\rm{e}}}-\alpha \), where α depends on the choice of reference concentration u0, \({G}_{{\rm{m}}{\rm{c}}}^{\,i}=\alpha -\log \,{c}_{i}/{u}_{0}\) is the chemical potential (or equivalently, translational entropy) of tile i at concentration ci and Gse is the energy of each bond in units of RT, the molar gas constant times temperature. G(A) has competing contributions that scale with the structure’s area and perimeter, and is hence maximized for certain partial assemblies called critical nucleation seeds. The formation of such seeds is often rate-limiting: once these seeds are assembled, subsequent growth is faster and mostly ‘downhill’ in free energy. If the nucleation rate ηshape for a given shape is dominated by a single critical nucleus As, we could use an Arrhenius-like approximation \({\eta }_{{\rm{s}}{\rm{h}}{\rm{a}}{\rm{p}}{\rm{e}}}\propto {{\rm{e}}}^{-G({A}_{{\rm{s}}})}\); in the case that multiple critical nuclei are significant, we must perform a sum.

When such analyses are applied to homogeneous crystals with uniform concentration ci = c of components, critical nuclei are simply those with the appropriate balance of size and perimeter. Heterogeneous concentration patterns require a more nuanced analysis: critical seeds can now be arbitrarily shaped, potentially offsetting a larger perimeter penalty by incorporating tiles with higher bulk concentration. Therefore, we implemented a stochastic sampling algorithm to estimate the nucleation rate of a structure with an uneven pattern of concentrations (Extended Data Fig. 4).

Consider the examples in Fig. 3 where the concentrations of some shared tiles in the SHAM mix have been enhanced. These high-concentration tiles are colocalized in structure A but scattered across H and M. Consequently, such a pattern will lower kinetic barriers for the nucleation of A while maintaining high barriers for H and M. The typical area K over which colocalization promotes nucleation can be estimated from the size of critical seeds predicted by classical nucleation theory and is generally larger at higher temperatures26. Hence, we expect a trade-off between speed and complexity of pattern recognition (Fig. 3e), with more subtle discrimination at higher temperatures (large K)—at the expense of slower experiments—and lower discriminatory power at lower temperatures (small K).

Fig. 3: Theory shows selective nucleation when high-concentration tiles are colocalized in one shape more than in others.
figure 3

a, One pattern (A flag 9) enhancing the concentration of shared tiles colocalized in A but relatively dispersed in H and M. b, A flag 9, plotted by tile locations in each shape along with example flag patterns that have colocalization in H and M. c, For A flag 9, free energies of assemblies along predicted nucleation pathways for each shape (Extended Data Fig. 4). Several example assemblies are shown; the green and red ones are critical seeds for the A and H pathways, respectively. d, Regions predicted to participate in nucleation by the simulation for three concentration patterns (lighter colours correspond to higher participation). e, Macrostate free energies for sets of partial assemblies of increasing size (number of tiles) and predicted AFM results at several temperatures spanning the melting temperature. Small plots show the full-size range, thus illustrating the independence of the nucleation barrier kinetics and the complete assembly thermodynamics. f, For on-target (A, green) and off-target (H, red) shapes, nucleation rates (dashed) and growth rates (solid) are plotted as a function of temperature, according to the simplified model of Extended Data Fig. 4f. Rates are given relative to the time to completely consume the lowest-concentration tile; the horizontal dotted line indicates the rate of annealing between the on-target to off-target nucleation temperatures. Owing to the higher nucleation temperature for the on-target shape, when annealing time scales are comparable to or slower than growth time scales, depletion of shared tiles during a temperature anneal can lead to a WTA effect. Slower annealing and faster growth can increase the WTA effect. g, In this model, WTA leads to higher selectivity (on-target versus total nucleation) compared to systems with no shared components; for slower anneals, selectivity increases for systems with shared components, but decreases for systems with no shared components.

Source Data

To experimentally characterize the basis of selectivity, we systematically tested a series of 37 concentration patterns, which we call ‘flags’ because each one uses high concentrations in a chequerboard localized somewhere in one of the shapes (three examples are shown in Fig. 3b). We did not enhance concentrations of tiles unique to shapes, to avoid additional thermodynamic bias towards any one structure. We ramped the temperature down slowly, from 48 to 46 °C (the expected range for nucleation, a few degrees below the melting temperatures) to provide robustness to variations in nucleation temperatures among flags in different locations and to probe for slow off-target nucleation. To monitor nucleation and growth in real time, we designed distinct fluorophore–quencher pairs on adjacent tiles in four locations on each shape, using tiles not shared between shapes. Each pair quenches when the local region of that specific structure assembles (Fig. 4a).

Fig. 4: Selective nucleation in experiments with shape-specific localized concentration patterns of shared tiles.
figure 4

a, Pairs of alternative tiles with a fluorophore and quencher (Fig. 2d) have their fluorescence quenched when incorporated together in an assembly; small assemblies of just a few strands do not effectively quench (Extended Data Fig. 5). b, Samples were annealed with a temperature protocol that cooled from 71 °C (well above melting temperature) to 48 °C over roughly 6 hours, cooled to 46 °C over 100 hours and finally cooled to 39.5 °C over 3 hours (Extended Data Fig. 6). c, Experimental results for the three flag patterns shown in Fig. 3. The positions of fluorophore–quencher tile pairs used in each of the four samples are shown by the inset icons. Points where fluorescence signals dropped by 10% below their maximum (to which signals were normalized) are shown with coloured dots for on-target nucleation and with  for off-target nucleation. ‘Growth times’ measure the period from ‘10% quenching’ to the end of the experiment, shown as horizontal bars. Sample AFM images from one of the samples are shown for each flag. Scale bars, 100 nm. d, Total growth times for on-target versus off-target nucleation are summarized for all 37 flag patterns. Each numbered box indicates the location of the corresponding 5 × 5 chequerboard flag; good performance is indicated by a tall green bar and a short red bar. e, The same data shown as a ternary plot, with proximity to triangle corners indicating relative fractions of growth time and circle size indicating overall growth time. f, Average change in quenching (a measure of nucleation) of on- and off-target structures with flag patterns compared to equimolar SHAM mixes. Each dot represents a single flag pattern (Extended Data Fig. 7). For most patterns, increasing shared tile concentrations reduces the absolute off-target nucleation, supporting a WTA effect.

Source Data

Experimental results illustrating selective nucleation are shown in Fig. 4c for three example flag concentration patterns. When the pattern localizes high-concentration species in a structure, for example, H, the fluorophore in the expected nucleation region of that structure quenched first and rapidly. After a delay, fluorophore signals from other parts of the same structure also dropped, indicating growth. Fluorophores on off-target structures showed minimal to no quenching until late in the experiment. AFM images from samples at the end of the experiment confirm that fluorophore quenching corresponded to selective self-assembly of complete or partial shapes. Of the 37 flag positions, roughly half showed robust selective nucleation and growth (Fig. 4d,e), while other positions were either not selective or did not grow well, for reasons we have not been able to determine.

In multifarious systems, we expect enhanced selectivity because of a competitive suppression of nucleation. Using an annealing protocol that spends sufficient time at temperatures in which A can nucleate and grow significantly, but H cannot nucleate (Fig. 3f), we expect a WTA effect in which the assembly of A depletes shared tiles S and thus actively suppresses nucleation of H. As shown in Fig. 4f, we see evidence for this effect in most experiments, suggesting that WTA dynamics is amplifying small differences in nucleation kinetics.

Pattern recognition by nucleation

Our work thus far shows that the space of all concentration patterns, which includes patterns not experimentally tested, consists of regions that result in the selective assembly of each of H, A and M, respectively (Fig. 5a). These regions together represent a phase diagram for this self-assembling system23 that reflects the decisions it makes to classify concentration patterns. Whereas phase boundaries of traditionally studied physical systems are usually low dimensional and not fruitfully interpreted as decision boundaries, in multicomponent heterogeneous systems such as ours, the phase diagram is naturally high dimensional. More generally, phase boundaries in disordered many-body systems tend to be complex and thus implicitly solve complex pattern recognition problems, a perspective that also underlies Hopfield’s associative memory in neural networks17,41.

Fig. 5: Design of self-assembly phase diagrams to solve pattern recognition problems.
figure 5

a, Phase diagram shows desired outcomes of kinetically controlled self-assembly in different regions of N = 917 dimensional concentration space (2D schematic shown). Each grayscale image represents a vector of tile concentrations. b, θ specifies which pixel location corresponds to which tile. c, Given a map θ, any image can be converted to a tile concentration vector by associating the grayscale value of pixel location n with the concentration of the corresponding tile i = θ(n). We compute the loss for a given pixel-to-tile map θ using simulations to estimate the nucleation rates of desired and undesired structures for each image and summing over a training set. Stochastic optimization in θ space gives a putative optimal θopt that we used for experiments. d, Images used for training. e, Extra images used to test generalization power. Sources and names of individuals are from left to right as follows in d (for details, see Supplementary Information section 2.7). In a, c and e, some of the images are also shown, credits are as for d. d, Top row, D. Hodgkin, Keystone/Getty Images; J. Hopfield, Princeton University; Horse, Pixabay. Second row: Hazelnuts, Pixabay; Harom, MNIST; H, EMNIST. Third row: A. Avogadro, C. Sentier/University of Pennsylvania; L. Abbott, himself; Anchovy, NOAA/NMFS/SEFSC Pascagoula Laboratory. Fourth row: Apples, M. Shemesh; Aon, MNIST; A, EMNIST. Fifth row: E. Mitscherlich, William Sharpe/Smithsonian Institute; M.-B. Moser, BI Basmo/Kavli Institute of Systems Neuroscience; Mockingbird, Pixabay. Sixth row: Magnolia, D. Richardson; Mbili, MNIST; M, EMNIST.

Source Data

Here nucleation is solving a particular pattern recognition problem based on which molecules are colocalized in different structures. Similar colocalization-based decision boundaries arise in neural place cells studied by the Mosers24,25,42,43 and are complex enough to solve pattern recognition problems and permit statistically robust learning. Having demonstrated that multifarious self-assembly can solve a specific pattern recognition problem, could different molecules be designed to solve other tasks such as recognizing or classifying images? Here the grayscale value of each pixel position in the 30 × 30 images is taken to represent the concentration of a distinct molecule. Instead of synthesizing new molecules with new interactions to solve the above challenge, we show that the design problem is solvable with our existing molecules by an optimized choice of a pixel-to-tile map θ that specifies which existing tile should correspond to which pixel position (Fig. 5b). In addition to saving DNA synthesis costs, this approach helps demonstrate that a random molecular design can be exploited, ex post facto, to solve a specific computational problem by modifying how the problem is mapped onto physical components, as done in reservoir computing44.

We specified our design problem by picking arbitrary images as training sets shown in Fig. 5d. Note that images in one class share no more resemblance than images across classes, for example, class H is Hodgkin, Hopfield, Horse and so on, although the number of pixels and grayscale histogram were standardized across images (Methods). In this way, the number of distinct images per class (six in the experiments presented below) tests the flexibility of decision surfaces inherent to this self-assembling molecular system as a classifier.

We then used an optimization algorithm (Fig. 5c and Methods) on θ that sought to maximize nucleation of the on-target structure for the concentration pattern corresponding to each image while also minimizing off-target nucleation. That is, our algorithm sought to map high-concentration pixels in each image (for example, Mitscherlich) to colocalized tiles in the corresponding on-target structure (here, M) to enhance nucleation, while mapping those same pixels to scattered tiles in undesired structures (here, A and H). Note that this map θ is simultaneously optimized for all images and not independently for each image. Hence no map θ might be able to perfectly satisfy all the above requirements simultaneously for all images in all classes; analogous to associative memory capacity17,23,41, performance drops as one attempts to train more patterns (Extended Data Fig. 8).

For pattern recognition experiments, we enhanced concentrations of tiles in the SHAM mix in accordance with each of the 18 training images (using the optimized θ) and annealed each of the 18 mixes with a 150 hour ramp from 48 to 45 °C. As verified by AFM imaging and real-time fluorescence quenching, we found that the 18 training images yielded correct nucleation, in the sense that there was more of the correct shape than any other shape and in all but five cases was highly (more than 80%) selective (Fig. 6).

Fig. 6: Pattern recognition results with a pre-existing multifarious system.
figure 6

ac, All images (three shown) are converted (a) through a single pixel-to-tile map θ to vectors of tile concentrations (b), which are shown mapped onto tile locations in each shape, with nucleation rate predictions (c). d, Normalized fluorescence over time in hours (one label per shape; other label configurations shown in Extended Data Fig. 9) during a 150 hour temperature ramp from 48 to 45 °C, and final AFM images. Scale bars, 100 nm. e, Summary of results for both fluorescence and AFM for all 36 images, and a uniform 60 nM tile concentration control sample. Above, colours of vertical lines indicate the target shape for each pattern, whereas triangular markings of each colour indicate the relative fraction of growth time (on right) or fraction of shapes counted in AFM images (on left) for the corresponding shape (solid markings indicate target shape). Dashed lines indicate samples with no significant quenching or observed shapes. Below, ternary plots summarize the same results, with proximity to triangle corners indicating relative fractions of growth time (right) or counted shapes (left) and circle size indicating overall growth time (right) or total number of shapes (left).

Source Data

We also tested 12 degraded images and six alternate handwriting images (Fig. 5e), with the same trained pixel-to-tile map θ. Pattern recognition was successful for random speckle distortions and all but one partly obscured image. Generalization, the ability to recognize related images not present in a training set, is a critical aspect of learning in neural networks. A given architecture can be naturally robust to certain families of distortions (for example, convolutional networks can handle translation) but not others (for example, dilation). As nucleation is a cooperative process, often dominated by one or a few critical seeds involving just a handful of tiles, flipping of random uncorrelated pixels and obscuring parts of an image that do not involve those critical pixel combinations will not inhibit nucleation, demonstrating robustness. On the other hand, only three of the six alternate handwritten digits were correctly recognized by self-assembly, indicating a lack of robustness to this type of variation without further training.

Discussion

The phenomena underlying pattern recognition by multifarious self-assembly may be exploited by complex evolved or designed systems (Extended Data Fig. 10). Beyond self-assembly, molecular folding processes could potentially recognize patterns in the concentrations of cofactors or subcomponents if folding kinetics can select between distinct stable states45. Similarly, the phase boundaries for multicomponent condensates governing genetic regulation46 may also contain inherent information-processing capabilities. In such cases, the ‘pixel-to-tile’ map would instead correspond to a layer of phosphorylation or binding circuitry that activates or deactivates specific components on the basis of the levels of upstream information-bearing molecular signals. Within artificial cells47, multicomponent nucleation may be an especially compact way to implement decision-making within the limited space constraints.

To better understand the information-processing potential of nucleation, we may treat this physical process as a machine learning model. A key issue is how the complexity of decision surfaces, quantified in terms of computational power or learning capacity, depends on underlying physical aspects of self-assembly such as the number of molecular species, binding specificity and geometry48,49. Our work already suggests that temperature mediates a trade-off between speed, accuracy and complexity of pattern recognition; at higher temperatures, nucleation seeds are larger, allowing discrimination on the basis of higher-order correlations in the concentration patterns, but the physical process is also correspondingly slower. The trade-off derives from how computation here exploits the inherently stochastic nature of nucleation; monomers must make many unsuccessful attempts at forming a critical seed for both on- and off-target structures, with repeated disassembly before discovering the seed for the correct pattern recognition outcome. Relating such backtracking to stochastic search algorithms for NP-complete problems, as has been done for well-mixed chemistry50, might characterize the computational power of stochastic nucleation.

Viewing nucleation as a machine learning model raises the question of whether there is a natural physical implementation of learning. Here we trained decision boundaries in silico using ideas from reservoir computing44,51; molecules with a fixed set of interactions could nevertheless solve an arbitrary problem by changing the mapping between inputs and fixed components (Extended Data Fig. 1). The analogy between Hopfield associative memories and multifarious self-assembly, especially those based on random colocalization23,24,25,42,43, suggests a way to go beyond fixed components to a scenario in which interactions between components are learned in a Hebbian manner by a natural physical process. Notably, interactions between shared tiles in our system are mediated by shape-specific molecules. If these interaction-mediating tiles could be physically created or activated in response to environmental inputs, for example, through proximity-based ligation, molecular systems could autonomously learn new self-assembling behaviours from examples52 without the need for computer-based learning. Alternatively, the natural evolution of hydrophobic residues to stabilize multi-protein complexes may have the necessary properties for inducing multifarious pattern recognition53.

The connection between self-assembly and neural network computation raises many questions for further exploration, the broadest being a variant on Anderson’s observation that ‘more is different’54. Anderson was referring to the fact that systems containing many copies of the same simple component can show emergent phenomena, such as fluid dynamics, that are best understood at a higher level. Biology also explores another sense of ‘more is different’: it often makes use of a few copies of a great many different types of component8. Here new phenomena naturally emerge in the ‘large N limit’: robustness, programmability and information processing. These phenomena are best explored in information-rich model systems devoid of the distracting complexities of biology. DNA nanotechnology provides one such platform that already hints at such ‘more types is different’ phenomena. For example, self-assembled few-component DNA structures are often sensitive to sequence details and molecular purity, thus taking years to refine experimentally, whereas DNA origami55 and uniquely addressed tile systems32,33,34 use hundreds to thousands of components and usually work on the first try, even with unpurified strands, imprecise stoichiometry and no sequence optimization. Such observations suggest heterogeneity as a defining principle for biological self-assembly56.

Our work adds sophisticated information processing as a new emergent phenomenon in which self-assembly, in the multicomponent limit, gains programmable and potentially learnable phase boundaries to solve specific pattern recognition problems, analogous to earlier results for large N neural networks41. This neural network inspired perspective may help us recognize information processing in high-dimensional molecular systems that is deeply entangled within physical processes, whether in biology or in molecular engineering: multicomponent liquid condensates, multicomponent active matter and other systems might have similar programmable and learnable phase boundaries.

Methods

Multifarious DNA tile system design

Previous theoretical proposals23,24,57 for multifarious mixtures require each component to accept multiple strongly binding partners at each binding site. However, in DNA tile assembly, each binding site can usually only bind its Watson–Crick complement, not an arbitrary set of other domains. Hence, we used an alternate approach: we laid out three structures made of entirely unique, abstract tiles, designed a merging algorithm to reuse tiles in multiple locations if consequences for unintentional binding between other tiles was minimal, and then designed DNA sequences reflecting the resulting abstract layout of tiles.

The three target shapes were drawn on a 24 × 24 single-stranded tile (SST) molecular canvas32, at an abstract level without sequences. Each location in each shape was initially a unique tile, with four abstract binding sites referred to as ‘glues’ in place of binding domains with sequences: after sequence design, ‘matching’ glues correspond to domains with complementary sequences. Edges of the shapes used a special ‘null glue’ with no valid binding partner. In total, this initial design had 2,706 glues and 1,456 tiles.

The three shapes were then processed through a ‘merging’ algorithm that attempted to reuse the same tiles in different shapes. Each step of the algorithm randomly chose two tiles in two different shapes, with null glues on the same sides of each tile, if any. It then considered a modified set where the two tiles were identical, by making them use the same four glues, and propagating the changes in the glues to all other places they occurred within all shapes, starting with the neighbouring tiles (for example, Extended Data Fig. 2c). Such a change could create undesired growth pathways, for example, allowing chimera of multiple shapes. Thus, the algorithm then checked the modified set for two criteria taken from algorithmic self-assembly (Extended Data Fig. 2a,b). The self-healing criterion requires that, for any correct subassembly of any shape, whereas attachments of the wrong tile for a particular location may take place by one bond, only the correct tile can attach by two or more bonds58. The second-order sensitivity criterion for proofreading requires that, for any correct subassembly of any shape, if an incorrect attachment by one bond takes place, the incorrectly attached tile will not create a neighbourhood where an additional incorrect tile can attach by two bonds, and thus the initial error will be likely to fall off35,36. If the modified set satisfied these two criteria, which are trivially satisfied when every tile and bond is unique to a particular location, then the merging algorithm accepted the modified set and continued to another step with a different pair of randomly chosen tiles. Thus, we ensured that there is at least a minimum barrier to continued incorrect growth in a regime where tile attachment by two or more bonds is favourable, and attachment by one bond is unfavourable, which is the case close to the melting temperature of most DNA tile assembly systems59,60.

The algorithm repeatedly merged tiles that satisfied the two criteria until no further acceptable merges were possible. As each merge could affect the acceptability of later merges by changing the glues around each tile, to guide the algorithm towards a sequence of merges it was more likely to be compatible with, the algorithm was initially restricted to considering pairs of tiles from an alternating ‘chequerboard’ subset, which, apart from edges, were likely to be merge-able. After exhausting acceptable merges from this subset, the algorithm attempted merges using all tiles in the system. After repeating this stochastic algorithm multiple times, and selecting the system with the smallest number of tiles, the final resulting system had 698 binding domain and 917 tiles, with 371 of tiles shared between at least two shapes (Extended Data Fig. 2d).

After the assignment of abstract binding domains to each tile by the merging algorithm, the sequences for the binding domains, and thus tiles themselves, were generated using the sequence design software of Woods et al.16. Tiles used a standard SST motif, with alternating 10 and 11 nt binding domains, designed to have similar binding strengths as predicted using a standard thermodynamic model16,29,61. Following Woods et al.16, we set a target range of −8.9 to −9.2 kcal mol−1 for a single domain at 53 °C, which was between the melting temperature and growth temperature for their system. Null binding domains on the edges of shapes, not intended to bind to any other tiles, were assigned poly-T sequences.

Models of nucleation

To model the dependence of the nucleation rates of the three shapes on patterns of unequal concentration, we developed a simple nucleation model based on the stochastic generation of possible nucleation pathways and critical nuclei, which we call the Stochastic Greedy Model (SGM). The model estimates nucleation rates by analysing stochastic paths generated in a greedy manner by making single-tile additions starting from a particular monomer in the system. At each step, all favourable attachments are added and then an unfavourable attachment is performed with probability weighted by the relative free-energy differences of the available tile attachment positions. When multiple favourable attachments are available, the most favourable attachment is made deterministically. This procedure is repeated for many paths over all possible initial positions within the shape considered, and the barrier (highest free-energy state visited in ‘growing’ a full structure) is recorded for each path. A nucleation rate is estimated by assuming an equilibrium occupation of this barrier state (Arrhenius’ approximation26) and summing over the kinetics of the available attachments from this state (see Extended Data Fig. 4 and Supplementary Information section 2.2 for a detailed discussion). The approximations here could be improved by running fully reversible simulations, for example, using xgrow and the kinetic Tile Assembly Model59,62 augmented with Forward Flux Sampling63.

Fluorophore labels and DNA synthesis

Sites for fluorophore and quencher modifications were chosen to avoid edges, modify only unshared tiles and provide a reasonable distribution of locations on each shape. Fluorophores were chosen for spectral compatibility and temperature stability64. ROX, ATTO550 and ATTO647N were paired with Iowa Black RQ, and FAM was paired with Iowa Black FQ. Both fluorophore and quencher modifications were made on the 5′ ends of tiles; to sufficiently colocalize fluorophores and quenchers, one tile in the label pair used a reversed orientation (Fig. 4a). Fluorophore labels are discussed in detail in Supplementary Information section 3.

Tiles without fluorophore or quencher modifications were ordered unpurified (desalted) and normalized to 400 μM in TE buffer (Integrated DNA Technologies). Tiles with fluorophore or quencher modifications were ordered purified by high-performance liquid chromatography (HPLC) and normalized to 100 μM. Given that unpurified synthetic oligonucleotides typically have less than 40 to 60% of the molecules being full length, it is remarkable (although consistent with Woods et al.16) that this did not prevent successful pattern recognition by nucleation.

Experimental overview

The basic workflow for the main experiments was as follows: for a chosen set of concentration patterns (flag or image), samples were prepared on a 96-well plate using an acoustic liquid handler to mix strand stocks in the necessary proportions; vortexed, spun and transferred to PCR tubes for the days-long anneal in the quantitative PCR (qPCR) machine; then samples were deposited on mica for AFM imaging. Fluorescence from the qPCR machine and AFM images were subsequently analysed.

Mixing and growth

Individual tiles were mixed, in the concentration patterns used for experiments, using an Echo 525 acoustic liquid handler (Beckman Coulter). Samples used TEMg buffer (TE buffer with 12.5 mM MgCl2) in a total volume of roughly 20 μl. Flag experiments used a 50 nM base concentration of unenhanced tiles and an 880 nM concentration of enhanced concentration tiles, whereas pattern recognition experiments used tiles with nominal concentrations between 16.67 and 450 nM, which were then quantized into ten discrete values to simplify mixing and conserve material (Supplementary Information section 2.8).

For each concentration pattern in the flag experiments and pattern recognition of trained images, four samples were prepared, each with the same concentration pattern of tiles, but with tiles in different locations replaced by their fluorophore–quencher-modified alternates: one sample for each shape with tiles for all four fluorophore labels on only that shape, to monitor growth of multiple regions on each shape, and an additional sample with one fluorophore on each shape: ROX, ATTO550 (‘five’) and ATTO647N (‘six’) on H, A and M structures, respectively. To reduce the total number of samples, only the lattermost sample type was prepared for pattern recognition of test images. Fluorophore and quencher-modified tile locations always had tiles mixed at the lowest concentration used in the experiment.

After transferring samples to PCR tubes, samples were grown in an mx3005p qPCR machine (Agilent), to provide a program of controlled temperature over time while monitoring fluorescence. Growth protocols began with a ramp from 71 to 53 °C over 40 min to ensure any potentially pre-existing complexes were melted, and then a slower ramp from 53 °C to an initial growth temperature at 1 °C h−1. At this point, three different protocols were used. For constant temperature flag growth experiments, the growth temperature was 47 °C and this was held for 51 h. For temperature ramp flag growth, the initial growth temperature was 48 °C, which was reduced over the course of 100 h to 46 °C. For pattern recognition, a ramp from 48 to 45 °C over 150 h was used. For constant temperature experiments, fluorescence readings were taken every 12 min and for other experiments, every 30 min. After the growth period, temperature was lowered to 39 at 1 °C per 26 min. See Supplementary Information sections 5 and 6 for temperature protocols plotted as a function of time. The experimental timescales and temperatures were chosen not to test the potential speed of selective nucleation, but rather to provide robustness to unknown nucleation temperatures and to convincingly show that nucleation of incorrect structures is limited over long timescales. Thus, on-target nucleation often took place during a comparatively short time and temperature in the experiment, with the remaining time spent either above the expected nucleation temperature or waiting to observe potential off-target nucleation. We also did not try to optimize the system’s speed: the WTA mechanism suggests that significantly faster timescales are possible, and smaller assemblies would reduce the time needed for growth after nucleation. Because of the small sample size and long experiment duration, great care to avoid evaporation was necessary. Once protocols were finished, samples were stored at room temperature until ready for AFM imaging.

Imaging

AFM imaging was performed using a FastScan AFM (Bruker) in fluid tapping mode directly after annealing was completed. In contrast to previous studies32,33,34 in which uniquely addressed SST shapes were gel purified before imaging, we did not do so here, thus we were able to observe assembly intermediates. To achieve better images, two techniques were combined: sample warming to prevent non-specific clumping of structures, and washing with Na-supplemented buffer to prevent smaller material, such as unbound, single DNA tile strands, from adhering to the mica surface. Each sample was diluted 50 times into TEMg buffer with an added 100 mM NaCl, then warmed to roughly 40 °C for 15 min. Next, 50 μl of the sample mix was deposited on freshly cleaved mica, then left for 2 min. As much liquid as possible was pipetted off the mica and discarded, then immediately replaced with Na-supplemented buffer again and mixed by pipetting up and down. This washing process of buffer removal and addition was repeated twice with added-Na buffer, then once with TEMg buffer to remove remaining Na, before imaging was performed in TEMg buffer. As adhesion of DNA to mica is dependent on the ratio of monovalent and divalent cations in the imaging buffer, this process was meant to ensure that unbound tiles were removed during the washing process where Na and Mg were present, whereas imaging itself took place with only Mg so that the lattice structures would be more strongly adhered to the surface resulting in better image quality.

Fluorescence and AFM data analysis

Fluorophore signals are known to be affected by extraneous factors such as temperature, pH, secondary structure and the local base sequence near the fluorophore64, which complicates quantitative interpretation of absolute fluorescence levels. Our own control experiments also illustrated effects due to partial assembly intermediates as well as due to the total amount of single-stranded DNA in solution (Supplementary Information section 3). For this reason, the fluorescence of each fluorophore was normalized to the maximum raw fluorescence value of that fluorophore in that particular sample, and the time at which the fluorescence signal decreased by 10% was then used as a measure of the extent of nucleation that appears less sensitive to these artefacts (Extended Data Fig. 5). The duration between the point of 10% quenching and the end of the growth segment of the experiment was defined as the ‘growth time’ for that fluorophore label; the growth time was defined as 0 in the event of quenching never reaching 10%. For concentration patterns with four samples with different fluorophore arrangements, the total growth time of a shape was defined as the average of the growth time of the five total fluorophore labels on the shape across the four samples (four in the shape-specific sample and one in the each-shape sample), whereas for concentration patterns with only one sample, the growth time of the corresponding fluorophore label was used. As the position of the fluorophore within the shape, relative to where nucleation occurs, has a substantial influence on growth time measurements, the considerable variability in these measurements relative to the true nucleation kinetics must be acknowledged.

For flag experiments, AFM imaging was done only for qualitative confirmation of the selective nucleation and growth indicated by fluorescence results. For pattern recognition and equal-concentration experiments, however, shapes in AFM images were uniformly quantified. At least one sample of each of the patterns had three 5 × 5 μm images taken under comparable conditions. The sample corresponding with each image was blinded, and structures were counted independently by each of the four authors, classifying structures as either ‘nearly complete’ or ‘clearly identifiable’ examples of each of the three shapes. For the purposes of analysing pattern-dependent nucleation and growth, no clear distinction between the number of nearly complete and clearly identifiable shapes was found, and so the two categories were summed. Counts were averaged across the three images, then averaged across the counts of the four authors, to obtain a count per shape per 25 μm2 region for each pattern. Each author used their own, subjective, interpretation of ‘nearly complete’ and ‘clearly identifiable’ structures, and the total number of structures counted in each image differed by up to ±50% for different authors. However, the ratios of different shapes in each image counted by each author remained within 5% of the mean ratios for most images, and across all images no author had a bias of more than ±4% towards identifying a particular shape more or less often than average. Results are detailed in Supplementary Information section 6.3.

To measure the selectivity of patterns, the fraction of on-target shape growth time and AFM counts, compared to the sum of shape growth times and AFM counts, was used. The total growth times, and total AFM counts, of the on-target shapes were used to measure overall shape growth.

Pattern recognition training

Images for pattern recognition were adapted from several sources (Fig. 5d). Each image was rescaled to 30 × 30, discretized to ten grayscale values and adjusted so that the number of pixels with each value was consistent across all images. Each pixel’s grayscale value, 0 ≤ pn ≤ 1, was converted to the concentration ci for the corresponding tile ti where i = θ(n) using an exponential formula, \({c}_{i}=c{{\rm{e}}}^{3{p}_{n}{\rm{l}}{\rm{n}}3}\), where the base concentration is c = 16.67 nM. The intention of the numbers used was to make the average tile concentration 60 nM for each image. As each image had 900 pixels and there are 917 tiles in the system, 17 tiles did not have their concentrations set by any pixel; these tile concentrations were uniformly set to the lowest concentration, and the assignment of these tiles was used to ensure that fluorophore label locations did not vary in concentration.

The tile-pixel assignment was optimized through a simple hill-climbing algorithm, starting from a random assignment, where random modifications to the assignment map are attempted at each step and accepted if the move increases the efficacy of the map. This efficacy was quantified through a heuristic function that accounts for relative nucleation rates, location of nucleation sites (with preference given to locations that succeeded in the flag experiments shown in Fig. 4d) and satisfaction of constraints related to the fluorescent reporters. Because the nucleation algorithm described above, the SGM, is computationally expensive, a simplistic model of nucleation we call the Window Nucleation Model (WNM) was used to evaluate relative nucleation rates for most of the optimization steps. The WNM is based on the Boltzmann-weighted sum of concentrations over a k × k window swept over each structure, similar to the model used in Zhong et al.24. The more detailed but computationally costly SGM was then used for an additional several hours in hopes of improving the mapping. The WNM, along with all constraints about nucleation location and fluorescent reporters, was also used to explore the capacity of this map-training procedure in Extended Data Fig. 8. Details of the pattern recognition training and the window-based nucleation model are discussed in Supplementary Information sections 2.4 and 2.5.