## Introduction

RNA molecules form structures through base-pairing interactions between complementary regions. Frequently, a given region of an RNA molecule will be complementary both to another region on the same molecule as well as to a different RNA molecule. How is the competition between forming intra- and inter-molecular contacts decided?

Predicting the outcome of this competition is a major open question, affecting a wide swath of both in vivo and in vitro phenomena. The effects of this competition are particularly stark in the context of biological condensates, in which RNA–RNA interactions play a major, largely understudied, role1,2,3,4,5,6. While typical condensates often involve RNA–protein contacts, purely RNA-based aggregation phenomena have been observed both in vitro and in vivo for certain transcripts associated with repeat expansion disorders7.

The expansion of repeats in certain sections of DNA has been implicated in a significant number of (primarily) neurodegenerative disorders including Huntington’s disease, myotonic dystrophy, and Fragile X syndrome8,9,10. While the proximate cause of many of these disorders is the effect of the expansion on the protein sequence, these expansions can lead to effects at the level of the RNA as well11,12,13,14,15,16,17, including an aggregation transition7,18. In particular, RNA containing CAG or CUG repeats were found by Jain & Vale to phase separate depending on the number of repeats present in each molecule, led by GC stickers binding to one another7. Since all GC stickers are self-complementary, it is not immediately clear what leads RNA molecules in certain parameter regimes to form inter- vs. intra-molecular contacts at different rates. Aggregation was observed when the number of repeats per strand exceeded ~30, roughly the same number of repeats leading to diseases in humans7. This phenomenon was also observed and further studied in molecular dynamics (MD) simulations of the system by Nguyen et al.19. These simulations were able to explore the molecular details of the aggregation transition, at the cost of each simulation (at a different concentration or number of repeats per strand) requiring ~3 months of supercomputer time.

Current models are insufficient to explore the properties of the aggregation transition demonstrated by these studies. State-of-the-art models of associative polymers either do not include a competition between intra- and inter-molecular binding (as is more natural for rigid proteins and for heterotypic interactions) or (erroneously) assume it has no qualitative effects on the resulting system20,21,22. While intra-chain interactions are typically ignored, exceptions do exist. These include Dobrynin’s 2004 study extending the Flory–Stockmayer approach to include intra-chain associations23, and a recent publication by Weiner et al. which found that self-bonds play a crucial role in determining phase behavior in a lattice system with heterotypic binding motifs of varying lengths24.

Here, we derive an analytical model to describe a system of polymers with self-complementary stickers. Eschewing mean-field-theory approaches that have dominated the field, we employ a multimerization-based framework that predicts the entire multimerization landscape in addition to the phase behavior, and thus naturally and explicitly considers the competition between intra- and inter-molecular contacts25. Quantitative consideration of this competition reveals that configurational entropy, arising from the multiplicity of ways to form bonds, is the driving force for aggregation in this system. Mapping out the complete phase diagram, we find that as a result of the competition between intra- and inter-molecular bonds, the system exhibits a tunable reentrant phase transition as a function of sequence or temperature. With very strong stickers (or low temperatures) the polymers fold into stable monomers and dimers, and are more likely to form aggregates at intermediate sticker strengths. We furthermore find that, for long enough linkers that enable adjacent stickers to bind, the parity of the number of stickers per strand affects not only the dimerization transition but the large-scale aggregation behavior as well. We validate our results by comparing them to a computational model that enumerates the complete landscape of intra- and inter-molecular structures that the RNA can form, and by comparing them to the results of the Jain & Vale and Nguyen et al. studies7,19. Our work provides a unified framework to explain both dimerization and aggregation phenomena in CAG repeat systems17,19 and extends these to arbitrary sequences, temperatures, and concentrations, thus setting the stage for the construction of novel materials and new techniques based on programmable RNA condensates.

## Results

### Equilibrium behavior is predicted by an analytical model

We consider a nucleic acid sequence comprised of n identical stickers (Fig. 1a). The stickers are separated by n−1 equally spaced linkers that do not interact with the stickers. Each linker consists of l nucleotides. Stickers are self-complementary and bind through base pairing interactions, such that each sticker can be bound to at most one other. Each bonded sticker has a free energy contribution of Fb; however, bonds that create closed loops also have an entropic cost ΔSloop that depends on the loop length lloop. This is because nucleotides comprising a closed loop (such as a hairpin, internal, or multi-loop) are constrained in the conformations they can adopt. A simple model treating unbound nucleotides as a polymer random walk estimates that the entropic cost of forming loops scales logarithmically with the loop length (see the “Methods” section)26,27. Assuming a characteristic loop length leff, the effective strength of the sticker interactions is F ≡ FbTΔSloop(leff) (see the “Methods” section).

In this work, we are concerned with the behavior resulting from such sequences interacting with one another. Two stickers that bind to one another may be on the same strand or on two different strands. Moreover, many strands can be connected to one another through a chain of such bonds. We call a group of m strands connected through a series of intermolecular bonds a multimer of size m, or an m-mer. There are many ways a multimer of size m can form: any combination of bonds that occur either intra- or inter-molecularly within a group of m strands, such that each strand is reachable from every other by following a series of intermolecular bonds, is an m-mer.

We consider a system of M strands present in a container of volume V, such that their concentration is ctot = M/V. We take the thermodynamic limit of M and V going towards infinity with their ratio staying constant. We seek to predict how frequently multimers comprised of m strands form in this system, and how this frequency changes with m. We define cm as the concentration of multimers of size m, such that

$${c}^{{{{{{{{\rm{tot}}}}}}}}}=\mathop{\sum }\limits_{m=1}^{\infty }m{c}_{m}.$$
(1)

There are two possible regimes for the system: For large m, cm either decreases or increases with m (Fig. 1a). In the former case, the system is in the dilute phase, with only small multimers typically forming. In contrast, if cm increases with m, large aggregates of the order of the system size dominate the landscape. The aggregation transition is defined as the crossover point between the regime in which very large multimers are suppressed, to that in which they are dominant.

In equilibrium, cm is proportional to the ratio of the partition function of m-mers, Zm, to the partition function of m monomers, $${({Z}_{1})}^{m}$$ (see the “Methods” section). The partition functions are comprised of three terms:

$${Z}_{m}={{\rm {e}}}^{-\beta (m-1)\Delta F}\mathop{\sum}\limits_{{N}_{{\rm {b}}}}g(n,\, m,\, {N}_{{\rm {b}}}){{\rm {e}}}^{-\beta F{N}_{{\rm {b}}}}.$$
(2)

Here, the multiplicity factor g(n, m, Nb) represents the number of distinct ways to make Nb bonds connecting m identical strands, each with n stickers. ΔF is the effective free energy cost of multimerization (see below) and β = 1/kBT is the inverse thermal energy, where T is temperature. g can be calculated exactly (see the “Methods” section and Supplementary Note 1) and is qualitatively different depending on whether the linkers are long enough to allow adjacent stickers to bind to one another or not (Fig. 1b).

In order to fit experimental data on the prevalence of multiple nucleic acid strands binding to one another in vitro, nucleic acid models include a free energy penalty for multimerization. This leads to the term (m−1)ΔF in Eq. 2. This penalty is motivated by the enthalpic and entropic costs of nucleic acids binding, including ion effects and the translational and orientational entropies lost upon association28,29,30. This penalty scales linearly with the number of strands in a multimer, such that each additional strand added to a multimer carries the same penalty31. See the “Methods” section and Supplementary Note 2 for further discussion.

The sum in Eq. 2 can be approximated by its dominant term (a saddlepoint approximation). There are three regimes to consider, corresponding to strong, intermediate, and weak binding, in which the sum in Eq. 2 is dominated by large, intermediate, and small values of Nb, respectively (Fig. 1c). The value of $${N}_{{\rm {b}}}={N}_{{\rm {b}}}^{\star }$$ that dominates the sum is that which maximizes a combination of the bond energy F and configurational entropy g. For example, the strong binding regime is characterized by bond energy considerations overwhelming configurational entropy effects, while the intermediate binding regime is characterized by a degree of balance between the two.

### The model is validated by comparing to exact computational enumeration and previously published results

To validate the analytical model, we constructed a dynamic programming-based computational model that exactly enumerates Zm in polynomial time (Supplementary Note 5.2). The analytical model described above makes three primary approximations compared to the computational model: (1) it assumes a constant entropy for all loops; (2) it considers only structures with a given number of bonds Nb (with a single next-order correction term); (3) it uses an approximate form for g(n, m, Nb) (see the “Methods” section). The computational model makes none of these approximations, considering all (non-pseudoknotted; see Supplementary Note 5.1) structures that can form and including a loop-length-dependent loop entropy term.

Nevertheless, the analytical model closely approximates the exact computational model, as demonstrated in Fig. 2. The analytical model requires only one fitting parameter: the normalized effective loop length $${l}_{{{{{{{{\rm{eff}}}}}}}}}^{{{{{{{{\rm{fit}}}}}}}}}$$ (see the “Methods” section). That parameter is fit separately to the regimes allowing and disallowing neighbor binding. Importantly, it is fit only once for each regime—to the monomer partition function with strong binding—and not separately for different values of n, m, or Fb. We demonstrate quantitative agreement between the analytical and computational models in Fig. 2, and in Supplementary Fig. 5.

We further sought to compare the model’s predictions to previously published results, namely the MD simulations performed by Nguyen et al.19. Those simulations examined 64 CAG-repeat RNA strands with varying numbers of repeats per strand and of RNA concentrations. We considered the same system of CAG sequences, using the value Fb = −10 employed in the MD simulations and no fitting parameters beyond the aforementioned single parameter fit to the computational model. We enumerated the monomer and dimer partition functions computationally, and used the analytical model to extrapolate up to m = 64, the number of strands used in the MD simulations. The extrapolation was performed by fitting the single parameter to our computational results for m = 1, and using Supplementary Eqs. S38 and S48 to obtain the results for m > 2. The primary difference between our model predictions and those of MD simulations is that the former is purely equilibrium, while the latter is decidedly not so, even after significant simulation time. (A secondary difference is that the former considers an infinite system of given concentration, while the latter considers a finite number of strands).

We plot the propensity of the system to form aggregates as a function of n and ctot in Fig. 3. Following ref. 19, we define multimers of size 2 ≤ m ≤ 4 as oligomers; however, this ensemble is dominated by dimers, with trimers and tetramers forming at very low fractions. We find that for certain concentrations, the system forms either monomers or dimers depending on the parity of n, in agreement with experimental results17; however, this parity does not significantly affect aggregation. We plot the results of Nguyen et al. on top of our predictions as colored points, finding excellent quantitative agreement between the two.

### A reentrant phase transition governs aggregation as a function of sticker strength

For very low temperatures or strong stickers, the ensemble of multimers is dominated by small structures such as dimers, in which all bonds can be satisfied. However, for intermediate sticker strengths, the configurational entropy gain of having a few unsatisfied bonds exceeds the energetic cost. This configurational entropy grows with multimer size, driving the system to aggregate. Finally, for very weak stickers or high temperatures, the structures melt. This phenomenon corresponds to a reentrant phase transition. We demonstrate this transition in our computational model in Fig. 4, enumerating up to m = 15. As shown in the figure, the two dilute phases at strong and weak binding regimes are quite different from one another. In the strong binding regime, (almost) all bonds are satisfied in a typical structure, mainly through intramolecular interactions or dimerization. In the weak binding regime, (almost) no bonds are typically satisfied.

We next explored whether this reentrant transition was merely a small m effect. We employed the analytical model, for which we can consider arbitrarily large values of m. Even when considering m → , we find a reentrant transition in the threshold concentration above which the system is expected to form aggregates, $${c}_{{{{{{{{\rm{thresh}}}}}}}}}^{{{{{{{{\rm{tot}}}}}}}}}$$ (see Supplementary Note 4), as shown in Fig. 5a. This transition is especially prominent for short linkers that disallow neighbor binding since the configurational entropy of dimers in this regime is quite limited (regardless of n, only one dimer configuration can satisfy all stickers). For longer linkers (allowing neighbor binding), this transition is most pronounced for even values of n for which monomers can satisfy all their own bonds, although it is apparent also for odd n, for which dimers can satisfy all bonds.

The behavior shown in Fig. 5 is in agreement with what we would expect from configurational entropy concerns alone (Supplementary Fig. 6). That the propensity of the system to aggregate occurs at more negative values of βF, and is more pronounced, for the case of disallowing neighbor binding than for the case of allowing neighbor binding, is predicted by the different forms of the configurational entropy in these two regimes. Similarly, larger values of n increase the propensity of the system to aggregate because of their effect on configurational entropy, rather than any enthalpic considerations (Supplementary Fig. 6).

## Discussion

In this work, we have considered a simple model of competition between intra- and inter-molecular binding: a polymer with n identical evenly spaced self-complementary stickers. We have shown that the system is characterized by three parameters: n, the number of repeats per strand; βF, the effective strength of each bond accounting for the loop entropy cost; and ctoteβΔF, a dimensionless concentration that accounts for multimerization cost.

Our model computes the prevalence of all possible multimers that can form, considering both intra-strand and inter-strand contacts. Our framework quantitatively recapitulates previously published MD simulation results, each data point of which required 3 months of supercomputer simulation time19. We substantially extend these results to arbitrary sequences, temperatures, and concentrations, and to arbitrarily large multimers (i.e. aggregates) in an analytical framework.

In this system, aggregation is not necessarily predicted as the regime where the most possible bonds are satisfied, as bonds can be satisfied by intramolecular as well as by intermolecular contacts. Instead, aggregation is predicted by the relative stability of the aggregate compared to smaller multimers. The stability of each structure is a function of three terms, as seen in Eq. 2: (1) the number of stickers bound (each contributes F to the free energy); (2) the number of strands in the structure (each contributes μ + ΔF, where μ is the chemical potential); and (3) the configurational entropy of the structure. This last term contributes $$-\log (g)/\beta$$ to the free energy, where g is the number of ways to satisfy the given number of bonds with the given number of strands in the structure.

This last term is the driving force for aggregation in this system. Aggregates are no more stable than dimers in terms of the first term, the possible number of stickers bound (both are able to satisfy all stickers). Aggregates are further penalized by the second term, the multimerization cost. If these two terms were the only terms in the free energy, we would not see any aggregates. It is the third term, the configurational entropy, that drives aggregation. Larger multimers are able to satisfy their bonds in many more configurations than a corresponding collection of smaller multimers, leading to an enormous entropic benefit in forming aggregates. This has been described as a competition between configurational and translational entropies in other contexts24,32. In our system, the benefit due to g peaks when most, but not all, stickers are satisfied (Supplementary Fig. 6).

This behavior leads to a reentrant phase transition. For −βF 1, the number of bonds satisfied is the primary consideration. Dimers are able to satisfy all their bonds, and the multiplicity benefit of aggregates is not sufficiently large when all bonds are satisfied, suppressing aggregation in this regime. Aggregation is also suppressed for very positive values of βF, which as a result of loop entropy costs can occur even when the sticker binding itself is favored (i.e. Fb < 0). However, for intermediate values of βF—when dimers prefer having some bonds left unsatisfied—the configurational entropy benefit of forming aggregates is overwhelming. Aggregates form at 1–2 orders of magnitude lower concentrations in this regime than in the strong binding regime.

The predicted aggregation transition of the system is completely described in Fig. 5b. We plot the (dimensionless) threshold concentration $${c}_{{{{{{{{\rm{thresh}}}}}}}}}^{{{{{{{{\rm{tot}}}}}}}}}$$ as a function of n and βF. Aggregation is more prevalent for short linkers (disallowing neighbor binding) than for longer linkers (allowing neighbor binding). For short linkers, small structures are quite constrained in the number of ways they can satisfy all of their bonds, leading the differential configurational entropy benefit of aggregates to grow quite large. For longer linkers, smaller structures are more stable since the corresponding multiplicity is much larger. For similar reasons, the reentrant phase transition is most pronounced with short linkers. For long linkers, even values of n demonstrate a more pronounced reentrant transition than odd values, since their competition is between monomers—with no multimerization penalties—and aggregates. In all other cases, the reentrant transition is primarily due to competition between dimers and aggregates. For short linkers, the parity of n is found in our model to affect monomerization vs. dimerization in agreement with previously published results17, but has almost no effect on aggregation properties. The reason is that for short linkers and strong stickers, dimers behave similarly regardless of the parity of n: both odd and even n can form a dimer satisfying all bonds with only one configuration.

Although there is a qualitative difference between short linkers of l < 3 and long linkers of l ≥ 3, within each regime, increasing the linker length leads to larger values of ΔS and weaker binding. Decreasing the persistence length, for example by changing ionic conditions, would be expected to lead to a similar result. These effects and the predicted phase diagram as a whole (Fig. 5b) could be at least qualitatively tested experimentally by replicating the Jain & Vale experiments for multiple sequences with different sticker strengths and linker lengths and measuring the change in the concentration needed to form aggregates for the different conditions. The available published data is in good agreement with our predictions, in that larger values of n show a greater propensity for aggregation in both experiments and our model predictions7.

Our results bear similarities to the so-called “magic number effect" whereby for heterotypic mixtures, aggregation is suppressed when the number of binding sites in one species is a small integer multiple of the other’s32,33. In such systems, small stable clusters can form with all bonds satisfied. In our homotypic system, dimers can always exhibit a magic number-like effect for strong stickers, and in the regime in which neighbor binding is allowed, for even n, monomers can as well. In fact, a weak reentrant transition has been observed in some simulations of the magic number effect in heterotypic systems (see Fig. 3A of ref. 34). Our results suggest that a reentrant transition may be a generic feature of the magic number effect and that the strength of the reentrant behavior may decay the more molecules are involved.

Our model has several limitations. To make the expression analytically tractable, our formalism makes a heuristic approximation for the multimer multiplicity factor g in the regime disallowing neighbor bonds. For similar reasons, we were unable to analytically explore the weak binding regime, applicable for systems where the loop entropy cost of forming stickers outweighs their energetic benefit. A limitation of our model’s physiological applicability is that we did not explicitly consider magnesium. Magnesium can act as a bridge between negatively charged RNA molecules such that even in the absence of base pairing, Mg–RNA mixtures can form aggregates18,35. Experimental results thus rely on magnesium aiding the aggregation process7. However, the MD simulations to which we compare here do not explicitly consider magnesium19 and the high concentrations required for the system to aggregate (e.g. Fig. 3) are the result. To first-order, the effects of magnesium could be accounted for in our model as modifying ΔF (along with Fb), which effectively modifies the concentrations, as concentrations only enter the model as ctoteβΔF. For clarity, we opted to leave ΔF unmodified; therefore, the high concentrations we consider should be significantly decreased for a system including magnesium.

While non-equilibrium effects are relevant in these systems, our analysis is entirely an equilibrium prediction. Indeed, kinetic trapping appears to be the biggest experimental hurdle to testing our reentrant phase predictions. At the same time, the results of decidedly out-of-equilibrium MD simulations19 show excellent quantitative agreement with our equilibrium predictions (Fig. 3). For this reason, it is likely that out-of-equilibrium effects are not the dominant factor in repeat RNA aggregation behavior. In vivo RNA aggregates are even more fluid-like and dynamic than in vitro aggregates, for reasons that remain largely unclear but appear to be the result of active enzymes in the cell7. Future work may consider how such active processes affect the aggregation properties, and the connection between in vivo non-equilibrium steady states and the equilibrium steady state discussed here.

Given the radical simplicity of the model used here, there is a host of extensions to consider. For example: How does this model interact with complex coacervation, as when including polycations in the solution? How does a polymer pattern with multiple orthogonal stickers behave? How do multiple different polymers, with both cis and trans binding, interact with one another? And how do physiological RNA molecules use the principles explored here to control their aggregation properties?

Our work demonstrates that the competition between intra- and inter-molecular binding can lead to remarkable and (perhaps) unintuitive behavior. Our results mapping the control knobs for this phase behavior create a framework for the study of RNA–RNA interactions in in vivo biological condensates and set the stage for the construction of novel materials and new techniques based on programmable RNA condensates.

## Methods

### Partition functions determine equilibrium behavior

We consider a nucleic acid sequence comprised of n stickers separated by n−1 linkers (Fig. 1a). Stickers are self-complementary and bind through base pairing interactions, such that each sticker can be bound to at most one other sticker. The strength of the sticker interactions, Fb, is determined by the sequence of the stickers; for example, an RNA GC sticker with A nucleotide linkers in standard conditions has Fb = −6.4 kcal/mol (or, for DNA, −1.4), while a GCGC sticker has Fb = −12.2 kcal/mol (−5.8 for DNA). These are calculated using the classic nearest-neighbor model for RNA or DNA base-pairing interactions29,30. The linkers, each of which is of length l, are inert.

We seek to predict how frequently multimers comprised of m strands form, and how this frequency changes with m. Aggregation occurs in the parameter regime where the concentration of multimers comprised of m strands, cm, increases with m. cm is defined as the sum of all structures that have m strands connected by base pairing interactions. In equilibrium, cm is proportional to the partition function of m-mers, Zm:

$${Z}_{m}=\mathop{\sum}\limits_{{\sigma }_{m}}{{\rm {e}}}^{-\beta F({\sigma }_{m})}.$$
(3)

Here, σm is a structure comprised of m strands linked by base pairing, including potential intramolecular bonds; and β = 1/kBT where kB is Boltzmann’s constant and T is the temperature measured in Kelvin. F(σm) is the free energy of the structure, given by29

$$F({\sigma }_{m})={F}_{{\rm {b}}}{N}_{{\rm {b}}}({\sigma }_{m})+(m-1)\Delta {G}_{{{{{{{{\rm{assoc}}}}}}}}}-T\mathop{\sum}\limits_{{{{{{{{\rm{loops}}}}}}}}}\Delta {S}_{{{{{{{{\rm{loop}}}}}}}}}({l}_{{{{{{{{\rm{loop}}}}}}}}}),$$
(4)

where Nb(σm) is the number of bonds in the structure, and ΔGassoc is the hybridization penalty associated with intermolecular binding (discussed below). Each closed loop of length lloop leads to an entropic penalty of ΔSloop(lloop), associated with the decrease in three-dimensional configurations of the single-stranded region of the loop compared to a free chain, given by26,27

$$\Delta {S}_{{{{{{{{\rm{loop}}}}}}}}}({l}_{{{{{{{{\rm{loop}}}}}}}}})={k}_{{\rm {B}}}\left[\ln {v}_{{\rm {s}}}+\frac{3}{2}\ln \left(\frac{3}{2\pi b\,{l}_{{{{{{{{\rm{loop}}}}}}}}}}\right)\right],$$
(5)

where vs is the volume within which two nucleotides can bind, and b is the persistence length of single-stranded regions. This equation treats the single-stranded loop as an ideal chain. An excluded volume term vm2 can be added to Eq. 420 but we assume v is small enough that this term is negligible except for very large m (see Supplementary Note 4 for further discussion).

Given the partition functions Zm for all m-mers, we can calculate the equilibrium concentrations of m-mers, cm, for all m, by solving a set of m simultaneous equations. Zm affects physical observables such as cm only through the ratio $${Z}_{m}/{Z}_{1}^{m}$$, describing, in essence, the propensity of m strands to form an m-mer as opposed to m monomers25,31:

$${c}_{m} =\frac{{Z}_{m}}{{Z}_{1}^{m}}{c}_{1}^{m}\\ \mathop{\sum}\limits_{m}m{c}_{m} ={c}^{{{{{{{{\rm{tot}}}}}}}}}$$
(6)

where the concentrations are made dimensionless by normalizing by a reference concentration (see Supplementary Note 2) and ctot is the total concentration of strands added to solution. In short, this equation arises from cm = Zmemβμ where μ is the chemical potential and the fugacity eβμ = c1/Z1 in equilibrium25.

Solutions to Eq. 6 have two typical regimes. In one, cm decays exponentially with m. On the other, cm grows with m (until excluded volume effects begin to dominate). The latter regime corresponds to aggregation (Fig. 1a).

### An analytical model for the partition functions

The calculation of Zm is too computationally intensive to perform directly, by explicitly enumerating all possible structures that can form, as the number of possible structures grows exponentially with n and m. In order to predict phase behavior for a wide range of sequences and experimental conditions, we develop an analytical framework for computing Zm. This framework enables us to search a broad parameter space and tune phase behavior in the system. We validate our analytical model against a computational model that exactly calculates Zm with a dynamic programming approach (Supplementary Note 5.2) thus providing an exact baseline model for comparison.

We rely on one major assumption to enable an analytical approach: we approximate the loop entropies as independent of loop length; or equivalently, we assume that the model is dominated by loops of one characteristic length, leff. This length depends on the length of the linkers in the system, l. This approximation is reasonable because of two factors. First, because of the logarithmic dependence of ΔSloop on loop length (Eq. 5), moderate heterogeneities in loop length lead to only small differences in ΔSloop. Second, because the typical number of loops in a multimer scales linearly with the size of the multimer (see Supplementary Note 3), we expect similar levels of heterogeneity in loop length independent of the size of the multimer. This approximation is expected to break down for very large n and weak binding (Fb > 0), in which case the few loops that typically form will likely have a broad distribution of lengths; this regime is not considered here.

With this approximation, for monomers, each bond provides constant free energy of F = FbTΔS, where ΔS = ΔSloop(leff). Since the number of loops is given by Nb−(m−1), we also define ΔF ≡ (ΔGassoc + TΔS). This quantity enters Eq. 6, such that it allows us to redefine a rescaled concentration ceβΔF (also, see Supplementary Note 2). Without rescaling concentration, the partition function Zm can thus be written as

$${Z}_{m} ={{\rm {e}}}^{-\beta (m-1)\Delta F}\mathop{\sum}\limits_{{\sigma }_{m}}{\rm {e}}^{-\beta F{N}_{{\rm {b}}}({\sigma }_{m})}\\ ={{{{{\rm{e}}}}}}^{-\beta (m-1)\Delta F}\mathop{\sum}\limits_{{N}_{{{{{\rm{b}}}}}}} g(n,\, m,\, {N}_{{\rm {b}}}){{{{{\rm{e}}}}}}^{-\beta F{N}_{{\rm {b}}}}$$
(7)

where the multiplicity factor g(n, m, Nb) represents the number of distinct ways to make Nb bonds connecting m identical strands, each with n stickers. This is identical to Eq. 2.

This multiplicity factor is most straightforward to consider for the case of monomers. We make the approximation that the contribution of pseudoknots to the partition function is negligible due to their high entropic cost (see Supplementary Note 5.1). Our goal is therefore to calculate the number of ways to form non-pseudoknotted structures containing Nb bonds given a strand of n stickers. For monomers, the multiplicity can be calculated exactly. However, the result depends on whether adjacent stickers are able to bind to one another or not. For a long enough linker length (≥3 nts for the case of RNA), neighboring stickers can bind; for shorter linker lengths (as, for example, for CAG repeats), they cannot (see Fig. 1b). As derived in Supplementary Note 1.1,

$$g(n,\, 1,\, {N}_{{\rm {b}}})=\left\{\begin{array}{ll}\frac{n!}{(n-2{N}_{b})!\,({N}_{{\rm {b}}}+1)! \,{N}_{{\rm {b}}}!} \hfill &\,{{\mbox{if adjacent stickers can bind}}}\, \hfill \\ \frac{(n-{N}_{{\rm {b}}})!\,(n-{N}_{{\rm {b}}}-1)!}{(n-2{N}_{{\rm {b}}})!\,(n-2{N}_{{\rm {b}}}-1)!\,({N}_{{\rm {b}}}+1)!\,{N}_{{\rm {b}}}!} \hfill &\,{{\mbox{otherwise}}}\hfill \end{array}\right.$$
(8)

The top line (allowing neighbor binding) is simply calculated as the product of two factors: $$\left({{n}\atop{2{N}_{b}}}\right)$$ (the number of ways to choose 2Nb bound stickers from n possibilities); and the Catalan number $${C}_{{N}_{\rm {{b}}}}$$ (the number of non-pseudoknotted ways to construct bonds between the chosen stickers). The bottom line (disallowing neighbor bonds) requires a brief additional calculation to derive (Supplementary Note 1.1).

Calculating g(n, m, Nb) from g(n, 1, Nb) also depends on whether or not adjacent stickers can bind (see Supplementary Note 1.2). While the exact calculation requires large numbers of sums with no closed-form solution, a close approximation is given by

$$g(n,\, m,\, {N}_{\rm {{b}}})\, \approx \left\{\begin{array}{ll}\frac{g(nm,\, 1,\, {N}_{\rm {{b}}})}{m} \hfill &\,{{\mbox{if adjacent stickers can bind}}}\, \hfill \\ \frac{g(nm+\alpha (m-1),\, 1,\, {N}_{\rm {{b}}})}{m}&\,{{\mbox{otherwise}}}\,\hfill\end{array}\right.$$
(9)

where α ≈ 0.42, representing an additional heuristic for the case of disallowing neighbor binding compared to the case of allowing such binding. The value of α = 0.42 used is a heuristic estimate that is an especially good fit to the strong interaction regime, and other approximations may improve it (see Supplementary Fig. 1). The factor of 1/m corrects for overcounting due to symmetry (Supplementary Note 1.2.3; see also Supplementary Fig. 2)36.

Given expressions for the multiplicity factor, the partition functions (Eq. 7) are now in principle computable. However, the full sum in that equation remains too computationally intensive to be useful. We, therefore, turn to a saddlepoint approximation: sums of exponentials are typically dominated by their maximum terms, and Eq. 7 is no exception.

In order to find the maximum term, there are three cases to consider, corresponding to physically meaningful distinctions (Fig. 1c). In one regime, the “strong binding" regime, the ensemble is dominated by structures that maximize the bond energy, and the sum is dominated by the last terms ($${N}_{{\rm {b}}}={N}_{{\rm {b}}}^{\max }$$). In the second, the “intermediate binding" regime, the ensemble is dominated by structures that maximize a combination of the bond energy and configurational entropy measured by g, and the sum is dominated by an intermediate-term ($${N}_{{\rm {b}}}={N}_{{\rm {b}}}^{\star }$$). In the third, the “weak binding" regime, the ensemble is dominated by structures that have almost no bonds, and the sum is dominated by the first terms ($${N}_{{\rm {b}}}={N}_{{\rm {b}}}^{\min }$$). These three cases must be treated separately: in the strong and weak binding regimes, the discrete nature of the sum is crucial, while in the intermediate regime, the sum can be well-approximated by an integral. The boundary between these regimes occurs approximately when $${N}_{{\rm {b}}}^{\star }={N}_{{\rm {b}}}^{\max }-1$$ or $${N}_{{\rm {b}}}^{\star }={N}_{{\rm {b}}}^{\min }+3$$. For Figs. 3 and 5, we set the boundary between the strong and intermediate regimes at $${N}_{{\rm {b}}}^{\star }={N}_{{\rm {b}}}^{\max }-\frac{1}{4}$$ (allowing neighbor binding) and $${N}_{\rm {{b}}}^{\star }=\frac{n}{2}-2$$ (disallowing neighbor binding).

After computing the dominant term of the sum, the next-order correction to Zm comes from either considering the next-dominant term (strong and weak regimes) or the curvature at the maximum (intermediate regime); see Supplementary Note 3 for more details.

When comparing between the analytical and computational models, we use a single fitting parameter $${l}_{{{{{{{{\rm{eff}}}}}}}}}^{{{{{{{{\rm{fit}}}}}}}}}$$, which tunes the normalized effective loop length. That parameter is fit separately to the monomer partition functions allowing and disallowing neighbor binding, but is kept constant for all values of m. For different binding strengths, a different fraction of stickers will be bonded, leading to a different value of leff. Rather than having a separate fitting parameter for each parameter set, we only fit once (to monomers) in each of the two linker length regimes (allowing and disallowing neighbor binding). We then assume that leff changes linearly with the fraction of stickers bonded, leading to:

$${l}_{{{{{{{{\rm{eff}}}}}}}}}=\frac{nm}{2{N}_{{\rm {b}}}^{\star }}{l}_{{{{{{{{\rm{eff}}}}}}}}}^{{{{{{{{\rm{fit}}}}}}}}}.$$
(10)

We fit $${l}_{{{{{{{{\rm{eff}}}}}}}}}^{{{{{{{{\rm{fit}}}}}}}}}$$ to the strong binding regime (Fig. 2) for which $${l}_{{{{{{{{\rm{eff}}}}}}}}} \, \approx \, {l}_{{{{{{{{\rm{eff}}}}}}}}}^{{{{{{{{\rm{fit}}}}}}}}}$$. We find intuitively reasonable values for $${l}_{{{{{{{{\rm{eff}}}}}}}}}^{{{{{{{{\rm{fit}}}}}}}}}$$. When using l = 1 (disallowing neighbor binding), we find $${l}_{{{{{{{{\rm{eff}}}}}}}}}^{{{{{{{{\rm{fit}}}}}}}}}=4.3$$ nucleotides. This value is in between the length of an internal loop formed by two individual linkers (4 nucleotides) and the length of a hairpin loop formed by two linkers and a sticker (5 nucleotides). When using l = 4 (allowing neighbor binding), we find $${l}_{{{{{{{{\rm{eff}}}}}}}}}^{{{{{{{{\rm{fit}}}}}}}}}=7$$ nucleotides. This value is also in between the length of an internal loop formed by two individual linkers (10 nucleotides) and the length of a hairpin loop formed by a single linker (5 nucleotides).

### Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.