Physical principles of retroviral integration in the human genome

Certain retroviruses, including HIV, insert their DNA in a non-random fraction of the host genome via poorly understood selection mechanisms. Here, we develop a biophysical model for retroviral integration as stochastic and quasi-equilibrium topological reconnections between polymers. We discover that physical effects, such as DNA accessibility and elasticity, play important and universal roles in this process. Our simulations predict that integration is favoured within nucleosomal and flexible DNA, in line with experiments, and that these biases arise due to competing energy barriers associated with DNA deformations. By considering a long chromosomal region in human T-cells during interphase, we discover that at these larger scales integration sites are predominantly determined by chromatin accessibility. Finally, we propose and solve a reaction-diffusion problem that recapitulates the distribution of HIV hot-spots within T-cells. With few generic assumptions, our model can rationalise experimental observations and identifies previously unappreciated physical contributions to retroviral integration site selection.

(1-2) the viral DNA finds a candidate integration site by diffusive search; (3)(4) an integration move is performed; (5) the difference in configurational energy is computed and the integration event is accepted if ∆E < 0 or with probability p = exp [−∆E/k B T ]. (6) If rejected the viral DNA can continue its diffusive search for the next candidate. C The preference for HIV integration in nucleosomal DNA can be explained by considering this DNA is tightly wrapped and hence heavily bent. For this reason an integration event is likely to reduce the bending of this segment. On the contrary integration in naked DNA must locally increase the bending. D Integration probability along the viral DNA. In our simplified model, any site along the viral DNA (of length L v = 40σ = 294 bp) can be used as integration point. Considering a more refined model where the first bond along the viral DNA is kinked (thus mimicking the presence of intasome joining the longterminal repeeats, or LTR, [1]) naturally favours the integration at that site (green line in the plot). At the same time, integration profile in the tDNA is unaltered (see main text). E Integration profile in a target DNA segment with heterogeneous flexibility (of length L t = 200σ = 1470 bp). The flexible part is naturally favoured by our quasi-equilibrium model since the energy barrier for integration is lower. This is in agreement with experiments [2]. When a purely non-equilibrium model is used, i.e. no ∆E is computed and every event is accepted, the integration probability is uniform. Integrations profiles are obtained by averaging at least 1000 integration events.
Supplementary Figure 2: Probability distribution ρ(r). Data points are obtained averaging over 5 10 5 random walks starting from the surface of a sphere of radius R and allowed to diffuse and react within the sphere at constant D and κ. The line is the prediction obtained from the analytic solution to the reaction-diffusion problem (see eq. (9)). We span the 2D space (D r , K r ) and compute the goodness of fit as r 2 = i (o i − e i ) 2 (plotted as log 10 r 2 in grey-scale). From left to right we consider three values of the amplitude A = 1, 10, 50σ (or 50, 500, 2500 nm) which is used to randomly displace the boundary of the three concentric shells for each independent replica of the simulation (average over 10 6 replicas). The green/red dots in the parameter space denote the values for which we find best/worst agreement with experiments. Line in panel B is a guide for the eye to identify the region of parameter space where we find a systematically good agreement with experiments. (D-F) Examples of best agreements between simulations and experiments for a given boundary displacement. (G-I) Examples of worst agreements between simulations and experiments for a given boundary displacement. Experimental dataset is taken from Ref. [3].

Supplementary Note 1: Integration Algorithm
In this section we discuss in more detail the integration algorithm used in this work (also see main text). The integration of viral DNA into the host is a complex process that requires many steps [4]. In our model we summarise this complexity into two main processes: diffusive search of integration site and quasi-equilibrium integration event.
The former process is natural to account for as both target and viral DNAs are polymers diffusing in a noisy environment. Random and diffusive search has been seen in HIV integration in vitro [5] and it is very likely that it also occurs in vivo with even more pronounced complexity given the heterogeneous nuclear composition.
The latter, quasi-equilibrium integration, is our only assumption for the integration event: We start from the observation that HIV integration needs to locally deform the substrate in order to perform the double-strand cleavage and integration; then, we reason that in order to deform the substrate, the integrase enzyme needs to either expend some energy or to exploit thermal fluctuations that facilitate local deformations. In either case, we model this process by computing the total configurational energy immediately before (E) and immediately after (E ) the integration move and compute the difference ∆E = E − E. If ∆E < 0 then this means that the integration move is locally energetically favoured and the move is accepted. On the contrary, if ∆E > 0 we accept it with probability p = exp [−∆E/k B T ]. This effectively mimics thermal fluctuations (of order k B T ) that can induce spontaneous local deformations and thus facilitate integration. If the move is rejected the viral DNA resumes the diffusive search (see Suppl. Fig. 1B for a schematics).

Supplementary Note 2: Integration within Single Nucleosomes
In the first part of our work, we study the integration of viral loops within nucleosomes. The size of viral loops is not a crucial parameter in our model as the reconnection moves occur only "locally", i.e. between two short polymer segments, and irrespectively of the rest of the chain. This is compatible with experiments in vivo and in vitro. Indeed, in the latter case HIV DNA is generally much smaller than in vivo and yet it is still successfully integrated within DNA substrates [6]. For this reason, we consider viral DNA (vDNA) as made of 20 or 40 beads (about 150-294 bp) for computational efficiency. We model the presence of a histone protein as represented by a sphere of size σ h = 3σ = 7.5 nm. By assigning an attractive interaction between the histone and a selected segment of the polymer, we can model the wrapping of the DNA around the histone core. This is done using the LJ potential (see Methods in main text) and setting r c = 1.8σ and = 4k B T . With this simple model we cannot control the handedness of the wrapping. On the other hand, this is not crucial for our argument. Notice that this model has been used in the literature to describe chromatin reconstitution in vitro [7].
Because the integration process is weighted by computing the change in energy before and after the recombination event, the kinetics of integration becomes very slow when we set the persistence length to be l p = 20σ. This is because any bending introduced during the recombination heavily contributes to increasing the conformational energy. To speed up the kinetics while maintaining a realistic persistence length, we employ an extensible bond potential, i.e. instead of FENE bonds we use harmonic bonds where x 0 = 1.1σ and κ = 20k B T . This effectively allows bonds to be more easily stretched and uniformly lowers the overall energy barrier along the DNA. Importantly, by using this potential for all bonds along the DNA we are not introducing explicit biases in the integration into any specific region. Any bias in the integration statistics is a result of the presence of histone-like particles and the consequent local larger bending introduced within the nucleosome core (see main text Fig. 1).
The observed bias for nucleosomal DNA can be readily understood by considering the fact that this DNA is tightly wrapped around the histone and hence heavily deformed and storing a large bending energy. This is because the nucleosomal DNA is comparable to DNA's persistence length. If now one imagines using a much larger contour length to wrap the same histone octamer, the resulting configuration is much looser, less bent and thus storing less bending energy. For this reason, one can imagine that an integration event on a segment of nucleosomal DNA lowers the total configurational energy and it is thus an energetically favoured process. on the contrary, integration in naked DNA requires a local deformation that locally increases the bending energy of the tDNA and it is thus energetically disfavoured but possible because of thermal fluctuations (see Suppl. Fig. 1C for a schematics).

Supplementary Note 3: A Refined Model for Viral DNA with LTR
Up until now, we have considered a model where any segment along the viral DNA can be selected for the integration event. Indeed we observe a uniform integration probability along the viral DNA (vDNA). Although this may seem biologically inaccurate, it is done for computational efficiency as it speeds up the search process.
Restricting the integration event to occur in one segment on the vDNA would not change our results.
An improvement in this respect can be done as follows: we can consider a refined model where the viral DNA is now modelled as a loop with a "kink". More specifically, we set a fixed persistence length equal to l p = 50 nm for all the segments of the viral DNA apart from one, where we set l p = 0. By doing so we effectively allow the angle between beads N, 1, 2 to assume any sterically accessible conformation without paying any energy penalty. In practice, this translates into the viral DNA often displaying a "kink" at this location. This kink can be seen as the point where the intasome is located and where it keeps the long-terminal-repeats (LTR) together [1].
By performing integrations with this model we observe no change in the integration profile along the target DNA, as expected (indeed it is only the deformation of the substrate that determines the integration profile along the tDNA). Yet we observe a difference in the integration profile along the viral DNA. This is because the segment with zero persistence length is more easily deformable with respect to others. For this reason we expect, and observe, a marked increase in integration probability in this segment (see Suppl. Fig. 1D). See Suppl. Movie 2 for an integration event with kinked vDNA.

Supplementary Note 4: Integration in Naked DNA with Heterogeneous Flexibility
In this section we discuss our results for integrations in substrates with heterogeneous flexibility. We consider a segment of naked DNA 200 beads long (about 1470 bp or 500 nm): the first half of the segment is rigid (persistence length l p = 50 nm) while the second half is flexible (persistence length l p = 30 nm). As one can see in Suppl. Fig. 1E our simulations show that the integration of a 40 beads (294 bp) viral DNA is favoured in the flexible region, as seen experimentally [2]. This finding can be understood again in terms of energy barriers. The flexible regions allow more easily local deformations that are exploited by the integration machinery to integrate the viral DNA into the substrate. Importantly, our model can account for this preference and for that shown for nucleosomal DNA with one simple assumption, i.e. that the retroviral integration is a quasi-equilibrium process that needs to overcome an energy barrier to integrate the viral genetic material into the host (through local deformation of the substrate).

Supplementary Note 5: Non-equilibrium Integration
In this section we discuss an alternative integration strategy that is fully nonequilibrium. Here, retroviral integration maintains its diffusive search but it can perform integration within the host without being restricted to the Metropolis test.
In other words, any integration move is always accepted. This strategy can be understood as mimicking the fact that an enzyme may consume ATP to actively deform the substrate and integrate the viral DNA without the need to wait for thermal fluctuations of the substrate.
We repeat the simulations performed in the previous section (in naked DNA substrates with heterogeneous flexibility) and report our findings in Supplementary Figure 1E. As one can notice, this time the integration profile is uniform, i.e. the preference towards flexible regions is lost. This finding (flat integration profile in naked DNA with heterogeneous flexibility) is not seen in experiments considering HIV, which instead find a preference for flexible (or curved) DNA [2]. For this reason, we argue that our model for HIV integration as quasi-equilibrium process is a better model for real HIV integration as this gives integration profiles that are in quantitative agreement with experiments (see main text Fig. 1). Yet, our model predicts that retroviruses that do not require to bend the substrate or that employ ATP to deform the substrate should display integration profiles that are insensitive to the underlying DNA flexibility.

Supplementary Note 6: Integration within Nucleosomal Arrays
In the main text (see main Fig. 2) we consider a model in which several histones form a short reconstituted chromatin fibre. In formulating this model we are inspired by the idea that the chromatin fibre is not a static, crystalline structure but that it is a dynamic, "soft" assembly that minimises energetic costs and that can display heteromophic conformations [8]. Our model is motivated by in vitro [9] and in vivo [10] observations that chromatin assumes a range of possible structures rather than one typical structure.
In our model, the substrate DNA is made of 290 beads with 10 nucleosomal segments 20 beads (160 bp) long interspersed with linker DNA 10 beads long (80 bp). This choice considers a linker DNA slightly longer than the typical one in eukaryotes; yet this accelerates the self-assembly of the chromatin fibre as there is a smaller energy penalty to be paid to loop longer linker DNA.
The reconstitution is simulated by using a model where each of the 10 histones in the system experiences a short-ranged attraction (as before mediated by the LJ potential with r c = 1.8σ and = 4k B T ) with a specific nucleosomal segment. In other words, for 10 nucleosomal segments and 10 histone-like particles, we assume attractive interactions between each pair (1-1, 2-2, 3-3, ...) and purely steric interactions with all other possible pairs (e.g., 1-2, 2-4, 5-9, ...). Although this is far from realistic (as histones do not have one binding site in the entire genome) we are not here interested in the process of reconstitution per se but on the integration statistics within a reconstituted chromatin fibre acting as substrate. For this reason, we employ this simple strategy to reconstitute many (> 1000) chromatin fibres in open, partially folded and condensed states and study the statistics of retroviral integration on these different substrates with varying local chromatin structure (see main text). These states are generated as follows: in the open chromatin state histone-like particles display purely repulsive interactions between one another (i.e. via WCA potential described above with σ histone = 3σ). Starting from this open state, we then include nearest-neighbour attractions (i, i ± 1)between the histone-like particles: depending on the strength of the attraction this leads to either partially folded fibres (nnp, r c = 9σ = 3σ histone = 22.5 nm and = 40k B T ) or fully condensed fibres (nnf, r c = 9σ = 3σ histone = 22.5 nm and = 80k B T ). Alternatively, we also consider next-nearest-neighbour attractions which lead to zig-zagging fibres (nnn, r c = 9σ = 3σ histone = 22.5 nm and = 60k B T ). These large values of binding affinity are required to stabilise short DNA loops which are formed by the linker DNA (10 beads is half the persistence length of DNA in this model) and overcome steric hindrance of the DNA.
We stress that although this model may be seen as oversimplified, it allows us to reproduce a range of chromatin substrates with conformations that are not too far from the heteromorphic structures seen in vitro or in vivo [10,8,9]. More sophisticated and realistic models for chromatin reconstitution employing patchy-particles can in principle be used (see Ref. [7]) but the main physical drivers of retroviral integration, i.e. bending and accessibility, are already fully captured by our simplified model.

Supplementary Note 7: Random Walk Model for Viral Integration in Nuclei
In this section we discuss the random walk model to describe the behaviour of viral DNA entering cell nuclei. We assume that the viral DNA enters the nucleus from the nuclear envelope and begins an unbiased Brownian walk with diffusion coefficient D, i.e. satisfying the Langevin equation where η is a vector of Gaussian numbers with zero mean and unit variance. Furthermore, we describe the integration process as happening at rate κ. In other words, at each time-step we draw a random number and if it is smaller than κ∆t, we stop the random walk and record its position among the "integrated" viruses.
We average over typically 10 5 random walks starting from positions uniformly distributed on the surface of a sphere of radius R and obtain a distribution of integration events. This distribution is then binned in the radial location r of the integrated event and the count for each bin is divided by the area of the shell at position r, i.e. 4πr 2 dr. This distribution is finally normalised so that its integral over the range [0, R] is unity.
In this model, we take σ = 50 nm and R = 200σ = 10 µm. By taking the viscosity of the nucleoplasm at η = 150cP we can then obtain τ br = 50 ms. We set D = 0.05 µm 2 /s, close to the diffusivity of viral capsids in nuclei [11] and choose κ = 0.002 s −1 (a parameter largely unknown). The retroviral penetration length is thus l = D/κ = 5 µm. Importantly, Eq. (2) can be simulated with spatially varying diffusion coefficient and reaction rates.

Supplementary Note 8: Solution of the Reaction-Diffusion Equation for Viral Integration in the Nucleus -Uniform case
Here we describe the solution of the reaction-diffusion equation with D and κ constants. The equation can be written as By assuming that the solution takes the form of ρ(r) = r n f we can write r n ∂ rr f + 2 [n + 1] r n−1 ∂ r f + + 2nr n−2 + n(n − 1)r n−2 − κ D r n f = 0 By setting n = −1/2 we find which is the modified Bessel equation with m = 1/2, x = r/l and l = D/κ. The generic solution of this equation is with I m (x) and K m (x) the modified Bessel functions of order m of the first and second kind respectively. Using this generic form, the steady state distribution ρ(r) = r −1/2 f is thus Here, we note that ρ(r) must not diverge at r → 0; this entails that C 2 = 0 and we thus find ρ(r) = sinh (r/l) r shi(R/l) (9) with the normalisation hyperbolic sine integral shi(t) ≡ R 0 sinh(t)/t dt obtained imposing R 0 ρ(r)dr = 1. Alternatively, one can define ρ = f /r and find a simple equation for f leading to the following solution of Eq. (4) with A and B coefficients to be determined. In order for this solution to be well behaved at r → 0 we need A = −B, reobtaining Eq. (9).

Supplementary Note 9: Reaction-Diffusion Equation in Heterogeneous Nuclei
In this case we need to solve the system of equations: with the continuity conditions In these equations, R 1 and R 2 are the position of the boundaries between zones 1-2 and 2-3 respectively. We solve this system of equations with Mathematica DSolve which gives where l i = D i /κ i and N is a normalisation factor that can be set by imposing where Θ(x, b 1 , b 2 ) is here defined as unity if b 1 ≤ x < b 2 and zero otherwise.

Supplementary Note 10: Resizing Nuclear Shells
In order to obtain quantitative agreement between our model and the data from experiments [3] we observe that we need to reshape the concentric regions in the nucleus of a model T-cell. First we fix the mass (or amount of genetic material) in heterochromatin to be twice the one in euchromatin, i.e.
We further impose that the volume occupied by heterochromatin is the inner and outer layer, whereas the one of euchromatin is the middle one, i.e.
In these equations, R 1 and R 2 are the boundaries between the first and second layers and between the second and the third, respectively. We now insert eq. (14) into eq. (13) and can solve for ρ eu /ρ het : obtaining the equation reported in the main text.

Supplementary Note 11: Robustness in Parameter Space
In this section we discuss the robustness of our theoretical model with respect to the parameter space. Our reaction-diffusion model has two main free parameters: the ratios of diffusion coefficient D r and integration rate K r for euchromatin versus heterochromatin. The heterochromatic value of diffusion D 0 is set to be similar to that experimentally measured for the viral capsid in the nucleus [11], i.e. D 0 = 0.05µm 2 /s. The heterochromatic integration rate is, to the best of our knowledge, unknown and we set it to K 0 = 0.01s −1 to broadly match the experimental data from Ref. [3]. As explained above, the key parameter here is the "penetration length" l = D 0 /K 0 which we find needs to be about l √ 5µm to yield broad agreement with the experimental trend (see main text). Albeit our model with D r = 1 and K r = 1 (i.e. both eu-and heterochromatin have the same D 0 and K 0 ) is in fair agreement with the data (see main text) the fine details of the experimental curve cannot be captured by these parameters. We then note that while varying D r and K r we can obtain a closer agreement, our simulations display sudden jumps in the integration profile that are not found in experiments (see main text and Suppl. Fig. 3A). We thus reasoned that our model neglected the cell-to-cell heterogeneity in the boundaries' location. For this reason we randomised the location of the shells' boundaries and associated an amplitude A which delimited their maximum displacement. By averaging over many (∼ 10 7 ) independent replicas, we obtained smoother curves that started to closely follow the experiments (see main text).
In particular, one can notice that the goodness of the fit -computed as the sum of the residuals r 2 = i (o i − e i ) 2 , where o i is the observed (calculated numerically) and e i the expected (experimental) value -is generally robust across the parameter space but it shows a systematic improvement in a diagonal strip K r 0.25 + 1.3D r (see Suppl. Fig. 3B).
Finally, by varying the maximum amplitude of the randomisation of the boundary (A = 1, 10 and 50 σ, with σ = 50 nm are shown in Suppl. Fig. 3) we show that this parameter affects the qualitative behaviour of the integration profile. Specifically, small values A yield abrupt changes which correspond to the jump from one shell to the next. Yet, large values of A smooth out the curves so much that they nearly recover the case with constant diffusion and integration rate. We find that intermediate values (e.g., A 10σ = 500 nm) best match the trend seen in the experimental curve. This is shown in the heat maps of Suppl. Fig. 3, as the case