Relative rate and location of intra-host HIV evolution to evade cellular immunity are predictable

Human immunodeficiency virus (HIV) evolves within infected persons to escape being destroyed by the host immune system, thereby preventing effective immune control of infection. Here, we combine methods from evolutionary dynamics and statistical physics to simulate in vivo HIV sequence evolution, predicting the relative rate of escape and the location of escape mutations in response to T-cell-mediated immune pressure in a cohort of 17 persons with acute HIV infection. Predicted and clinically observed times to escape immune responses agree well, and we show that the mutational pathways to escape depend on the viral sequence background due to epistatic interactions. The ability to predict escape pathways and the duration over which control is maintained by specific immune responses open the door to rational design of immunotherapeutic strategies that might enable long-term control of HIV infection. Our approach enables intra-host evolution of a human pathogen to be predicted in a probabilistic framework.


Supplementary Figure 2 | Different sequence backgrounds lead to different patterns of escape
in the Gag TW10 epitope. Strong interactions between the Gag TW10 epitope escape mutations 242N (a) and 248T (b) and specific residues in the sequence background in patient CAP239 lower the fitness cost of these two mutations. All strong interactions (|J|>0.1, see equation (1) in the main text) between these escape mutations and the p24 protein sequence background are shown, with the width of the link proportional to the magnitude of the coupling. 223V and 219Q are known compensatory mutations. Similarly, 146P has been positively associated with variation in the TW10 epitope 1 , and 256V is known to strongly suppress TW10 variation 2 . In patient CAP239, escape occurs through mutations 242N and 248T. Compensatory residues in the sequence background in patient CH198 lower the fitness cost of the 242N escape mutation (c), but other escape mutations such as 248T are suppressed (d). In patient CH198, escape occurs only through the 242N mutation.
Supplementary Figure 3 | Exploring the potential contributions of multiple escape pathways. (a) Difference in energy (gap) between the predicted fittest and second fittest potential escape mutants for each epitope. When the gap is large, this indicates that alternative escape mutations may come at a much larger fitness cost to the virus, compared to the easiest escape path. In contrast, a low value for the gap indicates that multiple alternative escape routes with similar fitness costs exist. Typically, multiple potential escape mutations are available that have comparable fitness costs, but in some cases the fitness cost of escape increases sharply for suboptimal escape paths. (b) Logarithm of the entropy of the sequence distribution (see equation (1) in the main text) restricted to the set of escape mutant sequences for each epitope only, which can be interpreted as an effective number of likely escape paths.

Supplementary Figure 4 | Empirical correlation between viral replicative capacity and energy.
Using experimental measurements of viral replicative capacity 3 taken from a study unrelated to this work, along with corresponding energy measurements for these viral sequences, we can derive an empirical relationship for variation in viral fitness as a function of energy. Fig. 2 in the main text, including epitopes where ≥50% of the virus population consists of escape mutants at the time the T cell response was first detected. Total of n=71 epitopes, including 3 epitopes where escape occurs through putative antigen processing (AgP) mutation outside the epitope, and 10 epitopes where no escape is observed. Vertical immunodominance measurements are available for a subset (n=53) of these epitopes. Error bars show first/third quartiles for time to escape in the Wright-Fisher simulations, computed from the statistics of 10 3 simulation runs.

Supplementary Figure 5 | Correlation between escape time and fitness-based measures can improve when epitopes where escape is observed at the time the T cell response was first detected are included. (a-d) analogous to
Supplementary Figure 6 | Fitness-based methods accurately predict the residues at which escape mutations occur. In the great majority of epitopes, the most common residue where escape mutations are observed in patients during the entire time course of evolution corresponds to one of the two top residues where escape mutations are predicted to incur the lowest fitness costs (41/51=80% of epitopes where escape is observed) or where mutations are most frequently observed in simulated evolution (43/51=84%). Less frequently, the residue where escape mutations are observed most often has one of the top two highest Shannon entropies (34/51=67%). Epitopes where escape was observed at the time the T cell response was detected are excluded (n=6), as is one epitope without detailed escape sequence data.

Supplementary Figure 7 | Predicting the residues of escape mutations in individual epitopes.
Here we show the single site entropy, fitness cost of mutation, and frequency of escape mutations in simulated evolution at each residue for all epitopes where nonsynonymous mutations were observed in the epitope (n=51). Each epitope is represented by a row of residues, with the residue where escape mutations were most frequently observed in the clinical data denoted by a circle. Predictions for the same epitope based on epitope entropy, fitness cost, and simulated evolution are placed side by side in each row. Darker colors indicate residues where escape mutations are predicted to be more likely. Predictions are correct when the circle in each row is more darkly shaded than the boxes in the same row. Epitopes where escape was observed at the time the T cell response was detected are excluded (n=6), as is one epitope without detailed escape sequence data.

Supplementary tables
Supplementary Our analysis employs HIV sequence data broadly sampled from thousands of individuals infected by both clade B and clade C viruses, far beyond the cohort of 17 individuals considered here, in order to obtain a more accurate estimate of the distribution of HIV sequences at the population level. Here we report the total number of sequences (and the number of unique individuals from which they were obtained) used to train the Potts models for each protein/clade. All sequences were downloaded from the Los Alamos National Laboratory HIV sequence database (www.hiv.lanl.gov). In order to reduce the influence of selection for drug resistance, only sequences from drug-naïve individuals were used for protease and reverse transcriptase. Analogous to Table 2, but with random, patient-specific baseline escape rates included in the CPH model. Contributions of vertical immunodominance (%M) and purely fitnessrelated measures (S, ΔE, t WF ) again are mostly independent. Note that here the maximum possible pseudo-R 2 is substantially lower than in Table 2. Escape occurs more rapidly when escape mutations are present in the virus population at the time that T cell responses are first detected. This is because the fitness cost of escape appears to be particularly low for these epitopes (Methods). Nonetheless, the correlation between fitness cost/time to escape in simulated evolution and the true escape time remains robust even if these epitopes are omitted.