Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# MPL resolves genetic linkage in fitness inference from complex evolutionary histories

## Abstract

Genetic linkage causes the fate of new mutations in a population to be contingent on the genetic background on which they appear. This makes it challenging to identify how individual mutations affect fitness. To overcome this challenge, we developed marginal path likelihood (MPL), a method to infer selection from evolutionary histories that resolves genetic linkage. Validation on real and simulated data sets shows that MPL is fast and accurate, outperforming existing inference approaches. We found that resolving linkage is crucial for accurately quantifying selection in complex evolving populations, which we demonstrate through a quantitative analysis of intrahost HIV-1 evolution using multiple patient data sets. Linkage effects generated by variants that sweep rapidly through the population are particularly strong, extending far across the genome. Taken together, our results argue for the importance of resolving linkage in studies of natural selection.

This is a preview of subscription content, access via your institution

## Access options

\$32.00

All prices are NET prices.

## Data availability

Raw data used in our analysis is available in the GitHub repository located at https://github.com/bartonlab/paper-MPL-inference. Source data are provided with this paper.

## Code availability

Code used in our analysis is available in the GitHub repository located at https://github.com/bartonlab/paper-MPL-inference. The repository also contains Jupyter notebooks that can be run to reproduce the results presented here. The source code is shared under GPL-3.0 license https://github.com/bartonlab/paper-MPL-inference/blob/master/LICENSE-GPL. An executable version is also provided on Code Ocean at https://codeocean.com/capsule/3400567/tree (ref. 30), distributed under the GPL-3.0 license https://opensource.org/licenses/gpl-license/.

## References

1. Bignell, G. R. et al. Signatures of mutation and selection in the cancer genome. Nature 463, 893–898 (2010).

2. Greaves, M. & Maley, C. C. Clonal evolution in cancer. Nature 481, 306–313 (2012).

3. Burrell, R. A., McGranahan, N., Bartek, J. & Swanton, C. The causes and consequences of genetic heterogeneity in cancer evolution. Nature 501, 338–345 (2013).

4. Nik-Zainal, S. et al. The life history of 21 breast cancers. Cell 149, 994–1007 (2012).

5. Landau, D. A. et al. Evolution and impact of subclonal mutations in chronic lymphocytic leukemia. Cell 152, 714–726 (2013).

6. Łuksza, M. et al. A neoantigen fitness model predicts tumour response to checkpoint blockade immunotherapy. Nature 551, 517–520 (2017).

7. McMichael, A. J., Borrow, P., Tomaras, G. D., Goonetilleke, N. & Haynes, B. F. The immune response during acute HIV-1 infection: clues for vaccine development. Nat. Rev. Immunol. 10, 11–23 (2010).

8. Allen, T. M. et al. Selective escape from CD8+ T-cell responses represents a major driving force of human immunodeficiency virus type 1 (HIV-1) sequence diversity and reveals constraints on HIV-1 evolution. J. Virol. 79, 13239–13249 (2005).

9. Zanini, F. et al. Population genomics of intrapatient HIV-1 evolution. eLife 4, e11282 (2015).

10. Strelkowa, N. & Lässig, M. Clonal interference in the evolution of influenza. Genetics 192, 671–682 (2012).

11. Łuksza, M. & Lässig, M. A predictive fitness model for influenza. Nature 507, 57–61 (2014).

12. Muller, H. J. The relation of recombination to mutational advance. Mut. Res. 1, 2–9 (1964).

13. Smith, J. M. & Haigh, J. The hitch-hiking effect of a favourable gene. Genet. Res. 23, 23–35 (1974).

14. Hegreness, M., Shoresh, N., Hartl, D. & Kishony, R. An equivalence principle for the incorporation of favorable mutations in asexual populations. Science 311, 1615–1617 (2006).

15. Lang, G. I. et al. Pervasive genetic hitchhiking and clonal interference in forty evolving yeast populations. Nature 500, 571–574 (2013).

16. Tenaillon, O. et al. Tempo and mode of genome evolution in a 50,000-generation experiment. Nature 536, 165–170 (2016).

17. Levy, S. F. et al. Quantitative evolutionary dynamics using high-resolution lineage tracking. Nature 519, 181–186 (2015).

18. Bollback, J. P., York, T. L. & Nielsen, R. Estimation of 2Nes from temporal allele frequency data. Genetics 179, 497–502 (2008).

19. Malaspinas, A.-S., Malaspinas, O., Evans, S. N. & Slatkin, M. Estimating allele age and selection coefficient from time-serial data. Genetics 192, 599–607 (2012).

20. Mathieson, I. & McVean, G. Estimating selection coefficients in spatially structured populations from time series data of allele frequencies. Genetics 193, 973–984 (2013).

21. Feder, A. F., Kryazhimskiy, S. & Plotkin, J. B. Identifying signatures of selection in genetic time series. Genetics 196, 509–522 (2014).

22. Lacerda, M. & Seoighe, C. Population genetics inference for longitudinally-sampled mutants under strong selection. Genetics 198, 1237–1250 (2014).

23. Foll, M., Shim, H. & Jensen, J. D. WFABC: a Wright–Fisher ABC–based approach for inferring effective population sizes and selection coefficients from time-sampled data. Mol. Ecol. Resour. 15, 87–98 (2015).

24. Ferrer-Admetlla, A., Leuenberger, C., Jensen, J. D. & Wegmann, D. An approximate Markov model for the Wright–Fisher diffusion and its application to time series data. Genetics 203, 831–846 (2016).

25. Taus, T., Futschik, A. & Schlötterer, C. Quantifying selection with Pool-Seq time series data. Mol. Biol. Evol. 34, 3023–3034 (2017).

26. Illingworth, C. J. R. & Mustonen, V. Distinguishing driver and passenger mutations in an evolutionary history categorized by interference. Genetics 189, 989–1000 (2011).

27. Illingworth, C. J. R., Fischer, A. & Mustonen, V. Identifying selection in the within-host evolution of influenza using viral sequence data. PLoS Comput. Biol. 10, e1003755 (2014).

28. Terhorst, J., Schlötterer, C. & Song, Y. S. Multi-locus analysis of genomic time series data from experimental evolution. PLoS Genet. 11, e1005069 (2015).

29. Sohail, M. S., Louie, R. H. Y., McKay, M. R. & Barton, J. P., MPL resolves genetic linkage in fitness inference from complex evolutionary histories. Github https://github.com/bartonlab/paper-MPL-inference (2020).

30. Sohail, M. S., Louie, R. H. Y., McKay, M. R. & Barton, J. P., MPL resolves genetic linkage in fitness inference from complex evolutionary histories. Code Ocean https://doi.org/10.24433/CO.1795728.v1 (2020).

31. Mustonen, V. & Lässig, M. Fitness flux and ubiquity of adaptive evolution. Proc. Natl Acad. Sci. USA 107, 4248–4253 (2010).

32. Illingworth, C. J. R., Parts, L., Schiffels, S., Liti, G. & Mustonen, V. Quantifying selection acting on a complex trait using allele frequency time series data. Mol. Biol. Evol. 29, 1187–1197 (2011).

33. Schraiber, J. G. A path integral formulation of the Wright–Fisher process with genic selection. Theor. Popul. Biol. 92, 30–35 (2014).

34. Ewens, W. J. Mathematical Population Genetics 1: Theoretical Introduction (Springer Science & Business Media, 2012).

35. Iranmehr, A., Akbari, A., Schlötterer, C. & Bafna, V. CLEAR: Composition of likelihoods for evolve and resequence experiments. Genetics 206, 1011–1023 (2017).

36. Liu, M. K. P. et al. Vertical T cell immunodominance and epitope entropy determine HIV-1 escape. J. Clin. Invest. 123, 380–393 (2013).

37. Moore, P. L. et al. Multiple pathways of escape from HIV broadly cross-neutralizing V2-dependent antibodies. J. Virol. 87, 4882–4894 (2013).

38. Doria-Rose, N. A. et al. Developmental pathway for potent V1V2-directed HIV-neutralizing antibodies. Nature 509, 55–62 (2014).

39. Liu, Y. et al. Selection on the human immunodeficiency virus type 1 proteome following primary infection. J. Virol. 80, 9519–9529 (2006).

40. Neher, R. A. & Leitner, T. Recombination rate and selection strength in HIV intra-patient evolution. PLoS Comput. Biol. 6, e1000660 (2010).

41. Batorsky, R. et al. Estimate of effective recombination rate and average selection coefficient for HIV in chronic infection. Proc. Natl Acad. Sci. USA 108, 5661–5666 (2011).

42. Wang, S. et al. Manipulating the selection forces during affinity maturation to generate cross-reactive HIV antibodies. Cell 160, 785–797 (2015).

43. Liao, H.-X. et al. Co-evolution of a broadly neutralizing HIV-1 antibody and founder virus. Nature 496, 469–476 (2013).

44. Ganusov, V. V. et al. Fitness costs and diversity of the cytotoxic T lymphocyte (CTL) response determine the rate of CTL escape during acute and chronic phases of HIV Infection. J. Virol. 85, 10518–10528 (2011).

45. Ganusov, V. V., Neher, R. A. & Perelson, A. S. Mathematical modeling of escape of HIV from cytotoxic T lymphocyte responses. J. Stat. Mech.: Theory Exp. 2013, P01010 (2013).

46. Kessinger, T., Perelson, A. & Neher, R. Inferring HIV escape rates from multi-locus genotype data. Front. Immunol. 4, 252 (2013).

47. Pandit, A. & de Boer, R. J. Reliable reconstruction of HIV-1 whole genome haplotypes reveals clonal interference and genetic hitchhiking among immune escape variants. Retrovirology 11, 11–56 (2014).

48. Leviyang, S. & Ganusov, V. V. Broad CTL response in early HIV infection drives multiple concurrent CTL escapes. PLoS Comput. Biol. 11, e1004492 (2015).

49. Beerenwinkel, N., Günthard, H. F., Roth, V. & Metzner, K. J. Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front. Microbiol. 3, 329 (2012).

50. Turajlic, S., Sottoriva, A., Graham, T. & Swanton, C. Resolving genetic heterogeneity in cancer. Nat. Rev. Genet. 20, 404–416 (2019).

51. Good, B. H., McDonald, M. J., Barrick, J. E., Lenski, R. E. & Desai, M. M. The dynamics of molecular evolution over 60,000 generations. Nature 551, 45–50 (2017).

52. Kouyos, R. D., Althaus, C. L. & Bonhoeffer, S. Stochastic or deterministic: what is the effective population size of HIV-1? Trends Microbiol. 14, 507–511 (2006).

53. Cocco, S., Feinauer, C., Figliuzzi, M., Monasson, R. & Weigt, M. Inverse statistical physics of protein sequences: a key issues review. Rep. Prog. Phys. 81, 032601 (2018).

54. Socolich, M. et al. Evolutionary information for specifying a protein fold. Nature 437, 512–518 (2005).

55. Weigt, M., White, R. A., Szurmant, H., Hoch, J. A. & Hwa, T. Identification of direct residue contacts in protein–protein interaction by message passing. Proc. Natl Acad. Sci. USA 106, 67–72 (2009).

56. Morcos, F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl Acad. Sci. USA 108, E1293–E1301 (2011).

57. Russ, W. P., Lowery, D. M., Mishra, P., Yaffe, M. B. & Ranganathan, R. Natural-like function in artificial WW domains. Nature 437, 579–583 (2005).

58. Ferguson, A. L. et al. Translating HIV sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design. Immunity 38, 606–617 (2013).

59. Mann, J. K. et al. The fitness landscape of HIV-1 gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS Comput. Biol. 10, e1003776 (2014).

60. Figliuzzi, M., Jacquier, H., Schug, A., Tenaillon, O. & Weigt, M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1. Mol. Biol. Evol. 33, 268–280 (2015).

61. Barton, J. P. et al. Relative rate and location of intra-host HIV evolution to evade cellular immunity are predictable. Nat. Commun. 7, 11660 (2016).

62. Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).

63. Louie, R. H. Y., Kaczorowski, K. J., Barton, J. P., Chakraborty, A. K. & McKay, M. R. Fitness landscape of the human immunodeficiency virus envelope protein that is targeted by antibodies. Proc. Natl Acad. Sci. USA 115, E564–E573 (2018).

64. Quadeer, A. A., Louie, R. H. Y. & Mckay, M. R. Identifying immunologically-vulnerable regions of the HCV E2 glycoprotein and broadly neutralizing antibodies that target them. Nat. Commun. 10, 2073 (2019).

65. Quadeer, A. A., Barton, J. P., Chakraborty, A. K. & McKay, M. R. Deconvolving mutational patterns of poliovirus outbreaks reveals its intrinsic fitness landscape. Nat. Commun. 11, 377 (2020).

66. Kimura, M. Diffusion models in population genetics. J. Appl. Probab. 1, 177–232 (1964).

67. Tataru, P., Bataillon, T. & Hobolth, A. Inference under a Wright-Fisher model using an accurate beta approximation. Genetics 201, 1133–1141 (2015).

68. He, Z., Beaumont, M. & Yu, F. Effects of the ordering of natural selection and population regulation mechanisms on Wright-Fisher models. G3: Genes, Genomes, Genetics 7, 2095–2106 (2017).

69. Tataru, P., Simonsen, M., Bataillon, T. & Hobolth, A. Statistical inference in the Wright-Fisher model using allele frequency data. Syst. Biol. 66, e30–e46 (2017).

70. Risken, H. The FokkerPlanck Equation: Methods of Solution and Applications 2nd edn (Springer, 1989).

71. Gaschen, B., Kuiken, C., Korber, B. & Foley, B. Retrieval and on-the-fly alignment of sequence fragments from the HIV database. Bioinformatics 17, 415–418 (2001).

72. Korber, B. et al. in Human Retroviruses and AIDS (eds Korber, B. et al.) 102–111 (Los Alamos National Laboratory, 1998)..

73. Zanini, F., Puller, V., Brodin, J., Albert, J. & Neher, R. A. In vivo mutation rates and the landscape of fitness costs of HIV-1. Virus Evol. 3, vex003 (2017).

## Acknowledgements

We thank A.K. Chakraborty, C.J.R. Illingworth, B. Lee and J.G. Schraiber for helpful discussions and comments on the manuscript. The work of M.S.S., R.H.Y.L. and M.R.M. was supported by the Hong Kong Research Grants Council under grant number 16234716. M.S.S. and M.R.M. were also supported by the Hong Kong Research Grants Council under grant number 16201620, while R.H.Y.L. was also supported by Australia’s National Health and Medical Research Council under grant number APP1121643. The work of J.P.B. reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award R35GM138233.

## Author information

Authors

### Contributions

All authors designed research, developed methods, analyzed data, interpreted results and wrote the paper.

### Corresponding authors

Correspondence to Matthew R. McKay or John P. Barton.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Extended data

### Extended Data Fig. 1 MPL accurately recovers selection coefficients from complex simulated evolutionary trajectories.

a, Trajectories of mutant allele frequencies over time exhibit complex dynamics in a WF simulation with a simple fitness landscape. b, Separate views of individual trajectories for beneficial, neutral, and deleterious mutants (left panel) and inferred selection coefficients (right panel) for a single simulation run. Note that many neutral mutations exhibit temporal variation similar to beneficial or deleterious mutations. MPL estimates the underlying selection coefficients used to generate these trajectories, presented as mean values ± one theoretical standard deviation, and distinguishes between beneficial, neutral, and deleterious mutations, using Eq. (11). Dashed lines mark the true selection coefficients. c, Distributions of selection coefficient estimates across n = 100 replicate simulations with identical parameters in the special case of perfect sampling. MPL is also robust to finite sampling constraints, accurately classifying beneficial (d) and deleterious (e) mutants even when the number of sequences sampled per time point ns is low, and the spacing between time samples Δt is large. Simulation parameters. L = 50 loci with two alleles at each locus (mutant and WT): ten beneficial mutants with s = 0.025, 30 neutral mutants with s = 0, and ten deleterious mutants with s = −0.025. Mutation probability μ = 10−3, population size N = 103. Initial population composed of approximately equal numbers of three random founder sequences, evolved over T = 400 generations.

### Extended Data Fig. 2 MPL improves selection inference for simulated data sets.

In Fig. 2, we showed the performance of MPL and existing methods on simulated test data, averaged over n = 100 replicate simulations with identical parameters. Here we show the improvement of MPL over existing methods for the classification of beneficial (a) and deleterious (b) mutations, and for the error in the estimated selection coefficients (c), for each individual simulation. Selection is more difficult to infer in some simulated data sets, but results from MPL show better agreement with the true parameters in the vast majority of simulations. Simulation parameters. L = 50 loci with two alleles at each locus (mutant and WT): ten beneficial mutants (with s = 0.1 for complex, s = 0.025 for simple), 30 neutral mutants (s = 0 for both scenarios), and ten deleterious mutants (s = −0.1 for complex, s = −0.025 for simple). Mutation probability μ = 10−4, population size N = 103. For the complex case, the initial population is composed of equal numbers of five random founder sequences, evolved over T = 310 generations. Recorded trajectory used for inference begins at generation 10. For the simple case, the initial population begins with all WT sequences, evolved over T = 1000 generations.

### Extended Data Fig. 3 MPL performs well in the presence of recombination.

a, Classification performance of MPL is robust to variation in per locus recombination probability, r. Results are shown for n = 100 independent Monte-Carlo runs. The lower and upper edge of the boxplot correspond to the 25th to 75th percentiles, the bar corresponds to the median while the top and bottom whiskers show the maximum and minimum value within 1.5× the interquartile range from the boxplot. Linkage effects in the data decrease as the recombination probability increases. As a measure of the linkage disequilibrium in the data, we plot the histograms (b) of the covariance (xijxixj) of mutant allele frequencies integrated over time (300 generations) for a range of recombination probabilities. The number of mutant pairs with strong pairwise covariance values decrease with increasing values of r, indicating lower linkage disequilibrium. Simulation parameters. Same as those of simple scenario used in Fig. 2, that is, L = 50 loci with two alleles at each locus (mutant and WT): ten beneficial mutants (s = 0.025), 30 neutral mutants (s = 0), and ten deleterious mutants (s = −0.025). Mutation probability μ = 10−4, population size N = 103, r = {0, 10−5, 10−4, 10−3}. The initial population begins with all WT sequences, evolved over T = 300 generations.

### Extended Data Fig. 4 Performance of MPL on data with HIV-1-like sampling profiles.

a, The number of sequences per time point ns are drawn from a binomial distribution with n = 1000 and p = 0.0139, with the same mean as that of the HIV data. b, The time between samples is drawn from a mixture of two gamma distributions f(x;k,θ), where k and θ are the shape and scale parameters. The mixture distribution has the form w1 × (f(x;k1,θ1) + m1) + w2 × (f((k2θ2 + m2x);k2,θ2) + m2) where m1 = 0, m2 = 120, are constants added to shift the mean, k1 = 3.5, k2 = 3, θ1 = 8.4, θ2 = 2, while w1 = 0.87, and w2 = 0.13 are the mixing weights. The parameters were chosen to mimic the distribution of the time between samples of the HIV data analyzed in the manuscript (Supplementary Table 1). c, The number of generations used for inference is also drawn from a mixture of two gamma distributions, having the form given above and with parameters k1 = 5.5, k2 = 15, θ1 = 7.2, θ2 = 8, m1 = 5, m2 = 143, w1 = 0.21, and w2 = 0.79. The parameters were chosen to mimic the distribution of the trajectory lengths of the HIV data analyzed in the manuscript (Supplementary Table 1). d, A typical sampled trajectory of allele frequencies: beneficial (red), deleterious (blue) and neutral (gray). Dashed lines indicate the sampling time-points. e, The AUROC performance of identifying beneficial and deleterious selection coefficients under perfect and heterogeneous sampling scenarios. Results are evaluated for those sites that are polymorphic in the heterogeneous sampling case. Results are shown for n = 100 independent Monte-Carlo runs. The lower and upper edge of the boxplot correspond to the 25th to 75th percentiles, the bar corresponds to the median while the top and bottom whiskers show the maximum and minimum value within 1.5× the interquartile range from the boxplot. Simulation parameters: population size N = 1000, L = 50 loci with two alleles at each locus (mutant and WT), ten beneficial mutants with selection coefficients s uniformly distributed over the range [0.075, 0.125], 30 neutral mutants with s = 0, and ten deleterious mutants with selection coefficients uniformly distributed over the range [-0.125, -0.075], mutation probability per site per generation μ = 10−4, and recombination probability per site per generation r = 10−4.

### Extended Data Fig. 5 Most genetic variants have little effect on inferred selection at other sites, but a small minority have strong effects.

After computing the pairwise effects $$\Delta \hat s_{ij}$$ of each variant i on the inferred selection coefficient for each other variant j, referred to as the target, we summed the absolute value of the $$\Delta \hat s_{ij}$$ values over all target variants j to quantify the influence of each variant i on selection at other sites. One histogram is shown for each sequencing region, for each individual. For the vast majority of variants, the total effect on selection at other sites is near zero. However, a small minority have strong effects. We defined a variant to be ‘highly influential’ if the sum of the absolute values of the $$\Delta \hat s_{ij}$$ over all targets j was larger than 0.4 (=40%).

### Extended Data Fig. 6 Variants that strongly influence inferred selection at other sites often act across large genomic distances.

Plot of all linkage effects on inferred selection coefficients $$\Delta \hat s_{ij}$$ for which |$$\Delta \hat s_{ij}$$| > 0.004. One plot is shown for each sequencing region, for each individual. These strong effects of linkage on inferred selection coefficients can act at long range across the genome. Approximately 40% of highly influential variants, characterized by strong effects on inferred selection at other sites, lie within identified CD8+ T cell epitopes. The 5′ region for individual CH607 is not shown because no $$\Delta \hat s_{ij}$$ values are larger than the cutoff.

### Extended Data Fig. 7 For most variants, effects on inferred selection coefficients for other variants, and linkage disequilibrium, are stronger at smaller genomic distances.

a, Histogram of the absolute value of linkage effects on inferred selection coefficients for other variants |$$\Delta \hat s_{ij}$$|, divided into subgroups based on the distance along the genome between variant i and target variant j. Consistent with intuition, the large effects on inferred selection coefficients occur most frequently for different variants that occur at the same site on the genome (that is, distance equal to zero). ‘Interactions’ between such variants are necessarily perfectly competitive because only a single nucleotide is allowed at each position in the genetic sequence. For most variants, stronger linkage effects on inferred selection coefficients are more frequently observed for other variants within a distance of ten base pairs (bp). Large linkage effects for pairs of variants within a distance of 30 bp, the approximate length of a linear T cell epitope, occur appreciably more frequently than for pairs of variants at greater genomic distances. However, there is little difference in the distribution of linkage effect sizes for pairs of variants that are between 31 bp and 100 bp apart compared to pairs of variants that are more than 100 bp apart. Nonetheless, some strong linkage effects on inferred selection are observed at long genomic distances (see Fig. 4 and Supplementary Fig. 5). b, Linkage disequilibrium, measured by the absolute value of the off-diagonal entries of the integrated allele frequency covariance matrix, Cint. Like the |$$\Delta \hat s_{ij}$$|, linkage decays along with the distance between variants along the genome. However, we note that linkage disequilibrium values in general appear to be more long-ranged.

### Extended Data Fig. 8 Estimates of selection coefficients in a simple example of clonal interference.

a, Two escape mutations arise in the TW10 epitope targeted by individual CH58 and compete for dominance. b, MPL infers that both TW10 escape variants are positively selected. Estimates based on trajectories of individual variants only infer substantial positive selection for the 1514A variant that fixes. The magnitude of selection inferred with the independent model is also smaller than that inferred by MPL. c, Inferred selection in the HIV-1 5′ half-genome sequence for CH58. Inferred selection coefficients are plotted in tracks. Coefficients of transmitted/founder nucleotides are normalized to zero. Tick marks denote polymorphic sites. Inner links, shown for sites connected to the TW10 epitope, have widths proportional to matrix elements of the inverse of the integrated covariance. Linked sites affect selection estimates within the epitope.

### Extended Data Fig. 9 Estimates of selection coefficients in a complex example of clonal interference.

a, Multiple escape variants for the Nef epitope EV11, targeted by individual CH131, interfere with one another over the course of nearly one year. Here we have omitted the trajectories for transient variants with a deletion at sites 8988a-8988c, which are insertions with respect to the HXB2 reference sequence. b, MPL infers that all nonsynonymous EV11 escape variants are positively selected. Variants 9000C and 9006T are both synonymous, and are inferred to be nearly neutral by MPL. As in previous examples, inferences using only the trajectories of individual variants only infer substantial positive selection for variants that are polymorphic at the final time point, or where the transmitted/founder (TF) allele at the same site appears strongly selected against. In the latter case, positive selection is inferred because all selection coefficients are normalized such that the selection coefficient for the TF variant is zero. This is why the independent model infers 8988T to be beneficial despite its low frequency at the final time point. Note that the independent model also infers the synonymous mutation 9000C to be beneficial. c, Inferred selection in the HIV-1 3′ half-genome sequence for CH131. Inferred selection coefficients are plotted in tracks. Coefficients of TF nucleotides are normalized to zero. Tick marks denote polymorphic sites. Inner links, shown for sites connected to the EV11 epitope, have widths proportional to matrix elements of the inverse of the integrated covariance. Linked sites affect selection estimates within the epitope.

### Extended Data Fig. 10 Inferred selection coefficients across patients using different conventions for data processing.

Inferred selection coefficients are highly similar following different choices for processing the sequence data. Pearson R2 values between inferred selection coefficients range from 0.97 to 1.00, with an average of 0.99. Data processing conventions. Reference: current data processing conventions. Max Δt = 200/400: remove time points that are more than 200/400 days beyond the last included time point (reference: 300 days). Max gap freq. = 80%/99%: remove sites where >80%/99% of observed variants are gaps (reference: 95%). Max gap num. = 50/500: remove sequences with >50/500 gaps in excess of subtype consensus (reference: 200). Min seqs. = 2/6: remove time points with <2/6 available sequences (reference: 4). Remove ambiguous: remove sequences that contain ambiguous nucleotides if any other nucleotide variation is observed at the same site. LTR, long terminal repeat.

## Supplementary information

### Supplementary Information

Supplementary Figs. 1 and 2, Supplementary Table 1 and Supplementary Text.

## Rights and permissions

Reprints and Permissions

Sohail, M.S., Louie, R.H.Y., McKay, M.R. et al. MPL resolves genetic linkage in fitness inference from complex evolutionary histories. Nat Biotechnol 39, 472–479 (2021). https://doi.org/10.1038/s41587-020-0737-3

• Accepted:

• Published:

• Issue Date:

• DOI: https://doi.org/10.1038/s41587-020-0737-3

• ### Inferring the distribution of fitness effects in patient-sampled and experimental virus populations: two case studies

• Ana Y. Morales-Arce
• Parul Johri
• Jeffrey D. Jensen

Heredity (2022)