Introduction

Variation in multiple host factors, both genetic and non-genetic, can influence the genetic composition of infecting pathogens and their subsequent evolutionary trajectory within a host. Examples, include human leukocyte antigen (HLA) restriction of epitopes and subsequent escape in HIV-1 and other viruses1,2,3,4,5,6, drug-induced selection pressure and appearance of drug-resistance mutations in viruses, bacteria and eukaryotic pathogens7,8,9,10,11,12,13 and interactions between polymorphic red-blood cell types and malarial disease14,15. Consequently, the molecular mechanisms underlying diverse pathogen-related processes including infection, invasion, immune-response and drug resistance, can potentially be uncovered by studying the association between host factors and the genetic composition of pathogens16,17,18,19,20,21,22.

However, while it is now feasible to collect large-scale data on pathogen genomic variation and host parameters, reliable hypothesis-free detection of biologically meaningful associations between host factors and pathogen diversity is challenging for several reasons; the greatest of which is population structure. Host genetic variation often has a strong spatial structure arising from historical patterns of isolation and gene flow. Because of their commensal nature and mode of transmission, most pathogens are likely to share some of this structure, leading to non-causal association between host and pathogen genetic variation. For the same reason, geographical heterogeneity among host factors that influence pathogens causally (e.g., local variation in treatment protocols) may also lead to indirect correlation between host and pathogen genetics.

A second major challenge is statistical power. Consider searching genome-wide for associations between host genomic factors and pathogen genomic factors, in which the number of tests carried out could be in the billions. Naïve correction for multiple testing is likely to eliminate power for anything except the strongest associations. Consequently, there is a need for approaches to association testing that enable prior information about the likely structure of association to be used in the search for signal.

To date, various approaches to testing for association between pathogen genetic variation and host factors, both genetic, as in the case of classical HLA loci, and non-genetic, as in the case of drug resistance, have been developed16,17,18,23,24. Standard association tests, which suffered the problems of stratification described above, were superseded by methods that utilise an inferred phylogeny of pathogen samples to correct for relatedness17,22,25,26,27,28. Moreover, additional power can be obtained by explicit modelling of the processes of escape and reversion in the context of HLA restriction of HIV-117,22,23. However, such approaches have a number of limitations. For example, they do not consider recombination in pathogen genomes, they do not make use of all the data available, they typically do not infer the strength of host-induced selection or combine information across nearby sites in the context of epitope mapping and the more sophisticated approaches are often computationally prohibitive for very large samples.

To address these limitations we have developed a model-based approach to inferring the effect of host factors on pathogen genomic variation. The approach is motivated by the presence of extremely large databases of pathogen genomic data. Given the size of the databases (e.g., the Los Alamos database on HIV-1 has over 150,000 sequences encoding a portion of reverse transcriptase), it is likely that there are sequences closely related to those that infected the individuals within a particular study in question. Our approach aims to infer the most likely ancestral infecting sequence (which may be a mosaic of those in the database) for each individual and therefore to identify the pathogen evolution that has resulted directly from exposure to the current host. Moreover, we model correlations in evolution among adjacent sites within the pathogen that arise through being a shared target of selection (e.g., within a restricted HLA epitope or within a region of a protein where drug resistance and consequent compensatory mutations can evolve). A diagram of the underlying process and our modelling framework is shown in Fig. 1. Our methods and inference regime are detailed in the Methods and Supplementary Methods. We show that the approach has substantially improved power to detect sites under selection and, through applications to the evolution of drug resistance and escape from HLA-drive immune-restriction in HIV-1, how the method can deliver new insight into important biological processes.

Results

Simulations and methods comparison

To validate the method and compare its performance to alternative approaches for detecting host-induced selection we carried out three simulation studies. First, we simulated data under the fitted model to evaluate power and accuracy. We used HIV-1 protease sequence data from three major public databases as a reference (Table 1; $$n = 162,901$$), and simulated 100 replicates of a data set of 460 study sequences with six different HLA-induced selection profiles, four of which were common ($$n = 100$$ each) and two of which were rare ($$n = 30$$ each); Supplementary Fig. 1. We performed simulations with recombination ranging between 0 and 0.01 (a rate of 0.01 meaning that on average the ancestor of each study sample copies 100 contiguous codons from a reference sample before recombination). We performed joint inference on the HLA-associated selection parameters across the viral sequence and examined how the size of the sample-specific reference panel (range: 10–100) affects estimates. Motivation behind the use of sample-specific reference panel and our approach is given in the Methods subsection: Restricting DB: sample-specific reference data sets. In the absence of recombination, parameter estimates are largely unbiased and accurate, although the rate of reversion has high variance (Fig. 2; Supplementary Fig. 2). Estimates for rare alleles perform similarly to those for common alleles (Fig. 2b). We find little impact of the size of the sample-specific reference data set and the posterior distributions are well calibrated (Supplementary Fig. 3). Recombination leads to a downward bias in estimates of recombination and selection intensity and upward bias in the rate of reversion (Fig. 2c; Supplementary Fig. 4) and consequently poorer posterior calibration at sites under selection (Supplementary Fig. 4). Nevertheless, the inferred profiles of host-dependent and host-independent selection pressures remain strongly correlated to the truth. We also considered the impact of error in choosing the sample-specific reference panel by forcing inclusion of the sequences actually copied. We find that forcing the true ancestors to be included improves accuracy by only a small margin (Supplementary Fig. 5) indicating that 100 potential ancestors chosen through Hamming distance is typically sufficient.

In the second set of simulations we assessed robustness by simulating data under a birth–death model (without recombination), setting the current infected population size at 1,000,000 and sampling proportion at 10% to define DB and sample 1,460 query sequences with associated host HLA information (see Methods subsection: Simulation study 2: simulating a sampled birth–death process for details). We find some attenuation of the selection signal (and over-estimation of the reversion parameter) but strong correlation between the inferred and estimated strengths of host-dependent and host-independent selection (Fig. 2c; Supplementary Fig. 6).

The birth–death simulations also enable comparison of our approach with five alternative methods for identifying sites under host factor specific selection (Supplementary Methods subsection: Simulation study 2: methods comparison): Fisher’s exact test (as in Moore et al.16); a phylogenetically corrected Fisher’s exact test (as in Bhattacharya et al.22); an approximate escape rate estimate (as in Fryer et al.18); a ‘Phylogenetic dependency networks’ approach—PhyloD (as in Carlson et al.25); and PhyloD OR (as in Carlson et al.21). In each case, we assume that the wild-type consensus strain is correctly defined and set any non-synonymous difference from consensus as a candidate for escape. For each method, we obtained p values or parameter estimates that provide a metric for the strength of selection conditional on each HLA type and generated receiver operating characteristic curves for each simulation (Fig. 2e). We find that our approach dramatically increases sensitivity for a given false-positive rate (FPR). For example, at FPR = 0.01, our sensitivity is 0.61, compared to the second best-performing method, PhyloD, at 0.13. To assess whether the difference between methods decreases with sample size, we repeated the analysis with 3000 query sequences. Our method achieves a sensitivity of 0.81 (at an FPR of 0.01), compared to the next best-performing method, PhyloD, at 0.22 (Supplementary Fig. 7). We conclude that augmenting study data with a large reference data set and modelling the escape process explicitly provides a substantial gain in the ability to identify sites under host factor specific selection.

To examine the impact of a reduced reference sequence data set, we randomly subsampled 10 and 1% of DB from Simulation study 2. We also considered a leave-one-out (LOO) strategy, where the reference data are augmented with study sequences, with the exception of the sample under consideration. We find that larger reference panels achieve greater accuracy. For example, at FPR = 0.01, sensitivity ranges from 0.61 with the full reference to 0.49 with 10% and 0.32 with 1%. The LOO strategy only boosts power if the reference panel is of the order of the sample size. However, in the absence of any reference panel, our method with a LOO strategy still achieves considerably greater power (sensitivity of 0.34 at FPR = 0.01) than any competing method (Supplementary Fig. 8).

In the third simulation study (Methods subsection: Simulation study 3: the effect of population differentiation), we used an empirical bootstrap approach to assess robustness to population structure and the impact of both differentiation between D and DB and relatedness within D, taking the parameters estimated from empirical data (see Results: HLA-associated selection). We used viral sequence data from Botswana ($$n = 343$$, protease only) to simulate ancestors and the empirical distribution of HLA genotype frequencies from this sample. We measured the accuracy of estimates (both in terms of bias and calibration of posteriors) for different protease reference panels, ranging from the ideal panel (i.e., the ancestors used in the simulation), to that used in this study, to one that considered only the 57,969 sequences known not to come from Botswana. The latter panel maximises differentiation and also introduces the potential problem of sequences within D typically being more closely related to each other than to members of DB. Results are summarised in Supplementary Figs. 911. In terms of bias (summarised by the frequency-weighted root mean square error (RMSE) between true and inferred HLA-associated selection profiles) we find relatively little differences between panels, with the most diverged reference panel being only 3% worse than the best. The estimator was well calibrated for each reference panel used: 95% and 50% credible intervals contained the true value at between 91–98% and 45–57% of sites across bootstrap replicates, respectively. Typically, the LOO strategy had little effect on estimator performance. We conclude that the approach gives substantial robustness to population structure and within-sample relatedness, enabling the integration of highly diverse data sets, though note that very highly diverged reference data sets (for example, consisting of a separate subtype) will perform poorly.

Drug-associated selection

To provide empirical validation of our approach for detecting host factor dependent selection we analysed HIV-1 data from the Stanford drug resistance database9 in which viral sequences are linked to antiretroviral drug treatment history of the patient. We therefore aim to learn drug-treatment specific evolution, while other selective factors (e.g., selection due to cytotoxic-T-lymphocyte (CTL) pressure, the antibody response, or the APOBEC3G response) will be captured by $${\boldsymbol{\omega }}$$. We analysed protease and reverse transcriptase independently and excluded integrase due to lack of data9. Priors on parameters are given in Supplementary Table 1. We defined the collection of sequences in hosts not receiving any therapy at the time of sequencing as DB, and viral sequences in hosts receiving any treatment (coupled with their drug regime data) as D. To guarantee the presence of variation in drug selection we introduce a null class by randomly assigning 2,000 sequences from DB to D. Details of data preparation are given in the Supplementary Methods subsection: Query and reference data set preparation.

We estimated $$\gamma _{h,i}$$ for protease and reverse transcriptase inhibitors prescribed to at least 10 individuals in D (Fig. 3a; Supplementary Fig. 12), and identified sites with very strong evidence for drug-induced escape (median estimate of selection factor >2 and at least 97.5% of the posterior >1; top-tier) and moderate evidence (median estimate of selection factor >1.5 and at least 90% of the posterior >1; second-tier). We compared the enrichment of sites identified to known (experimentally validated) major drug resistance mutations (DRMs)9 (Fig. 3b). For some drugs, DRM data was lacking because the drug is no longer commonly used (DDC and DLV), was an experimental drug that failed (αAPA, and ADV) or is now used in small amounts with other drugs (RTV). We observed strong, statistically significant and consistent enrichment of DRMs at sites identified as selected; with elevated enrichment (as measured by odds ratio) in the strongest DRMs and sites with strongest evidence for selection. For example, of the 31 strongest DRMs in reverse transcriptase, 11 are found in the top-tier selected sites and a further 6 in the second-tier. Of the 41 apparent false positive sites, 17 are described as being selected for in the literature (but not classified as a major DRM in the Stanford database); for example mutation at codon 11 of reverse transcriptase is known to be associated with minor reductions in APV susceptibility29, but is not classified as a major DRM9. A further three apparent false positive are DRMs for treatment cocktails including the drug, nine are sites of DRMs for drugs in the same class but where more specific information is not provided, seven are DRMs for different drugs in the same class and only five appear to have no support in the literature (Supplementary Table 2). False negatives are caused by sites where the selected codon is more than one nucleotide change away from the consensus (five mutations), one is an insertion, eight have in vitro support for drug resistance but are not documented as being selected for in vivo, seven have no compelling evidence in the literature and three are genuinely missed (Supplementary Table 3)9. Further evidence for the biological validity of the inferred profiles is given by the co-clustering of related drugs in trees constructed from the selection intensity profiles (Supplementary Fig. 13). In summary, we find that the method can identify sites selected by specific drug treatments in vivo, either directly through resistance or indirectly through compensation for resistance mutations that reduce fitness.

HLA-associated selection

While the extent to which host HLA alleles can influence patterns of escape and reversion in HIV-1 has been studied extensively, the methods developed here provide additional power and accuracy as well as provide control against factors such as population stratification and recombination. Moreover, the methody also provides a framework in which to combine data from multiple previous studies. We have assembled nearly 3000 HIV-1 sequences from patients with known HLA class I genotypes from six studies, of European and African ancestry representing a mixture of subtype B and C sequences (Table 2). We augmented the data with HIV-1 genome data taken from three sources, representing 250,000–280,000 sequences depending on the gene (Table 1). We analysed the data in two ways, first by considering escape as deviation from subtype B and also by considering escape from subtype C. We expect these analyses to yield similar results, except at sites where subtypes differ systematically. We analysed data from protease and reverse transcriptase and inferred the impact of alleles at HLA class I loci (at 2-digit resolution) jointly. We restricted analysis to HLA alleles with at least 10 representatives in the data. We truncated to 2-digit HLA resolution to boost power as the majority of alleles are rare (61% of 4-digit HLA alleles have <10 copies, compared to 30% at 2-digit resolution). Further, 27.3% of the individuals for whom we have HLA information were typed to the 2-digit level. Full details of data preparation are given in the Supplementary Methods subsection: Query and reference data set preparation. We define ‘top-tier’ HLA-associated candidate sites as those where the median $$\gamma _{h,i} \ > \ 2$$ and the lower 2.5% quantile > 1, and ‘second-tier’ candidate sites as those where the 10% quantile of $$\gamma _{h,i} \ > \,1$$.

To illustrate the value of our approach, we first considered the B*51 restricted epitope TAFTIPSI in reverse transcriptase (Fig. 4). We find a strong signal of selection within the epitope, with the escape site (position 135) experiencing the strongest rate elevation. Interestingly, a variant at site 129 which is known to abrogate CTL recognition in vitro30, is highly conserved in vivo and shows no signal for selection, presumably due to other fitness consequences. In contrast, we also see strong evidence for B*51 associated escape around codons 173 and 195, which have not been reported previously. Other known allele–epitope combinations that we recover include, for example, B*18 associated selection around codon 13831,32,33.

To compare to experimental and previous work on epitope restriction we classify documented epitopes into an A-list, which represents the best-defined experimentally determined HIV-1 CTL epitopes, updated yearly34 and a B-list, which refers to the entire collection of epitopes reported in the literature35. Among the A-list epitopes, some have documented CTL escape variants, either in vivo or in vitro. See Supplementary Tables 4 and 5 for details and Supplementary Notes. We also consider an in silico set of predictions for strongly-binding anchor residues generated by Motif-Scan36. We compare these sites to our estimates of HLA-associated selection across protease and reverse transcriptase (Fig. 5, Supplementary Figs. 913).

To assess overlap between sites identified we measured the enrichment of sites identified as under selection and used a permutation strategy to assess significance (Methods subsection: Testing overlap with A-list and B-list epitopes). Across protease and reverse transcriptase we find strong enrichment of top-tier signals of selection at A-list epitopes for HLA-B (Table 3). For example, top-tier sites have an odds ratio of 63 for enrichment in A-list epitopes ($$p = 0.003$$). As the confidence in the epitope collection or the strength of selection decreases, so does the enrichment. We find no evidence for enrichment of selection at computationally predicted epitopes and little evidence for overlap of HLA-A associated sites, though in general we note that HLA-A alleles show much weaker evidence for selection in general (Supplementary Figs. 14 and 17). We also find very little evidence for HLA-C driven selection at all (Supplementary Figs. 16 and 18).

Comparing our results to those in Carlson et al.37, focusing on the well-studied HLA-B associations, we find that 100% of the sites that are both in our top-tier and $$q \ < \ 0.2$$ in Carlson et al.37 reside in A-list or B-list epitopes (Supplementary Tables 6 and 7). Subsets of method specific HLA associations also lie in A-list and B-list epitopes, which, given the overlap in subtype C viral samples (100% of subtype C viral sequence data with associated host HLA information was present in Carlson et al.37) suggest that the approaches are complementary, each likely detecting distinct portions of the true underlying signal. For example, our notion of escape means that we cannot detect HLA-associated selection to the consensus codon.

Overall, only a small number of sites identified here as being HLA-associated have been described and experimentally validated previously; four in reverse transcriptase (Table 4), and one in protease (Table 5). Conversely, not all previously reported escape mutation sites are identified here. In some cases this may be due to low-sample size in this study. However, it is also possible that earlier studies failed to account adequately for linkage disequilibrium between HLA alleles. We also identified potentially novel signals of HLA-driven epitope escape. For example, B*45 shows multiple signals of selection around codon 200 of reverse transcriptase (Fig. 5), yet there are no reported epitopes. This may reflect a historical bias towards studies being carried out in European-ancestry populations with subtype B viruses where some HLA alleles and viral epitopes are rare. However, we also note that results between the subtype B and subtype C consensus analyses are highly concordant, with a few interesting exceptions. For example, B*58 shows evidence for inducing strong selection away from the subtype B consensus around codon 122–123 of reverse transcriptase but not away from the subtype C consensus. Interestingly, this is a position of divergence between subtype B and subtype C viruses and B*58 is typically more common in populations with HIV-1 subtype C viruses, suggesting that B*58-induced selection pressure may have driven the fixation of the difference between subtypes. More generally, we found that average selection intensity away from the subtype C consensus in individuals carrying subtype B virus is significantly stronger than the average selection intensity away from the subtype B consensus (Supplementary Fig. 20: protease $$p = 2.8 \,\times 10^{ - 5}$$, reverse transcriptase $$p = 2.0 \,\times 10^{ - 6}$$; see Methods subsection: The impact of HLA on HIV-1 sequence evolution for details), though only reverse transcriptase was significant in the inverse setting ($$p = 0.0058$$, protease $$p = 0.70$$). Moreover, we estimate that the combined contribution of HLA-associated diversifying selection to non-synonymous viral sequence change represents 59 and 77% of selection from subtype B and C consensus in protease, and 61 and 78% in reverse transcriptase (Methods subsection: Testing the impact of HLA on HIV-1 subtype differentiation, and Supplementary Table 8). In summary, we estimate that HLA-driven selection accounts for more than half of all HIV-1 coding changes and has contributed to diversification between subtypes.

To assess the relationship between sequence similarity between HLA alleles and the inferred HLA-induced selection profiles, we measured the concordance between dendrograms inferred from pairwise differences of classical HLA alleles protein sequences and dendrograms inferred from the estimated selection profiles (Supplementary Methods; Dendrograms of selection profiles and comparing topologies). We find that closely related HLA alleles have closely related selection profiles for HLA-A (odds ratio = 2.1; permutation p value = 0.017) and HLA-B (odds-ratio = 3.1; permutation p value = 0.014), but not HLA-C (odds ratio = 0.71; permutation p value = 0.71). However, deeper structure within the inferred trees shows little concordance. These findings provide biological validation for our results and suggest that future work to incorporate relationships between HLA allele would be valuable.

Finally, as noted recently38, a number of sites both show evidence of HLA-driven selection as well as being associated with evolution in response to drug treatment (for example, around codon 70 in reverse transcriptase for B*45). Depending on whether the two types of selection act in the same or different directions, such pressures could either speed up or delay the origin of drug resistance. These results suggest that HLA genotype could be a predictor of the response to drug treatment.

Discussion

Differences between hosts, such as in their immune system, or treatment received, can lead to host-specific selection regimes and subsequent adaptation by the pathogen. If within-host adaptation is at a cost to intrinsic fitness, then, on further transmission, selection may favour reversion to the fitter, ancestral state. To learn about such forces, the ideal experiment is to compare the genetic composition of the original infecting pathogen strain and that present sometime after successful infection across a large number of hosts that differ only in the factors of interest. While only strictly possible in experimental systems, observational approaches aim to learn about the same processes, but, to do so, have to make assumptions about the (unobserved) pool of pathogens to which infected individuals were exposed and the distribution of potential confounding factors. If assumptions are not met, power to detect true associations will be reduced and there is a risk of false positive results. The introduction of phylogenetic methods for association testing in pathogen genomics17,20,21,22,25,26,39 provided a partial solution to many issues, by using genome-wide relatedness between pathogens as a proxy for correlation in unobserved confounding (such as population structure), similar to the use of principal components in the analysis of human genetic association40. Moreover, by developing explicit models of molecular evolution in response to host factors, it is possible to learn relative or even actual rates of escape and reversion17. However, phylogenetic methods have limitations. For example, they can be computationally expensive, meaning they are hard to apply to huge data sets; assumptions of homogeneity in rates over time and space typically are necessary; and, more fundamentally, recombination is widespread among pathogens.

To address these shortcomings, we have developed a model-based approach to association testing that exploits the availability of extremely large reference data sets on pathogen variation (but where there is no relevant metadata concerning the factors of interest), which are likely to contain genomes closely related to the samples of interest (with relevant metadata). Moreover, we argue that when identifying host-driven selection, most of the information lies near the tips of the trees as deeper comparisons have to integrate over many transmission events. By utilising the Li and Stephens41 haplotype model, combined with previous work on modelling adaptive evolution and reversion17,18,42 we have developed an approach that both dramatically increases power to detect associations and scales to huge data sets. Further, the framework can be extended in many ways, for example, to model the effects of covariates on recombination profile or intensity, or to incorporate hierarchical correlation structure between the selective effects of HLA types to boost power. Importantly, within the data analysed here, we estimate that between 7% (protease) and 17% (reverse transcriptase) of viral samples having undergone recombination at least once since their common ancestor with sequences in the reference data set.

The analysis of in vivo patterns of evolution provides a complementary approach to the in vitro study of drug resistance and immune evasion. The evolutionary response reflects the combination of fitness gains of escape and the fitness loss through modification. Hence, sites with strong resistance may have too strong a fitness cost to be typically selected for and weak resistance changes that have little fitness impact may be strongly selected for. Moreover, compensatory changes in the pathogen genome can potentially counteract fitness loss and hence appear as part of the resistance response. Our analysis of drug-associated evolution using data from the Los Alamos database identifies the majority of major DRMs; sites not identified typically result from failings of the model (variants that are more than one change away from the consensus codon, insertions and deletions). However, we also identify novel sites with compelling evidence for drug-associated selection, likely reflecting minor DRMs with small fitness loss and compensatory changes.

The analysis of HLA-associated selection presented here combines data from across six studies and nearly 3,000 HIV-1 sequences with linked patient HLA class I data. We show the method can recover well-known epitopes and associated escape mutations and that there is strong overlap between curated lists of epitopes and those identified here. However, we also found that simple in silico epitope prediction fails to provide a good predictor of the HIV-1 evolutionary response and multiple cases where curated epitopes do not appear to lead to a substantial selective pressure. We also showed a preponderance of signals of escape associated with HLA-B alleles, as opposed to HLA-A and HLA-C and identified a number of sites where differences between subtype consensus sequences collocate with sites selected by particular HLA alleles. Furthermore, by contrasting the aggregate HLA-associated selection in hosts of subtype B viruses with hosts of subtype C viruses, we observed differential signals of selection at sites distinguishing subtype B and C, supporting the hypothesis that differences in HLA allele frequency between populations could have contributed to the differentiation of HIV-1 subtypes43. As field-based sequencing of pathogens becomes feasible, so does the potential to accumulate vast data sets that can track spatial and temporal shifts in selection pressures affecting pathogens and their responses. Such studies will require new approaches to studying genetic association. While the method developed here is aimed at detecting within-host evolutionary responses, future extensions could include adaptation at a population level, for example through exploiting longitudinal sampling and/or spatial heterogeneity in treatment.

Methods

Motivation, justification and overview

Our goal is to estimate the location and nature of host-associated selection upon the pathogen genome sequence through statistical analysis of association between host factor and pathogen genomic variation. We desire our method to have four properties: the ability to use all available pathogen sequence data, irrespective of whether host factors have been measured; the ability to measure the evidence for selection across the genome analysed in a hypothesis-free manner; the ability to combine information across neighbouring positions where appropriate; and the ability to account for confounding factors such as recombination and population structure. To meet these requirements we have developed a Bayesian model-based approach in which we use an approximation to the coalescent with recombination and codon-level selection42. In extension to the earlier work, we enable host-specific factors to influence patterns of variation and consider the case of two data sets. The first, D, represents the collection of pathogen sequences for which host factor data (e.g., HLA genotype or drug treatment) is also available. The second, DB, represents a much larger data sets of pathogen sequences for which host factor data is not available. We use DB to model the set of potential pathogen sequences that a host can be infected with (allowing for recombination) and assume that genetic differences between this reference panel and D largely reflects evolution within the host. Thus host-induced selection will result in an association between host factors and the evolutionary changes observed in D. A cartoon of the underlying process and our model is shown in Fig. 1.

The justification of the approach is that if $$|{\mathbf{D}}_{\mathrm{B}}| \ll |{\mathbf{D}}|$$ then members of D will typically coalesce very recently and approximately independently with members of DB. Thus the prior on D can be modelled with the Li and Stephens imperfect mosaic hidden Markov model (HMM), utilising a modified NY98 codon model44. To incorporate host-induced selection, we define a consensus codon $${\cal{C}}_i$$ at site i and model selection as a scaled increase in the rate of non-synonymous change away from $${\cal{C}}_i$$. Conversely, we model reversion as a scaled increase in the codon substitution rate towards $${\cal{C}}_i$$. To obtain emission probabilities we integrate over the distribution of coalescent time between a member of D and DB assuming a coalescent model.

The HMM formulation enables efficient computation of the probability of observing a given sequence in D given current parameter values. The product over all members of D is used to approximate the joint probability $$P\left( {{\mathbf{D}}|{\mathbf{D}}_{\mathrm{B}},{\cal{H}},{\mathbf{\Theta }}} \right)$$, where $${\cal{H}}$$ is host factor data for members of D and $${\mathbf{\Theta }}$$ represents the parameters of recombination and codon substitution. To speed up computation, we a priori identify a subset of DB for each member of D that is used to model its ancestor. The model parameters we aim to infer are: the synonymous transition rate, μ, the $$\frac{{dN}}{{dS}}$$ ratio at each site, $$\omega _i$$, the recombination probabilities between neighbouring sites, $$r_i$$, the rate of reversion $$\zeta _i$$, and the host factor dependent scaling of escape rate at each site, $$\gamma _{h,i}$$. We impose a piecewise constant prior on $${\boldsymbol{\gamma }}_h$$ to enable rapid exploration of $$\gamma _{h,i}$$ and regularisation of the inference. While motivated by the size distribution of epitopes, the piecewise constant prior is extremely flexible, for example encompassing a small number of discrete sites with high selection parameters. Other parameters (transition–transversion rate and fraction of codon substitutions more than one nucleotide change away) are estimated from external data. We use Markov chain Monte Carlo (MCMC) to sample from the posterior distribution for all parameters (See Supplementary Methods for details and MCMC moves). To guard against overfitting, we use shrinkage priors on the $$\frac{{dN}}{{dS}}$$ ratio and host factor associated selection coefficients. Similarly, the window structure on $${\boldsymbol{\gamma }}_h$$ borrows information across neighbouring sites, further guarding against overfitting in the absence of signal42.

Statistical and computational details

In order to calculate the likelihood of the collection of parameters, $${\mathbf{\Theta }}$$, governing an evolutionary model, given a collection of sequence data D, it is important to account for the non-independence of samples by considering the underlying genealogy G. For our purposes, we are not interested in estimating G. We, therefore, treat the genealogy as a nuisance parameter and integrate over it

$$P({\mathbf{D}}|{\bf{\Theta }}) = \mathop {\smallint }\nolimits_G {\kern 1pt} P({\mathbf{D}}|{\mathbf{\Theta }},G)P(G)dG.$$
(1)

Here, P(G) is our prior on the genealogy G. $$P({\mathbf{D}}|{\mathbf{\Theta }},G)$$ is the probability of the data under some parameter values $${\mathbf{\Theta }}$$ and genealogy G, which can be evaluated using Felsenstein’s peeling algorithm45. Since the space of genealogies is very large, evaluating this integral is problematic. However, without recombination it is possible to estimate using a numerical summation approximation, as employed by BEAST46 and MrBayes47, for example. However, analysis is limited to hundreds or thousands of sequences depending on the properties of the analysed sequences. In the presence of recombination, G becomes an ancestral recombination graph and so the state space of G becomes far larger, making it impossible to use this approach in all but the simplest of models, and only for very small sequence data sets.

The solution that we use is to estimate the integral using an approximation to the coalescent with recombination41. The model we present has two components:

1. 1.

An approximation of the coalescent with recombination.

2. 2.

A model of codon substitution.

By combining these two components, we are able to approximate the probability of observing a given sequence in the presence of host factor dependent selection, and recombination. We make the assumption that each pathogen sequence is from a distinct individual. In practice, when applying the model to real data, we perform a number of data filtering steps to ensure this assumption is reasonable. We use the Li and Stephens imperfect mosaic HMM to approximate the coalescent with recombination. In the Supplementary Methods we describe the Li and Stephens model and the use of the forwards and backwards algorithms in evaluating likelihoods under this approximation.

In our inference problem we have two data sets. One query data set; D for which we have associated host factor information and another larger reference data set; DB without host factor information. Recall from our model summary that we make two important assumptions:

1. 1.

All selection along the lineages connecting a member of D to DB occurred within the host of the member of D.

2. 2.

The reference data set DB is a good approximation of the distribution of pathogens an individual can be infected with (which is reasonable when $$|{\mathbf{D}}_{\mathrm{B}}| \gg 0$$).

These two assumptions allow us to apply the Li and Stephens approximation. In the Li and Stephen’s HMM we require emission probabilities. For our model these will be probabilities of codon substitutions in the presence of some host factor. Making assumption 1 allows us to approximate $$P({\mathbf{D}}|{\mathbf{D}}_{\mathrm{B}},{\mathbf{\Theta }})$$. This is because we have host factor information for the members of $${\mathbf{D}} = \left\{ {D_1,D_2, \ldots ,D_n} \right\}$$ which we can use to determine the relevant emission probabilities. I.e., The first assumption allows us to use the sequence data in DB:

$$\begin{array}{*{20}{l}} {P({\mathbf{D}}|{\mathbf{D}}_{\mathrm{B}},{\mathbf{\Theta }})} \hfill & = \hfill & {P(D_1|{\mathbf{D}}_B,{\mathbf{\Theta }})P(D_2|{\mathbf{D}}_B,D_1,{\mathbf{\Theta }})} \hfill \\ {} \hfill & {} \hfill & { \ldots P(D_n|{\mathbf{D}}_B,D_1,D_2 \ldots ,D_{n - 1},{\mathbf{\Theta }}).} \hfill \end{array}$$
(2)

Using assumption 2, we may write

$$\begin{array}{*{20}{l}} {P({\mathbf{D}}|{\mathbf{D}}_{\mathrm{B}},{\mathbf{\Theta }})} \hfill & \approx \hfill & {P(D_1|{\mathbf{D}}_B,{\mathbf{\Theta }})P(D_2|{\mathbf{D}}_B,{\mathbf{\Theta }}) \ldots P(D_n|{\mathbf{D}}_B,{\mathbf{\Theta }})} \hfill \end{array}$$
(3)
$$\begin{array}{*{20}{l}} {} \hfill & \approx \hfill & {\hat \pi (D_1|{\mathbf{D}}_B,{\mathbf{\Theta }})\hat \pi (D_2|{\mathbf{D}}_B,{\mathbf{\Theta }}) \ldots \hat \pi (D_n|{\mathbf{D}}_B,{\mathbf{\Theta }}).} \hfill \end{array}$$
(4)

Here, $$\hat \pi$$ is the function describing the Li and Stephens approximation (described in the Supplementary Methods) with our model of recombination and codon evolution. In other words, assumption 2 results in the approximation that D is generated by independent realisations of recombination and mutation through DB. This avoids any need for averaging over orderings of D, allowing rapid evaluation of the approximate likelihood.

Throughout the remainder of the Methods section we will refer to HLA-associated selection, but note that the method is general and extends to any host factor associated selection on pathogen sequence.

Codon model of substitution

We use a codon model of substitution model to describe sequence change. Importantly for our purposes, a codon model allows us to detect selection through non-synonymous nucleotide changes. At each site we are interested in detecting if these types of nucleotide changes away from some wild-type consensus sequence are enriched in a particular HLA background. We first write down the Nielsen and Yang (NY98) codon model (without HLA-associated selection)44 and then extend it to incorporate the HLA types of the host transmitted to (a given member of D). This will allow us to infer HLA-dependent selection along the pathogen sequence. Note that in Eqs. (5)–(24) we consider codon substitutions at a single site (and drop any site subscript). However, when we perform inference we will allow our codon model to be parameterised by distinct collections of parameters at each site.

In the NY98 codon model44, the substitution rate from codon i to codon j is

$$q_{i,j} = \mu \left\{ {\begin{array}{*{20}{l}} \kappa \hfill & {{\mathrm{if}}\,i\,{\mathrm{and}}\,j\,{\mathrm{differ}}\,{\mathrm{by}}\,{\mathrm{a}}\,{\mathrm{synonymous}}\,{\mathrm{transition}}} \hfill \\ 1 \hfill & {{\mathrm{if}}\,i\,{\mathrm{and}}\,j\,{\mathrm{differ}}\,{\mathrm{by}}\,{\mathrm{a}}\,{\mathrm{synonymous}}\,{\mathrm{transversion}}} \hfill \\ \omega \hfill & {{\mathrm{if}}\,i\,{\mathrm{and}}\,j\,{\mathrm{differ}}\,{\mathrm{by}}\,{\mathrm{a}}\,{\text{non{\hbox{-}}synonymous}}\,{\mathrm{transversion}}} \hfill \\ {\kappa \omega } \hfill & {{\mathrm{if}}\,i\,{\mathrm{and}}\,j\,{\mathrm{differ}}\,{\mathrm{by}}\,{\mathrm{a}}\,{\text{non{\hbox{-}}synonymous}}\,{\mathrm{transition}}} \hfill \\ 0 \hfill & {{\mathrm{otherwise}},} \hfill \end{array}} \right.$$
(5)

where $$\kappa$$ is the relative rate of transitions to transversions and $$\omega$$ is the ratio of non-synonymous base changes to synonymous base changes assuming equal codon usage. The original model44 weighted by the frequency of the codon switched to at that position to ensure reversibility. This is not a requirement for us. In fact, the process we want to model is not reversible, as we will include a boost in movement away from the consensus codon in the presence of a given HLA, and assume a distinct boost towards consensus independent of HLA (to model reversion). Finally μ is the scaled mutation rate parameter for the rate of synonymous transversions measured in units of N generations (where N is the effective population size). We note that while the original model was used to model fixed substitutions within species, here (as in many cases) the model is actually of how sequence diversity is generated within a population sample.

The exact probability of a substitution from codon i to codon j in time t cannot be solved analytically, though we may estimate it by numerically evaluating the matrix exponential of the 64 × 64 instantaneous rate matrix described by Eq. (5) 42. This is computationally expensive so we look for an approximation.

If the expected time until the first coalescence of a given lineage with all other sampled lineages t, is small (reasonable if the number of sampled sequences in DB is very large), we can assume that the vast majority of codon changes occur through single-base changes (as the probability of observing more than one nucleotide change in a codon is low). This allows us to avoid matrix exponentiation and reduce computational cost. Consider a switch from codon C1 to codon C2. The total rate of substitution out of codon C1 is

$$\left( {\kappa \beta _S + \beta _V + \omega \left( {\kappa \alpha _S + \alpha _V} \right)} \right)\mu = :{\mathrm{\Lambda }}\mu ,$$
(6)

where βS, βV, αS and αV are the counts of the number of synonymous transitions, synonymous transversions, non-synonymous transitions, and non-synonymous transversions from C1, respectively.

If the overall movement away from a particular codon C1 is $${\mathrm{\Lambda }}\mu$$, then the probability a change has occurred in time t is

$$P({\mathrm{leave}}\,C_1|{\mathbf{\Theta }},t) = \mathop {\smallint }\nolimits_0^t {\kern 1pt} {\mathrm{\Lambda }}\mu {\kern 1pt} {\mathrm{exp}}\left( { - {\mathrm{\Lambda }}\mu t} \right)dt = 1 - {\mathrm{exp}}\left( { - {\mathrm{\Lambda }}\mu t} \right).$$
(7)

Here, $${\mathbf{\Theta }}$$ is the collection of parameters governing codon substitution. We can make an approximation by splitting up $$P(C_1 \to C_2|{\mathbf{\Theta }},t)$$. We consider two classes of moves from $$C_1 \to C_2$$: those which result from a single nucleotide change in time t (e.g., AAA → AAG) and those that result from multiple nucleotide changes in time t (e.g., AAA → ATA → AAA → AAG). For $$C_1 \,\ne\, C_2$$,

$$\begin{array}{*{20}{l}} {P(C_1 \to C_2|{\mathbf{\Theta }},t)} \hfill & = \hfill & {\mathop {\sum }\limits_{i = 1}^\infty {\kern 1pt} P(C_1 \to C_2|i\,{\mathrm{changes}},t)P(i\,{\mathrm{changes}}|{\mathbf{\Theta }},t)} \hfill \end{array}$$
(8)
$$\begin{array}{*{20}{l}} {} \hfill & = \hfill & {P({\mathrm{leave}}\,C_1|{\mathbf{\Theta }},t){\kern 1pt} \mathop {\sum }\limits_{i = 1}^\infty {\kern 1pt} P(C_1 \to C_2|{\mathrm{leave}}\,C_1,i\,{\mathrm{changes}},t)} \hfill \\ {} \hfill & {} \hfill & { \times P(i\,{\mathrm{changes}}|{\mathbf{\Theta }},t)} \hfill \end{array}$$
(9)
$$\begin{array}{*{20}{l}} {} \hfill & \approx \hfill & {\left( {1 - {\mathrm{exp}}\left( { - {\mathrm{\Lambda }}\mu t} \right)} \right)\left( {P\left( {C_1 \to C_2|{\mathrm{leave}}\,C_1,1\,{\mathrm{change}},t} \right)\phi } \right.} \hfill \\ {} \hfill & {} \hfill & {\left. { + \,P\left( {C_1 \to C_2|{\mathrm{leave}}\,C_1, > 1\,{\mathrm{change}},t} \right)\left( {1 - \phi } \right)} \right)} \hfill \end{array}$$
(10)
$$\begin{array}{*{20}{l}} {} \hfill & \approx \hfill & {\left( {1 - {\mathrm{exp}}\left( { - {\mathrm{\Lambda }}\mu t} \right)} \right)} \hfill \\ {} \hfill & {} \hfill & { \times \left( {\phi P\left( {C_1 \to C_2|{\mathrm{leave}}\,C_1,1\,{\mathrm{change}},t} \right) + \left( {1 - \phi } \right)\pi _{C_2}} \right)} \hfill \end{array}{,}$$
(11)

where $$\phi$$ is the probability of undergoing a 1 step change from C1 to C2 and $$\pi _{C_2}$$ is the probability of observing codon C2 at that position in the sequence.

Thus, for each codon C1, we generate the following collection of probabilities for state changes at each site

$$\begin{array}{l}1\,{\mathrm{step}}\,{\text{non{\hbox{-}}synonymous}}\,{\mathrm{transition}}\\ P(C_1 \to C_2|{\mathbf{\Theta }},t) = \left( {1 - {\mathrm{exp}}( - {\mathrm{\Lambda }}\mu t)} \right)\left( {\phi \frac{{\kappa \omega }}{{\mathrm{\Lambda }}} \,+ (1 - \phi )\pi _{C_2}} \right)\end{array}$$
(12)
$$\begin{array}{l}1\,{\mathrm{step}}\,{\mathrm{synonymous}}\,{\mathrm{transition}}\\ P(C_1 \to C_2|{\mathbf{\Theta }},t) = \left( {1 - {\mathrm{exp}}( - {\mathrm{\Lambda }}\mu t)} \right)\left( {\phi \frac{\kappa }{{\mathrm{\Lambda }}} \,+ (1 - \phi )\pi _{C_2}} \right)\end{array}$$
(13)
$$\begin{array}{l}1\,{\mathrm{step}}\,{\text{non{\hbox{-}}synonymous}}\,{\mathrm{transversion}}\\ P(C_1 \to C_2|{\mathbf{\Theta }},t) = \left( {1 - {\mathrm{exp}}( - {\mathrm{\Lambda }}\mu t)} \right)\left( {\phi \frac{\omega }{{\mathrm{\Lambda }}} \,+ (1 - \phi )\pi _{C_2}} \right)\end{array}$$
(14)
$$\begin{array}{l}1\,{\mathrm{step}}\,{\mathrm{synonymous}}\,{\mathrm{transversion}}\\ P(C_1 \to C_2|{\mathbf{\Theta }},t) = \left( {1 - {\mathrm{exp}}( - {\mathrm{\Lambda }}\mu t)} \right)\left( {\phi \frac{1}{{\mathrm{\Lambda }}} \,+ (1 - \phi )\pi _{C_2}} \right)\end{array}$$
(15)
$$\begin{array}{l} \ge 2\,{\mathrm{step}}\,{\mathrm{change}}\\ P(C_1 \to C_2|{\mathrm{\Theta }},t) = (1 - {\mathrm{exp}}( - {\mathrm{\Lambda }}\mu t)){\kern 1pt} \left( {(1 - \phi )\pi _{C_2}} \right)\end{array}$$
(16)

if $$C_1 \ne \,C_2$$, and

$$\begin{array}{l}{\mathrm{No}}\,{\mathrm{change}}\,{\mathrm{in}}\,{\mathrm{codon}}\,{\mathrm{at}}\,{\mathrm{the}}\,{\mathrm{site}}\,1 - ((12) + \ldots + (16)) = (17)\\ P(C_1 \to C_2|{\mathbf{\Theta }},t) = {\mathrm{exp}}\left( { - {\mathrm{\Lambda }}\mu t} \right) + \left( {1 - {\mathrm{exp}}\left( { - {\mathrm{\Lambda }}\mu t} \right)} \right)\left( {1 - \phi } \right)\pi _{C_2}\end{array}{,}$$
(17)

if $$C_1 = C_2$$. We estimate πC (for each codon C) and $$\phi$$ empirically using the Durban data set48,49,50.

Codon model of substitution: adding HLA-associated selection

In order to account for HLA-associated selection we may choose to incorporate transmission and HLA proportions in the same way as the existing models17,18. Alternatively, we can consider a simpler model in which we are only concerned with the HLA type (h, say) of the sampled host in the smaller population. We explore the latter approach. For movement away from the consensus codon $${\cal{C}}$$ at a given site, $${\mathrm{\Lambda }}$$ is modified to $${\mathrm{\Lambda }}^\prime$$ in the presence of HLA type h

$${\mathrm{\Lambda }}\prime = \kappa \beta _S + \beta _V + \omega \left( {\kappa \alpha _S + \alpha _V} \right)\gamma _h.$$
(18)

So, as an example, for a non-synonymous transition from the consensus (which is predefined, in practice, we set it as either the consensus subtype B, or subtype C codon at the position) codon $${\cal{C}}$$ to C2, explicitly writing down the HLA dependence and using the approximation from Eqs. (13)–(17)

$$P({\cal{C}} \to C_2|h,t) = \left( {1 - {\mathrm{exp}}( - {\mathrm{\Lambda }}\prime \mu t)} \right)\left( {\phi \frac{{\kappa \omega \gamma _h}}{{{\mathrm{\Lambda }}\prime }} \,+ (1 - \phi )\pi _{\cal{C}}} \right),$$
(19)

γh is a scaling of the non-synonymous/synonymous substitution ratio $$\omega$$ associated with HLA type h, away from $${\cal{C}}$$. This weighting provides extra selection away from the consensus codon $${\cal{C}}$$ if $$\gamma _h \ > \ 1$$ and less if $$\gamma _h \ < \ 1$$ to create a new total rate of movement out of the consensus codon $${\cal{C}}$$ in the presence of HLA type h, $${\mathrm{\Lambda }}\prime$$. Thus an association between a particular HLA type and non-synonymous change from consensus at the site would result in $$\gamma _h \ > \ 1$$. Notice that no selection is applied to synonymous codon changes.

Extending to an HLA profile of a host $${\mathbf{H}} = \{ h_1,h_2, \ldots ,\ h_n\}$$

$$P({\cal{C}} \to C_2|{\mathbf{H}},{\mathbf{\Theta }},t) = \left( {1 - {\mathrm{exp}}\left( { - {\mathrm{\Lambda }}\prime \mu t} \right)} \right)\left( {\phi \frac{{\kappa \omega {\kern 1pt} \mathop {\prod }\nolimits_{h \in {\mathbf{H}}} {\kern 1pt} \gamma _h}}{{{\mathrm{\Lambda }}\prime }} \,+ (1 - \phi )\pi _{\cal{C}}} \right),$$
(20)

We also wish to incorporate reversion into our codon model. Given the state of the codon under consideration, there is some set of nucleotide substitutions resulting in $${\cal{C}}$$ at that position. It is these substitutions which we wish to scale through reversion parameters. We let $$\zeta$$ denote the scaling of selection due to reversion at a particular site. We assume that each $$\zeta$$ is independent of the host’s HLA profile, H. By doing so, we are assuming that in the absence of HLA pressure there is a selection towards the consensus codon $${\cal{C}}$$ at each site. Note that reversion only acts on non-synonymous changes that result in the consensus codon $${\cal{C}}$$. Consequently, the parameterisation is identifiable and we are able to distinguish $$\omega _i$$ and $$\zeta _i$$ in the product $$\omega _i\zeta _i$$.

Consider the case in which a non-synonymous transition results in the consensus codon, $${\cal{C}}$$, from some codon C1. Then $${\mathrm{\Lambda }}$$ is modified to $${\mathrm{\Lambda }}\prime$$:

$${\mathrm{\Lambda }}\prime = \kappa \beta _S + \beta _V + \omega \left( {\kappa \left( {\zeta + (\alpha _S - 1)} \right) + \alpha _V} \right).$$
(21)

Reversion is therefore modelled by

$$P(C_1 \to {\cal{C}}|{\mathbf{H}},{\mathbf{\Theta }},t) = \left( {1 - {\mathrm{exp}}\left( { - {\mathrm{\Lambda }}\prime \mu t} \right)} \right)\left( {\phi \frac{{\kappa \omega \zeta }}{{{\mathrm{\Lambda }}\prime }} + (1 - \phi )\pi _{\cal{C}}} \right).$$
(22)

$$\zeta$$ is a scaling of the non-synonymous/synonymous substitution ratio $$\omega$$ towards $${\cal{C}}$$. This weighting provides extra selection towards the consensus codon $${\cal{C}}$$ if $$\zeta \ > \ 1$$ and less if $$\zeta \ < \ 1$$. We assume that the same $$\zeta$$ scales all the potential 1 step non-synonymous base changes to $${\cal{C}}$$.

We have now created a codon model which accounts for reversion and HLA-associated selection within a host, and can now determine emission probabilities; $$\varepsilon _{j,i}$$. We define $$\varepsilon _{j,i}$$ as the probability of copying the codon at position i from sequence j (see Supplementary Methods for background details). Therefore, by evaluating

$$P(C_1 \to C_2|{\mathbf{H}},{\mathbf{\Theta }}) = \mathop {\smallint }\nolimits_0^\infty {\kern 1pt} P(C_1 \to C_2|{\mathbf{H}},{\mathbf{\Theta }},t)P(t)dt{,}$$
(23)

which is the probability that a change from codon C1 to C2 has occurred in the time to coalescence with the reference set, we can substitute in the relevant codon change ($$C_1 \to C_2$$) and determine $$\varepsilon _{j,i}$$. We integrate out t assuming the coalescent with fixed population size. In the standard coalescent with a fixed population size, the distribution of the divergence time between a sequence and its closest relative from among k other sequences may be approximated by an exponential distribution with rate $$\frac{k}{2}$$ $$\left( {P(t) = \frac{k}{2}{\mathrm{exp}}\left( { - \frac{k}{2}t} \right)} \right)$$42 (see Supplementary Notes for a derivation of the mean).

This is simply a case of substituting in and integrating. For example, for a non-synonymous transition from the consensus,

$$P({\cal{C}} \to C_2|{\mathbf{H}},{\mathbf{\Theta }}) = \left( {\phi \frac{{\kappa \omega {\kern 1pt} \mathop {\prod }\nolimits_{h \in {\mathbf{H}}} {\kern 1pt} \gamma _h}}{{{\mathrm{\Lambda }}\prime }} + (1 - \phi )\pi _{C_2}} \right)\left( {1 - \frac{k}{{k + {\mathrm{\Lambda }}\prime \mu }}} \right).$$
(24)

Notice that in evaluating this integral we are focused on sequence evolution between a member of D and its nearest neighbour in DB. Only considering nearest neighbours in the genealogy has two clear advantages:

1. 1.

We only use the parts of the genealogy that contain the most information about HLA-dependent selection within a sampled host: terminal branches.

2. 2.

The resulting reduction in computation time allows us to use far more sequence data. Using more sequence data will shorten terminal branches and so increase our power to detect HLA-associated selection.

Using very large data sets is important for our method. As terminal branches shorten through increased sampling, assumption 1 (that selection occurs in the member of D along the lineage connecting each member of D to its nearest neighbour in DB) becomes more reasonable as we may assume that no or very few transmissions occur along the lineage connecting sequences in D to their nearest neighbour in DB.

In Eqs. (18)–(24) we considered a single host with an HLA type h or collection of HLA types H. We can trivially extend this to obtain the desired approximation of the likelihood given in Eq. (3). Let $${\cal{H}}$$ be the collection of HLA profiles of the associated hosts for sequences in D. Let Hj denote the HLA profile of host j. Then

$$P({\mathbf{D}}|{\mathbf{D}}_{\mathrm{B}},{\cal{H}},{\mathbf{\Theta }}) \approx \mathop {\prod }\limits_j {\kern 1pt} \hat \pi (D_j|{\mathbf{D}}_{\mathrm{B}},{\mathbf{H}}_j,{\mathbf{\Theta }}).$$
(25)

We have now created a model which accounts for recombination using the Li and Stephens41 approximation and incorporates both reversion and HLA-associated selection. We perform inference using MCMC.

Restricting DB: sample-specific reference data sets $${\mathbf{D}}_{{\mathrm{B}}_j}$$

In order to determine the posterior probability of a new state within the MCMC quickly, we keep track of two large arrays for each query sequence (as well as other quantities required to evaluate the likelihood). This requires storing two $$|{\mathrm{codon}}\,{\mathrm{sequence}}| \times |{\mathbf{D}}| \times |{\mathbf{D}}_{\mathrm{B}}|$$ arrays of doubles. When evaluating the likelihood of observing ~2,000 query sequences from a collection of ~60,000 reference protease sequences from the Stanford drug resistance database, over 300 GB of RAM was required! We require a method to reduce the memory required to perform inference.

We restrict DB to a different subset $${\mathbf{D}}_{{\mathrm{B}}_j}$$ for each $$D_j \in D$$. We make the approximation

$$P({\mathbf{D}}|{\mathbf{D}}_{\mathrm{B}},{\cal{H}},{\mathbf{\Theta }}) \approx \mathop {\prod }\limits_j {\kern 1pt} \hat \pi (D_j|{\mathbf{D}}_{\mathrm{B}},{\mathbf{H}}_j,{\mathrm{\Theta }}) \approx \mathop {\prod }\limits_j {\kern 1pt} \hat \pi (D_j|{\mathbf{D}}_{{\mathrm{B}}_j},{\mathbf{H}}_j,{\mathbf{\Theta }}).$$
(26)

If the sequences we choose for each $${\mathbf{D}}_{{\mathrm{B}}_j}$$ are similar to the true ancestors of each of the Dj, then this will be a good approximation.

The simple approach we use to restrict DB is by considering the closest n sequences to Dj by Hamming distance (the Hamming distance between two strings is the total number of differences between the strings, giving each difference equal weighting). A drawback of using Hamming distance to restrict DB is that in the presence of high recombination rates, one can imagine sequences which should be included in $${\mathbf{D}}_{{\mathrm{B}}_j}$$ but are excluded on the basis this metric.

For the small amount of overhead, this sample-specific restriction reduce the computational time and memory footprint required by the programme to the extent that a run of 1,500,000 updates analysing ~1,000 query sequences over a codon sequences of length ~100, considering the closest 100 reference sequences according to some metric may be performed in ~24 h on an Intel i7 desktop machine using <1 GB of RAM.

MCMC inference regime

The structure of the model we have presented lends itself to inference using MCMC to sample from the posterior. For our MCMC implementation, we use the Metropolis Hastings algorithm. We fix $$\kappa$$, $$\phi$$ and πC for each codon C at empirical estimates. We perform MCMC moves on the following collection of parameters:

• The recombination probability ri between neighbouring sites i and i + 1.

• The non-synonymous/synonymous substitution ratio in the absence of reversion or HLA-associated selection at each site, $$\omega _j$$.

• Scaling of selection due to reversion at each site, $$\zeta _i$$.

• HLA-associated scalings of selection at each site, $$\gamma _{h,i}$$.

In our sampling scheme we allow $$\zeta _i$$ and $$\gamma _{h,i}$$ to vary via piecewise constant functions across the codon sequence, which we call selection windows. This is analogous to the ‘block-like’ model used by Wilson and McVean42. Imposing this window structure allows information about HLA-dependent selection (and reversion) to be combined across sites. This makes sense in a biological setting where sites close together in the coding sequence may result in amino acids lying in the same epitope or close together in the protein structure. The window model will act to smooth across sites, but will be overwhelmed if a given site is subject to strong HLA-dependent selection42. How strict the window model is depends on the expected number of windows, which is controlled by the parameter pw (see the merge and split moves in Supplementary Methods MCMC moves subsection, and Wilson and McVean42). If $$p_w = 1$$, the number of selection windows is equal to the number of sites, if $$p_w = 0$$ then there is just one selection window. The piecewise constant prior is extremely flexible. Using a piecewise constant prior in combination with MCMC moves to shrink/expand and merge/split these blocks allows us to explore the space of HLA-associated selection rapidly while maintaining a far smaller number of parameters than considering HLA-associated selection site by site. See Supplementary Methods for MCMC move details.

Simulation study 2: simulating a sampled birth–death process

For a second large simulation study, we consider a more realistic generative process for creating our reference and query data sets. Rather than using an existing large reference data set and simulating query sequences under our model, we generate an instance of a sampled birth–death process. We then simulate sequence data down the resultant sampled birth–death tree.

We simulate the sampled tree backwards in time together with unseen transmission events using the following procedure:

Let birth rate be λ and the death rate be δ. Thus, backwards in time λ is a death rate, and δ a birth rate. Let N and M denote the total number of infected individuals and sampled infected individuals respectively. Setting r as the sampling proportion for extinct lineages, we consider a collection of competing Poisson processes.

Set $$t = 0$$, and let the total number of infected individuals and sampled infected individuals at the present be $$N = \tilde N$$ and $$M = \tilde M$$, respectively.

While $$M \ > \ 0$$, sample the time of the next event $$t \to t + \tilde t$$ where $$\tilde t \sim Exp(N(\lambda + \delta ))$$. First, we determine whether this event is a birth or a death. Sample $$u \sim U[0,1]$$.

• Birth. If $$u \ < \ \frac{\lambda }{{\lambda + \mu }}$$, then $$N \to N - 1$$ and there are three possibilities here with the following probabilities.

• An unseen lineage coalesces with unseen lineage with probability $$\left( {1 - \frac{M}{N}} \right)^2$$.

• An unseen lineage coalesces with a seen lineage with probability $$\left( {1 - \frac{M}{N}} \right)\frac{M}{N}$$, and we record the time t.

• A seen lineage coalesces with a seen lineage with probability $$\left( {\frac{M}{N}} \right)^2$$, $$M \to M - 1$$, and we record the time t.

• Death. Else $$\left( {u \ge \frac{\lambda }{{\lambda + \mu }}} \right)$$, and $$N \to N + 1$$.

• If $$\hat u \sim U[0,1] \ < \ q$$ then the death event is sampled $$M \to M + 1$$, and we store the time of the event t.

Given that birth and death events are lineage independent, we may just use the time information to then generate the a tree. Code is available at https://github.com/astheeggeggs/mcqueen. We note that this generative process is based on the equations laid out in the Supplementary Materials of Palmer et al.17 and Frost and Volz51, and that it is similar to the backwards episodic birth–death process algorithm outlined by Stadler52, except that we have a fixed birth and death rate over the entirety of the tree, store unseen transmissions, and do not have mass extinction events.

We now have a sampled birth–death tree. Given our knowledge of all the transmission events along the lineages of this sampled birth–death tree, we may assign HLA types to each node by passing the information down the tree from the root. Note that at each coalescence event, one of the daughter lineages will be in the same host, whilst the other will represent the transmission to a new host and HLA environment. All unseen transmission events which we count are transmissions to a new host, so also coincide with a change in HLA environment.

We then simulate sequence change down this tree conditional on all the HLA information, using the codon substitution model outlined in the methods subsection: Codon model of substitution, again by passing a sequence at the root and tracking changes down the tree. Finally, we sample m of these sequences at the leaves to be our query sequence set with associated host HLA information, and the remainder to define our reference sequence data set, after throwing away the host HLA information. In this simulation study, we set N = 1,000,000, M = 100,000, q = 0.1, and m = 1,460, and use the same selection profiles, but increase the number of flat selection profiles to 16, so that the total frequency of a given HLA allele more closely resembles that seen in the data, which is important for this tree based simulation of HLA-associated evolution.

Simulation study 3: the effect of population differentiation

Differentiation between the query and reference data sets can potential lead to reduced performance of estimators due to two factors: confounding between covariates that are non causal (for example differences in both HLA allele frequency and viral genetic diversity due to genetic drift) and relatedness within the query population leading to poorly calibrated estimates. To assess the impact of population differentiation on performance of our estimator we carried out a parametric bootstrap simulation, using viral sequence data from the Botswana study (n = 343, protease only) to simulate ancestors infecting a set of hosts with HLA allele frequencies drawn from the Botswana population. We used the evolutionary parameters estimated within this study (including the HLA-associated selection profiles, $$\frac{{dN}}{{dS}}$$, and recombination probabilities) and performed 100 independent simulations of 1,500 query sequences with associated HLA information. For each simulation we then performed inference using a range of reference data sets and also considered a leave-one-out (LOO) strategy in which the reference data are augmented by all sequences from the query data set except for the sequence under consideration. These will typically be the most closely related to the query data. The reference data sets we consider are:

1. a.

DB = The sequences actually simulated from, under the model (Botswana sequence data). This represents a gold-standard reference data set.

2. b.

DB = All sequence data available in public databases for which no HLA information is available (n = 162,901).

3. c.

DB = The simulated sequences, using a LOO strategy for each query.

4. d.

DB = a) + b), using a LOO strategy.

We display plots of inference results for a common and rare HLA type for each HLA class I molecule (A–C) in Supplementary Figs. 911.

To compare the impact of relatedness on our ability to perform inference, we compare the mean RMSE of the median HLA-associated selection estimates across sites, weighted by the frequency of the HLA types in our simulated data sets. Results are displayed in Supplementary Table 9.

As expected, the gold standard DB, a) is the most accurate. We find that the reference data set that we have aggregated, b) performs almost as well, with the addition of the simulated sequences to the reference data set performing worse. This is likely due to the approximate nature of the model used to simulate the sequence data, but nevertheless is encouraging to see that our current approach recapitulates parameters almost as well as if we had access to the best possible query data set, a).

We also examined the effect of restricting the reference data set to the subset Los Alamos reference data (where we had sampling country information). We then checked to see if removal of all Botswana samples ($$n = 73$$) from this Los Alamos reference set dramatically reduced our accuracy. We saw similar results for inference using these two reference sets.

These results indicate that our method is robust to population differentiation and relatedness within the query data set. Importantly, the effect of removing all reference data from Botswana has only a marginal impact on accuracy. The LOO strategy does not typically improve performance, though clearly enables use of the method when no (or only very distantly related) reference data is available. Users of our software (available at https://github.com/astheeggeggs/mcqueen) can set DB to be created using a LOO approach by using the -q or –separate_reference_fasta flags.

Testing overlap with A-list and B-list epitopes

We permute the labels of HLA types for a class I gene 1,000,000 times, and count the number of overlaps with the putative epitope sets for each permutation. This leads to an estimate of an alternative null distribution (by shuffling we account for possibility that parts of the region may be epitope rich across HLA types). We then compare this distribution of counts of overlapping sites to the observed number of epitope overlaps with our putative selected sites to obtain p values, by determining the proportion of the shuffled sets with at least as many overlaps as the real data.

The impact of HLA on HIV-1 sequence evolution

To investigate the contribution of diversifying selection due to HLA pressure, we first evaluate the average HLA-associated selection away from consensus at each site i, for each individual j in the sample using the median estimates from the MCMC analysis

$$A_{j,i} = \omega _i\left( {\kappa \alpha _{S,i} + \alpha _{V,i}} \right)\mathop {\prod }\limits_{h \in {\mathbf{H}}_j} {\kern 1pt} \gamma _{h,i},$$
(27)

where αS,i and αV,i are the number of non-synonymous transitions and non-synonymous transversions from the consensus codon at site i, respectively.

We then determine the collection of Aj,i for which there is increased selection away from consensus due to HLA-associated selection: $$\left( {\mathop {\prod}\nolimits_{h \in {\mathbf{H}}_j} {\gamma _{h,i}} } \right) \ > \ 1$$. We then evaluate the fraction of the total mutation rate away from consensus that this represented

$$\frac{{\mathop {\sum }\nolimits_{i,j:\left( {\mathop {\prod }\nolimits_{h \in {\mathbf{H}}_j} {\kern 1pt} \gamma _{h,i}} \right) \ > \ 1} {\kern 1pt} A_{j,i}}}{{\mathop {\sum }\nolimits_{i,j} {\kern 1pt} \left( {\omega _i\left( {\kappa \alpha _{S,i} + \alpha _{V,i}} \right){\kern 1pt} \mathop {\prod }\nolimits_{h \in {\mathbf{H}}_j} {\kern 1pt} \gamma _{h,i} + \left( {\kappa \beta _{S,i} + \beta _{V,i}} \right)} \right)}},$$
(28)

and the fraction of the mutation rate that resulted in an amino-acid substitution

$$\frac{{\mathop {\sum}\nolimits_{i,j:\left( {\mathop {\prod}\nolimits_{h \in {\mathbf{H}}_j} {{\kern 1pt} \gamma _{h,i}} } \right) \ > \ 1} {{\kern 1pt} A_{j,i}} }}{{\mathop {\sum}\nolimits_{i,j} {\left( {\omega _i\left( {\kappa \alpha _{S,i} + \alpha _{V,i}} \right){\kern 1pt} \mathop {\prod}\nolimits_{h \in {\mathbf{H}}_j} {{\kern 1pt} \gamma _{h,i}} } \right)} }}.$$
(29)

We find that this represents 18 and 28% of all mutations (59% and 77% of non-synonymous substitutions) away from subtype B and C consensus in protease, respectively. In reverse transcriptase this represented 18% and 32% of all mutations (61% and 78% of non-synonymous substitutions) away from subtype B and C consensus, respectively.

Testing the impact of HLA on HIV-1 subtype differentiation

We wish to examine whether amino-acid differences between HIV subtype B and C may have been driven by differences in HLA frequency distributions. To do this, we first use RIP53 to determine, for each query sample, the viral subtype within each of the genomic regions analysed. At each site we then calculate (using the inferred evolutionary parameters) the average selection away from each subtype consensus in those hosts harbouring the subtype B and C viruses separately. In this way, we obtain an approximation of the average HLA-associated selective pressure away from the subtype B/C consensus in the geographic regions in which subtype B and subtype C viral sequences predominantly reside. For example, letting SC be the set of subtype C viral samples, Hj be the HLA profile of the host in which viral sequence j resides, and $$\gamma _{h,i}^B$$ denote HLA h associated selection away from subtype B at site i

$$A_{i,C}^B: = \frac{1}{{|S_C|}}{\kern 1pt} \mathop {\sum }\limits_{j \in S_C} {\kern 1pt} \left( {\omega _i\left( {\kappa \alpha _{S,i} + \alpha _{V,i}} \right){\kern 1pt} \mathop {\prod }\limits_{h \in {\mathbf{H}}_j} {\kern 1pt} \gamma _{h,i}^B} \right)$$
(30)

is the selective effect away from subtype B consensus at site i of HLA alleles averaged over regions in which subtype C predominates. Results are shown in Supplementary Fig. 20.

To determine if there is an elevated HLA-associated selective effect on sites that differ between subtypes B and C, we performed permutation tests. For example, consider selection away from subtype B consensus. We evaluate

$$\mathop {\sum }\limits_{i \in D} {\kern 1pt} \left( {A_{i,B}^B - A_{i,C}^B} \right),$$
(31)

where D is the set of sites at which subtype B and subtype C differ at the amino-acid level. We then randomly shuffled site labels and re-evaluated this quantity 1,000,000 times to obtain a p value. Resultant p values for protease and reverse transcriptase are shown in Supplementary Table 8.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.