Introduction

Hepatitis C virus (HCV) is the major cause of liver-associated disease and liver cancer, affecting more than 180 million people worldwide1. Fortunately, effective drug treatments, using direct-acting antiviral agents (DAAs), are available that achieve sustained virological response (SVR) in more than 95% of patients2,3. However, drug resistant mutations (DRMs) in the HCV proteins targeted by drugs greatly affect the treatment outcome and often lead to drug failure2,4. Numerous DRMs have been identified, corresponding to specific amino acid substitutions capable of negatively affecting the activity of DAAs either in vitro or in vivo in treated patients4,5,6,7. The extensive use of drugs for treating HCV places selective pressure that may lead to DRM-enriched viruses becoming prevalent in the population, which eventually would limit efficacy of the available drugs. Interestingly, many HCV DRMs are known to be individually deleterious for the virus8,9,10. Thus, it is important to understand the evolutionary factors facilitating the emergence of DRMs in HCV.

One such evolutionary factor is epistasis: a phenomenon in which the phenotypic effect of a mutation at one residue is dependent on mutations elsewhere in the protein sequence. Epistasis has been suggested to play an important role in HCV evolution11,12,13,14. In the case of HIV, a chronic-disease-causing RNA virus similar to HCV, drug resistance has been suggested to be mediated by epistasis15,16,17, with the fitness cost incurred by DRMs compensated by other mutations in the drug-targeted protein18,19,20. For HCV, preliminary data suggests that DRMs may be involved in epistatic interactions9,10,21; however, a comprehensive understanding of the role played by epistasis in the evolution of DRMs in HCV is still lacking.

In this study, we investigate the role of pair-wise epistatic interactions in the evolution of drug resistance in the NS3 protein, one of the main targets of HCV drugs22. Using the globally prevalent HCV genotype 1a sequence data23, we infer an in-silico model for the fitness landscape of HCV NS3, which takes into account the effect of both individual mutations and epistatic interactions between pairs of mutations. Our inferred model correlates strongly with multiple experimental data sources. Consistent with past studies on fitness landscape of HCV proteins12,13,14,24, we find that epistatic interactions are important contributors to HCV fitness. Applying the fitness landscape model to study DRMs associated with NS3-targeting drugs (all known to be protease inhibitors4), we reveal that specific DRMs, referred as “SC-DRMs”, are associated with strong compensatory epistatic interactions and are enriched in almost all NS3 drugs. We further show, by integrating the fitness landscape with an in-host evolutionary model, that under selective pressure from drugs it is relatively easy for HCV to incur SC-DRMs compared to other DRMs. In addition, we find that the number of SC-DRMs seems to negatively correlate with the efficacy of each NS3-specific drug. Overall, our results suggest an important role of epistasis in emergence of NS3-specific DRMs. Accounting for epistatic interactions might therefore be critical for studying resistance to current HCV NS3 drugs as well as for developing new drugs targeting the NS3 protein.

Results

Importance of epistatic interactions in predicting HCV NS3 fitness

To study the role of epistatic interactions in the development of drug resistance for HCV NS3, we first inferred a fitness landscape for the HCV NS3 protein using available sequences for genotype 1a. This inference involved determining a prevalence landscape—an estimate of the probability of observing an NS3 protein sequence among naturally occurring HCV populations—using a “least-biased” maximum entropy probabilistic model (Methods). In this model, the probability of observing a sequence \({{{{{{{\bf{x}}}}}}}}=\left[{x}_{1},\, {x}_{2},\ldots,\, {x}_{N}\right]\), is given by

$${P}_{{{{{{{{\bf{h,J}}}}}}}}}{{{{{{{\bf{(x)}}}}}}}}=\frac{{e}^{-{E}_{{{{{{{{\bf{h,J}}}}}}}}}{{{{{{{\bf{(x)}}}}}}}}}}{\mathop{\sum}\limits_{{{{{{{{\bf{{x}}}}}}}^{{\prime} }}}}{e}^{-{E}_{{{{{{{{\bf{h,J}}}}}}}}}({{{{{{{{\bf{x}}}}}}}}}^{{\prime} })}},\,{{\mbox{where}}}\,{E}_{{{{{{{{\bf{h,J}}}}}}}}}{{{{{{{\bf{(x)}}}}}}}}=\mathop{\sum }\limits_{i=1}^{N-1}\mathop{\sum }\limits_{j=i+1}^{N}{J}_{ij}\left({x}_{i},\, {x}_{j}\right)+\mathop{\sum }\limits_{i=1}^{N}{h}_{i}\left({x}_{i}\right),$$
(1)

where the parameters \({h}_{i}\left({x}_{i}\right)\) represent the effect of mutations at individual residues i and the parameters \({J}_{ij}\left({x}_{i},\, {x}_{j}\right)\) account for the effect of epistatic interactions between mutations at two different residues i and j. Eh,J(x) represents the energy of sequence x which is inversely related to its prevalence. Similar models have been developed previously to predict direct residue contacts25,26,27. Here, we observed a strong negative correlation (\(\overline{r}=-0.79\), Fig. 1; see “Methods” section for details) between the sequence energy predicted from the inferred model and the infectivity measurements for 45 sequences obtained from experimental studies9,10,21,28,29 (listed in Supplementary Data 1). This suggests that the fitness landscape model serves as a good proxy for the intrinsic fitness landscape of HCV NS3. This result was consistent with maximum-entropy-based fitness landscapes inferred in the past for other HCV proteins (NS5B24 and E212,13), and proteins from HIV17,30,31,32,33 and poliovirus34. We also noted that the correlation obtained for the inferred model was much stronger than the correlation achieved by a model that ignores pair-wise epistatic interactions (\(\overline{r}=-0.55\), Fig. 1 inset; see Methods for details). Based on a bootstrapping procedure, the difference in the correlation obtained for the two models was also found to be statistically robust (Supplementary Figure 1). This result suggests that considering epistatic interactions is important for reliably predicting the intrinsic fitness of HCV NS3.

Fig. 1: Correlation between the sequence energy obtained from inferred NS3 fitness landscape and in-vitro infectivity measurements.
figure 1

Normalized energy values computed from the inferred model correlate strongly with the experimental fitness measurements. In contrast, a conservation-only model provided a much lower correlation (inset). The legend shows references for fitness/infectivity measurements9,10,21,28,29. Normalization of fitness measurements and predicted model energies was performed by subtracting the mean from each data set and dividing by its standard deviation. Source data are provided as a Source Data file.

Association of DRMs with compensatory interactions

Many of the DRMs in HCV NS3 are known to be individually deleterious8,9,10. The emergence of DRMs may suggest that the fitness cost of DRMs are compensated by mutations elsewhere in the protein. To investigate if the residues involved in known NS3 DRMs4,5,6,7 (listed in Table 1) were associated with compensatory interactions, we studied the model parameters Jij. Large positive values of Jij in Eq. (1), that increase sequence energy and thereby decrease its predicted fitness, represent strong antagonistic interactions or negative epistasis between residues i and j. Negative epistasis reduces the fitness of the double mutants and limits acquisition of additional mutations35. In contrast, large negative values of Jij in Eq. (1) represent strong compensatory interactions or positive epistasis between residues i and j. Positive epistasis boosts the reproduction capability of double mutants, allowing viruses to acquire and retain drug resistance36. For HIV protease, positive epistasis among DRMs predicted by maximum entropy modeling (similar to ours) was shown to be consistent with the results of deep mutational scanning experiments18. In the case of HCV NS3, we found that pairs of mutations with large negative values of Jij were more likely to involve DRMs compared to random expectations (Supplementary Fig. 2). This suggests the enrichment of positive epistasis in residues associated with HCV NS3 DRMs.

Table 1 List of NS3 drugs4,5,6,7 and the associated DRMs

Focusing on the top 10/100/300 pairs of mutations with large values of −Jij, we observed that some specific DRMs were associated with particularly strong compensatory interactions (Fig. 2a). Henceforth, we refer to the DRMs enriched in the top 300 pairs (p = 4.8 × 10−61; two-sided Fisher’s exact test) as strongly coupled DRMs, “SC-DRMs”. The residues associated with the majority of SC-DRMs were involved in a sparse network of interactions (residues 36, 41, 54, 55, 71, 168, and 170), while a few had dense interaction networks (residues 80 and 122). Among the pairs involving SC-DRMs, two comprised both DRMs: residues 41 and 168 (ranked 1st), and residues 54 and 55 (ranked 3rd). The identified pairs involving SC-DRMs contain multiple residues that are in contact with the resolved NS3 protein structure, supporting the possibility that these residues may be interacting (Fig. 2b). Similar results have been reported for strongly coupled pairs of residues in maximum entropy models for multiple protein families25.

Fig. 2: Identification of SC-DRMs and their significance.
figure 2

a Network of interactions between top 10/100/300 ranked mutations (ranked by the values of -J from Eq. (1)). Interactions linking at least one DRM are shown orange, and links between non-DRMs are shaded in gray. b Pairs of interacting residues involving SC-DRMs that are in contact based on the crystal structure of the NS3 protein (PDB ID: 4B6E [https://doi.org/10.2210/pdb4b6e/pdb]). The carbon-alpha atoms of each pair of residues are shown as colored spheres, and the distance between each pair is also labeled. Two residues were assumed to be in contact if their carbon-alpha atoms were <8 Å apart. c Inferred NS3 sectors and their association with SC-DRMs. Sectors (listed in Supplementary Table 1) were inferred using the GUI implementation of the robust co-evolutionary analysis approach, RocaSec37,90. The statistical significance (p-values) was determined using one-sided Fisher’s exact test. In addition to the set of residues associated with DRMs and SC-DRMs, the following known NS3 biochemical regions were provided to the Rocasec (listed in Supplementary Table 2). (i) NS3-NS4A-Pro-Act NS3-NS4A interface for protease activation40; (ii) NS3-NS4A-Mem-Asso: NS3-NS4A membrane association41; (iii) NS5A-Hyper-Phos: NS5A hyper-phosphorylation42,43; (iv) NS3-Motif-Enz-Heli: motif important for enzymatic and helicase activities in NS344; and (v) NS3-Intra-Dimer-Int: intra-dimer interface in NS3 helicase45. Source data are provided as a Source Data file.

To explore if the SC-DRMs identified by our model are also associated with groups of residues known to mediate different NS3 functions, we applied a robust co-evolutionary analysis approach that we developed previously37. Distinct from maximum-entropy-based fitness landscape models, this approach identifies collective groups of co-evolving residues (called sectors), rather than pair-wise interactions. Such sectors, for HIV and HCV, have been shown to distinctly associate with protein functional or structural domains37,38,39. Applying the co-evolutionary analysis on the NS3 data considered in this work, we found that SC-DRMs were enriched in multiple inferred sectors (Fig. 2c; for details of inferred sectors, see Supplementary Table 1). (A similar enrichment to SC-DRMs was noted for DRMs but with slightly weaker statistical significance.) Of the sectors with known biochemical associations40,41,42,43,44,45 (for details of the experimentally-defined biochemical domains, see Supplementary Table 2), SC-DRMs were particularly enriched in the sectors linked with the NS3-NS4A interface known to mediate serine protease activity of NS340. While there is no overlap between the SC-DRM-associated residues and the residues known to be critical for NS3 protease activity, inference of a sector encompassing both of these sets of residues suggests that they may be co-evolving. Interestingly, SC-DRMs were not enriched in the inferred sector associated with the NS3 helicase activity. This is in line with the fact that none of the approved HCV NS3 drugs are helicase inhibitors4.

Model predictions correlate with known NS3 DRM compensation data

Experimental data derived from in vivo or in vitro studies offers the most direct evidence for compensatory mutations associated with SC-DRMs. However, such data is currently limited. In vivo evidence is available for the SC-DRM Q80K which has been reported to co-occur with the A91S mutation among individuals who experience HCV treatment failure10. This compensatory interaction has also been observed in vitro in the H77 strain10. Experimental evidence of SC-DRM D168E being compensated by Q41R9 has also been reported for the H77 strain. These two pairs of compensatory mutations were both associated with large values of −Jij; Q80K and A91S were ranked 60th, and D168E and Q41R were ranked 1st (Fig. 2a).

We further investigated the mutational interactions predicted by our model for these two SC-DRMs, Q80K and D168E. We specifically examined the energy changes in the H77 strain bearing the D168E or Q80K mutants (denoted H77D168E and H77Q80K respectively) upon introducing all possible mutations. A negative energy change indicates increased fitness, whereas positive change suggests a fitness reduction. Strikingly, our model predicted that the Q41R and A91S mutations yielded the second-most negative energy change compared to all other mutations in the H77D168E and H77Q80K strains, respectively (Fig. 3). This outcome is consistent with the documented compensatory roles of these mutations for DRMs D168E and Q80K, and points to the specificity of our model in describing epistatic compensatory pathways.

Fig. 3: Model predictions of mutational interactions for SC-DRMs Q80K and D168E.
figure 3

Histogram of the change in energy observed by all single mutations X in the H77 strain carrying (a) the D168E mutant and (b) the Q80K mutant. Energy(H77D168E) and Energy(H77Q80K) are the predicted energy for the H77 strain carrying the D168E and Q80K mutant. The predicted energy for the H77 strain carrying the D168E mutant and an additional single mutation X, as well as for the H77 strain carrying the Q80K mutant and an additional single mutation X, are given by Energy(H77D168E+X) and Energy(H77Q80K+X), respectively. Source data are provided as a Source Data file.

Extending the analysis to predict compensatory mutations associated with SC-DRMs in different sequence backgrounds as opposed to only H77 (see “Methods” section for details), we found that 168E and 41R were compensatory (for each other) for all sequence backgrounds, while 91S compensated for 80K in ~23% of sequence backgrounds (Supplementary Table 3). We also identified potential compensatory mutations for SC-DRMs 36L, 55A, 122C/G, and 177V. These identify specific targets for future experimental studies.

Enrichment of SC-DRMs in NS3 drugs

Currently, there are nine known NS3-targeting drugs used for treating HCV genotype 1a infections4,5,6,7. These drugs can be divided into two classes: NS3-specific DAAs that exclusively target NS3 (telaprevir, boceprevir, simeprevir, vaniprevir, and danoprevir) and multi-protein DAAs that target NS3 together with other proteins (paritaprevir, grazoprevir, voxilaprevir, and glecaprevir). For all of these drugs, a total of 20 NS3-specific DRMs have been identified, ranging from 3 to 15 DRMs per drug. Each NS3 drug, irrespective of the drug class, was found to comprise at least two identified SC-DRMs (Table 1). This association reached statistical significance (“Methods” section) for most drugs (5/9; two from NS3-specific DAAs and three from multi-protein DAAs; Fig. 4) and was robust to the number of top-coupled pairs of mutations used for defining SC-DRMs (Supplementary Fig. 3). In contrast, the remaining DRMs (non-SC-DRMs) were generally not significantly enriched in drugs (1/9, Supplementary Fig. 4). The enrichment of the identified SC-DRMs in NS3-targeting DAAs suggests that they play a significant role in conferring resistance to both classes of drugs. This observation suggests that epistasis is an important factor contributing to the acquisition of resistance to NS3 drugs.

Fig. 4: SC-DRMs are enriched in NS3 drugs.
figure 4

Statistical significance of the identified number of SC-DRMs associated with each drug. The p-value measures the probability of observing by a random chance at least the observed number of SC-DRMs among all DRMs associated with a drug (one-sided test; see “Methods” section for details). Results with p-value < 0.05 are marked with a star on the top of each bar. Source data and exact p-values are provided as a Source Data file.

Some DRMs have been reported to disrupt drug binding while having minimum effect on the NS3 protease function7,46. Thus, we investigated whether SC-DRMs may also be enriched in binding residues of the drugs. Structures for four drugs in complex with the NS3 protein are available (PDB ID: [3M5L] for danoprevir, [3SU3] for vaniprevir, [3SV6] for telaprevir, and [3SUD] for grazoprevir). Based on these, we identified binding residues for each drug as those NS3 residues that are within 5Å of a drug atom46. While, for each drug, not all drug-specific DRMs are located within the binding residues (Fig. 5a), DRMs were found to be statistically significantly enriched within them (Fig. 5b). The same was also true for SC-DRMs of danoprevir and grazoprevir, but not for vaniprevir and telaprevir (Fig. 5b). This suggests that DRMs associated with these drugs, and SC-DRMs in the case of danoprevir and grazoprevir, may confer resistance by directly affecting drug binding.

Fig. 5: SC-DRMs appear to impact binding of NS3 drugs through direct interactions.
figure 5

a Binding residues of drugs shown on the crystal structure of the NS3 protein-drug complexes (PDB ID: 3M5L [https://doi.org/10.2210/pdb3m5l/pdb] for danoprevir, 3SU3 [https://doi.org/10.2210/pdb3su3/pdb] for vaniprevir, and 3SV6 [https://doi.org/10.2210/pdb3sv6/pdb] for telaprevir and 3SUD [https://doi.org/10.2210/pdb3sud/pdb] for grazoprevir). The carbon-alpha atoms of the identified drug-binding residues are shown in colored spheres. The drug-binding residues associated with DRMs and SC-DRMs for each drug are shown in green and blue, respectively, while those that do no fall under DRMs are shown in gray. Drugs in each structure are shown as black sticks. The NS3 residues within 5 Å7D2of drug atoms were considered as drug-binding residues. b, c Statistical significance of the number of (b) drug-specific DRMs/SC-DRMs and (c) all DRMs/SC-DRMs in binding residues of each of the four considered drugs. Here, drug-specific DRMs/SC-DRMs are listed in Table 1 for each of the four drugs, while all DRMs/SC-DRMs refer to the DRMs/SC-DRMs known for all drugs. The p-value measures the probability of observing by a random chance at least the observed number of DRMs or SC-DRMs among all binding residues for each drug (one-sided test; see “Methods” section for details). Results with p-value < 0.05 are marked with a star on the top of each bar. Source data and exact p-values are provided as a Source Data file.

The known DRMs for each drug have been identified either via limited in-vitro experiments based on their adverse impact on DAA activity, or in-vivo in a few treated patients in clinical trials4. Hence, information of DRMs available for each NS3 drug may not be complete. This, in addition to the observation that several DRMs are shared across multiple NS3 drugs (Table 1), motivates analysis of the enrichment of the collective set of DRMs (i.e., associated with all NS3 drugs) in the binding residues of the four drugs with available structures (detailed in Supplementary Table 4). In this case, the enrichment of DRMs was statistically more significant (Fig. 5c) than that observed for drug-specific DRMs (Fig. 5b). SC-DRMs were also now statistically significantly enriched in all four drugs (Fig. 5c). This analysis identified numerous DRMs and SC-DRMs that have been determined for specific drugs and which lie within the binding footprints of other drugs, but have not yet been reported as conferring resistance for those drugs. Hence, these may correspond to putative DRMs or SC-DRMs that have yet to be observed in-vitro or in-vivo. SC-DRMs at residues 41, 55, and 168 were common among the binding residues of all four drugs, with two of these SC-DRMs (41 and 168) known to be involved in compensatory interactions via ex-vivo experiments9. Collectively, this analysis suggests that DRMs may facilitate resistance by interrupting binding ability of drugs to the NS3 protease. In the case of SC-DRMs, this resistance is further facilitated through compensatory interactions of networked mutations, wherein the deleterious effect of a SC-DRM might be compensated by another mutation in the network.

Our analysis can be used to predict mutations that could potentially confer drug resistance. Specifically, we identified 25 binding residues for drugs with known structures (PDB ID: [3M5L] for danoprevir, [3SU3] for vaniprevir, [3SV6] for telaprevir, and [3SUD] for grazoprevir), out of which 14 residues were not previously associated with any known DRMs (Supplementary Table 5). However, based on our model, we found that mutations at four of these 14 residues (residues 78, 79, 123, and 159) were associated with strong compensatory interactions. Furthermore, at least two of these four residues were present in the binding residues of all four drugs considered, suggesting that mutations at these residues may potentially confer resistance to the drugs.

SC-DRMs provide an easy escape from drug-induced selection pressure

In general, the initiation of drug treatment alters the in-host environment in which HCV replicates. Selective pressure exerted by a drug may promote mutations that simultaneously resist the drug and maintain replicative capacity. Since we observed that SC-DRMs were statistically significantly enriched in most drugs (Fig. 4) while the remaining DRMs (non-SC-DRMs) were generally not (Supplementary Figure 4), we investigated whether it is easier for these SC-DRMs to accumulate in the viral population than other DRMs. We integrated the inferred fitness landscape in an in-host evolutionary model to quantify the average time, termed “escape time”, that the virus takes to escape from selective pressure targeting the residues involved in DRMs (see Methods for details). This Wright-Fisher-like model47 accounts for the complex stochastic dynamics involved in in-host evolution of HCV quasispecies, including host-virus and virus-virus interactions, and multiple pathways that HCV may employ to escape from selective pressure exerted by a drug. Similar evolutionary models have been employed previously by us and others for determining the average immune escape time associated with residues in HCV E212,13 and HIV Gag32.

Contrasting the escape times of residues associated with SC-DRMs against those for the residues associated with the remaining NS3 DRMs revealed that the former set of residues carries shorter escape times (p = 0.0018, Mann-Whitney test; Fig. 6a). Investigating the escape time of the residues associated with individual NS3 DRMs showed that almost all residues associated with SC-DRMs had a shorter escape time than the remaining NS3 DRMs (Fig. 6b). These results suggest that SC-DRMs provide relatively easy pathways, enabled via epistatic interactions, for HCV to escape drug-induced selective pressure. This provides a rationale for the enrichment of SC-DRMs in NS3 drugs (Fig. 4).

Fig. 6: Escape time of residues involved in NS3 DRMs.
figure 6

a Comparison between escape time of residues involved in SC-DRMs and the remaining residues involved in DRMs. In each box plot, the middle line indicates the median, the edges of the box represent the first and third quartiles, and whiskers extend to span a 1.5 interquartile range from the edges. The reported p-value was calculated using the two-sided Mann–Whitney test (n1 = 9 SC-DRMs and n2 = 11 remaining DRMs). b Individual escape time of residues involved in DRMs of the NS3 protein. SC-DRMs are shown in blue and the remaining DRMs in orange. Source data are provided as a Source Data file.

Accumulation of SC-DRMs appears to impact the efficacy of NS3 drugs

We further explored whether there exists any (inverse) relation between the number of SC-DRMs associated with a drug and the expected efficacy of the drug. Efficacy data was collected from multiple clinical studies (listed in Supplementary Table 648,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69). (Multi-protein DAAs have generally been reported to achieve much higher efficacy than NS3-specific DAAs.) We observed a strong negative correlation between the number of SC-DRMs and the efficacy of NS3-specific DAAs (r = −0.67; Fig. 7a) as well as for multi-protein DAAs (r = −0.77; Fig. 7b). While the limited number of drugs didn’t allow these tests to reach statistical significance, the observed strong negative correlation is suggestive of the potential impact that abundance of SC-DRMs can have on the efficacy of HCV drugs.

Fig. 7: Correlation between the drug efficacy and the number of SC-DRMs associated with each drug.
figure 7

Efficacy of a drug is represented by the weighted average of sustained virological response (SVR) rate reported for the drug. Results are shown for (a) NS3-specific DAAs and (b) multi-protein DAAs. SVR rates for each drug were curated from the literature48,69 (listed in Supplementary Data 1; see Supplementary Table 6 for details). Source data are provided as a Source Data file. The p-values measure the two-sided significance level of the Spearman correlation.

Discussion

Emergence of DRMs is a common phenomenon observed in patients undergoing HCV drug therapy, which often negatively impacts the treatment outcome. Studies indicate that a notable proportion of patients who fail DAA treatment have DRMs, with prevalence rates ranging from 20-90% depending on the specific DAA used70,71,72. The accumulation of DRMs in such patients carries clinical and public health concern due to the limited treatment options available73 and the potential transmission of drug-resistant strains to other individuals74. In addition, the widespread use of HCV DAAs could lead to the prevalence and dominance of DRMs in the future, similar to the observed increase in HIV DRMs with the use of antiretroviral therapy (ART)75. Thus, it is important to understand the evolutionary factors that contribute to emergence of HCV DRMs.

Many of the DRMs are known to be individually deleterious. Therefore, other evolutionary factors, such as epistasis, may be facilitating their emergence. To investigate this aspect, we first inferred a fitness landscape for the NS3 protein (one of the main proteins targeted by HCV drugs) considering both the effects of individual mutations and interactions between mutations at different residues. Predictions obtained from the inferred model correlated well with multiple experimental data sources. The analysis of model parameters capturing pair-wise epistatic interactions (couplings) showed that certain DRMs, namely SC-DRMs, were associated with strong compensatory interactions, were seemingly involved in mediating protease function, and were prevalent among NS3 drugs. Upon integrating the inferred fitness landscape into an evolutionary model, we found that SC-DRMs were associated with a relatively easy escape from drug-induced selection pressure. The number of SC-DRMs also appeared to correlate inversely with the efficacy of NS3-targeting drugs. Overall, our results suggest that epistatic interactions associated with SC-DRMs provide easy pathways that contribute to drug resistance.

The inference of the fitness landscape from population-level sequence data for NS3 is complex given the selective pressure from host immune responses and recent use of DAAs. The high correlation between prevalence and fitness of NS3 (Fig. 1) may be surprising, but a similar relationship has been reported for other HCV proteins12,13,24, as well as several HIV proteins17,30,31,32,33. The mechanistic rationale for this correspondence has been previously proposed for HIV proteins, with three key factors identified76: (i) a diverse and largely ineffective immune response due to host genetic diversity, (ii) reversion to the ancestral (fitter) sequence upon transmission to a new host, and (iii) the absence of robust and effective natural or vaccine-induced herd memory responses, which would shift the virus away from the steady state. Although HCV differs from HIV, it shares several similarities and may also involve these factors. Specifically, since the lack of a functional vaccine, most NS3 sequences sampled from chronic patients, and NS3 being a target of T cells77, it is likely that NS3 experiences diverse and ineffective immune responses in such patients. Reversion to the consensus amino acid upon HCV transmission to a new host has also been documented78.

DAAs could potentially lead to population-wise selective pressure that may bias our data, however such effects are not expected to be strong. This is because DAAs are currently only available to a limited fraction of HCV-infected individuals (less than 20%79). To examine this more explicitly for the NS3 data set that we used (comprising 7370 sequences), we investigated the 58 papers from which these sequences were reported. This analysis revealed that the large majority of the sequences (5877 sequences) were indeed from drug-naïve patients. Comparing statistical properties of the complete dataset (7370 sequences) with those of the drug-naïve subset (5877 sequences) revealed a strong correlation (r > 0.9, Supplementary Fig. S5a) between the mutation frequencies and pair-wise mutation frequencies in both datasets. We also constructed a maximum entropy model using only the drug-naïve sequences and found that the predicted sequence energies from the drug-naïve model exhibited a correlation (r = −0.70, Supplementary Fig. S5b) with the in-vitro fitness measurements, which was comparable to the correlation observed with the complete dataset.

It is noteworthy that 36 out of the 45 fitness measurements compiled from different experimental studies were associated with DRMs. The high correlation between our model’s predictions (inferred using the complete dataset) and the experimental fitness measurements (Fig. 1) supports that our model can accurately capture the intrinsic effect of DRMs, despite being trained mainly on drug-naïve sequences. This is because DRMs have been observed in drug-naïve patients as well80,81. We further evaluated the correlation between our model predictions and the 36 fitness measurements exclusively associated with DRMs and found it also to be high (r = −0.72, Supplementary Fig. 6), providing additional evidence for the ability of our model to capture the effect of NS3 DRMs.

Among the pairs of residues involving SC-DRMs (Fig. 2a), two pairs (pairs 41–168 and 80–91) have been demonstrated to be involved in compensatory interactions using ex-vivo experiments9,10. For the remaining pairs, based on the available NS3 protein structure, a few pairs were found to be in contact (pairs 36–45, 54–55, 66–71, and 170–174) (Fig. 2b), further suggesting that epistatic interactions may exist between these pairs. Such pairs of residues involving SC-DRMs provide directions for future experimental studies investigating compensatory interactions. These may include experiments that quantify the change in replicative fitness, protein folding, or protease enzymatic function upon mutating these pairs of residues individually and simultaneously.

Our analysis revealed an inverse correlation between the number of SC-DRMs and the efficacy of NS3-targeting drugs (Fig. 7). However, this relationship may be influenced by various confounding factors. These include, for instance, differences in drug dosage and duration, whether the drug was used in combination with interferon and/or ribavirin, whether peg-interferon therapy was administered prior to DAA treatment, as well as different host-specific factors of patients (cirrhotic status, HCV RNA level, and HLA composition). These factors could not be explicitly accounted for in our analysis. Further research with more detailed data on these confounding factors is needed to fully understand their influence on drug efficacy and to provide more robust conclusions.

While we focused on epistasis within the NS3 protein, inter-protein interactions may also play a role in conferring resistance to drugs. For instance, interactions between NS5A and NS5B – the two other proteins targeted by multi-protein DAAs – are known to be critical for HCV RNA replication82, and hence, interactions between these proteins might also affect the emergence of DRMs. Moreover, we observed a marginally significant (p = 0.07, Mann–Whitney test) difference in the number of DRMs of NS3-specific DAAs (4–10) and multi-protein DAAs (8–16). This suggests that the resistance profile of these two classes of drugs may be different, thereby motivating further investigation into inter-protein interactions between different HCV proteins for conferring drug resistance.

We predicted SC-DRMs to be associated with shorter escape time from drug-induced selective pressure (Fig. 6a), which provides support to their enrichment in each NS3 drug (Fig. 4). In addition, this analysis revealed that DRMs at residues 138 and 158 were associated with much higher escape time compared to other DRMs (Fig. 6b). This is suggestive that it may be hard for HCV to escape from drug-induced selective pressure by incurring mutations at these NS3 residues. Thus, preferentially targeting such residues may be desirable for designing robust next-generation HCV drugs and vaccines.

The high genetic variability among different HCV genotypes and subtypes makes them differently susceptible to the development of NS3 DRMs83. Thus, certain drugs only work for specific genotypes/subtypes. For instance, telaprevir, boceprevir, and vaniprevir only work for genotype 1 infections, while simeprevir, paritaprevir, and danoprevir can be used for treating both genotypes 1 and 4 infections4. In line with this, we have also shown in a previous study that evolutionary constraints are different across HCV subtypes13, which may also be a contributing factor in the observed difference in efficacy of drugs against different genotypes/subtypes. Thus, while we focused here only on NS3 subtype 1a, extending this analysis to study different HCV genotypes/subtypes might be helpful to understand genotype-specific differences in drug efficacies.

There are also multiple limitations of our study. First, we focused on pair-wise epistatic interactions only. Higher-order epistatic interactions may also contribute to viral fitness, and these are not captured by our model. Inferring such higher-order effects is challenging and requires larger data sets with higher sequence variability. Second, we are unable to systematically study the resistance profile of multi-protein DAAs. This would require a joint model considering multiple proteins together. Again, data limitations currently preclude the development of such joint multi-protein models. Third, while our analysis shows an inverse relation between the efficacy of a given drug and the number of associated SC-DRMs, there are multiple factors that may potentially confound a relative analysis of reported drug efficacies. These include differences among drugs with respect to dosage, duration, and administration, along with differences among characteristics of patients. Each of these factors may influence the efficacy of drugs, and they could not be explicitly accounted for in our analysis. More detailed data related to the effect of these confounding factors is required to deconvolve the impact of SC-DRMs on drug efficacy.

Methods

Sequence data preprocessing

We downloaded 9683 NS3 genotype 1a aligned protein sequences (coverage ≥ 99%) from the HCV-GLUE database, http://hcv.glue.cvr.ac.uk5,6. We conducted principal component analysis (PCA) of the pair-wise similarity matrix (9683 × 9683) constructed from the sequence data84 to remove 148 outlying sequences. Briefly, all those sequences were considered outliers that appeared at more than 3 scaled median absolute deviations away from the median of either the first or second PC85. The scaled median absolute deviations is given by: \(c\times {{{{{{\mathrm{median}}}}}}}\,\left({{{{{{\mathrm{abs}}}}}}}\,\left({A}_{i}-{{{{{{\mathrm{median}}}}}}}\,(A)\right)\right)\), where A is the first or second PC, \({{A}}_{i}\,{{\mbox{is the}}}\,i\,{{\mbox{th element in the first or second PC,}}}\,c=-1/(\sqrt{2}\times{{{{{{\mathrm{erfcinv}}}}}}}\, (3/2))\approx 1.482\), and erfcinv() is the inverse complementary error function. To avoid unnecessary patient bias that can compromise model predictive ability (Supplementary Figure 7), we excluded 2167 sequences that were not associated with any patients. These filtering procedures resulted in M = 7370 sequences (accession numbers listed in Supplementary Data 2) from W = 4773 patients. Next, we excluded from this data 116 fully conserved residues, i.e., residues where no mutation was observed in any sequence. This excluded residue 156 from our analysis as it was fully conserved in our data, and therefore, DRMs associated with it were not investigated in our work. The final multiple sequence alignment (MSA) comprised M = 7370 sequences and N = 515 residues.

Inference of HCV NS3 fitness landscape

We constructed a least-biased maximum entropy model for the NS3 protein that can reproduce the single and double mutant probabilities of the MSA that are given by

$$\begin{array}{c}{f}_{i}(a)=\frac{1}{W}\mathop{\sum }\limits_{k=1}^{M}{w}_{k}\delta ({x}_{i}^{k},\, a)\\ {f}_{ij}(a,\, b)=\frac{1}{W}\mathop{\sum }\limits_{k=1}^{M}{w}_{k}\delta ({x}_{i}^{k},\, a)\delta ({x}_{j}^{k},\, b).\end{array}$$
(2)

Here, \({x}_{i}^{k}\) is the ith amino acid in the sequence k, wk is the reciprocal of the number of MSA sequences from the patient that sequence k was obtained from, and δ(a, b) is the Kronecker delta function. As described in Eq. (1), the maximum entropy model assigns a sequence \({{{{{{{\bf{x}}}}}}}}=\left[{x}_{1},\, {x}_{2},\ldots,\, {x}_{N}\right]\) the probability

$${P}_{{{{{{{{\bf{h,J}}}}}}}}}{{{{{{{\bf{(x)}}}}}}}}=\frac{{e}^{-{E}_{{{{{{{{\bf{h,J}}}}}}}}}{{{{{{{\bf{(x)}}}}}}}}}}{Z},\,{{\mbox{where}}}\,{E}_{{{{{{{{\bf{h,J}}}}}}}}}{{{{{{{\bf{(x)}}}}}}}}=\mathop{\sum }\limits_{i=1}^{N-1}\mathop{\sum }\limits_{j=i+1}^{N}{J}_{ij}\left({x}_{i},\, {x}_{j}\right)+\mathop{\sum }\limits_{i=1}^{N}{h}_{i}\left({x}_{i}\right),$$

where h is the set of all fields that represent the effect of mutations at a single residue, and J is the set of all couplings that represent the effect of interactions between mutations at two different residues. \(Z={\sum }_{{{{{{{{\bf{x}}}}}}}}}{e}^{-{E}_{{{{{{{{\bf{h,J}}}}}}}}}{{{{{{{\bf{(x)}}}}}}}}}\) is a normalization factor, and Eh,J(x) represents the energy of sequence x. The fields h and couplings J are chosen such that the single and double mutant probabilities obtained from the model match respectively fi(a) and fij(a, b) (Eq. (2)), i.e.,

$$\begin{array}{c}\mathop{\sum}\limits_{{{{{{{{\bf{x}}}}}}}}}\delta \left({x}_{i},\, a\right){P}_{{{{{{{{\bf{h,\, J}}}}}}}}}{{{{{{{\bf{(x)}}}}}}}}={f}_{i}(a)\\ \mathop{\sum}\limits_{{{{{{{{\bf{x}}}}}}}}}\delta \left({x}_{i},a\right)\delta \left({x}_{j},\, b\right){P}_{{{{{{{{\bf{h,J}}}}}}}}}{{{{{{{\bf{(x)}}}}}}}}={f}_{ij}(a,\, b).\end{array}$$
(3)

The problem of inferring the model parameters can be cast as the following convex optimization problem24

$$\left({{{{{{{{\bf{h}}}}}}}}}^{*},\, {{{{{{{{\bf{J}}}}}}}}}^{*}\right)=\begin{array}{c}\,{{{{{{\mathrm{arg}}}}}}\; {{{{{\mathrm{min}}}}}}}\,\\ {{{{{{{\bf{h}}}}}}}},{{{{{{{\rm{J}}}}}}}}\end{array}{{{{{{{\rm{KL}}}}}}}}\left({P}_{0}| | {P}_{{{{{{{{\bf{h,J}}}}}}}}}\right)=\begin{array}{c}\,{{{{{{\mathrm{arg}}}}}}\; {{{{{\mathrm{min}}}}}}}\,\\ {{{{{{{\bf{h}}}}}}}},{{{{{{{\rm{J}}}}}}}}\end{array}\mathop{\sum}\limits_{{{{{{{{\bf{x}}}}}}}}}{P}_{0}({{{{{{{\bf{x}}}}}}}})\ln \frac{{P}_{0}({{{{{{{\bf{x}}}}}}}})}{{P}_{{{{{{{{\bf{h,J}}}}}}}}}{{{{{{{\bf{(x)}}}}}}}}}\,$$
(4)

where \({{{{{{{\rm{KL}}}}}}}}\left(\cdot | | \cdot \right)\) denotes the Kullback-Leibler divergence between probability distributions, and

$${P}_{0}({{{{{{{\bf{x}}}}}}}})=\frac{1}{W}\mathop{\sum }\limits_{k=1}^{M}{w}_{k}\delta \left({{{{{{{{\bf{x}}}}}}}}}^{k},\, {{{{{{{\bf{x}}}}}}}}\right)$$

is the patient-weighted probability of observing strain x in the MSA.

To obtain the fields h and couplings J such that the inferred model reproduces the single and double mutant probabilities of the MSA, we used the GUI realization of MPF-BML86, an efficient inference framework introduced in ref. 33. This framework has been used previously to infer the fitness landscape of the HIV envelope protein33 and the HCV E2 protein12,13. The MPF-BML inference framework comprises the following three steps:

  1. 1.

    The first step in the inference framework is to prevent overfitting of our model and reduce the computational time. To achieve this, we employ a process that retains only the top ki most frequent mutants out of the total qi mutations. The remaining mutants, qi − ki in number, are grouped together in a way that the entropy associated with the grouping accounts for a certain fraction ϕ of the entropy without grouping. For a specific residue i, we need to find the smallest integer value of ki that satisfies the following condition:

    $${S}_{i}\left({k}_{i}\right)\, \ge \, \phi {S}_{i}\left({q}_{i}\right),$$

    where

    $${S}_{i}\left({k}_{i}\right)=-\mathop{\sum }\limits_{a=1}^{{k}_{i}}\, {f}_{i}(a)\ln {f}_{i}(a)-{\overline{f}}_{i}\ln {\overline{f}}_{i},$$

    and

    $${\overline{f}}_{i}=\mathop{\sum }\limits_{a={k}_{i}+1}^{{q}_{i}}{f}_{i}(a).$$

    ϕ is chosen such that the mean of

    $${\beta }_{i}(\phi )=\frac{\mathop{\sum }\nolimits_{a=1}^{{q}_{i}}{\left({f}_{i}(a)-{\overline{f}}_{i}(a)\right)}^{2}}{\mathop{\sum }\nolimits_{a=1}^{{q}_{i}}\frac{{f}_{i}(a)\left(1-{f}_{i}(a)\right)}{W}}$$

    is approximately one, and

    $${\overline{f}}_{i}(a)=\left\{\begin{array}{ll}{f}_{i}(a)\quad &\,{{\mbox{if}}}\,a \, < \, {k}_{i}+1\\ {\overline{f}}_{i}\hfill &\,{{\mbox{if}}}\,a \,=\, {k}_{i}+1 \\ 0\hfill &\,{{\mbox{if}}}\,a \, > \, {k}_{i}+1\end{array}\right..$$

    The main concept is to achieve equilibrium between the bias (numerator) and the variability in the estimated amino acid frequencies (denominator) until they become approximately equal. Specifically, each amino acid at the ith residue is encoded using qi binary digits, where qi = ki + 1, and ki represents the number of mutants after combining. The jth most frequent amino acid is then represented by a qi-bit binary code with the value of 2j−1. Consequently, the consensus sequence is represented by an all-zero vector. We define a binary matrix based on the amino acid matrix as Y, with the ith row denoted by yi.

  2. 2.

    Because the normalization factor Z is intractable in Eq. (1), the second step, called the minimum probability flow (MPF) method, is to alleviate this computational burden by replacing Ph,J(x) with an alternate probability mass function (PMF) by considering a continuous-time Markov chain whose states correspond to the \(B=\mathop{\prod }\nolimits_{i=1}^{N}\left({q}_{i}+1\right)\) possible sequences. The master equation describing this Markov chain is given by

    $$\frac{d}{dt}{P}_{t}\left.({{{{{{{{\bf{y}}}}}}}}}_{{{{{{{{\rm{i}}}}}}}}}| {{{{{{{\bf{h}}}}}}}},\, {{{{{{{\bf{J}}}}}}}})\right)=\mathop{\sum }\limits_{j=1,\ j\ne i}^{M}{\Gamma }_{ij}{P}_{t}({{{{{{{{\bf{y}}}}}}}}}_{{{{{{{{\rm{j}}}}}}}}}| {{{{{{{\bf{h}}}}}}}},\, {{{{{{{\bf{J}}}}}}}})-\mathop{\sum }\limits_{j=1,\ j\ne i}^{M}{\Gamma }_{ji}{P}_{t}\left.({{{{{{{{\bf{y}}}}}}}}}_{{{{{{{{\rm{i}}}}}}}}}| {{{{{{{\bf{h}}}}}}}},\, {{{{{{{\bf{J}}}}}}}})\right),$$
    (5)

    where \({P}_{t}\left.({{{{{{{{\bf{y}}}}}}}}}_{{{{{{{{\rm{i}}}}}}}}}| {{{{{{{\bf{h}}}}}}}},\, {{{{{{{\bf{J}}}}}}}})\right)\) denotes the probability of observing yi at time t, and when t = 0, we have Pt(h, J) = P0. We can derive the solution to Eq. (5) as

    $${P}_{t}\, \left.({{{{{{{{\bf{y}}}}}}}}}_{{{{{{{{\rm{i}}}}}}}}}| {{{{{{{\bf{h}}}}}}}},\, {{{{{{{\bf{J}}}}}}}})\right)={\left[\exp (t{{{{{{{\boldsymbol{\Gamma }}}}}}}}){P}_{0}\right]}_{i},$$

    where [a]i denotes the ith element of the vector a. The matrix Γ is the B × B transition rate matrix with (i, j)th element Γij given such that

    $$\mathop{\lim }\limits_{t\to \infty }{P}_{t}\left.({{{{{{{{\bf{y}}}}}}}}}_{{{{{{{{\rm{i}}}}}}}}}| {{{{{{{\bf{h}}}}}}}},\, {{{{{{{\bf{J}}}}}}}})\right)=P\left.({{{{{{{{\bf{y}}}}}}}}}_{{{{{{{{\rm{i}}}}}}}}}| {{{{{{{\bf{h}}}}}}}},\, {{{{{{{\bf{J}}}}}}}})\right).$$

    The details of matrix Γ can be found in ref. 87. The idea is that, regardless of the initial values of h and J, the PMF can evolve towards \({P}_{t}\left({{{{{{{\bf{h,\; J}}}}}}}}\right)\) as time increases. Then we choose a t to make this problem tractable. After replacement, Eq. (4) expands as a Taylor series around t = 0, and can be written as

    $${{{{{{{\rm{KL}}}}}}}}({P}_{0}\!\parallel \! {P}_{t}({{{{{{{\bf{h}}}}}}}},\, {{{{{{{\bf{J}}}}}}}}))=tK({{{{{{{\bf{J}}}}}}}},\, {{{{{{{\bf{h}}}}}}}})+o(t),$$

    where

    $$K({{{{{{{\bf{J}}}}}}}},\, {{{{{{{\bf{h}}}}}}}})=\mathop{\sum }\limits_{b=1}^{M}\mathop{\sum }\limits_{i=1}^{N}\mathop{\sum }\limits_{a=1}^{{q}_{i}}\exp \left(\frac{1}{2}\left(\left(2{y}_{b,(i-1)N+a}-1\right)\mathop{\sum }\limits_{j=1}^{N}\mathop{\sum }\limits_{c=1}^{{q}_{j}}{y}_{b,(j-1)N+c}\, {J}_{ij}(a,\, c)-{h}_{i}(a)\right)\right)$$

    and yb,n stands for the (b, n) entry of matrix Y. Then the estimation of the parameters can be obtained by minimizing K(J, h) plus L1 and L2 regularization factors, which can be written as

    $$\left({{{{{{{{\bf{J}}}}}}}}}^{{{{{{{{\rm{MPF}}}}}}}}},\, {{{{{{{{\bf{h}}}}}}}}}^{{{{{{{{\rm{MPF}}}}}}}}}\right)= \begin{array}{c}\,{{\mbox{arg min}}}\,\\ {{{{{{{\bf{h}}}}}}}},{{{{{{{\rm{J}}}}}}}}\end{array}\left(K(\, {{{{{{{\bf{J}}}}}}}},{{{{{{{\bf{h}}}}}}}})+{\lambda }_{1}\mathop{\sum }\limits_{i=1}^{N}\mathop{\sum }\limits_{a=1}^{{q}_{i}}\mathop{\sum }\limits_{j=i+1}^{N}\mathop{\sum }\limits_{b=1}^{{q}_{j}}\left|\, {J}_{ij}(a,b)\right|\right. \\ \left.+{\lambda }_{2}\mathop{\sum }\limits_{i=1}^{N}\mathop{\sum }\limits_{a=1}^{{q}_{i}}\mathop{\sum }\limits_{j=i+1}^{N}\mathop{\sum }\limits_{b=1}^{{q}_{j}}{J}_{ij}{(a,\, b)}^{2}\right),$$
    (6)

    where λ1 and λ2 are the coefficients of the L1 and L2 regularization factors respectively and are chosen manually based on the third step.

  3. 3.

    The third step is to choose a set of couplings and fields that satisfy Eq. (6) to initialize a gradient descent algorithm where the gradient is approximated using Markov chain Monte Carlo (MCMC) simulations. The gradient descent employs a modified RPROP algorithm88 for each parameter set. This particular step is referred to as the Boltzmann machine-learning (BML) method, which refines the parameters obtained in the previous MPF step to achieve a more accurate model fit. During this BML process, the couplings that were forced to zero due to L1 regularization in Eq. (6) remain fixed at zero in each iteration. Eventually, we adopt the parameter set described in ref. 89, such that

    $${\epsilon }_{1}=\frac{1}{W}\mathop{\sum }\limits_{i=1}^{W}\mathop{\sum }\limits_{a=1}^{{q}_{i}}\frac{{\left({f}_{i}^{{{\mbox{model}}}}\left(a;{\lambda }_{1},\, {\lambda }_{2}\right)-{f}_{i}\left(a,\, {\phi }^{*}\right)\right)}^{2}}{\frac{1}{W}{f}_{i}\left(a,\, {\phi }^{*}\right)\left(1-{f}_{i}\left(a,\, {\phi }^{*}\right)\right)}\, \approx \, 1$$
    (7)
    $${\epsilon }_{2}=\frac{1}{\mathop{\sum }\limits_{k=1}^{W}{q}_{k}\mathop{\sum }\limits_{l=k+1}^{W}{q}_{l}}\mathop{\sum }\limits_{i=1}^{W}\mathop{\sum }\limits_{a=1}^{{q}_{i}}\mathop{\sum }\limits_{j=i+1}^{W}\mathop{\sum }\limits_{b=1}^{{q}_{j}}\frac{{\left({f}_{ij}^{{{\mbox{model}}}}\left(a,\, b;{\lambda }_{1},\, {\lambda }_{2}\right)-{f}_{ij}\left(a,\, b,\, {\phi }^{*}\right)\right)}^{2}}{\frac{1}{W}{f}_{ij}\left(a,\, b,\, {\phi }^{*}\right)\left(1-{f}_{ij}\left(a,\, b,\, {\phi }^{*}\right)\right)}\, \approx \, 1,$$
    (8)

    where \({f}_{i}^{{{{{{{\mathrm{model}}}}}}}\,}\left(a;{\lambda }_{1},\, {\lambda }_{2}\right)\) and \({f}_{ij}^{{{{{{{{\rm{model}}}}}}}}}\left(a,\, b;{\lambda }_{1},\, {\lambda }_{2}\right)\) are the single and double mutant probabilities obtained from the model, while \({f}_{i}\left(a,\, {\phi }^{*}\right)\,{{\mbox{and}}}\,{f}_{ij}\left(a,\, b,\, {\phi }^{*}\right)\) are the single and double mutant probabilities of the MSA after grouping with combining factor ϕ*. λ1 and λ2 are chosen to balance between overfitting and underfitting in the single and double mutant probabilities.

The MPF-BML software requires an input comprising the MSA and the patient weight of each sequence in the MSA. For model inference, all parameters were set to default values in MPF-BML software except for L1 and L2 regularization parameters that were set to λ1 = 10−4 and λ2 = 150, respectively. The inferred model accurately reproduced the statistics of the NS3 MSA (Supplementary Figure 8). These included statistics used to train the model (i.e., single mutant probabilities and double mutant probabilities) as well as the statistics predicted by the model (e.g., connected correlations and distribution of the number of mutants per sequences).

Fitness verification

Ex-vivo experimental infectivity measurements were compiled from the literature9,10,21,28,29 to check if our inferred NS3 prevalence landscape model is capable of capturing the underlying protein fitness landscape. We used our model to compute energies of the NS3 sequences (Eq. (1)) and compared them with their corresponding reported infectivities. Since the energy of a sequence is inversely related to its prevalence, a strong negative correlation between model-based energy and infectivity would indicate that the inferred prevalence landscape model is a good proxy of the intrinsic NS3 fitness landscape. The details of the specific fitness measurements (listed in Supplementary Data 1) from each study are presented in Supplementary Table 7. As experiments were conducted under different lab settings, we considered the weighted average of Spearman correlation coefficients from different experiments. This can be written as

$$\overline{r}=\frac{{\sum }_{i=1}^{{q}_{\exp }}{Q}_{i}{r}_{i}}{{\sum }_{i=1}^{{q}_{\exp }}{Q}_{i}},$$

where ri is the Spearman correlation coefficient between model predictions (energies) and infectivity measurements reported in experiment i, Qi is the number of measurements for experiment i, and \({q}_{\exp }\) is the total number of experiments.

Conservation-only model

To compare our model with a model that ignores all interactions between residues, we defined a conservation-based maximum entropy model that is parametrized only by the “fields” h as follows

$${h}_{i}(a)=\ln \frac{1-{f}_{i}(a)}{{f}_{i}(a)},\quad i=1,\, 2,\ldots,\, N.$$
(9)

Here fi(a) is the frequency of observing amino acid a at residue i.

Acquisition of drug-resistant mutations for NS3-specific drugs

A total of 21 residues with DRMs from nine NS3-specific drugs used for treating HCV genotype 1a infections (listed in Table 1) were obtained from the GLUE database (http://hcv.glue.cvr.ac.uk)5,6, as well as from other relevant studies4,7. An NS3 DRM is defined by an amino acid substitution at a specific residue of NS3 that is able to adversely impact the activity of a DAA in-vitro and/or in-vivo in treated patients.

Identification of sectors using robust co-evolutionary analysis

We employed the robust co-evolutionary analysis (RoCA) method to identify ‘sectors’ or co-evolving groups of residues in the NS3 protein ref. 37 RoCA achieves this by performing an eigenvector-based spectral analysis on the MSA correlation matrix, followed by a data-driven random-matrix-based clustering procedure. We used the GUI-based implementation of the RoCA method, RocaSec90, to predict NS3 sectors. Note that we chose not to use the well-known Statistical Coupling Analysis (SCA) method91 due to its limited ability to resolve co-evolutionary structures in viral proteins, as has been demonstrated in our previous study37.

Visualization of interactions between top-coupled pairs of mutations

For visualizing the interactions between top-coupled pairs of mutations, we used the Circos plot. Each NS3 residue was evenly distributed along the outer space of the circle in Fig. 2. Residue numbering was started from 1 at the 3 o’clock position and increased in the counter-clockwise direction. Only residues involving DRMs were labeled. Each link within the circle represents a pair of top-coupled mutations (ranked by the values of -J from Eq. (1)). Links involving at least one DRM were shown in orange, while those between non-DRMs were shaded in gray.

Visualization of protein crystal structures

All NS3 protein crystal structures (PDB ID: [4B6E], [3M5L], [3SU3], [3SV6], [3SUD]) were obtained from the Protein Databank (https://www.rcsb.org). The PyMOL software (https://www.pymol.org) was used for computing the distance between atoms in each protein structure and for drawing the structural figures.

Prediction of compensatory mutations associated with SC-DRMs in different sequence backgrounds

To predict compensatory mutations connected to SC-DRMs in various sequence backgrounds (as opposed to only H77; the sequence background considered in Fig. 3), we introduced each SC-DRM into all MSA sequences lacking that mutation and computed the inferred energy change upon introducing all associated strongly coupled mutations in each selected sequence. We repeated this process for all SC-DRMs. Mutations that compensated for an SC-DRM in at least 10% of the selected sequences are listed in Supplementary Table 3.

Statistical significance testing

We calculated the statistical significance of the number of SC-DRM residues (identified by our model) associated with a specific drug using a p-value. For each drug and a given number of top-coupled pairs of mutations inferred by our model (e.g., 10, 100, or 300), the p-value represents the probability that, given j total DRMs associated with a specific drug, we would identify at least i of them as SC-DRMs purely by chance. Let n represents the number of residues involved in the top-coupled pairs, which is a subset of the N total residues in the NS3 protein. In our case, N = 515, with 116 fully conserved residues removed. Note that in this calculation same residues involved in multiple pairs of mutations were only counted once. This p-value is computed as:

$$p=\mathop{\sum }\limits_{q=i}^{\min (j,n)}\frac{\left(\begin{array}{c}j\\ q\end{array}\right)\left(\begin{array}{c}N-j\\ n-q\end{array}\right)}{\left(\begin{array}{c}N\\ n\end{array}\right)}.$$
(10)

The above equation sums up the probabilities of observing i or more SC-DRMs associated with a drug using our model. If p < 0.05, we reject the null hypothesis that the SC-DRMs associated with a drug were observed by a random chance.

Evolutionary simulation

We considered a Wright-Fisher-like viral evolutionary model47 to quantify the relative ease of escape from selective pressure targeting each residue involved in the DRMs known for HCV NS3-targeting DAAs4,5,6,7 (listed in Table 1). Similar evolutionary models have been shown to be representative of the relative ease of escape from selective pressure of immune system for HCV E2 genotype 1a and 1b12,13, and are informative of protein structures92,93,94. As in refs. 12,13, we adopted the “escape time” metric to represent the number of generations required for mutations at a residue under selective pressure to reach a frequency of >0.5 in a fixed-size virus population.

The model set-up can be summarized as follows. The fixed virus population size was set to Me = 2000 in accordance with the estimated HCV effective population size in in-host evolution95). For a given NS3-DRM-associated residue i, we started the simulation with a homogeneous population comprising copies of a randomly selected sequence from the MSA having the consensus amino acid (i.e., the most frequent amino acid) at residue i. For each generation of the virus population, sequences undergo the following three steps.

  1. 1.

    Mutation. Each nucleotide in the sequences is randomly mutated to another nucleotide with a fixed probability μ = 10−4 in accordance with the HCV mutation rate reported in refs. 96,97.

  2. 2.

    Selection. Each sequence in the viral population survives with a probability calculated based on its fitness predicted from the inferred landscape (see ref. 12 for details). In addition, fitness of all sequences having the consensus amino acid at residue i is decreased by a fixed value b. This models the selective pressure exerted by a drug at residue i and provides a selective advantage to the sequences having a mutation at this residue. b was set as the largest value of the field parameter h in the inferred fitness landscape.

  3. 3.

    Random sampling. A standard multinomial sampling process, parameterized by the survival probabilities calculated in the previous step and Me, is performed to generate the next generation of the virus population.

The above three steps (mutation, selection, and random sampling) are repeated until the frequency of sequences having a mutation at residue i exceeds 0.5 in the population and the corresponding number of generations is recorded. This number is considered the time (generation) it took for the virus to escape from selective pressure at residue i. We re-ran this procedure multiple times (100) using the same initial sequence, as well as for multiple distinct initial sequences (25), yielding a total of 2500 values). The mean of the number of generations recorded over all these simulation runs represented the escape time associated with residue i (listed in Supplementary Data 3).

Acquisition of efficacy data for NS3-specific drugs

A total of 22 studies reporting efficacy data of nine NS3-specific drugs for treating patients infected with HCV genotype 1 were included (listed in Supplementary Table 648,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69). In each study, efficacy of a drug was reported as the proportion of patients with SVR for 12 or 24 weeks after the end of the treatment. We used the weighted average of SVR rates associated with a drug to represent its aggregated efficacy.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.