Introduction

Breast cancer is the most common cancer affecting women worldwide1. Several risk factors have been identified, including family history of breast cancer, dense breast tissue, female reproductive factors, alcohol or tobacco use, body mass index, and genetics (e.g. BRCA gene)2,3; however, nearly half of breast cancers develop in women in the absence of these risk factors3, suggesting that additional factors likely contribute to breast cancer risk.

The role of viruses in breast cancers is increasingly recognized and is likely underestimated4. As documented elsewhere, human herpes viruses (HHV) including Epstein Barr virus (EBV, HHV4) and cytomegalovirus (CMV, HHV5), mouse mammary tumor virus (MMTV), high risk human papilloma virus (HPVs), bovine leukemia virus (BLV), human polyomavirus JC virus (JCV), and human endogenous retrovirus K (HERVK) have been implicated in human breast cancer4,5,6,7,8,9. Several viruses (e.g., MMTV, HPV, EVB, and BLV) have been identified and shown to co-exist in human breast cancer cells10,11,12, and in benign breast biopsies 1–11 years before developing cancer11. They have also been identified in normal breast tissue samples and in milk of normal lactating women, albeit to a lesser extent10,11. Indeed, viruses linked to human cancers are ubiquitous yet only a small proportion of infected individuals develop cancer, one of many reasons that have made it challenging to identify causal relations between viruses and cancer13. One factor that may moderate the association between viruses and breast cancer is variation in host immunogenetics related to human leukocyte antigen (HLA).

HLA genes, located on chromosome 6, code for two main classes of cell-surface proteins involved in the immune response to foreign antigens including viruses and cancer neoantigens14,15. HLA-I molecules of the classical genes A, B and C are expressed on all nucleated cells, bind and present small peptides (8–10 amino acid residues16) from proteolytically degraded foreign antigens to CD8 + cytotoxic T cells, signaling cell destruction. HLA-II molecules of the DPB1, DQB1 and DRB1 genes are expressed on lymphocytes and professional antigen presenting cells, present larger peptides (12–22 amino acid residues17) derived from endocytosed exogenous antigens to CD4 + T cells, facilitating antibody production and adaptive immunity. Each individual carries two of each HLA gene, for a total of 12 classical HLA alleles. The HLA region is the most highly polymorphic region of the human genome18, with most of the variation existing in the binding groove. This variation amounts to tremendous individual variability in the ability to bind and eliminate viruses and other foreign antigens. Specific HLA alleles have been associated with breast cancer protection or susceptibility19,20,21,22,23,24,25,26,27,28. This association is captured in the breast cancer—HLA immunogenetic profile which contains the correlations between the prevalence of breast cancer and HLA allele frequency19. Given the documented involvement of several viruses in breast cancer, discussed above, we investigated, in this study, the possible viral elimination by the HLA system, as a mechanism of preventing the oncogenic effect of those viruses. More specifically, we focused on 7 viruses that have been found in breast cancer tissue (HHV4, HHV5, HPV, JCV, MMTV, BLV, HERVK) and estimated in silico their binding affinity with respect to 69 common HLA-I alleles of the 3 classical genes (A, B, C) and 58 common HLA-II alleles of the 3 classical genes (DPB1, DQB1, DRB1). Since binding affinity is a critical initial step in foreign antigen elimination, it is reasonable to assume that high binding affinity would be more effective in virus elimination, and vice versa for low binding affinity. Thus the objectives of this study were (a) to estimate in silico the predicted binding affinity of specific viruses with respect to specific alleles using the Immune Epitope Database (IEDB) NetMHCpan (ver. 4.1) tool29,30, (b) to identify those viruses whose binding affinities were associated with the breast cancer—HLA immunogenetic profile, and (c) to test the hypothesis that the predicted binding affinity of this set of viruses is lower than that of the viruses unassociated with the breast cancer—HLA profile.

Results

General

Predicted Binding Affinity (PBA) scores varied substantially among virus proteins (Table 1) and HLA-I alleles (Table 2) and HLA-II (Table 3). All HLA-I PBA values (N = 69 alleles × 7 viruses = 483) were positive (indicating high affinity lowest percentile rank (LPR) < 1, PBA > 0), whereas for HLA-II, 12.8% (52 out of N = 58 alleles × 7 viruses = 406) were negative (indicating low affinity LPR > 1, PBA < 0).

Table 1 Viral proteins used.
Table 2 Predicted binding affinities (PBA, see “Methods”) and Breast Cancer—HLA P/S scores17 for all 69 HLA-I alleles and 7 viruses studied.
Table 3 Predicted binding affinities (PBA, see “Methods”) and Breast Cancer—HLA P/S scores17 for all 58 HLA-II alleles and 7 viruses studied.

The overall design of our analyses is depicted in the schematic diagram of Fig. 1. Details of the analyses are provided in each of the sections to follow.

Figure 1
figure 1

Schematic diagram of the analyses performed. 1, N = 14 prevalences of breast cancer in 14 countries; 2, N = 127 HLA-I and -II alleles; 3, N = 7 virus proteins; 4, N = 127 HLA allele sequences provided by, and used in, the Immune Epitope Database (IEDB) tool29,30; 5, N = 127 (BC-HLA P/S score, last column in Tables 2 and 3); 6, N = 127 HLA alleles × 7 viruses = 889 HLA-I (Table 2) and HLA-II (Table 3) estimated binding affinities.

Effect of virus and HLA class on PBA

The effects on PBA of Virus, HLA Class, and their interaction were evaluated using a repeated- measures analysis of variance (ANOVA), where the 7 viruses comprised the “Within-Subjects” Virus factor and the 2 HLA classes comprised the “Between-Subjects” fixed Class factor. We found the following: (a) The effect of Virus was highly significant (P < 0.001, Greenhouse–Geisser test), with JCV and BLV having lower average PBA scores (Fig. 2); (b) The effect of HLA Class was also highly significant (P < 0.001, F-test), with HLA-I having 2.5 × higher scores than HLA-II (Fig. 3A); and (c) the Virus x Class interaction term was also highly significant (P < 0.001, Greenhouse–Geisser test) (Fig. 3B). This interaction seems to be due mainly to the fact that the PBA scores for JCV and BLV viruses are disproportionately lower in HLA-II as compared with HLA-I and are substantially lower than the other viruses.

Figure 2
figure 2

Mean (± SEM) predicted binding affinities of the 7 viral proteins used across all 127 HLA alleles.

Figure 3
figure 3

(A) Mean (± SEM) predicted binding affinities for HLA-I (N = 69) and HLA-II (N = 58). (B) Same for each virus studied.

Effect of HLA-I and HLA-II genes on PBA

Given the significant Virus × HLA Class interaction above, the effect of Virus and Gene on PBA were evaluated separately for each HLA class using 2 separate repeated measures ANOVAs, one for each HLA Class, where Virus was the Within-Subjects factor as above, and Gene was the Between-Subjects factor comprising the 3 genes of HLA-I (A, B, C) and the 3 genes of HLA-II (DPB1, DQB1, DRB1). These analyses also evaluated the effect of Virus within each HLA Class separately. We found the following: (a) There was a significant effect of Virus (P < 0.001, Greenhouse–Geisser test) for both HLA-I (Fig. 4, left panel) and HLA-II (Fig. 4, right panel); (b) for HLA-I, there was a marginally significant effect of Gene (P = 0.024, F-test), with higher PBA values for gene C (Fig. 5A), whereas the Virus × Gene effect was not statistically significant (P = 0.166, Greenhouse–Geisser test) (Fig. 5B); (c) For HLA-II, there was a significant effect of Gene (P = 0.011, F-test), with higher PBA values for gene DQB1 (Fig. 6A), and the Virus × Gene effect was highly significant (P < 0.001, Greenhouse–Geisser test) (Fig. 6B).

Figure 4
figure 4

Mean (± SEM) predicted binding affinities for HLA-I and HLA-II for each virus studied.

Figure 5
figure 5

(A) Mean (± SEM) predicted binding affinities for HLA-I A, B and C genes (N = 69). (B) Same for each virus studied.

Figure 6
figure 6

(A) Mean (± SEM) predicted binding affinities for HLA-II DPB1, DQB1 and DRB1 genes (N = 58). (B) Same for each virus studied.

Association between PBA and breast cancer: HLA immunogenetic profile

As depicted in the schematic diagram of Fig. 1, our analyses culminated in the hierarchical tree clustering which we applied to the data of HLA-I and HLA-II shown in Tables 2 and 3, respectively. This analysis yielded 2 dendrograms, one for each class. In both cases, there were 2 clusters, as follows. For HLA-I, BC-HLA immunogenetic scores were grouped with BLV, JCV and MMTV (Fig. 7A). Remarkably, the average PBA scores of the viruses in this BC-associated group (red in Fig. 7A,B) were significantly lower than those in the other, non-BC group (blue in Fig. 7A,B) (P < 0.001, paired-sample t-test). For HLA-II, BC-HLA immunogenetic scores were grouped on a sub-branch with JCV and BLV; MMTV, HERV-K, and HPV were grouped on the other sub-branch. As with HLA-I, the average PBA scores of the 5 viruses in the BC-associated group (red in Fig. 8A,B) were significantly lower than those in other non-BC group (P = 0.038, paired samples t-test). Altogether, these results document the grouping of PBA of certain viruses with BC-HLA immunogenetic profile, and their lower predicted binding affinity, as compared to the group of viruses not grouped with BC-HLA.

Figure 7
figure 7

Hierarchical tree clustering results for HLA-I and BC-HLA profile. (A) Dendrogram of the 7 viruses’ predicted binding affinities and BC-HLA profile. (B) Mean (± SEM) of predicted binding affinities of the viruses in the two color-coded groups of the dendrogram. N = 69 HLA-I alleles. See text for details.

Figure 8
figure 8

Hierarchical tree clustering results for HLA-II and BC-HLA profile. (A) Dendrogram of the 7 viruses’ predicted binding affinities and BC-HLA profile. (B) Mean (± SEM) of predicted binding affinities of the viruses in the two color-coded groups of the dendrogram. N = 58 HLA-II alleles. See text for details.

Discussion

In light of separate lines of evidence linking both HLA and viruses to breast cancer, we first evaluated the predicted binding affinity of specific viruses implicated in breast cancer with regard to 127 common HLA alleles and then examined the associations between those viral protein binding affinities with a population-derived breast cancer-HLA profile19. With regard to the former, the overall results documented variation in HLA-I and HLA-II mediated immunity to viral proteins implicated in breast cancer that varied by virus, HLA gene, and across alleles within each gene. Specifically, our findings documented (a) higher predicting binding affinities of HLA-I alleles (than those of HLA-II), (b) higher binding affinities of gene C of HLA-I and gene DQB1 of HLA-II, and (c) overall lower binding affinities for JCV in both HLA-I and HLA-II.

With respect to viruses, it is worth noting that all 7 viruses investigated here have been implicated in breast cancer4,5,6,7,8,9. This study focused on immunogenetic aspects of these viruses, both with respect to their predicted binding affinities to HLA-I and II alleles and their grouping with breast cancer—HLA immunogenetic profile. A major finding of the latter analysis was the grouping of specific viruses with breast cancer (shaded red in Figs. 7A and 8A), fewer for HLA-I (3/7 viruses, Fig. 7A) than HLA-II (5/7 viruses, Fig. 8A). This grouping enabled us to test the hypothesis that this association of specific viruses with breast cancer immunogenetics may be due to lower virus binding affinity to HLA molecules, thus delaying the elimination of virus directly (via HLA-I—CD8 + engagement leading to death of the infected cell) and/or indirectly (via HLA-II—CD4 + engagement leading to antibody production). Indeed, this was found to be the case for both HLA Classes (Figs. 7B, 8B). It is worth pointing out that the present findings do not preclude possible involvement of viruses in breast cancer via other mechanisms that remain to be identified and investigated.

In summary, the present study provides a novel contribution implicating HLA-mediated virus immunogenicity on breast cancer. Still several limitations must be considered. First, the analyses included 127 common HLA-I and II alleles and 7 viruses. The HLA region, however, is highly polymorphic and the binding affinities of the viruses with other less common HLA alleles was not investigated. Second, although the present analyses focused on 7 viruses that have previously been implicated in breast cancer, other viruses not included here may be involved in breast cancer. Third, we evaluated the binding affinities of HLA molecules to representative proteins of the 7 viruses, all of which are involved with viral entry into the host cell; still, other proteins may have different binding affinities. For example, several hundred types of HPV exist, several of which are associated with high risk for cancer31; here, we evaluated HPV16, one of the most common types of HPV involved in cancer risk32, yet other types of HPV may have different binding affinities. Finally, the breast cancer-HLA profile was based on population prevalence of breast cancer in general19; specific types of breast cancer may have a different HLA profile. Despite these limitations, the current findings provide novel insights regarding the interaction of virus exposure and host immunogenetics with regard to breast cancer.

Materials and methods

HLA alleles

We obtained the population frequency in 2019 of 69 common HLA-I alleles and 58 common HLA-II alleles that occurred in at least 9 of 14 Continental Western European Countries (Austria, Belgium, Denmark, Finland, France, Germany, Greece, Italy, Netherlands, Portugal, Norway, Spain, Sweden, and Switzerland) at frequencies ≥ 0.01, as described previously33.

Breast cancer: HLA protection/susceptibility (P/S) scores

These scores are correlations (Fisher z-transformed) between the prevalence of breast cancer in the 14 countries above and the population frequency of each one of the 127 HLA alleles in the same countries. The scores have been published19 and are given in Tables 2 and 3.

Virus proteins

We estimated in silico the predicted binding affinities (for each one of the 127 HLA alleles) of proteins of 7 viruses that have been implicated in breast cancer, namely HHV4, HHV5, HPV, JCV, HERVK, BLV, and MMTV. Details of the proteins analyzed are given in Table 1 and their amino acid (AA) sequences are given in the Appendix (Supplementary Materials).

In silico determination of predicted binding affinity of HLA-I and HLA-II alleles

Predicted binding affinities were obtained for viral protein epitopes using the Immune Epitope Database (IEDB) NetMHCpan (ver. 4.1) tool29,30. More specifically, we used the sliding window approach34,35,36 to test exhaustively all possible linear 9-mer (for HLA-I predictions) and 15-mer (for HLA-II predictions) AA residue epitopes of the 7 viral proteins analyzed (Table 1). The method is illustrated in Figs. 9 and 10 for the JCV virus protein. For each epitope-HLA molecule tested, this tool gives, as an output, the percentile rank of binding affinity of the HLA molecule and the epitope among predicted binding affinities of the same HLA molecule to a large number of different peptides of the same AA length; the smaller the percentile rank, the better the binding affinity. Now, given a protein of N amino acid length and an epitope length of k AA, there are N–k binding affinity predictions, i.e. N–k percentile ranks. Of these predictions, for each viral protein and HLA molecule tested, we retained the lowest percentile rank (LPR) as the best possible binding affinity of the protein-HLA molecule pair. We then applied two transformations on LPR. First, we took its inverse, so that higher values mean better binding affinities for more intuitive interpretation:

Figure 9
figure 9

Illustration of the 9-mer sliding window approach for the in silico estimation of the predicted binding activity for HLA-I alleles.

Figure 10
figure 10

Illustration of the 15-mer sliding window approach for the in silico estimation of the predicted binding activity for HLA-II alleles.

$$LP{R}^{\prime}=\frac{1}{LPR}$$
(1)

The \(LPR^{\prime}\) distribution was heavily skewed to the left (Fig. 11A), resembling a exponential distribution and deviating substantially from a normal distribution (Fig. 11B). Therefore, \(LPR^{\prime}\) values were (natural) log transformed to normalize the distribution for quantitative analyses (Fig. 12A,B):

Figure 11
figure 11

(A) frequency distribution of raw (untransformed) predicted binding affinities (\(LP{R}^{\prime})\) to illustrate their deviation from a symmetric distribution. (B) Probability-probability plot of the data in (A). Data from JCV virus.

Figure 12
figure 12

(A) frequency distribution of raw (untransformed) predicted binding affinities (\(LP{R}^{\prime})\) to illustrate their deviation from a symmetric distribution. (B) Probability-probability plot of the data in (A). Data from JCV virus.

$$\text{Predicted Binding Affinity }(\text{PBA}): PBA=\text{ln}(LP{R}^{\prime})$$
(2)

Give the logarithmic transformation above, PBA > 0 indicate \(LP{R}^{\prime}>1\), whereas PBA < 0 indicate \(LP{R}^{\prime}<1\).

Statistical analyses

General

Standard statistical methods were used to analyze the data using the IBM-SPSS statistical package (version 29), including analysis of variance (ANOVA), t-test, etc. All P-values reported are 2-sided. In addition, hierarchical tree clustering was performed on standardized data (Z-scores), with Ward linkage as the method and squared Euclidean distance as the interval.

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.