Introduction

Currently, statistics on morbidity and mortality associated with the circulating influenza viruses are alarming and indicate the viruses represent a threat to public health. Annually around the globe, influenza viruses cause about 3 to 5 million cases of morbidity, and 250 to 500 thousand deaths1, with almost incalculable socio-economic costs. Among the three types of the circulating influenza viruses, namely A, B, and C, the former two types are the most prevalent and cause epidemic outbreaks nearly every year, whereas the latter type is less predominant and only causes a mild infection2.

Type A influenza viruses are divided into subtypes according to the combination of their two surface glycoproteins: Hemagglutinin (HA) and Neuraminidase (NA) proteins. At present, 18 and 11 variants of HA and NA, respectively, are found in animal species3,4. A convenient nomenclature for influenza A subtypes is based on the combination of these two proteins found on a given virus. Examples include H1N1, H3N2, H5N1 and H9N2.

Influenza viruses evolve continuously through rapid amino acid mutations of HA protein, which enables them to circumvent neutralization by antibodies that are generated by the immune system of the host as results of infections or vaccinations4,5. This mutational process is known as antigenic drift2, and produces new variants that are immunologically different from the parent viruses. Thus, antibodies or vaccines developed during previous infections or vaccinations do not fully recognize the new variants. Another less frequent but more malignant mutational process has also been found to be causative of antigenic variation; namely the re-assortment process, a phenomenon called antigenic shift, in which two or more distinct influenza viruses, possibly from different species, interchange their genetic material and generate a new variant that is completely novel to the pre-existing immunological surveillances, and that can lead to pandemic situations2,6.

The genome of influenza A virus comprises 8 genetic segments4, two of which encode for the important trans-membrane envelope proteins HA and NA. The HA protein itself is divided into two domains: HA1 (of 329 amino acids long for H3N2) and HA2 (175 amino acids for H3N2)7. These two domains are interconnected via covalent disulfide bounds8. HA1 contains five antibody-binding regions or epitopes (labelled A, B, C, D, and E) on its globular head that is used to inject the virus into the host cells through the sialic acid receptor binding sites6. These epitopes have been found to be the primary targets of neutralizing antibodies and regularly mutate5. The NA protein also undergoes some mutations but this has limited impact on virus antigenicity9.

Prediction of antigenic relationships (similar or variant) among influenza viruses is useful in designing (or updating the existing) effective influenza vaccines, provides important insights into the evolutionary mechanisms that underpin their antigenic variations due to the pressure of natural selection imposed by the neutralizing antibodies10, and helps to understand the epidemiology of these pathogens.

Two biochemical assays are commonly used to characterize antigenic relationships among influenza viruses; namely, the Hemagglutinin-Inhibition (HI) assay, which measures the ability of an antibody that has been raised against one variant to inhibit agglutination of red blood cells by another variant, and the virus micro-neutralization (MN) assay2,11. These serological assays are confronted with a multitude of issues including that they are labour-intensive and time-consuming; they produce variable outputs12,13 especially when the experiment is carried out under different conditions, such as using different concentrations of virus and of red cell14; they are unsuitable for large quantitative assays15; their outputs are contaminated by noise14; and they require a high level of biosafety when analyzing some viruses (e.g. H5)16.

A large amount of genomic data related to influenza viruses has been generated from projects like Influenza Virus Resource17, obtained from the high-throughput assays of complete genome sequencing made possible by new biotechnologies. Therefore, computational methods have been introduced to analyse the viral mutations. Several sophisticated computer models, arising from different theoretical perspectives, have been proposed which combine high-throughput data and low-throughput antigenic data (e.g. from HI assays) to study virus antigenicity. Some models infer antigenic relationships among influenza variants (predictive models) from the amino acid mutations from large numbers of sequences on HA1 domains, while others make predictions about the flu variants that will dominate the next season (forecasting models) from the surveillance data of the current season18. However, the prediction step precedes the forecasting step because the vaccine only has to be updated if the newly emerging variant differs antigenically from the current vaccine strain. Also, from the results of a prediction model, antigenic cartography- grouping influenza viruses into distinct clusters- and phylogenetic tree of influenza viruses could be established, which facilitate visualization and interpretation of antigenic relationships. Through construction of a 2D antigenic map, using a multidimensional scaling technique, Smith et al.15 characterized the antigenic evolution history for influenza A/H3N2 viruses from 1968 to 2003. Likewise but using a 3D instead of a 2D antigenic map, Barnett et al.19 developed an online antigenic cartography resource to determine antigenic drifts and shifts for influenza viruses. Liu et al.20 combined phylogenetic trees with a Naïve Bayesian Model to map the antigenic patterns of influenza A/H1N1 viruses in China. Based on phylodynamics analysis Hadfield et al. developed an online application called Nextstrain for real-time tracking and visualization of the evolution of influenza and other epidemiological viruses such as Zika and Ebola21.

The prediction models that correlate genetic mutation and antigenic data were developed based on various inference techniques, including information theory12 and data-driven machine-learning approaches13. Liao and his co-workers7 proposed a number of bioinformatics models, which are based on scoring models and regression methods including multiple regression, logistic regression and Support Vector Machine (SVM), for predicting antigenic variants of influenza A/H3N2 viruses. Moreover, Lees et al.10 developed several linear models accounting for different regions on HA1 sequence including the five epitopes, N-linked glycosylation, and other non antigenic regions, for predicting the antigenicity of A/H3N2 strains. Yin et al.22 developed a staking model that draws a consensus conclusion from the outcomes of set of classifiers. Furthermore, some prediction methods do not rely solely on the number of amino acid changes between two variant sequences, but also incorporate additional properties such as amino acids substitution metrics13 that account for certain features of amino acid residues like BLOSUM62, or physiochemical differences like the amino acid volume and electrostatic charge8,23, in order to improve the predictive performance. Zhou et al.24 proposed a Context-Free Encoding Scheme (CFreeEnS) prediction method that allows to be integrated with a large number of different substitution matrices for protein sequences.

These computational methods have already demonstrated their usefulness, helped to uncover many characteristics of virus evolution, and provided a promising approach for efficient vaccine strains selection2. However, a majority of these prediction models are data driven and lack a concrete and mechanistic theoretical basis.

In this study we present a simple biophysical model that can predict antigenic relationships among influenza A viruses using both genomic amino acid sequences and experimental antigenic data. Antigenicity of an influenza virus has two remarkable features: sparsity- a few critical positions on the HA1 protein undergo antigenically consequential genetic mutation, and co-evolution- certain positions tend to co-mutate jointly25. Therefore, we enhance our model by carefully choosing a regularization term that actualizes both features, namely the elastic-net lasso. We demonstrate the applicability of the model using benchmark datasets of four subtypes of influenza A viruses (H1N1, H3N2, H5N1, and H9N2) and report on its performance profiles. Furthermore, we checked the effectiveness of model on a large, novel (unseen) dataset of influenza A/H1N1 viruses and found that model achieved a high AUC value (an established measure of prediction performance) of 0.86.

Materials and Method

Materials

The model has been developed using both antigenic data and HA1 protein sequences of influenza viruses. The antigenic data that measures the relationship between pairs of influenza viruses are given in the form of reciprocal normalized Hemagglutinin Inhibition-(HI) values, denoted NHT14. The smaller the NHT value the closer the antigenic similarity between the two viruses. A pair of viruses is considered to be antigenically similar if their corresponding log-transformed NHT value is ≤log(4), and otherwise they are said to be antigenically different25. We obtained a total of 1557 pairs of influenza A viruses with measured antigenic relationships, spanning 4 subtypes, from the study of Peng et al.12, which is publicly available and has been assembled from the relevant literature and documents published by the collaborating networks of the World Health Organization’s (WHO) global influenza surveillance network26 (for more information, see the Supplementary Material of12). These datasets have been partially used in the study of other computational methods in this research area8,10. We choose these particular datasets because of their epidemiological importance and also we would be able to readily make comparisons with other models that have used them. The datasets contained information for 291 unique influenza A viruses.

We downloaded the HA1 protein sequences of the 291 viruses belonging to the 4 subtypes from the Influenza Virus Resource17. For each subset of HA1 proteins belonging to a particular subtype, we performed multiple sequence alignment analysis using the msa package in R27. Additionally, 131 important amino acid positions found on or nearby the five canonical epitopes of the H3N2 viruses, which are the primary targets of the neutralizing antibodies, were obtained from the literature5,8,10,12. These positions are given in Table S1 in the Supplementary Materials. Table 1 below provides an overview of the datasets used to build the model.

Table 1 Overview of antigenic datasets used to develop the model.

Model

The mathematics of statistical mechanics defines the relationship between the energy of a system such as the binding of a virus to an antiserum, and its thermodynamic quantities, which in our case are represented by NHT measurements. If the functional forms specifying the energy (Hamiltonians) are known, we can derive the thermodynamic measurements, and vice-versa. In complex systems, like the virus-antiserum mixtures we deal with here, it is difficult to identify the Hamiltonians precisely, but they could be approximated from thermodynamic measurements. Here we take the inverse approach: from the available empirical measurements we approximate the Hamiltonians that govern the interactions between the virus and antiserum, through a machine learning approach. In fact, recently, several problems in computational biology have been addressed by means of such an inverse biophysical approach28,29,30,31, with very meaningful results.

This inverse approach cannot be applied directly in the case of antigenic similarity among influenza viruses because three elements are involved in the process of determining the HI value: the two viruses, [the homologous virus (νi) and the heterologous virus (νj)], and the antiserum (Y). However, analogous to the steps of serological assays, the quantification of the HI value can be viewed as follows: A concentrated solution of the homologous virus νi and red cells is mixed up with the antiserum Y under standard conditions. This mixture of virus-antiserum attains an equilibrium state14. Suppose that its energy is given by E(νi). Likewise, under the same conditions, suppose a concentrated solution of the heterologous virus νj is mixed up with the same red cells and antiserum, and reaches the equilibrium state with energy E(νj). Then one could define the antigenic relation between the two viruses as a function of the difference between the two energies, ΔEij = E(νi) − E(νj). Thus, we compute the antigenic dissimilarity or “distance” between two strains, say νi and νj, of influenza viruses as follows:

$${d}_{ij}={e}^{{\rm{\Delta }}{E}_{ij}}\Rightarrow {\rm{\Delta }}{E}_{ij}=\,\mathrm{log}({d}_{ij})$$
(1)

Now, assuming that energy is additive, let us assign a Hamiltonian function for each amino acid type at each position of the sequence of HA1 protein for the particular virus (refer to the Supplementary Materials for full details about the formulation of the model) such that ΔEij is the sum of changes in energy between the two strains and it is computed as follows:

$${\rm{\Delta }}{E}_{ij}=\sum _{s=1}^{n}\,\delta H({a}_{s(k)}^{i},{b}_{s(\bar{k})}^{j}),$$
(2)

where \(\delta H({a}_{s(k)}^{i},{b}_{s(\bar{k})}^{j})\) is the function or Hamiltonian that assigns an amount of energy required when amino acid a of type k on virus νi changed to amino acid b of type \(\bar{k}\) on virus νj, both at position s. k and \(\bar{k}\) could be any of the 20 standard amino acids, and n is the length of the HA1 protein. This definition of antigenic similarity can potentially tolerate sub-optimal matching between pairs of sequences being compared.

Notice that the definition of the antigenic relatedness in Eq. (1) is consistent and well-defined; because dij = 0 for perfectly similar viruses i.e. νi = νj, and the dij increases as the two viruses become more dissimilar/divergent. The Hamiltonians vector δH represents the model’s parameters and is learned from the experimental measurements (NHT), by minimizing the following mean square error (MSE):

$${MSE}=\frac{1}{N}\sum _{n=1}^{N}\,{[{D}_{ij}^{(n)}-{d}_{ij}^{(n)}]}^{2}$$
(3)

where Dij is the target value between viruses i and j, derived from the log-transformed NHT measurements, log(HIij). Equation (3) is subject to the following constraints:

$${P}_{\alpha }(t)=\sum _{j=1}^{d}\,[(1-\alpha )\delta {H}_{j}^{2}+\alpha |\delta {H}_{j}|]\le t$$
(4)

for regularization parameters t > 0 and 0 ≤ α ≤ 1. This constraint in Eq. (4) represents the penalty term of the model; it is a combination of L1 and L2, which is known as elastic-net lasso32. L2 is good at detecting correlations among the features. In this context we devise it to capture the co-evolving positions. Meanwhile L1 induces sparsity into the model, consistent with empirical expectations. Also the regularization term Pα(t) avoids data over-fitting and improves the model’s performance on a novel dataset. t is the meta-parameter that governs the whole penalization term. α is a trade-off between L1 and L2; the larger the value of α the greater the emphasis on L1, and the smaller the value of α the greater the emphasis on L2.

Equation (3) together with the penalization term of Eq. (4) form a non-linear but convex function, and we solved it via an iterative, cyclic coordinate descent and adaptive Gauss-Seidel method. Details of this optimization algorithm are found in33.

Results

Predictive performance of the model

We evaluated the predictive performance of the model on datasets for the four influenza A subtypes considered in this study (Section 1.1), using a five-fold cross-validation test to avoid data over-fitting. In the cross-validation process, an antigenic dataset of a particular subtype was randomly divided into five quasi equal-sized subsets. We held-out one subset, while the remaining four subsets were combined and used to train the model. The held-out subset was used to test model performance. This process was repeated for each subset, and the test results from all the five subsets were combined in order to assess the overall performance of the model subject to that particular subtype.

We used several statistical metrics to measure the model’s performance, including Area Under the Curve (AUC) of the Receiver-Operator Characteristic (ROC), accuracy, sensitivity, specificity, Pearson correlation, and root mean square error (RMSE). The AUC is a measure that takes both true positive rate (TPR) and false positive rate (FPR); its value ranges from 0 to 1. The higher the AUC value, the better the performance, and the AUC value of = 0.5 is equivalent to the performance of a random model. The accuracy metric assesses the ability of the model to correctly identify the antigenic similarity between two influenza viruses.

Table 2 shows the relative performance of the model based on these metrics, and Fig. 1 depicts ROC curves for the four influenza A subtypes.

Table 2 Five-fold cross-validation results of the model.
Figure 1
figure 1

ROC Curves.

The model has an excellent predictive performance for subtype H5N1 (AUC = 0.90, accuracy = 0.89, and correlation = 0.86) and a very good performance for the other three subtypes. We compared the results of the current model to the results of the residue-based model PREDAV-FluA12, which was built using the same datasets. We observed that the current model outperformed the PREDAV-FluA model for both H3N2 (accuracy = 0.90 to 0.86) and H5N1 (accuracy 0.89 to 0.86) subtypes. The latter model had a better performance for the H1N1 subtype (accuracy 0.81 to 0.83) (see Fig. 2).

Figure 2
figure 2

Comparing with the PREDAV-FLUA model.

Energy variation across the canonical epitopes of H3N2

Analysis of the model’s parameters also revealed that all five immunodominant epitopes contributed varying amounts of energy to the antigenic properties of each of the four considered influenza A subtypes. This is consistent with the long and well-established observation that correlates the antigenic variation to concurrent mutations in multiple distinct regions of the HA1 sequence10. Strikingly, for Influenza A H3N2 subtype, the high efficiency neutralization epitopes, A, B, and D5, which are located within a close spatial proximity of receptor-binding sites (see Table S3 in the Supplementary Materials) and are the most preferred targets of antibodies, contribute relatively higher amounts of energy compared to low efficiency neutralization epitopes, i.e. epitopes E and C (Fig. 3). This finding is in agreement with other previous studies5,13,25.

Figure 3
figure 3

Energy distribution over epitopes.

Moreover, we found that most of the energy contributions, approximately 67% of the total energy, actually arise from the five epitopes. The rest of it distributed between residues that has been found to be targets of monoclonal antibodies5, receptor binding (RB) sites, and other consensus sites (Fig. 4).

Figure 4
figure 4

Normalized energy contribution from overall residues.

Influenza A subtypes undergo similar mutations

It is generally considered that genomic variation in the HA1 protein drives the evolution of influenza viruses and allows them to escape antibody neutralization. From the analysis of the model’s parameters we found that the four influenza A subtypes (H1N1, H3N2, H5N1, H9N2) exhibit similar patterns of energy fluctuations over HA1 protein positions. As shown in Fig. 5, the four subtypes share very similar energy peaks over certain positions along the HA1 sequence. This observation indicates the existence of a common evolution mechanism among influenza A viruses, and supports the conclusions of other previous studies13,34,35. The energy contribution of each position was calculated by summing up the residue-residue Hamiltonians associated with that position (See Eq. 1 in the Supplementary Materials).

Figure 5
figure 5

Energy distribution over residue positions.

Notice that the high peaks in Fig. 5 reflect strong repulsive energies, i.e mutations that can cause antigenic drift events (positively selected positions), while the low peaks reflect attractive energies, i.e. mutations that cause viruses to become antigenically more similar.

Most positions (64%, 76%, 64% and 71% of H1N1, H3N2, H5N1 and H9N2 positions, respectively) are silent, i.e. with zero energy contributions, consistent with the empirical fact that only minority of positions undergo antigenically consequential mutations.

In addition, we also found strong correlations among influenza A subtype Hamiltonians; varying from 0.87 between H1N1 and H3N2, with p-value = 0.026); to 0.67 between H1N1 and H95N2, with p-value = 0.01 (see Table S2 in the Supplementary Materials).

Predictions on a validation dataset

To examine the effectiveness of the model presented here and to assess how well it can be extrapolated on a novel (unseen) dataset, we tested the predictive performance of the model on a dataset of epidemics and pandemics of influenza A/H1N1 which we obtained from ref.22. Similar to the dataset described in Section 1.1, the validation dataset also has two parts, antigenic data based on hemagglutination inhibition (HI) assay and genomic sequences. The HI antigenic data were compiled from various sources of flu data, including the Francis Crick Institute (FCI), European Centre for Disease Prevention and Control (ECDC), World Health Organization (WHO), U.S Food and Drug Administration (FDA), and others. The genomic sequences were obtained from Influenza Virus Resource (IVR)36 and Global Initiative on Sharing All Influenza Data (GISAID)37. The original dataset was large, but after excluding the A/H1N1 viral pairs that occurred in the dataset used to build the model and removing duplications, we ended up with a total of 642 viral pairs and 168 sequences (Table 3). We choose this dataset for a number of important reasons. Firstly, it is publicly available, organized in a chronological order, and contains antigenic data related to the last five epidemics and pandemics caused by A/H1N1 viruses, from 1921 to 2016. Secondly, other computational methods have greatly neglected influenza A/H1N1, to such a degree that there is insufficient knowledge about it22. Table 3 gives per period summary of the dataset, the cleaned antigenic data and genomic sequences are provided in Supplementary Files S1 and S2, respectively.

Table 3 The validation dataset.

The predictive performance of the model on the validation dataset was assessed using two different strategies. In the first strategy, we utilized the validation dataset as a novel dataset and measured the performance using the model’s parameters learned using from the A/H1N1 data described in Section 1.1. The model achieved an AUC value of 0.86 (blue curve in Fig. 6). In the second strategy, we performed a five-fold cross-validation test using the validation dataset itself. In this case the model achieved an AUC value of 0.94 (red curve in Fig. 6). In the later case, the model has an improved performance because of the larger size of the dataset. These results demonstrate the reliability of the model in predicting antigenic relationships.

Figure 6
figure 6

Energy distribution over residue positions.

General model

The model given by Eq. (3) together with the penalization term of Eq. (4) is a subtype model; thus it was applied for each individual influenza subtype. But, we thought of a general model that predicts the antigenicity of all influenza subtypes, based on the simple fact that influenza viruses share a common evolutionary origin. In this section, we show that a model can be upgraded to become a general model.

From the available crystallographic structures of influenza A viruses, ten antigenic regions on the HA1 proteins; called artificial sites and largely govern the antigenic variations, has been identified10,12. These regions; denoted by E1 to E10, cover almost the entire HA1 protein and each one contains group of amino acid residues according to the distance from the globular head of HA in a descending order. We utilized such antigenic regions to generalize our model so that it could predict antigenic similarities for all subtypes of influenza A. Thus, we changed the definition of ΔEij in Eq. (2) between two strains νi and νj as follows:

$${\rm{\Delta }}{E}_{ij}=\sum _{k=1}^{10}\,\delta H(k,a,b),$$
(5)

such that δH(k, a, b) accounts for the energy of replacing amino acid a in strain i by amino acid b in strain j at the antigenic region k. The rest of equations remain unchanged.

We evaluated the performance of the general model on the dataset described in Section 1.1 of all the four influenza A subtypes considered in this study, using five-fold cross-validation. It scored AUC value = 0.78 (Fig. 7), accuracy = 0.80, sensitivity = 0.65, specificity = 0.92, and correlation = 0.77.

Figure 7
figure 7

Performance of general model.

Discussion

Accurate and reliable prediction of antigenic similarity among influenza viruses is necessary for optimal vaccine strain selection. In this work we have presented a new approach for predicting antigenic relatedness of influenza variants based on biophysical ideas.

Many factors related to the HA1 protein influence the antigenicity of influenza viruses, spanning from structural conformations23 to physiochemical features8 like hydrophobicity; however, these factors are not independent as has been demonstrated by the work of Yuhua et al.13. The predictive performance of their model improved when they integrated the model with combination of a few substitution metrics that reflected some of those factors, especially structure-based substitution matrices. But from the properties of thermodynamic and statistical mechanics, energy is a universal currency38. Thus, it is expected that the energy would reflect the aggregated net-effect of all the factors associated with the HA1 protein that might influence the antigenicity. This extremely useful inclusion of all factors, in a very simple way, is the power of our approach over other sequence-based methods which try to incorporate a few selective factors through some amino acids substitution metrics.

Sparsity is a hallmark property of cellular processes; for example, in this context, a few amino acids mutations drive most of the antigenic drift events for influenza viruses. Therefore, our model accounts for the sparsity, with the aid of an elastic-net penalization term.

From their cartography model that characterizes the evolutionary dynamics of H3N2 viruses, Smith et al.15 identified a list of amino acid substitutions that are associated with antigenic drift events between 10 clusters of H3N2- called cluster-difference substitutions- using influenza surveillance data collected between 1968 and 2003. We further investigated the consistency of our model’s parameters with that of15. As expected, we found that most of these amino acid substitutions corresponded to Hamiltonians of repulsive energy. Of the 67 cluster-difference amino acid mutations, 46 (approximately About 70%) were found to have non-zero energetic contributions (Table 4). Furthermore, most of these cluster transitions are the result of several concurrent cluster-difference substitutions; for example, transition from England 1972 (EN72) to Victoria 1975 (VI75) characterized by 12 amino acid substitutions (Table 4). In such cases we found some of the substitutions contribute a positive amount (repulsive) of energy, while some of them contribute a negative amount (attractive) of energy. A few cluster transitions are the result of single amino acid mutation; for example, the two transitions from Sichuan 1987 (SI87) to Beijing 1989 (BE89) and Beijing 1992 (BE92) to Wuhan 1995 (WU95), are both characterized by only single mutation N145K (at position 145 of HA1 protein, the amino acid asparagine (N) changed to amino acid lysine (K)). In these cases, we found that such substitutions always have strong repulsive energy (see Table 4). Overall, the net energy contribution for any cluster transition is always positive.

Table 4 Relative energy contributions for important amino acid mutations that drive clusters transition.

The model can be readily incorporated into current surveillance systems for influenza  -e.g. by supporting surveillance labs to interpret the antigenic and sequence data they routinely collect. The logical next step of this work is to extrapolate the structural similarities among influenza A viruses and develop a more accurate general purpose and pan-specific computational model that could predict the antigenicity of all the subtypes. Such a model would reconcile both structural inconsistencies among influenza viruses and experimental artefacts associated with antigenic data in order to obtain more accurate results. A further study is needed to produce a forecasting version of the model based on the same biophysical principles which would be helpful in recommending influenza vaccine strains.