Predicting Antigenicity of Influenza A Viruses Using biophysical ideas

Antigenic variations of influenza A viruses are induced by genomic mutation in their trans-membrane protein HA1, eliciting viral escape from neutralization by antibodies generated in prior infections or vaccinations. Prediction of antigenic relationships among influenza viruses is useful for designing (or updating the existing) influenza vaccines, provides important insights into the evolutionary mechanisms underpinning viral antigenic variations, and helps to understand viral epidemiology. In this study, we present a simple and physically interpretable model that can predict antigenic relationships among influenza A viruses, based on biophysical ideas, using both genomic amino acid sequences and experimental antigenic data. We demonstrate the applicability of the model using a benchmark dataset of four subtypes of influenza A (H1N1, H3N2, H5N1, and H9N2) viruses and report on its performance profiles. Additionally, analysis of the model’s parameters confirms several observations that are consistent with the findings of other previous studies, for which we provide plausible explanations.

www.nature.com/scientificreports www.nature.com/scientificreports/ that underpin their antigenic variations due to the pressure of natural selection imposed by the neutralizing antibodies 10 , and helps to understand the epidemiology of these pathogens.
Two biochemical assays are commonly used to characterize antigenic relationships among influenza viruses; namely, the Hemagglutinin-Inhibition (HI) assay, which measures the ability of an antibody that has been raised against one variant to inhibit agglutination of red blood cells by another variant, and the virus micro-neutralization (MN) assay 2,11 . These serological assays are confronted with a multitude of issues including that they are labour-intensive and time-consuming; they produce variable outputs 12,13 especially when the experiment is carried out under different conditions, such as using different concentrations of virus and of red cell 14 ; they are unsuitable for large quantitative assays 15 ; their outputs are contaminated by noise 14 ; and they require a high level of biosafety when analyzing some viruses (e.g. H5) 16 .
A large amount of genomic data related to influenza viruses has been generated from projects like Influenza Virus Resource 17 , obtained from the high-throughput assays of complete genome sequencing made possible by new biotechnologies. Therefore, computational methods have been introduced to analyse the viral mutations. Several sophisticated computer models, arising from different theoretical perspectives, have been proposed which combine high-throughput data and low-throughput antigenic data (e.g. from HI assays) to study virus antigenicity. Some models infer antigenic relationships among influenza variants (predictive models) from the amino acid mutations from large numbers of sequences on HA1 domains, while others make predictions about the flu variants that will dominate the next season (forecasting models) from the surveillance data of the current season 18 . However, the prediction step precedes the forecasting step because the vaccine only has to be updated if the newly emerging variant differs antigenically from the current vaccine strain. Also, from the results of a prediction model, antigenic cartography-grouping influenza viruses into distinct clusters-and phylogenetic tree of influenza viruses could be established, which facilitate visualization and interpretation of antigenic relationships. Through construction of a 2D antigenic map, using a multidimensional scaling technique, Smith et al. 15 characterized the antigenic evolution history for influenza A/H3N2 viruses from 1968 to 2003. Likewise but using a 3D instead of a 2D antigenic map, Barnett et al. 19 developed an online antigenic cartography resource to determine antigenic drifts and shifts for influenza viruses. Liu et al. 20  The prediction models that correlate genetic mutation and antigenic data were developed based on various inference techniques, including information theory 12 and data-driven machine-learning approaches 13 . Liao and his co-workers 7 proposed a number of bioinformatics models, which are based on scoring models and regression methods including multiple regression, logistic regression and Support Vector Machine (SVM), for predicting antigenic variants of influenza A/H3N2 viruses. Moreover, Lees et al. 10 developed several linear models accounting for different regions on HA1 sequence including the five epitopes, N-linked glycosylation, and other non antigenic regions, for predicting the antigenicity of A/H3N2 strains. Yin et al. 22 developed a staking model that draws a consensus conclusion from the outcomes of set of classifiers. Furthermore, some prediction methods do not rely solely on the number of amino acid changes between two variant sequences, but also incorporate additional properties such as amino acids substitution metrics 13 that account for certain features of amino acid residues like BLOSUM62, or physiochemical differences like the amino acid volume and electrostatic charge 8,23 , in order to improve the predictive performance. Zhou et al. 24 proposed a Context-Free Encoding Scheme (CFreeEnS) prediction method that allows to be integrated with a large number of different substitution matrices for protein sequences.
These computational methods have already demonstrated their usefulness, helped to uncover many characteristics of virus evolution, and provided a promising approach for efficient vaccine strains selection 2 . However, a majority of these prediction models are data driven and lack a concrete and mechanistic theoretical basis.
In this study we present a simple biophysical model that can predict antigenic relationships among influenza A viruses using both genomic amino acid sequences and experimental antigenic data. Antigenicity of an influenza virus has two remarkable features: sparsity-a few critical positions on the HA1 protein undergo antigenically consequential genetic mutation, and co-evolution-certain positions tend to co-mutate jointly 25 . Therefore, we enhance our model by carefully choosing a regularization term that actualizes both features, namely the elastic-net lasso. We demonstrate the applicability of the model using benchmark datasets of four subtypes of influenza A viruses (H1N1, H3N2, H5N1, and H9N2) and report on its performance profiles. Furthermore, we checked the effectiveness of model on a large, novel (unseen) dataset of influenza A/H1N1 viruses and found that model achieved a high AUC value (an established measure of prediction performance) of 0.86.

Materials and Method
Materials. The model has been developed using both antigenic data and HA1 protein sequences of influenza viruses. The antigenic data that measures the relationship between pairs of influenza viruses are given in the form of reciprocal normalized Hemagglutinin Inhibition-(HI) values, denoted NHT 14 . The smaller the NHT value the closer the antigenic similarity between the two viruses. A pair of viruses is considered to be antigenically similar if their corresponding log-transformed NHT value is ≤log(4), and otherwise they are said to be antigenically different 25 . We obtained a total of 1557 pairs of influenza A viruses with measured antigenic relationships, spanning 4 subtypes, from the study of Peng et al. 12 , which is publicly available and has been assembled from the relevant literature and documents published by the collaborating networks of the World Health Organization's (WHO) global influenza surveillance network 26 (for more information, see the Supplementary Material of 12 ). These datasets have been partially used in the study of other computational methods in this research area 8,10 . We choose these particular datasets because of their epidemiological importance and also we would be able to readily www.nature.com/scientificreports www.nature.com/scientificreports/ make comparisons with other models that have used them. The datasets contained information for 291 unique influenza A viruses.
We downloaded the HA1 protein sequences of the 291 viruses belonging to the 4 subtypes from the Influenza Virus Resource 17 . For each subset of HA1 proteins belonging to a particular subtype, we performed multiple sequence alignment analysis using the msa package in R 27 . Additionally, 131 important amino acid positions found on or nearby the five canonical epitopes of the H3N2 viruses, which are the primary targets of the neutralizing antibodies, were obtained from the literature 5,8,10,12 . These positions are given in Table S1 in the Supplementary Materials. Table 1 below provides an overview of the datasets used to build the model.

Model.
The mathematics of statistical mechanics defines the relationship between the energy of a system such as the binding of a virus to an antiserum, and its thermodynamic quantities, which in our case are represented by NHT measurements. If the functional forms specifying the energy (Hamiltonians) are known, we can derive the thermodynamic measurements, and vice-versa. In complex systems, like the virus-antiserum mixtures we deal with here, it is difficult to identify the Hamiltonians precisely, but they could be approximated from thermodynamic measurements. Here we take the inverse approach: from the available empirical measurements we approximate the Hamiltonians that govern the interactions between the virus and antiserum, through a machine learning approach. In fact, recently, several problems in computational biology have been addressed by means of such an inverse biophysical approach 28-31 , with very meaningful results.
This inverse approach cannot be applied directly in the case of antigenic similarity among influenza viruses because three elements are involved in the process of determining the HI value: the two viruses, [the homologous virus (ν i ) and the heterologous virus (ν j )], and the antiserum (Y). However, analogous to the steps of serological assays, the quantification of the HI value can be viewed as follows: A concentrated solution of the homologous virus ν i and red cells is mixed up with the antiserum Y under standard conditions. This mixture of virus-antiserum attains an equilibrium state 14 . Suppose that its energy is given by E(ν i ). Likewise, under the same conditions, suppose a concentrated solution of the heterologous virus ν j is mixed up with the same red cells and antiserum, and reaches the equilibrium state with energy E(ν j ). Then one could define the antigenic relation between the two viruses as a function of the difference between the two energies, ΔE ij = E(ν i ) − E(ν j ). Thus, we compute the antigenic dissimilarity or "distance" between two strains, say ν i and ν j , of influenza viruses as follows: Now, assuming that energy is additive, let us assign a Hamiltonian function for each amino acid type at each position of the sequence of HA1 protein for the particular virus (refer to the Supplementary Materials for full details about the formulation of the model) such that ΔE ij is the sum of changes in energy between the two strains and it is computed as follows: is the function or Hamiltonian that assigns an amount of energy required when amino acid a of type k on virus ν i changed to amino acid b of type k on virus ν j , both at position s. k and k could be any of the 20 standard amino acids, and n is the length of the HA1 protein. This definition of antigenic similarity can potentially tolerate sub-optimal matching between pairs of sequences being compared.
Notice that the definition of the antigenic relatedness in Eq. (1) is consistent and well-defined; because d ij = 0 for perfectly similar viruses i.e. ν i = ν j , and the d ij increases as the two viruses become more dissimilar/divergent. The Hamiltonians vector δH represents the model's parameters and is learned from the experimental measurements (NHT), by minimizing the following mean square error (MSE): where D ij is the target value between viruses i and j, derived from the log-transformed NHT measurements, log(HI ij ). Equation (3) is subject to the following constraints: www.nature.com/scientificreports www.nature.com/scientificreports/ for regularization parameters t > 0 and 0 ≤ α ≤ 1. This constraint in Eq. (4) represents the penalty term of the model; it is a combination of L 1 and L 2 , which is known as elastic-net lasso 32 . L 2 is good at detecting correlations among the features. In this context we devise it to capture the co-evolving positions. Meanwhile L 1 induces sparsity into the model, consistent with empirical expectations. Also the regularization term P α (t) avoids data over-fitting and improves the model's performance on a novel dataset. t is the meta-parameter that governs the whole penalization term. α is a trade-off between L 1 and L 2 ; the larger the value of α the greater the emphasis on L 1 , and the smaller the value of α the greater the emphasis on L 2 .
Equation (3) together with the penalization term of Eq. (4) form a non-linear but convex function, and we solved it via an iterative, cyclic coordinate descent and adaptive Gauss-Seidel method. Details of this optimization algorithm are found in 33 .

Results
Predictive performance of the model. We evaluated the predictive performance of the model on datasets for the four influenza A subtypes considered in this study (Section 1.1), using a five-fold cross-validation test to avoid data over-fitting. In the cross-validation process, an antigenic dataset of a particular subtype was randomly divided into five quasi equal-sized subsets. We held-out one subset, while the remaining four subsets were combined and used to train the model. The held-out subset was used to test model performance. This process was repeated for each subset, and the test results from all the five subsets were combined in order to assess the overall performance of the model subject to that particular subtype.
We used several statistical metrics to measure the model's performance, including Area Under the Curve (AUC) of the Receiver-Operator Characteristic (ROC), accuracy, sensitivity, specificity, Pearson correlation, and root mean square error (RMSE). The AUC is a measure that takes both true positive rate (TPR) and false positive rate (FPR); its value ranges from 0 to 1. The higher the AUC value, the better the performance, and the AUC value of = 0.5 is equivalent to the performance of a random model. The accuracy metric assesses the ability of the model to correctly identify the antigenic similarity between two influenza viruses. Table 2 shows the relative performance of the model based on these metrics, and Fig. 1 depicts ROC curves for the four influenza A subtypes.
The model has an excellent predictive performance for subtype H5N1 (AUC = 0.90, accuracy = 0.89, and correlation = 0.86) and a very good performance for the other three subtypes. We compared the results of the current model to the results of the residue-based model PREDAV-FluA 12 , which was built using the same datasets. We observed that the current model outperformed the PREDAV-FluA model for both H3N2 (accuracy = 0.90 to 0.86) and H5N1 (accuracy 0.89 to 0.86) subtypes. The latter model had a better performance for the H1N1 subtype (accuracy 0.81 to 0.83) (see Fig. 2).  Table 2. Five-fold cross-validation results of the model. www.nature.com/scientificreports www.nature.com/scientificreports/ Energy variation across the canonical epitopes of H3N2. Analysis of the model's parameters also revealed that all five immunodominant epitopes contributed varying amounts of energy to the antigenic properties of each of the four considered influenza A subtypes. This is consistent with the long and well-established observation that correlates the antigenic variation to concurrent mutations in multiple distinct regions of the HA1 sequence 10 . Strikingly, for Influenza A H3N2 subtype, the high efficiency neutralization epitopes, A, B, and D 5 , which are located within a close spatial proximity of receptor-binding sites (see Table S3 in the Supplementary Materials) and are the most preferred targets of antibodies, contribute relatively higher amounts of energy compared to low efficiency neutralization epitopes, i.e. epitopes E and C (Fig. 3). This finding is in agreement with other previous studies 5, 13,25 .

Metrics
Moreover, we found that most of the energy contributions, approximately 67% of the total energy, actually arise from the five epitopes. The rest of it distributed between residues that has been found to be targets of monoclonal antibodies 5 , receptor binding (RB) sites, and other consensus sites (Fig. 4).

Influenza A subtypes undergo similar mutations.
It is generally considered that genomic variation in the HA1 protein drives the evolution of influenza viruses and allows them to escape antibody neutralization. From the analysis of the model's parameters we found that the four influenza A subtypes (H1N1, H3N2, H5N1, H9N2) exhibit similar patterns of energy fluctuations over HA1 protein positions. As shown in Fig. 5, the four subtypes share very similar energy peaks over certain positions along the HA1 sequence. This observation indicates the existence of a common evolution mechanism among influenza A viruses, and supports the conclusions  Notice that the high peaks in Fig. 5 reflect strong repulsive energies, i.e mutations that can cause antigenic drift events (positively selected positions), while the low peaks reflect attractive energies, i.e. mutations that cause viruses to become antigenically more similar.
In addition, we also found strong correlations among influenza A subtype Hamiltonians; varying from 0.87 between H1N1 and H3N2, with p-value = 0.026); to 0.67 between H1N1 and H95N2, with p-value = 0.01 (see Table S2 in the Supplementary Materials).

Predictions on a validation dataset.
To examine the effectiveness of the model presented here and to assess how well it can be extrapolated on a novel (unseen) dataset, we tested the predictive performance of the model on a dataset of epidemics and pandemics of influenza A/H1N1 which we obtained from ref. 22 . Similar to the dataset described in Section 1.1, the validation dataset also has two parts, antigenic data based on hemagglutination inhibition (HI) assay and genomic sequences. The HI antigenic data were compiled from various sources of flu data, including the Francis Crick Institute (FCI), European Centre for Disease Prevention and Control (ECDC), World Health Organization (WHO), U.S Food and Drug Administration (FDA), and others. The genomic sequences were obtained from Influenza Virus Resource (IVR) 36 and Global Initiative on Sharing All Influenza Data (GISAID) 37 . The original dataset was large, but after excluding the A/H1N1 viral pairs that occurred in the dataset used to build the model and removing duplications, we ended up with a total of 642 viral pairs and 168 sequences (Table 3). We choose this dataset for a number of important reasons. Firstly, it is publicly available, organized in a chronological order, and contains antigenic data related to the last five epidemics and pandemics caused by A/H1N1 viruses, from 1921 to 2016. Secondly, other computational methods have greatly neglected influenza A/H1N1, to such a degree that there is insufficient knowledge about it 22 . Table 3 gives per period summary of the dataset, the cleaned antigenic data and genomic sequences are provided in Supplementary Files S1 and S2, respectively.
The predictive performance of the model on the validation dataset was assessed using two different strategies. In the first strategy, we utilized the validation dataset as a novel dataset and measured the performance using the model's parameters learned using from the A/H1N1 data described in Section 1.1. The model achieved an AUC value of 0.86 (blue curve in Fig. 6). In the second strategy, we performed a five-fold cross-validation test using the validation dataset itself. In this case the model achieved an AUC value of 0.94 (red curve in Fig. 6). In the later case, the model has an improved performance because of the larger size of the dataset. These results demonstrate the reliability of the model in predicting antigenic relationships.
General model. The model given by Eq. (3) together with the penalization term of Eq. (4) is a subtype model; thus it was applied for each individual influenza subtype. But, we thought of a general model that predicts the antigenicity of all influenza subtypes, based on the simple fact that influenza viruses share a common evolutionary origin. In this section, we show that a model can be upgraded to become a general model.
From the available crystallographic structures of influenza A viruses, ten antigenic regions on the HA1 proteins; called artificial sites and largely govern the antigenic variations, has been identified 10,12 . These regions; denoted by E 1 to E 10 , cover almost the entire HA1 protein and each one contains group of amino acid residues according to the distance from the globular head of HA in a descending order. We utilized such antigenic regions to generalize our model so that it could predict antigenic similarities for all subtypes of influenza A. Thus, we changed the definition of ΔE ij in Eq. (2) between two strains ν i and ν j as follows:    Table 3. The validation dataset. www.nature.com/scientificreports www.nature.com/scientificreports/ We evaluated the performance of the general model on the dataset described in Section 1.1 of all the four influenza A subtypes considered in this study, using five-fold cross-validation. It scored AUC value = 0.78 (Fig. 7), accuracy = 0.80, sensitivity = 0.65, specificity = 0.92, and correlation = 0.77.

Discussion
Accurate and reliable prediction of antigenic similarity among influenza viruses is necessary for optimal vaccine strain selection. In this work we have presented a new approach for predicting antigenic relatedness of influenza variants based on biophysical ideas.    www.nature.com/scientificreports www.nature.com/scientificreports/ Many factors related to the HA1 protein influence the antigenicity of influenza viruses, spanning from structural conformations 23 to physiochemical features 8 like hydrophobicity; however, these factors are not independent as has been demonstrated by the work of Yuhua et al. 13 . The predictive performance of their model improved when they integrated the model with combination of a few substitution metrics that reflected some of those factors, especially structure-based substitution matrices. But from the properties of thermodynamic and statistical mechanics, energy is a universal currency 38 . Thus, it is expected that the energy would reflect the aggregated net-effect of all the factors associated with the HA1 protein that might influence the antigenicity. This extremely useful inclusion of all factors, in a very simple way, is the power of our approach over other sequence-based methods which try to incorporate a few selective factors through some amino acids substitution metrics.
Sparsity is a hallmark property of cellular processes; for example, in this context, a few amino acids mutations drive most of the antigenic drift events for influenza viruses. Therefore, our model accounts for the sparsity, with the aid of an elastic-net penalization term.
From their cartography model that characterizes the evolutionary dynamics of H3N2 viruses, Smith et al. 15 identified a list of amino acid substitutions that are associated with antigenic drift events between 10 clusters of H3N2-called cluster-difference substitutions-using influenza surveillance data collected between 1968 and 2003. We further investigated the consistency of our model's parameters with that of 15 . As expected, we found that most of these amino acid substitutions corresponded to Hamiltonians of repulsive energy. Of the 67 cluster-difference amino acid mutations, 46 (approximately About 70%) were found to have non-zero energetic contributions (Table 4). Furthermore, most of these cluster transitions are the result of several concurrent cluster-difference substitutions; for example, transition from England 1972 (EN72) to Victoria 1975 (VI75) characterized by 12 amino acid substitutions (Table 4). In such cases we found some of the substitutions contribute a positive amount (repulsive) of energy, while some of them contribute a negative amount (attractive) of energy. A few cluster transitions are the result of single amino acid mutation; for example, the two transitions from Sichuan 1987 (SI87) to Beijing 1989 (BE89) and Beijing 1992 (BE92) to Wuhan 1995 (WU95), are both characterized by only single mutation N145K (at position 145 of HA1 protein, the amino acid asparagine (N) changed to amino acid lysine (K)). In these cases, we found that such substitutions always have strong repulsive energy (see Table 4). Overall, the net energy contribution for any cluster transition is always positive.
The model can be readily incorporated into current surveillance systems for influenza -e.g. by supporting surveillance labs to interpret the antigenic and sequence data they routinely collect. The logical next step of this work is to extrapolate the structural similarities among influenza A viruses and develop a more accurate general purpose and pan-specific computational model that could predict the antigenicity of all the subtypes. Such a model would reconcile both structural inconsistencies among influenza viruses and experimental artefacts associated with antigenic data in order to obtain more accurate results. A further study is needed to produce a forecasting version of the model based on the same biophysical principles which would be helpful in recommending influenza vaccine strains.