MAIVeSS: streamlined selection of antigenically matched, high-yield viruses for seasonal influenza vaccine production

Vaccines are the main pharmaceutical intervention used against the global public health threat posed by influenza viruses. Timely selection of optimal seed viruses with matched antigenicity between vaccine antigen and circulating viruses and with high yield underscore vaccine efficacy and supply, respectively. Current methods for selecting influenza seed vaccines are labor intensive and time-consuming. Here, we report the Machine-learning Assisted Influenza VaccinE Strain Selection framework, MAIVeSS, that enables streamlined selection of naturally circulating, antigenically matched, and high-yield influenza vaccine strains directly from clinical samples by using molecular signatures of antigenicity and yield to support optimal candidate vaccine virus selection. We apply our framework on publicly available sequences to select A(H1N1)pdm09 vaccine candidates and experimentally confirm that these candidates have optimal antigenicity and growth in cells and eggs. Our framework can potentially reduce the optimal vaccine candidate selection time from months to days and thus facilitate timely supply of seasonal vaccines.

maintain these properties through the production pipeline, and have a high growth phenotype.Timely selection of an effective high-yielding CVV is critical for optimal seasonal influenza vaccine manufacturing.In the 2003-2004 influenza season, an A/Fujian/411/2002-like virus was preferred as the recommended A(H3N2) vaccine virus, but the inability to identify CVVs led to a vaccine mismatch 3 .During the A(H1N1)pdm09 pandemic, vaccine supply was delayed due to the poor yield of initial CVVs 4 , and a global vaccine campaign was not initiated until better yielding viruses were obtained which was after the second pandemic wave 5 .
Efforts to generate high-yield CVVs for vaccine manufacturing in embryonated chicken eggs or cultured cells continue year-round even before recommended vaccine viruses are finalized for the upcoming influenza season.The conventional strategy to achieve high yield often involves additional passages in eggs or cells 6 , and genetic approaches rely on reassortment with a donor strain that exhibits high yield traits in eggs or cells 7 .Both approaches may take up to 6 months as well as have limitations.Egg or cell adaptation can result in undesired antigenic changes due to additional mutations in HA and reassortment strategies do not always lead to substantial improvements in yield [8][9][10][11] .Therefore, identifying influenza vaccine viruses with high-yield phenotypes directly from sequences obtained from clinical samples would be ideal and could potentially accelerate vaccine production timelines.
Over the past few years, several computational models [12][13][14] have been developed to identify influenza antigenic variants using genomic sequences (See the discussion section in Supplementary Information (SI)).However, none of these models can be used to directly identify antigenic match and high-yield viruses based on genetic sequences.
In our study, we propose an approach to overcome challenges in influenza vaccine strain selection.We introduce MAIVeSS, a framework that employs machine learning algorithms, to predict antigenic and yield phenotypes using viral genomic sequences from clinical samples.We validated MAIVeSS by screening A(H1N1)pdm09 for ideal CVVs and confirmed the high yield nature of the identified CVVs in both cells and eggs.Our results show that that MAIVeSS can facilitate the selection of naturally circulating influenza vaccine strains with matching antigenicity and high-yield as seed viruses for influenza vaccine production.

Machine-learning assisted influenza vaccine strain selection framework
This study aimed to design MAIVeSS to learn genetic features associated with three key influenza virus biological properties: antigenicity, growth, and receptor-binding (Fig. 1).We implemented and compared a set of machine learning models within MAIVeSS and found that the multi-task learning group-guided sparse learning model (MTL-GGSL) outperformed other state-of-the-art models for predicting antigenicity and glycan binding, while the generalized hierarchical sparse model (GHSM) outperformed other models for assessing growth (see Supplementary Data 1-3).
Using the features learned, MAIVeSS scores CVVs using a query HA protein sequence based on two properties: (1) antigenic properties, and (2) yield properties in eggs and/or cells (HY cell , high-yield in cells; This model has been developed to select high-yield vaccine seeds that match the antigenicity of circulating influenza strains.To achieve this, a library of viruses with mutations in the receptor binding sites of the HA protein is generated and analyzed for their serological reactivity, replication efficiency in eggs and cells, and glycan profiling using microarrays.A sparse learning model will then be applied to identify genetic features that correlate with these phenotypes.Based on these features, a predictive model will be developed to quantify antigenic distances from the currently used vaccine strains and the yield ability in eggs and cells using HA protein sequences as input data.This strategy can be adapted for other subtypes of influenza viruses, and the model can also be modified to incorporate NA sequences.Part of this figure was created with BioRender.com. HY egg , high-yield in eggs; HY both , high-yield in both cells and eggs; LY cell , low-yield in cells; LY egg , low-yield in eggs; LY both , low-yield in both cells and eggs).In this study, the WT virus is defined as a reassortant with wild-type HA and NA genes from A/California/04/2009(H1N1pdm09) (CA/04) and six internal genes from A/Puerto Rico/8/1934(H1N1) (PR8).High-yield is defined as a > 10-fold increase in TCID 50 /mL compared to the WT on the same substrate.By leveraging these predictive models, MAIVeSS can rapidly identify influenza viruses that are both antigenically matched and high-yielding from genome sequences obtained during surveillance.MAIVeSS is accessible through both GitHub (https://github.com/FluSysBio/MAIVeSS)and a webserver (http:// sysbio.missouri.edu/software/MAIVeSS).
In this study, we demonstrated the effectiveness of our machine learning models using A(H1N1)pdm09 viruses as an exemplary application, but the same principles can be readily applied to other subtypes of influenza viruses.

Development of an A(H1N1)pdm09 mutant library for machine learning
To enhance the reliability of feature selection for high-yield viruses, we established a random mutant virus library that targets the HA receptor binding site (RBS) of CA/04 15 .All the mutants were subjected to antigenic analyses using hemagglutination inhibition (HAI) assays, yield analyses in both MDCK cells and embryonated chicken eggs, and receptor-binding profiling through glycan microarrays.The phenotypic data collected were then used as training and testing data in MAIVeSS to identify the molecular features associated with antigenicity and yield and to establish predictive models.
In total, we generated 822 HA-containing plasmids, each carrying one to seven random mutations within or near the HA RBS (residues 119-241, H1 numbering; 126-244, H3 numbering).Using these mutant plasmids, we then generated corresponding mutant viruses via reverse genetics.Rescued mutant viruses had an HA from the mutant pool, a NA gene from CA/04, and the remaining 6 gene segments from PR8.To increase the likelihood of successful virus rescue, the transfection products underwent three passages.If none of the three passages yield positive results, we will conclude that the mutant cannot be rescued.
As a result, a total of 189 mutant viruses bearing unique amino acid substitutions with different biochemical properties were successfully generated (Supplementary Fig. 1).Of the mutant viruses that were successfully rescued, 96 had substitutions within the 119-190 region of the HA, 15 had substitutions within the 190-241 region, and 78 had substitutions in both regions.The positions of the substitutions overlapped with the reported HA antigenic sites Sa (n = 11), Sb (n = 9), Ca1 (n = 7), and Ca2 (n = 7).For consistency, specific amino acid positions in the mutants described in the following context are numbered according to the H3 numbering system.Eighty of the mutant viruses had a single substitution, 80 had two, and 29 had three or more.Interestingly, all substitutions present in the rescued viruses were located outside of the receptor binding site (RBS) and did not directly interact with the receptor molecule (Fig. 2A).In contrast, virus rescue was unsuccessful when substitutions were present within the RBS pocket.

Most mutations did not alter antigenic properties
To determine the antigenic properties of the mutant viruses generated, we performed HAI assay using post-infection ferret antisera.Out of 189 mutant viruses, only 5 mutants had a ≥ 4-fold reduction in their HAI titers relative to the antisera's homologous virus (Supplementary Data 4).These 5 antigenically distinct mutants had at least one substitution in known HA antigenic sites, with other substitutions mostly present within or close to the Ca1, Ca2, Sa, or Sb regions [16][17][18][19] .Of note, ferret antisera generated against WT CA/04 were unable to neutralize the triple mutant HA D131E-S193T-A198S.
We integrated the serological data of the 189 mutants with archived public data for seasonal A(H1N1) (1977-2009) and A(H1N1) pdm09 viruses (2009-2016) 20 .By using these integrated HA sequence and serological data, we utilized MAIVeSS to identify residues that were associated with antigenic changes.The results showed that 30 residues were associated with the antigenicity of A(H1N1) viruses (Table 1 and Supplementary Data 5), and most of these residues were located within or close to the antigenic sites, particularly Ca1, Ca2, Sa, and Sb (Fig. 2B).Of these 30 residues, only position 225 has been reported in the literature to potentially harbor an egg-adapted substitution for 2009 H1N1 viruses (Table 1 and Supplementary Information).
Amino acid substitutions near the HA RBS can result in highyield traits in both cells and eggs We next assessed how the amino acid substitutions introduced affected virus yield in both cells and eggs by measuring the infectious titers (TCID 50 ) for each mutant after growth.We identified 14 HY cell mutants that showed at least a 10-fold increase in virus yield compared to the WT virus as well as 29 LY cell mutants that showed at least a 10-fold decrease (Supplementary Fig. 2 and Supplementary Data 4).The highest yield was observed in the HA N159D-K166I mutant, with a yield of 1.52 × 10 7 TCID 50 /mL, which was about 100-fold higher than WT.Additionally, 33 HY egg mutants and 19 LY egg mutants were identified.The HA D131E-S193T-A198S, HA N159D-K166I, and HA I169F-D225G mutants had the highest titers in eggs, which were approximately 800-fold higher than the WT virus.Of note, these three mutants also exhibited high-yield traits in cells and were thus designated as HY both .
By using the HA protein sequence and their associated yield data, we applied MAIVeSS and identified 38 residues were associated with virus yield (Table 2 and Supplementary Data 6).The majority of these residues were located on the surface of the HA trimer and in close proximity to the RBS pocket (Fig. 2C).Interestingly, we observed that different substitutions at the same position could lead to different outcomes.For example, a change from small polar amino acids to nonpolar amino acids at residue 142 was predicted to enhance virus yield, whereas substitution to polar/charged amino acids at the same position was predicted to reduce virus yield in eggs and cells (Table 2).

Diversified glycan binding facilitates virus replication in cells and eggs
To investigate if the high-yield trait correlates with glycan substructure binding properties, we analyzed the receptor-binding properties of the 189 mutant viruses using glycan microarrays comprising of 75 glycoforms (Supplementary Fig. 3).The binding signals to these glycan isoforms varied widely among the mutants (Supplementary Fig. 4).Notably, all mutants exhibited strong binding avidity to glycans that were terminated with SA2,6 Gal, as we had expected.
By employing biolayer interferometry analyses (BLI) for glycan binding profiling, we confirmed the broadened binding specificity of the HY both mutant HA D131E-S193T-A198S.Specifically, we demonstrated that this mutant not only binds to 6′SLN, but also to 3′ SLN and sLeX (Supplementary Fig. 6).
We used MAIVeSS (see Online Methods) to identify the glycan substructures associated with yield traits in cells and eggs associated with binding preference to these glycan substructures (Supplementary Data 7-11).Our analysis revealed several glycan terminal substructures that were significantly associated with high-yield traits, including 6′SLN, 3′SLN, sLe X , and Neu5Gcα2-6Galβ1-4GlcNAc. Additionally, we found that certain internal substructures, such as core lactose, GlcNAcb1-2, and Galα1-4Galβ1-4GlcNAc, had a significant impact on glycan binding.

A subset of antigenically matched A(H1N1)pdm09 epidemic viruses were high-yield in both cells and eggs
We utilized MAIVeSS to predict both yield in eggs and cells and antigenic properties of sequenced A(H1N1)pdm09 viruses (2009-2020, n = 11,424) (Supplementary Data 12).Using the antigenic distance matrix generated by MAIVeSS, we created a sequence-based antigenic cartography map, which revealed two antigenic clusters, CA/09 and WI/19 (Fig. 3).The acquisition of N159K in the Sa antigenic site was predicted to cause an antigenic drift of the A/Wisconsin/588/ 2019(H1N1) (WI/19) HA from that of CA/09, which was consistent with those reported in other studies 21,22 .
Using MAIVeSS as the prediction tool, a total of 155 viruses were identified as potential HY egg , 433 as HY cell , and 761 as HY both .Of the 1349 viruses identified as high-yield variants, 897 had HA sequences that were directly generated from clinical swabs, while the remaining sequences were generated from viruses grown either in cells (n = 331) and eggs (n = 83).The virus source for HA sequence was unclear for the remaining 38 high-yield variants.
Among predicted as HY both , 294 were CA/09-like viruses (38.6%), while 467 were WI/19-like viruses (61.4%).These high-yield strains were not geographically clustered and were scattered across the phylogenetic trees, without clear association with any particular HA lineages (Fig. 3B).However, the number of HY both strains increased significantly after the emergence of WI/19-like variants (Fig. 3D).Specifically, 256 out of 2,198 (11.65%) viruses in 2019 and 386 out of 895 (43.13%) viruses in 2020 were estimated to be HY both strains.MAIVeSS analysis predicted that the vaccine strain WI/19 has an increase of approximately 105-fold and 23-fold in virus yield in cell and eggs, respectively compared to CA/04.The key residues associated with antigenicity are located in five HA antigenic sites: Sa (in purple), Sb (in raspberry), Ca1/Ca2 (in green), and Cb (in blue), as well as outside these antigenic sites (in orange).C The key residues associated with virus yield in cells and eggs are located near the receptor binding site or outside the receptor binding site.The residues are shown in two views of the structures, and the overlapped residues are boxed.The HA protein structure was modeled using PyMOL with the CA/04HA protein structure from the Protein Databank (PDB) 3LZG as the template.
Multiple amino acid substitutions associated with high yield were observed in the HYboth strains, but the specific amino acid substitutions varied across influenza seasons and even within the same season, depending on the strain (Supplementary Data 13).However, after the 2018-2019 influenza season, viruses with HA K133aN, N159K/D/S, K166Q, S206T, and/or K214R were more likely to be high-yield strains (Fig. 4A).
To validate the predictions of our model, we synthesized the HA and NA genes of 4 viruses predicted as having desirable features, and subsequently generated 4 reassortant viruses (i.e.rgSP/16, rgCQ/17, rgBRU/19 and rgMAS/20) with PR8 as the backbone and determined their antigenic and yield phenotypes. Antigenically, 2 reassortant viruses were experimentally confirmed to be CA/04-like and the other 2 WI/19-like (Fig. 4B).All 4 reassortant viruses grew to final titers of >10 8 TCID 50 /mL in both eggs and cells, which were at least 100-fold higher than the WT virus, which is a reassortant with wild-type HA and NA genes from CA/04 and six internal genes from PR8. (Fig. 4C).These results corroborated both antigenicity and yield predicted by the MAIVeSS model.
Taken together, our findings indicate that the high-yield trait of A(H1N1)pdm09 viruses was distributed across different genetic clusters and has become more prevalent since 2018.Our experimental results confirm MAIVeSS's ability to identify antigenic matches and high-yield vaccine strains for A(H1N1)pdm09 viruses.

Diversifying influenza virus glycan binding profile facilitates the acquisition of high-yield properties
It is well-documented that CA/04 exhibits an exclusive binding preference for 6′SLN and does not bind to 3′SLN 23 .Here we hypothesized that a small portion of naturally circulating A(H1N1)pdm09, such as those we predicted as CVVs, acquired high-yield properties by binding to additional sialylated glycan receptors, particularly 3′SLN, or by increasing their glycan binding avidity to 6′SLN.To test this, we conducted BLI for 6A(H1N1)pdm09 variants, including LY both MI/15 and HY both WI/19, as well as 4 HY both vaccine candidates predicted by MAI-VeSS.The results showed that 3 of the MAIVeSS predicted CVVs, rgSP/ 16, rgCQ/17 and rgMAS/20, bound to both 3′SLN and 6′SLN, whereas MI/15, WI/19, and one predicted CVV, rgBRU/19, bound only to 6′ SLN (Fig. 5A).
Of the residues linked to the high-yield traits of WI/19 and the four MAIVeSS selected viruses, only about half were conserved (Fig. 4A).However, HA N159K, K166Q, and S206T were consistently present in most of the naturally occurring high-yield strains (Supplementary Data 12) and high-yield mutants from our mutagenesis study (Supplementary Data 4).
We further investigated the effect of HA N159K, K166Q, and S206T on glycan binding affinity by conducting structural modeling based on the crystal structure of CA/04 HA complexed with 6′SLN and 3′SLN (Fig. 5B).In both complex structures, N159 was substituted with K, and energy minimization was performed to detect any possible allosteric structural changes that could affect ligand binding.In the HA:3′SLN structure, the sidechain of K159 flips toward the 190-helix, forming hydrogen bonds with the sidechains of both Q196 and Q192.This could cause a shift or tilt in the orientation of the 190-helix, resulting in a more compact receptor binding pocket and stronger binding with 3′ SLN.In contrast, 6'SLN in the HA:6′SLN complex closely interacts with the 190-helix even in the original CA/04 structure, so the K159 mutation does not significantly enhance the binding of 6'SLN binding to HA.Additionally, K166Q may impact the conformation of the 130-loop, while S206T substitution has the potential to modify the structural conformation of the 220-loop (Supplementary Fig. 7), thus affecting the binding of HA to glycan receptors.Therefore, our modeling analysis supported that these three substitutions in HA can substantially increase the binding affinity of 3′SLN without major impact on the binding of 6′SLN.
In summary, diversity at the HA RBS of A(H1N1)pdm09 viruses can enhance virus yields in both cell and egg substrates by increasing sialylated glycan binding avidity or diversifying virus binding to different sialylated glycan receptors.

Discussion
In this study, we developed MAIVeSS, a machine learning based framework, that can accurately predict both antigenicity and growth phenotypes based on HA protein sequences.The training dataset consisted of a library of 189 mutant viruses generated by epPCR-based reverse genetics targeting residues 126-244 (H3 numbering).We observed that acquisition of HA N159K, a key marker for antigenic drift according to our model, led to changes in antigenicity from CA/09 to WI/19, as determined by post-infection ferret antisera, consistent with published reports 21,22 and facilitated acquisition of the high-yield trait in a significant proportion of A(H1N1)pdm09 epidemic strains during recent influenza seasons (Fig. 3E).While our current model focuses on HA, it is important to note that antigenic drift of neuraminidase (NA) has also been well-documented in A(H1N1) and A(H3N2) influenza viruses 24,25 .In addition, it's worth noting that the antigenicity data used in our model training were derived from ferret antisera generated from native ferrets.While the antigenicity data reflected by sera generated from native ferrets and human sera without virus priming (such as infants) are generally comparable, it's important to note that they can differ from those generated from individuals with pre-existing The position in which a potential egg adapted amino acid substitution L194I was reported to affect antigenic properties 60 .
Article   The substitutions associated with HY both or LY both were in bold. d The positions reported to potentially harbor an egg adapted amino acid substitution, of which D131E and D225G/N were reported to affect antigenic properties 60 .e "−" denotes it is unknown how this residue will affect yield. f The bootstrap values were derived from 100 independent experiments, each with 80% of the training data. g The amino acid substitutions highlighted in bold are those anticipated to be associated with each specific yield phenotype.
immunity, particularly in adults.This is because adults commonly have a complicated influenza priming history with multiple infections and/ or vaccinations, which can significantly affect their immune response 26 .As such, our ongoing efforts are aimed at expanding the MAIVeSS prediction capacity to include both HA and NA proteins, as well as human serological data for training antigenicity analyses.
To determine if the high-yield trait correlates with glycan substructure binding properties, we analyzed the receptor-binding properties of all 189 mutant viruses.The glycan profiling analysis conducted on 43 high-yield mutants suggested that diversifying glycan binding profiles could enhance virus replication in both eggs and cells.Specifically, increased binding avidities to SA2-6Gal results in higher virus yield in mammalian cells, while broadening glycan binding capabilities to SA2-3Gal or sLe X improves virus yield in eggs (Supplementary Data 4).Our studies indicate that a small subset of A(H1N1)pdm09 epidemic viruses naturally bind to both SA2-6Gal and SA2-3Gal, allowing them to replicate efficiently in both cells and eggs without adaptation.On the other hand, similar to CA/04 and MI/15, some highyield strains (e.g.WI/19) bind only to 6′SLN but not 3′SLN (Fig. 5A), indicating that other glycan substructures present in eggs and/or cells may be involved in the high-yield trait for these epidemic viruses.It should be noted that both virus and host factors, such as innate immune responses and virus fitness, in addition to virus-receptor binding can impact virus replication in mammalian cells.Further studies are needed to investigate these possibilities.
Both SA2-6Gal and SA2-3Gal receptors are expressed in MDCK cells and chicken embryonated eggs.However, SA2-3Gal receptors are predominantly expressed in eggs while MDCK cells contain a similar amount of SA2-6Gal and SA2-3Gal 27 .In addition to SA2-3Gal and SA2-6Gal, neutral glycans such as high-mannose glycans and glycans terminated with Gal and GalcNAc are also commonly found in eggs.Mass spectrometry analyses showed some glycans in eggs are fucosylated 28 .CA/04, the prototype A(H1N1)pdm09 virus which showed poor replication in both MDCK cells and eggs, had a strong binding preference for SA2-6Gal and did not bind to SA2-3Gal 29 .In humans, there is no direct selection pressure to increase cell-based or egg-based replication efficiency.Thus, our findings suggested that ad hoc substitutions at the HA RBS across A(H1N1)pdm09 viruses likely enabled a subset of these variants to expand their binding preference from SA2-6Gal to both SA2-6Gal and SA2-3Gal, resulting in the acquisition of a high-yield trait.This study illustrates the feasibility of selecting HA sequences from naturally circulating strains as high yield candidates for recombinant vaccine development, by eliminating the need for further engineering.In summary, the data from the proof-of-concept experiments in this study confirmed that MAIVeSS enables rapid selection of antigenically matched and high-yielding influenza strains directly from clinical isolates as potential seed viruses to accelerate vaccine production and facilitate timely supply of seasonal vaccines.

Ethics statement
Animal study protocols were reviewed and approved by the Institutional Animal Care and Use Committee at Mississippi State University (#14-039) and University of Missouri-Columbia (#9656).All animal experiments were performed in an animal biosafety level 2 (ABSL2) research facility at Mississippi State University.Standard operating procedures for work with infectious influenza viruses were approved by the Institutional Biosafety Committee at Mississippi State University (#16-09 and #022-16) and University of Missouri-Columbia (#19-09) and performed under BSL2 conditions.

Machine-learning assisted influenza vaccine strain selection framework (MAIVeSS)
Machine learning models have been shown to be effective in identifying antigenicity associated features in protein sequences from different subtypes of influenza A viruses [30][31][32][33][34][35] , We developed machine learning models to identify the specific sequence features in HA proteins that determine three important phenotypes: antigenicity, yield in cells and eggs, and receptor binding.To achieve this, we trained our models on large datasets of HA protein sequences and associated phenotype information.We also developed a quantitative function that allows us to measure the distances between sequences based on their phenotypic characteristics.Our ultimate goals for these machine learning models are to identify: 1) mutations in the HA RBS that affect virus antigenicity; 2) mutations in the HA RBS that increase or decrease virus yields in cells and/or eggs; and 3) specific glycan substructures (glycan motifs) on the surface of cells or eggs that are associated with increased yields of influenza virus.By achieving these goals, we hoped to gain a better understanding of the molecular determinants of these important viral phenotypes and to identify potential targets for the development of improved influenza vaccines.
We approached the problem of identifying genetic features associated with influenza virus phenotypes using a sparse learning model.Mathematically, this model involves a linear regression loss function with regularization, which allows us to determine the most relevant genetic features associated with a given phenotype.The sparse learning model combines a least squares loss with a regularized   term and takes into account genetic distance matrices among HA proteins or glycan sequences (denoted as X), phenotypic differences (denoted as y), and sample numbers (denoted as N).This approach enables us to identify the key genetic features that contribute to different influenza virus phenotypes, such as antigenicity, yield, and receptor-binding.
The objective of our sparse learning model is to solve: where LðX ,y,wÞ is the loss function, λ is a pre-defined regularization parameter, RðwÞ denotes the regularization term, and w denotes the numerical weights of individual features (either a single residue or a group of neighboring residues).Absolute values of the weights indicate the impact of each mutation of a specific feature to phenotypes (i.e., antigenic, yield, and receptor-binding properties).The larger the absolute weight, the greater the impact.
Based on the features learned from sparse learning, we developed a predictive model to assess antigenic or yield properties given HA sequences.Specifically, where ŷ is the predicted phenotypic distance (either antigenicity or yield) between the two viruses; x is the feature distance vector; and w is the weight vector for those features, which can be associated with either antigenicity or yield.

Multi-task learning group-guided sparse learning (MTL-GGSL) model
To address the challenges associated with integrating serological data generated from different platforms (e.g., turkey and guinea red blood cells), we utilized a Multi-Task Learning (MTL) approach with Group Graphical Sparse Learning (GGSL) to analyze antigenicity.This approach allowed us to consider both N-linked glycosylation and amino acid features when analyzing the data.MTL allows us to learn multiple related tasks (i.e., analyzing antigenicity from different serological platforms) simultaneously, while GGSL considers the dependencies between different groups of features to improve the accuracy of the analysis.By utilizing MTL-GGSL, we were able to overcome the challenges associated with integrating data from different platforms and provide a more comprehensive analysis of antigenicity 20,36 .One advantage of using the group LASSO regularization in MTL-GGSL for antigenicity analyses is that it encourages multiple predictors from related tasks to share a subset of features.This is in contrast to the LASSO regularization, which may lead to sparse solutions where only a few features are selected for each task independently.Our previous study has shown that incorporating information on N-linked glycosylation can improve the performance of sparse learning models in predicting antigenic properties of influenza viruses 20 .By adopting MTL-GGSL, we are able to integrate information on both glycosylation and amino acid sequences from serological data generated using different platforms, which can further enhance the accuracy of our predictive models for influenza antigenicity.
Specifically, we define and the model is formulated as: where λ 1 , λ 2 , and λ 3 are regularization parameters, j is the subscript for feature, p is the total number of features, G l denotes feature group, q is the number of feature groups, α l = p m l is the weight of feature group G l ; W j denotes the weights for the j-th feature among different tasks, and W G l ,t as the weight for feature group G l of the t-th task.Alternating Direction Method of Multipliers (ADMM) 37 was employed to solve the optimization problem.

The generalized hierarchical sparse model (GHSM)
To consider synergistic effects of multiple features on the phenotypes, we adopt GHSM 38 in this study.The GHSM model aims to minimize: GHSM model solves the following objective: where λ and α are two regularization parameters controlling the sparsity and the decay in the coefficients for interactions of different orders, z ðkÞ <i 1 ,ÁÁÁ,i k > = x i 1 x i2 Á Á Á x ik denotes a data vector for the k-th order interaction corresponding to <i 1 , Á Á Á ,i k >, an interaction index <i 1 , Á Á Á ,i k >, where i 1 < Á Á Á <i k , is an index to uniquely indicate the interaction among the covariates i 1 , Á Á Á ,i k , W denotes the set of parameters fw ðkÞ g Þ ! with w ðkÞ <i 1 ,ÁÁÁ,i k > as its element corresponding to the index <i 1 , Á Á Á ,i k >, jj Á jj 2 denotes l 2 norm of a vector and denotes the element wise product of two vectors.The constrains associated with each covariate i have a chain of ðK À 1Þ inequality constraints, and there is a total of d chains.The application of these models for antigenicity, yield, receptor-binding are detailed as below.

Antigenicity analyses
In this study, we used eight individual tasks, each corresponding to an individual HAI dataset, including those for seasonal A(H1N1) viruses (1977-2009), A(H1N1)pdm09 viruses (2009-2020), swine A(H1N1) viruses, and mutants generated from this study.In each task, the lowrank matrix completion algorithm was used to minimize data noise and the challenges derived from low reactors and missing values in the HAI datasets, and antigenic cartography was then used for antigenic distance calculation 20,39 .Two groups of features (i.e., amino acid mutations and N-glycosylation sites) were used in the model to quantify influenza virus antigenic distances.We defined 327 residue features and 6 N-glycosylation site features.GETAREA software (http:// curie.utmb.edu/getarea.html)was used to predict whether these residues were on HA's surface.The A(H1N1)pdm09 three-dimensional HA structure (Protein Data Bank [PDB] identifier [ID] 3LZG) was used as the template.A total of 138 residues were predicted to be located at the HA protein's surface (Supplementary Data 14).All amino acid residues, with a variant rate >10%, were considered as non-conserved sites and included in the machine learning model.Finally, a total of 86 residues with 4 N-glycosylation sites were used as features in the machine learning model.

Yield analyses
In this study, we analyzed the yield of 189 mutants compared with the parent WT CA/04, which is 6:2 reassortant virus, in both cells and eggs.To analyze the data, we utilized two groups of features: amino acid substitutions and N-glycosylation sites.Furthermore, we employed the GHSM approach to identify synergistic amino acid substitutions associated with virus yield in eggs or cells.Specifically, to ensure the feasibility of our analyses, we constrained the highest order to 3. It's worth noting that there were 3468 second-order interactions and 12,337 third-order interactions in our GHSM analyses.

Glycan binding analyses
In this study, we used a glycan microarray with 75 glycoforms (Supplementary Fig. 1), which were grouped based on their internal and terminal substructures and linkers into a matrix of 27 glycan substructure features.We then utilized the Multi-Task Learning with Group Graphical Sparse Learning (MTL-GGSL) approach to determine the substructures associated with yield traits in cells and eggs.In the model, we employed three groups of features, including terminal substructures (n = 16), internal substructures (n = 8), and base substructures linked to the array (n = 3).

Model comparison, parameter optimization, and bootstrapping analyses
In order to ensure the robustness of our analyses on antigenicity, yield, and glycan binding, we compared MAIVeSS with five other sparse learning models, including the L1-norm regularized method (LASSO) 40 , the L2-norm regularized method (RIDGE) 41 , the sparse group LASSO method (SGL) 42 , the L1-and L2-norm regularized method 43 , and the L1and L∞-norm Composite Absolute Penalties method (iCAP) 44 (Supplementary Data 1-3).The latter two models incorporate both L1-and L2-norm regularization.
For antigenicity analyses, our comparison additionally included three primary machine learning models mentioned in the literature, along with three deep learning approaches.The conventional machine learning methods consist of Support Vector Machine (SVM) [45][46][47] , Naïve Bayes 14,[48][49][50] , and Random Forest 35,51,52 .The deep learning methods include Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), and Transformer.As the source codes for these models from the literature are not publicly accessible, we utilized the machine learning models available in MATLAB package (R2023a) for comparison.
To develop and evaluate our models, we allocated 90% of our data for training and validation, and the remaining 10% for testing.The testing dataset was excluded from parameter optimization to avoid potential overfitting.Within the combined training and validation dataset, we employed 10-fold cross-validation by segregating the data, 90% for training and the remaining 10% for validation, to fine-tune our parameters and evaluate the training performance (Supplementary Figs.8-13).As the results, we set λ 1 equals 2, λ 2 equals 0.01, and λ 3 equals 0.01 as optimal parameter for MTL-GGSL model, and λ equals 0.0001 and α equals 10 for GHSM model.
To evaluate the performance between models in MAIVeSS and the previously mentioned comparison models, we assessed their RMSE and Pearson correlation coefficient between predicted values and ground truth for both training (using 10-fold cross-validation) and testing datasets for antigenicity, yield, and glycan binding analyses.For antigenicity analyses, we also recast it as an antigenic distance classification problem to assess the model's efficacy in identifying antigenic variants.A virus pair is classified as an antigenic variant if its paired antigenic distance is 4-fold or greater; otherwise, it is not 53 .We included accuracy, precision, recall, specificity, and F1 score as performance metrics for both training and testing (detailed in the Supplementary Information).
To investigate the effect of amino acid substitutions on both yield and glycan binding phenotype, we employed a grouping method for amino acids 54 .Each amino acid was assigned to one of three groups based on its biophysical properties: nonpolar (V, L, I, M, C, F, W, and Y), small polar (G, A, and P), and polar/charged (S, T, N, Q, H, D, E, K, and R).HA protein sequence was encoded into a vector X i by comparing to a wild-type sequence and if a mutation occurred in residue j (e.g., nonpolar to small polar), we encoded the j-th element of X i to 1; otherwise, we encoded it to 0. To evaluate the directionality of amino acid substitutions on both yield and glycan binding phenotype, we used three different sparse models (LASSO, RIDGE, and SGL) and performed bootstrap analyses (detailed in Supplementary Data 5-7).In brief, we selected all features with a bootstrap value cutoff of 80 from 100 independent runs.

Predictive model
In this study, a predictive model was developed to estimate the antigenic distance between two viruses based on their genetic sequences.The model was defined as follows: where x is the genetic distance vector between the two viruses, y ̂is the predicted antigenic distance between them, w global is the global weight representing the average of weights across different tasks, w local indicates the weights from each individual task, and μ is set to 0.4 to balance the global and local weights.
In addition, a scoring function was proposed to measure yield differences between two viruses based on their amino acid sequences.The scoring function is defined as follows: Here, w and z were the weight and feature matrices used in the GHSM approach mentioned above.The detailed prediction results for both the antigenic distance and yield differences are presented in Supplementary Data (Supplementary Data 12).

Cells and viruses
Human embryonic kidney (293T) cells and Madin-Darby canine kidney (MDCK) CCL-34 cells were obtained from the American Type Culture Collection (Manassas, VA).The cells were maintained in Dulbecco's Modified Eagle Medium (GIBCO/BRL, Grand Island, NY; catalog number 11965-092) supplemented with 5% fetal bovine serum (Atlanta Biologicals, Lawrenceville, GA; catalog number S12450H) and penicillin-streptomycin (Invitrogen, Carlsbad, CA; catalog number 15140122) at 37 °C with 5% CO2.The HA gene of CA/04 was cloned into the vector pHW2000 and used as a template to construct the mutant library.The viruses generated by reverse genetics were propagated in MDCK cells and cultured at 37 °C with 5% CO2 in Opti-MEM medium (GIBCO/BRL, Grand Island, NY; catalog number 11058-021) supplemented with 1 μg/ml of TPCK (N-tosyl-L-phenylalanine chloromethyl ketone)-Trypsin (Sigma-Aldrich, St. Louis, MO; catalog number T1426) and penicillin-streptomycin (Invitrogen, Carlsbad, CA; catalog number 15140122).The virus titers were determined by TCID 50 in MDCK cells.

Sequence and serological Data
Serologic data for A(H1N1) viruses were collected from data described elsewhere 13,55,56 , and included HAI titers generated between 1,015 viruses and 194 serum samples (Supplementary Data 15).A total of 11,424 A(H1N1)pdm09 HA protein sequences from 2009 to 2020 were obtained from GISAID (https://gisaid.org).

Construction of plasmid library, gene synthesis, and rescue of mutants
The mutant plasmid library with random mutations in the HA RBS was generated using the epPCR strategy 15 .Four primers were used to generate the HA-pHW2000 RBS mutant library: 1) 130loop_F: 5'-TCA TGG CCC AAT CAT GAC TCG AAC-3'; 2) 190helix_F: 5'-TGG GGC ATT CAC CAT CCA TCT ACT-3'; 3) 190helix_R: 5'-AAC ATA TGT ATC TGC ATT CTG ATA-3'; and 4) 220loop_R: 5'-TAG TGT CCA GTA ATA GTT CAT TCT-3'.The epPCR product (2 μl) was transfected into XL1-Blue Supercompetent Cells (Agilent Technologies, Santa Clara, CA; catalog number 200236).The transformed cells were directly inoculated onto LB (Luria Bertani) agar plates, and the clones were propagated in 5 ml of LB media.The clones generated from the RBS mutant library were confirmed by Sanger sequencing using the sequencing primer 5'-GAA CGT GTT ACC CAG GAG ATT-3'.Mutant viruses were rescued by plasmid-based reverse genetics with the NA genes from CA/04 and six internal genes from PR8, as described elsewhere 57 .To compare the phenotypes of the predicted vaccine candidates, we also generated the WT virus, a parent prototype 6:2 reassortant virus ith wild-type HA and NA genes from CA/04 and six internal genes from PR8, by using reverse genetics.To confirm the lack of any undesired egg or cell-adapted amino acid changes, each mutant's HA genes were confirmed by using Sanger sequencing post-rescue and propagation.

Evaluation of viral yield
To evaluate the effect of mutations on viral yield, we performed cell culture assays and embryonated egg assays.For the cell culture assays, we inoculated MDCK cells with each influenza virus at a multiplicity of infection of 0.001 and incubated the cells at 37 °C with 5% CO2 for 1 h.After incubation, the inocula were removed, and the cells were washed twice with phosphate-buffered saline (PBS).Then, the cells were incubated with Opti-MEM I (GIBCO, Grand Island, NY; catalog number 11058-021) containing TPCK-trypsin (Sigma-Aldrich, St. Louis, MO; catalog number T1426) (1 µg/ml) at 37 °C with 5% CO2.After 48 h, 200 µl of supernatants were collected, aliquoted, and stored at −80 °C until use.For the embryonated egg assays, 9-day-old specific pathogen-free chicken eggs were inoculated with 200 TCID 50 of each virus and incubated at 37 °C for 72 h, and allantoic fluid were collected.The viral titers in the samples from both the MDCK cells and the embryonated eggs were determined using TCID50 assays in MDCK cells.
Conventional methods for quantifying yields of inactivated influenza vaccines rely on the HA protein, typically determined by SDS-PAGE gel following PNGase treatment.However, this procedure is labor-intensive, preventing us from quantifying the yields of all 196 mutants propagated in both eggs and cells.We quantified the total proteins obtained from ultracentrifugation purification of supernatants from virus-infected cell or egg cultures and analyzed their correlations with the viral titration TCID50.Pearson correlation analysis showed that the total protein yields are positively correlated with TCID50 (Supplementary Fig. 14).Additional SDS-PAGE analyses indicated that approximately 40% of the total proteins are HA proteins (Supplementary Figs. 15 and 16).Consequently, in this study, we used TCID50 titers to assess vaccine yield.

Virus concentration and purification
Viruses for the glycan microarray analysis were purified as described elsewhere 23 .Briefly, viruses were purified from the cell supernatant or allantoic fluid by low-speed clarification (2482 × g, 30 min, 4 °C) to remove debris and then followed by ultracentrifugation through a cushion of 30-60% sucrose in a 70Ti Rotor (Beckman Coulter, Fullerton, CA) (100,000 × g, 3 h, 4 °C).The virus pellet was re-suspended in 100 μl of PBS and stored at −80 °C until use.

Glycan microarray
To identify unique substructures bound specific sets of mutants, a glycan microarray with 75 glycoforms were printed on N-hydroxysuccinimide (NHS)-derivatized slides 23 .The 75 glycans were selected to represent four different glycan categories, including N-glycans, Asn-linked N-glycans, Gangliosides, Thr-linked O-mannosyl glycans (Supplementary Fig. 3).These glycans on the microarray have the same base structures and spacer arms but different terminal structures.The glycans were printed in replicates of four in a subarray, and sixteen subarrays were printed on each glass slide.All glycans were prepared at a concentration of 100 mM in phosphate buffer (100 mM sodium phosphate buffer, pH 8.5).The slides were fitted with a 16-chamber adapter to separate the subarrays into individual wells for assay.The unreacted NHS groups on the slides are blocked with 50 mM ethanolamine in 50 mM sodium borate buffer (pH 9.2) at 4 °C for 1 h and then the slides are rinsed with water.Before the assay, slides were rehydrated for 5 min in TSMW buffer (20 mM Tris-HCl, 150 mM NaCl, 0.2 mM CaCl 2 , and 0.2 mM MgCl 2 , 0.05% Tween).Viruses are purified by sucrose density gradient ultracentrifugation and titrated to about 32,000 hemagglutination units/ml.Then 10 μl of 1.0 M sodium bicarbonate (pH 9.0) was added to 80 μl of virus, and the virus was incubated with 10 μg of Alexa Fluor 488 NHS Esters (Succinimidyl Esters; Invitrogen, Carlsbad, CA; catalog number A20100) for 1 h at 25 °C.After overnight dialysis to remove excess Alexa 488, viruses HA titer were checked and then bound to glycan array.Labeled viruses were incubated on the slide at 4 °C for 2 h, washed, and centrifuged briefly before being scanned with an InnoScan 1100 AL fluorescence imager (Innopsys, Carbonne, France).

Haemagglutination and HAI assays
Haemagglutination and HAI assays were performed by using 0.5% turkey erythrocytes as described by the WHO Global Influenza Surveillance Network Manual for the Laboratory Diagnosis and Virological Surveillance of Influenza.Turkey erythrocytes were obtained from Lampire Biological Products (Everett, PA; catalog number 7209403).The turkey erythrocytes were washed three times with 1 × PBS (pH 7.2) before use and then diluted to 0.5% in 1 × PBS (pH 7.2).Ferret antisera used in the HAI assays were produced by infecting influenza seronegative ferrets (see details in Supplementary Information) or obtained from BEI Resources (https://www.beiresources.org) or International Reagent Resource (https://www.internationalreagentresource.org).

Fig. 1 |
Fig.1| Machine-learning Assisted Influenza VaccinE Strain Selection framework (MAIVeSS).This model has been developed to select high-yield vaccine seeds that match the antigenicity of circulating influenza strains.To achieve this, a library of viruses with mutations in the receptor binding sites of the HA protein is generated and analyzed for their serological reactivity, replication efficiency in eggs and cells, and glycan profiling using microarrays.A sparse learning model will then be

Fig. 2 |
Fig. 2 | Characterization of mutant viruses targeting the receptor-binding site of the HA protein.The HA of A/California/04/2009(H1N1) (CA/04) was used as the template to develop the error-prone PCR mutant library.A The regions in the HA receptor-binding site where viable mutant viruses were obtained are shown in green, while the regions where no viable mutant viruses were obtained are shown in red.The receptor binding pocket includes the 130-loop, 150-loop, 190-helix, and 220-loop.B The key residues associated with antigenicity are located in five HA 6804 (0.2477)The location of these residues in the HA protein structure are illustrated in Fig. 2C. a ABS antibody binding site, RBS receptor binding site.b Each amino acid is assigned to one of the three groups: nonpolar (NP) (V, L, I, M, C, F, W, and Y), small polar (SP) (G, A, and P), and polar/charged (P) (S, T, N, Q, H, D, E, K, and R) based on their biophysical properties.c

Table 1 |
Antigenicity associated residues selected by MAIVeSS for the A(H1N1)pdm09 viruses Global weight (w global ) learned from the MTL-GGSL model in the MAIVeSS, and the absolute local weight for each individual task (w local ) is available from Supplementary TableS5.
a ABS antibody binding sites, RBS receptor binding site, Gly N-linked glycosylation.bThe bootstrap values were derived from 100 independent experiments, each with 80% of the training data.cd

Table 2 |
Amino acid substitutions associated with yield traits selected by MAIVeSS for the A(H1N1)pdm09 viruses

Table 3 |
Hemagglutination inhibition (HAI) titers of the vaccinate candidates selected by MAIVeSS for 2009 H1N1 virusesThe HAI assays were performed in triplicate using 0.5% turkey red blood cells.The homologous HAI titers are highlighted in bold.Ferret antisera were produced by infecting influenza seronegative ferrets (see details in Supplementary Information) or obtained from BEI Resources (https://www.beiresources.org) or International Reagent Resource (https://www.internationalreagentresource.org).