Artificial intelligence predicts the immunogenic landscape of SARS-CoV-2 leading to universal blueprints for vaccine designs

The global population is at present suffering from a pandemic of Coronavirus disease 2019 (COVID-19), caused by the novel coronavirus Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). The goal of this study was to use artificial intelligence (AI) to predict blueprints for designing universal vaccines against SARS-CoV-2, that contain a sufficiently broad repertoire of T-cell epitopes capable of providing coverage and protection across the global population. To help achieve these aims, we profiled the entire SARS-CoV-2 proteome across the most frequent 100 HLA-A, HLA-B and HLA-DR alleles in the human population, using host-infected cell surface antigen presentation and immunogenicity predictors from the NEC Immune Profiler suite of tools, and generated comprehensive epitope maps. We then used these epitope maps as input for a Monte Carlo simulation designed to identify statistically significant “epitope hotspot” regions in the virus that are most likely to be immunogenic across a broad spectrum of HLA types. We then removed epitope hotspots that shared significant homology with proteins in the human proteome to reduce the chance of inducing off-target autoimmune responses. We also analyzed the antigen presentation and immunogenic landscape of all the nonsynonymous mutations across 3,400 different sequences of the virus, to identify a trend whereby SARS-COV-2 mutations are predicted to have reduced potential to be presented by host-infected cells, and consequently detected by the host immune system. A sequence conservation analysis then removed epitope hotspots that occurred in less-conserved regions of the viral proteome. Finally, we used a database of the HLA haplotypes of approximately 22,000 individuals to develop a “digital twin” type simulation to model how effective different combinations of hotspots would work in a diverse human population; the approach identified an optimal constellation of epitope hotspots that could provide maximum coverage in the global population. By combining the antigen presentation to the infected-host cell surface and immunogenicity predictions of the NEC Immune Profiler with a robust Monte Carlo and digital twin simulation, we have profiled the entire SARS-CoV-2 proteome and identified a subset of epitope hotspots that could be harnessed in a vaccine formulation to provide a broad coverage across the global population.

Each identified hotspot is a candidate element of a vaccine. Each candidate vaccine element is associated with a cost , while a total budget b is available for including elements in the vaccine. The description of the budget and costs depend on the vaccine platform.
Some vaccine platforms are mainly restricted to a fixed number of vaccine elements; in this case, each cost will be 1, and the budget will indicate the total number of elements which can be included.
Some other vaccine platforms are restricted to a maximum length of included elements. In this case, each cost will be the length of the vaccine element, and the budget will indicate the maximum length of elements which can be included.
Step 2. Create a set of "digital twin" citizens Our approach is based on simulating a set of "digital twin" citizens. In this work, we focus on vaccine elements whose effects are determined, in part, by the HLAs of each citizen. Thus, each digital twin corresponds to a set of HLA alleles.
It is known 1 that citizens from different regions of the world tend to have different sets of HLA alleles; further, some combinations of HLA alleles are more common than others. We use full HLA genotypes from actual citizens available from high-quality samples in the Allele Frequency Net Database 2 (AFND) to accurately model these relationships.
Creating a distribution over genotypes for each region. In particular, AFND assigns each sample to a region based on where the sample came from (e.g., "Europe" or "Sub-Saharan Africa"). In a first step, we create a posterior distribution over genotypes in each region based on the observations and an uninformative (Jeffreys) prior distribution.
Specifically, we collect all genotypes observed at least once across all regions; we assign an index g to each genotype, and we call the total number of unique genotypes as G. Second, we specify a prior distribution over genotypes. We use a symmetric Dirichlet distribution with concentration parameter of 0.5 because this distribution is uninformative in an information theoretic sense and does not reflect strong prior beliefs that any particular genotypes are more likely to appear in any specific region. For each region, we then calculate a posterior distribution over genotypes as a Dirichlet distribution as follows.
where is the (prior) concentration parameter for the ℎ genotype (always 0.5 here) and is the number of times the ℎ genotype was observed in the region.
We can now use this distribution to sample genotypes from a region using a two-step process.
where n is the desired number of genotypes to sample from the region, and 1 , … , are the counts of each genotype in the sample.
Creating a set of "digital twin" citizens. We create a set of digital twin citizens using a twostep approach. Our method must be given the population size p, as well as a distribution over regions. Concretely, the input is a Dirichlet distribution over the regions, as well as p.
(We note that this Dirichlet is completely independent of those over genotypes discussed in the previous section.) The number of citizens from each region is sampled using the same two-step sampling process described above.
Second, the genotypes for each region are sampled using the posterior distributions over genotypes discussed above.
Step 3. Create a tripartite graph We next use the vaccine elements and digital twins to construct a tripartite graph that will form the basis of the optimization problem for vaccine design. The graph has three sets of nodes: 1. All candidate vaccine elements identified in Step 1 2. All HLA alleles in all digital twin genotypes

All digital twins
The graph also has two sets of weighted edges: 1. An edge from each vaccine element to each HLA allele . The weight of this edge is log ( = −| , ), that is, the likelihood of no response for the allele from that particular vaccine element. (We describe below an approach for calculating this value for short peptides.) 2. An edge from each allele to each citizen which has that allele in its genotype. The weight of these edges is always 1.
As an intuition, we call the edges from a vaccine element to an allele (and, then, from the allele to each patient with that allele) as "active" when the vaccine element is selected. Then, the log likelihood of response for a citizen is the sum of all active incoming edges. That is, the flow from selected vaccine elements to the citizens gives the likelihood of no response for that citizen.
This definition does not include V in the conditioning set of the likelihood. Thus, it does not account for interactions among vaccine elements, such as immunodominance.
Calculating the likelihood of no response for a given digital twin and vaccine elements. We now describe example approaches for calculating log ( = −| , ) for three types of vaccine elements. Our vaccine design approach is applicable for any approach which assigns a value for log ( = −| , ).

Short peptide sequences.
Most short peptide prediction engines 3 compute some sort of a score that a peptide will result in some immune response (e.g., binding, presentation, cytokine release, etc.), and this score generally takes into account a specific HLA allele. In some cases, this is already a probability, and in others, it can be converted into a probability using a transformation function, such as a logistic function.

Long peptide sequences. Longer peptide sequences may include multiple short
peptide sequences with different scores from the prediction engine. An example approach to calculate log ( = −| , ), where v is the long peptide sequence, is to take the minimum (i.e., best) log ( = −| , ), where p is any short peptide contained in .
3. Longer amino acid sequences. Longer amino acid sequences may contain even more short peptide sequences, and the same approach used for long peptide sequences can be used here.

Step 4. Selecting a set of vaccine elements
Finally, we pose the vaccine design problem as a type of network flow problem through the graph defined in Step 3. In particular, the minimization problem can be posed as an integer linear program (ILP); thus, it can be provably, optimally solved using conventional ILP solvers.
Handling the minimax problem. As previously described, our goal is to choose the set of vaccine elements which minimize the log likelihood of no response for each patient. In practice, we can remove V from the conditioning set. Thus, the terms inside the summation are exactly those calculated in Step 3 as the weights on the edges in the graph.
Standard ILP solvers cannot directly solve this minimax problem; however, we use the standard approach of a set of surrogate variables to address this problem. In particular, we define to be the log likelihood of no response for citizen . That is, ≔ ∑ ∑ log ( = −| , ) ∈ ( ) ∈ . Further, we define ≔ max ∈ ; that is, z is the maximum log likelihood that any citizen does not respond to the vaccine (or, alternatively, the minimum log likelihood that any citizen will respond to the vaccine). Finally, then, our aim is to minimize z. ILP formulation. Our ILP formulation consists of three types of variables: • : one binary indicator variable for each vaccine element which indicates whether it is included in the vaccine for the given population. We usually index vaccine elements with i.

•
: one continuous variable for each citizen in the population which gives the log likelihood of no response for that citizen. We always index citizens with j.

•
: one continuous variable for each HLA allele which gives the log likelihood of no response for that allele. We always index alleles with k.

•
: one continuous variable which gives the maximum log likelihood that any citizen does not respond to the vaccine. (Our goal will be to minimize this value.) Additionally, the ILP uses the following constants: • , : the log likelihood that vaccine element does not cause a response for allele k.
• : the maximum cost of vaccine elements which can be selected.
Finally, the ILP uses the following constraints: • = ∑ , ⋅ : one constraint for each allele which gives the log likelihood that at least one selected peptide results in a positive response for that allele • = ∑ ∈ ( ) : one constraint for each citizen which gives the log likelihood that at least one selected peptide results in a positive response for at least one allele for that citizen. (That is, this is the likelihood of a positive response for this citizen.) • ≥ ∑ ⋅ : the vaccine elements we select cannot exceed the budget • ≥ : as discussed above, we use z as an approach to solve the minimax problem. These constraints imply that z is the minimum log likelihood that any individual patient will respond to the vaccine.
Objective: The objective of the ILP is to minimize z.
The setting of the binary variables corresponds to the optimal choice of vaccine elements for the given population.
Relationships to max-flow and other problems with provably efficient solutions. This is highly-related to a number of efficiently solvable network flow problems. Our problem is essentially a min-flow problem with multiple sinks, where each citizen is a sink; however, our aim is to minimize the flow to each individual sink rather than the flow to all sinks. In particular, rather than the "sum" operator typically used to transform multiple sink flow problems into a single-sink problem, we would need a (non-linear) "min" operator. Thus, efficient min-flow formulations are not applicable in this setting.