Epigraph: A Vaccine Design Tool Applied to an HIV Therapeutic Vaccine and a Pan-Filovirus Vaccine

Epigraph is an efficient graph-based algorithm for designing vaccine antigens to optimize potential T-cell epitope (PTE) coverage. Epigraph vaccine antigens are functionally similar to Mosaic vaccines, which have demonstrated effectiveness in preliminary HIV non-human primate studies. In contrast to the Mosaic algorithm, Epigraph is substantially faster, and in restricted cases, provides a mathematically optimal solution. Epigraph furthermore has new features that enable enhanced vaccine design flexibility. These features include the ability to exclude rare epitopes from a design, to optimize population coverage based on inexact epitope matches, and to apply the code to both aligned and unaligned input sequences. Epigraph was developed to provide practical design solutions for two outstanding vaccine problems. The first of these is a personalized approach to a therapeutic T-cell HIV vaccine that would provide antigens with an excellent match to an individual’s infecting strain, intended to contain or clear a chronic infection. The second is a pan-filovirus vaccine, with the potential to protect against all known viruses in the Filoviradae family, including ebolaviruses. A web-based interface to run the Epigraph tool suite is available (http://www.hiv.lanl.gov/content/sequence/EPIGRAPH/epigraph.html).

(S-1) For the aligned-sequence problem, the nodes of our graph will be associated with distinct (t, e) values. In order to align sequences, one has to deal with insertions and deletions, and this introduces gaps into the aligned sequences. For example, the sequences ACDEGHI and ADEFGHI are better aligned as ACDE-GHI and A-DEFGHI. The gap character is treated differently from an amino acid character: 1. For a given sequence s, if s[t] is not a gap character, then we associate the epitope at position t as the first k non-gap characters, beginning with the character s [t]. For example if k = 9 and s =GNF--RNQRK-IVKCFNCGK..., then the PTE associated with t = 2 is NFRNQRKIV.
2. If s[t] is the gap character, then we make a "placeholder epitope" whose first character is the gap character, and whose subsequent characters are the next k − 1 non-gap characters. For these epitopes, we set f (e) = 0. In the example above, the PTE associated with t = 4 is the placeholder epitope -RNQRKIVK.
The rules for connecting edges to consistent epitope pairs are also modified. The first rule is that edges are only supplied for adjacent positions; a directed edge can only connect a node at position t with another at position t + 1. As with ungapped sequences, two adjacent epitopes are considered consistent if the last k − 1 characters of the first epitope agree with the first k − 1 characters of the second epitope. But we also consider the pair consistent if the second epitope begins with a gap character, and the remaining k − 1 characters match the last k − 1 characters of the first epitope. For example: ACDEFGHIK and -CDEFGHIK are consistent, -CDEFGHIK is consistent with itself; and -CDEFGHIK and CDEFGHIKL are consistent.

Inexact matches
To evaluate the inexact-match coverage, we introduce a set H d (e) that includes all epitopes within Hamming distance d of the epitope e. If E is a set of epitopes, then we write H d (E ) as the set of all epitopes that are within distance d of some epitope in E . That is, H d (E ) = e∈E H d (e). Similar to Eq. (1), we can define the inexact-match coverage in terms of the frequencies f (e) of all the epitopes that approximately match the epitopes in the vaccine: Note that if the H d (e) were disjoint for all e ∈ E , then we would be able to write f (e ), (S-4) which suggests that optimization based on f (e), in place of f (e), would optimize off-by-d coverage. Unfortunately, however, the H d (e) are not in general disjoint, so f (e) in general overestimates the "value" of e (i.e., the contribution of e to the coverage). This is what prevents us from using this scheme to optimize the inexact-match coverage for unaligned sequences. The same problem holds, in principle, for aligned sequences, but the overlap of the Hamming sets is much smaller, and we find that we can use this scheme for aligned sequences. We follow the idea in Eq. (S-4), and define f (t, e) = ∑ e ∈H d (e) f (t, e ). In our experiments, we actually used a slight variant of this expression -5) where λ = 0.1. This mostly optimizes the number of inexact matches, but the λ f (t, e) term provides a small bonus for exact matches.
The notion extends straightforwardly for polyvalent vaccines, though the bookkeeping is a little trickier. If E o is the set of epitopes exactly matched by the first m antigens in a vaccine, then H d (E o ) is the set of epitopes that are covered in an off-by-d sense. To account for the fact that these epitopes are already covered, they need to be excluded from the sum that defines f (t, e). In particular, write H (e) as the set of epitopes that are in H d (e) but not in H d (E o ). Then (S-6) and the next antigen (m + 1) is obtained by finding the path through the graph that optimizes ∑ t f (t, e t ).

Filovirus alignment
We created a master input sequence alignment, using the Los Alamos Filovirus database 33 . This alignment includes 34 sequences -a single representative sequence for every human outbreak that has at least one full length genomic sequence available, as well as a representative of RESTV and LLOV, which have not been isolated from humans -to capture the extent of known Filoviridae diversity, while weighting the sampling towards recurrent outbreak strains. Hence the alignment includes 10 EBOV sequences, 7 SUDV, 2 BDBV, 1 TAFV, 1 RESTV, 1 LLOV, 3 RAVV, and 9 MARV. Within human outbreaks, sequences are highly similar, and so each outbreak is represented only once. To select a single representative that approximated the index case of each outbreak, we chose a sequence from the earliest sample in outbreak, when temporal data was available. If multiple isolates were sequenced from that sample, we picked a natural sequence that was either identical or closest to the consensus from the first time point. We then translated each gene, including the full-length Glycoprotein GP, but not the secreted forms, sGP and ssGP, and we concatenated the proteins into a full proteome alignment of all 7 Filovirus proteins. This served as a baseline for vaccine design. We then created two subsets of this data for staged vaccine design exploration. The first included only 8 sequences, one representative each from EBOV, SUDV, BDBV, TAFV, RESTV, LLOV, RAVV, MARV, so we could explore the vaccine design outcome if all species were weighted equally. The second one contained only a single representative virus from each of the 5 species in the Ebolavirus genus. We see in each case that the optimal coverage, as evaluated for k-mer epitopes, is achieved by the solution for which k o = k. But we also see that the differences are not substantial.  Table 2. Compare 9-mer coverage for Mosaic and Epigraph. Epigraph results are based on five runs with different random number seeds; the best and the median coverage fractions are reported. (Epigraph is much faster to run than Mosaic, so it is feasible to run it five times and keep the best solution.) The main thing to notice is that the coverage fractions are so similar, with the difference typically in the fourth decimal place. Mosaics were generated using 10 hour run times on a 48 Core AMD Opteron cluster via the HIV database portal, and used population sizes of 400 (http://www.hiv.lanl.gov/content/sequence/MOSAIC/). Epigraph usually has the slight advantage; in only 7 of the 48 cases did the Mosaic solution have the best coverage. Even the median Epigraph solution outperformed Mosaic most of the time. Here, m = 1 + 1 corresponds to the sequential cocktail solution (optimize for m = 1, fix that solution, and find the optimal complementary sequence), and m = 2 refers to iterative refinement and multiple random starts.   Table 5. Summary statistics for Tailored Therapeutic Vaccines. Shown are PTE Coverage (fraction of 9-mers in a natural strain that were perfectly matched by a 9-mers in the vaccine), and cost in terms of Extras (9 mers in the vaccine that are not found in the natural strains). In (a), these values calculated for each of 189 sequences included in the post-2005 US B clade alignment; in (b) for a set of 199 post-2005 C clade sequence from southern Africa; and in (c) a larger set of 4596 sequences was used, spanning the full M group. In these vaccine designs, n is the number of antigens delivered and m is the number of antigens manufactured (and from which the best n of m were chosen for each sequence, individually). For the straight Epigraph vaccines, n = m. For the Tailored vaccines, we propose manufacturing m = 6 antigens, and delivering n = 2 or n = 3. Coverage increases with increasing n or m, but the number of Extras depends mostly only on n. We also observed that using the conserved p24 instead of the full Gag protein dramatically increases the coverage and the reduces the number of "Extras", but p24 spans only 231 amino acids long out of Gags full 500 amino acids (based on the HIV reference strain HXB2), reducing the number of potential epitopes that could be targeted by over half.  Table 7. Ebola coverage: A summary of coverage statistics for a subset of Ebola Epigraph sequence options we explored. Solution A is the best single Epigraph using the 5 Ebola species set (Epigraph a). B is based on fixing "a", and complementing it with an Epigraph that will give the best coverage of the 34 outbreak sequence set (Epigraphs a+b). C is the 2 Epigraph solution, simultaneously solved, for the 34 outbreak set (Epigraphs c1+c2). D fixes the two Epigraphs in B, and adds a third complementary sequence to improve coverage of the 34 outbreak sequences (Epigraphs a+b+d). E fixes epigraph A, and then simultaneously solves for 2 more complementary sequences that maximize coverage of the 34 outbreak sequences, for a total of 3 antigens (Epigraphs a+e1+e2). F is the simultaneously solved 3 Epigraph set that best covers the 34 outbreak sequences (Epigraphs f1+f2+f3). E is the strategy that we preferred, because it improves coverage of diverse sequences in rare species while maintaining excellent coverage of recurrent outbreak forms. Note that E will not perfectly match the optimal solution for the 34 outbreak sequences, solution F, but we preferred solution E as it provided better coverage of rarer species of Ebolavirus, with negligible loss of EBOV or SUDV coverage.  Table 8. Computation. Run times, in seconds, for the Epigraph algorithm on a modern laptop computer. The basic run (m = 1) creates a graph from the sequence data, removes cycles, and then uses the dynamical programming scheme in Eq.
(3) to find the best path. Because the dynamical programming step is so fast, there is very little marginal cost in finding a second path, using the sequential approach, without (m = 1 + 1) or with (m = 2) iterative refinement. The last column shows the cost of computing new pairs of antigens using iterative refinement with T = 100 random initializations for the first path. ( Table 9. Compare unaligned and aligned epigraphs. This table contains no new information, but combines results from Table 2 and Table 3 to display side-by-side the performance of unaligned and aligned epigraphs. In general the differences are small, with the largest difference, for the m = 2 Nef C case, just over 0.01. For Gag, Nef, and Env, we see that the unaligned performance is almost always better. For Pol, the aligned performance is often better though it bears remarking that for Pol, the differences between the two are particularly small, always less than 0.0005. Asterisks indicate larger values.