The shared neoantigen landscape of MSI cancers reflects immunoediting during tumor evolution

The immune system can recognize and attack cancer cells, especially those with a high load of mutation-induced neoantigens. Such neoantigens are particularly abundant in DNA mismatch repair (MMR)-deficient, microsatellite-unstable (MSI) cancers. MMR deficiency leads to insertion/deletion (indel) mutations at coding microsatellites (cMS) and to neoantigen-inducing translational frameshifts. The abundance of mutational neoantigens renders MSI cancers sensitive to immune checkpoint blockade. However, the neoantigen landscape of MMR-deficient cancers has not yet been systematically mapped. In the present study, we used a novel tool to monitor neoantigen-inducing indel mutations in MSI colorectal and endometrial cancer. Our results show that MSI cancers share several highly immunogenic neoantigens that result from specific, recurrent indel mutation events. Notably, the frequency of such indel mutations was negatively correlated to the predicted immunogenicity of the resulting neoantigens. These observations suggest continuous immunoediting of emerging MMR-deficient cells during tumor evolution. One sentence summary Quantitative indel mutation analysis reveals evidence of immune selection in mismatch repair-deficient cancers


cMS mutation frequencies in MSI CRC and EC
Short-read NGS approaches are not ideally suited for mutational and neoantigen profiling of MSI cancers (19)(20)(21), showing a high variability in mutation frequency regarding the detection of mutations in different cMS candidates like i.e. TGFBR2, SLC35F5 or TFAM (Table  S1). Importantly cMS repeats of increased length which are most susceptible to mutations and therefore encompass the most important mutational targets during MMR-deficient tumorigenesis, are missed by NGS technology that is in common use today (3,5,19,(22)(23)(24)(25)(26). To fill this gap and precisely quantify cMS mutation patterns and their resulting neoantigen frames in MMR-deficient cancers, we developed a novel algorithm based on fragment length analysis as the current gold standard for the detection of MSI. ReFrame, our REgressionbased FRAMEshift quantification algorithm, allows unbiased quantitative detection of indel mutations by solving a linear system of mathematical equations to remove stutter band artifacts, which result from polymerase slippage events during PCR amplification and subsequent nucleotide gains and losses similar to MSI-induced indels (Fig. S1). We used ReFrame in a series of MSI colorectal cancers (MSI CRCs; n=139) (Table S2) to screen for mutations in 41 cMS residing in 40 target genes derived from the first comprehensive cMS database (Seltarbase) (27). Additionally, we investigated mutation profiles in a cohort of MSI endometrial cancers (MSI ECs; n=14). In agreement with previous reports (23,27), our results show that the load of indels at cMS in MSI CRC and EC is high and that multiple concomitant indels at several cMS in the same tumor are very common. Although most CRC and EC were distinguishable based on the cMS mutation patterns, a large set of cMS mutations were shared by the majority of MSI CRC and/or MSI EC (Fig. 1, Fig. S2). Moreover, we observed a significant variation of the number of mutations per tumor, ranging from 8 to 29 (median=20) out of 41 analyzed cMS in MSI CRC and from 8 to 25 in MSI EC (median=18). The observed variation suggests potential differences in the neoantigen load of MSI tumors. Potential clinical consequences, e.g. for the sensitivity towards ICB, should be assessed in future clinical studies (28)(29)(30). ReFrame is not only able to quantify mutation frequency, but also to distinguish mutation types, which is crucial for the prediction of the frame of the resulting neoantigen. As the translation of nucleotide into amino acid sequences is based on three base codons, every mutation in a homopolymer region can either result in a simple deletion or insertion of amino acids or in two entirely different neoantigen reading frames: Deletions of one nucleotide (further referred to as minus 1 or m1) or insertions of two nucleotides (plus 2, p2) will result in a shift to a frame here referred to as "minus-one" (M1), while deletions of two nucleotides (minus 2, m2) or insertion of one nucleotide (plus 1, p1) will result in a shift to a frame referred to as "minus-two" (M2) (Fig. 2, Fig. S3, Table S3). The results demonstrate that m1 mutations, resulting in M1 reading frames, were the predominant mutation type (77% in MSI CRC, Fig. 2). The M1/M2 distribution varied significantly across distinct cMS, with significantly elevated numbers of M2 mutation in BANP, TAF1B and ELAVL3, whereas in ACVR2A, HPS1, SLC35F5 and TCF7L2 there were significantly more mutations leading to an M1 frameshift than expected by chance (Bonferroni corrected binomial test, p<0.05; Table  S4 and S5).

Epitope landscape of MSI neoantigens
Following the detection of shared indel mutations in MSI colorectal and endometrial cancers, we evaluated the possible immunogenic potential of the frameshift neoantigens and associated neopeptides resulting from antigen processing. We used NetMHCpan 4.0, a state-of-the-art HLA binding prediction tool based on artificial neural networks, to predict neopeptides that are possibly presented as epitopes by HLA class I antigens encoded by the most important HLA supertypes (9,31,32). Applying commonly accepted IC 50 thresholds we distinguished between three classes of peptides with high (IC 50 < 50 nM), low (50 nM < IC 50 < 500 nM) and very low (500 nM < IC 50 < 5000 nM) predicted HLA binding affinity (31,33). As a first step, we analyzed all possible frameshift neoantigen sequences derived from the M1 and M2 frameshifts of the 41 cMS. We then complemented this set to cover all possible FSP neoantigens (n=524) derived from 264 cMS with a length of 8 or more nucleotides published in Seltarbase (Data S1) (27). Our results indicate multiple FSPs resulting from M1 or M2 frameshift mutations, that are potentially recognized by the immune system. We detected a wide range of variability with regard to the number of predicted putative epitopes maximally contained within a defined neoantigen. The highest number of predicted putative high-affinity epitopes within a neoantigen was 23 (for the M1 frame of P4HB), (low affinity: 92 predicted putative epitopes in M1 SPINK5; very low-affinity: 375 predicted putative epitopes in M1 P4HB). Other cMS mutation-induced neoantigens showed a complete lack of predicted epitopes (Fig. 3, Data S2 and S5). For HLA-A*02:01, the most common HLA allele in the USA European Caucasian population (34), one or more high-affinity peptides were predicted for 19.8% of the FSP neoantigens. HLA-A*02:01 epitopes with lower affinity were present in 39.5% (≤ 500 nM) and 59.8% (≤ 5000 nM) of candidates (Fig. S4, Table S6, Data S3). To make the potential impact of certain cMS candidates more tangible and to identify frameshift neoantigens with potentially highest relevance for immune recognition, we defined a "general epitope likelihood score" (GELS; see method section "Computation of immunological scores"). GELS accounts for HLA binding prediction and the prevalence of the respective HLA allele in a defined population, as the latter influences the probability of a neopeptide to be an epitope recognized by the immune system in a patient of this population (9,34). We calculated GELS for all FSP neoantigens using HLA allele frequencies for USA European Caucasians (calculations for additional ethnic groups are provided in Data S3). Accounting for a potential relation between immunogenicity and mutation frequency, we noticed that the most commonly mutated cMS located in the ACVR2A gene showed a very low GELS ( = 91%, GELS = 5.1%), whereas very high GELS candidates seemed to be associated with a low mutation frequency (i.e. TMEM97, = 27%, GELS = 91.1%; SPINK5, = 26%, GELS = 91.1%; RUFY2, = 16%, GELS = 90.4%; = 50% in USA European Caucasian population; Data S3). Hierarchical clustering of cMS candidates on all tumor samples revealed the existence of three distinct populations of cMS (Fig. 4A), which was retained in B2M-wild type, but not B2M-mutant tumors (Fig. 4B).

Immunoselection during MSI carcinogenesis
In order to systematically evaluate whether these observations may result from immunoediting, i.e. counterselection of emerging cancer cell clones that harbor highly immunogenic cMS mutations (high GELS neoantigens), we analyzed potential differences between the observed and expected distribution of cMS mutations. We observed a significant inverse correlation between GELS and mutation frequency with Pearson's = −0.45, = 0.0078 at = 41 cMS for endometrial tumors and = −0.42, = 0.0149 for colon tumors, with a conservative estimate of predicted HLA binding probability of = 50%, indicating that a high GELS was related to lower mutation frequency (Fig.  4C). The correlation remained significant even at the lowest epitope fidelity levels of = 10% , with = 0.0145 for endometrial and = 0.0031 for colon cancers respectively. The observation suggests that emerging tumor cell clones with highly immunogenic neoantigens are counterselected (Fig. 5), showing for the first time that immunoediting leaves its traces in neoantigen/cMS mutation patterns in MSI cancers (35)(36)(37)(38). Interestingly, the significant inverse correlation was only detected among B2M-wild type tumors. B2Mmutant tumors, in which immune selection on the basis of HLA class I antigen presentation should not apply, only a trend was observed (Fig. 4D), which possibly reflects effects of immune surveillance prior to B2M mutation (see Data S4 for detailed test parameters). We ruled out a potential influence of cMS length, a well-known factor influencing the likelihood of indel mutations on the observed mutation frequency (Fig. S54), (23,27,39) further supporting the concept of immunosurveillance-induced negative selection. Despite the statistically significant negative correlation between GELS and mutation frequency, we also observed some outliers (Fig. 4C) . We hypothesize that these outliers may reflect distinct effects that potentially influence the probability of a certain cell clone harboring a defined mutation to survive and thrive during tumor evolution. In addition to potential enhancement of immunogenicity, cMS mutations in tumor suppressor genes are predicted to lead to a growth advantage, at least in cancer or pre-cancer cell clones not directly under attack of the immune system. Such cMS candidates with high GELS and mutation frequencies should be of great relevance for the interaction between the immune system and MMR-deficient tumor cells. The presence of a neoantigen-inducing mutation is a prerequisite for presentation of corresponding neoepitopes that can be recognized by the host's immune system. To simultaneously account for mutation frequency and GELS as factors influencing the likelihood of the neoantigen being presented to the immune system, we defined an "immune relevance score" (IRS), which combines GELS with the mutation frequency in tumors computed via ReFrame (see Materials and Methods section "Computation of immunological scores"). The M1 FSP neoantigen derived from TGFBR2, the first described cMS driver mutation in MSI cancer and also the first ever FSP neoantigen characterized for its immunological properties in MSI cancer in pioneering studies (18,40,41), displays the highest IRS (28.57%). In addition to this well-characterized FSP neoantigen, our study uncovered various novel candidates with predicted importance for the immune biology of MMR-deficient cancers. The candidates LTN1, SLC22A9, SLC35F5, CASP5, TTK, TCF7L2, MYH11, MARCKS (all M1) and BANP (M2) all displayed an IRS above 10% (Fig. 4C, Data S3). The spatial distribution of predicted HLA-binding peptides within these high-IRS FSP neoantigens is visualized in Fig. S6. Interestingly, candidate genes with a possible tumor suppressor function were common among the high-IRS genes: CASP5 (apoptosis induction; IRS: 17.15%), TTK (maintenance of chromosomal stability; IRS: 12.38%), TCF7L2 (beta-catenin signaling; IRS: 11.32%), MYH11 (cell structure and proliferation; IRS: 11.11%) and BANP (migration and invasiveness; IRS: 10.73%) were all previously reported in the literature (42)(43)(44)(45)(46)(47)(48)(49)(50). This observation may suggest that highly immunogenic neoantigens are 'tolerated' preferentially if the cells gain a compensatory survival advantage from the mutation by switching off a tumor-suppressive pathway, supporting their role of propelling MSI tumor evolution (Fig. 5).

Discussion
MMR-deficient tumors, due to their well-defined mechanism of genomic instability, represent an ideal tumor type to study the evolution of solid cancer development and the role of the immune system during this process. By analyzing a broad spectrum of cMSencompassing genes that are susceptible to mutation in MMR-deficient cells, we were able to identify recurrent mutations and neoantigens, and to provide first evidence for immunoediting during MSI cancer development. The results of our study (Fig. 5) demonstrate that, in contrast to neoantigens in many other cancer types, which are typically differing between tumors or even occur as 'private' mutational neoantigens, MMR-deficient cancers share a large pool of FSP neoantigens. Thereby, most of the alterations are of the M1 type, resulting from one-basepair deletions (m1), with several candidates displaying a high likelihood of immunogenicity. This observation points towards a common evolutionary pathway of MSI tumorigenesis. The apparent dominance of m1 mutations emphasizes that MMR-deficient cancers not only share similar sets of genes inactivated by MMR deficiency-induced mutations, but also precisely the same FSP neoantigens resulting from these mutations, allowing the definition of a shared neoantigen set for MMR-deficient cancers. Using NetMHCpan 4.0, we identified a plethora of potential MHC binding peptides in FSP neoantigens. This number may even increase when using looser prediction thresholds, as recommended in a recent study evaluating the performance of MHC ligand prediction tools (51). Although many FSP neoantigens do not encompass such peptides for any of the common HLA types, our calculations demonstrate that the vast majority of MSI cancers are predicted to generate one or more neoantigens potentially recognizable by the host's immune system. This hypothesis is supported by the observation of common FSP neoantigen-specific T cell responses in patients with MSI cancer and Lynch syndrome mutation carriers (8). As demonstrated by previous studies, even very low-affinity peptides may encompass relevant epitopes (52,53). Moreover, several of the FSP neoantigens derived from common cMS mutation encompass "hot spot sequences" for which multiple HLA-binding peptides have been predicted (indicated by dark colors in Fig. 3), suggesting that these might be of increased interest for further evaluation (52,53). Our study has the following limitations. The list of neoantigens analyzed with ReFrame is not exhaustive, as additional frameshift mutations resulting from shorter, less frequently mutated cMS can occur in MSI cancers. In addition, we can only propose an atlas of predicted potential neoepitopes in MSI cancers. Although previous studies evaluated a few of the predicted candidates (18,54), supporting the general validity of the in silico predictions, functional validation of individual predicted epitopes will be required to demonstrate that they can in fact be processed by tumor cells and recognized by immune cells. By combining quantitative cMS mutation analysis with a neoantigen-specific immune score that accounts for the prevalence of the epitope-binding HLA molecules in the population, we for the first time are able to provide evidence that the cMS mutation patterns in MSI cancers show signs of immune selection: Candidates that encompass immunogenic epitopes predicted to bind to common HLA types tend to occur less frequently in manifest MSI cancers. This observation supports the concept that immune surveillance is a major force shaping the natural course of MMR-deficient cancer development (4,25,26,37,55). Depletion of expressed neoantigens, similar to what our data suggest, has recently been reported in lung cancer (56).
Other studies failed to detect evidence for negative selection of immunogenic, neoantigeninducing mutations in cancer and thereby immunoediting (57,58). This discrepancy may in part be related to the fact that our approach specifically compares individual cMS mutations based on their immunological consequences, accounting not only for the presence of predicted epitope sequences, but also for the population frequency of the respective HLA type, to which the predicted epitope is supposed to bind. In addition, the detectability of specific counterselection events is supported by three specific features of MMR deficiency: first, MMR-deficient cancers in contrast to other tumors share precisely the same mutations, because the location of a cMS within a gene determines its susceptibility for indel mutations in MMR-deficient cells; second, MMR-deficient cancers due to the dramatically elevated rate of somatic mutations per cell division are expected to harbor a significantly higher proportion of MMR deficiency-induced mutations compared to age-related mutations that have occurred prior to tumor initiation, thus enhancing the "visibility" of negative selection events; third, counterselection against FSP neoantigens may be particularly pronounced, as MMR deficiency-induced mutations often lead to generation of long neoantigens with potentially multiple epitopes, against which no central immune tolerance exists (59). The observation of immunoediting during the development of MMR-deficient cancers also implies that a person's HLA genotype should have a significant influence on the immune environment during MSI tumor evolution. Given the existence of immune-relevant FSP neoantigens that may be bound only by a certain type of HLA molecules, it is reasonable to assume that HLA genotype may be a modifier of cancer risk. This may also explain possible variations of Lynch syndrome penetrance or different rates of MMR deficiency previously suspected between distinct populations (60). Future studies on the natural course of Lynch syndrome should account for this factor. The shared neoantigen landscape encourages cancer-preventive vaccines against MSI cancers, particularly in the setting of Lynch syndrome. If we are able to enhance the abundance of T cells recognizing FSP neoantigens by an FSP neoantigen vaccine, we may shift the balance towards elimination of emerging cancer cells, thereby reducing the likelihood of escape variants leading to outgrowth of clinically manifest tumors. The safety and immunological efficacy of such an FSP neoantigen-based vaccine has already been demonstrated in a first clinical phase I/IIa trial (https://clinicaltrials.gov/show/NCT01461148). If the immune system can be specifically sensitized towards FSP neoantigens resulting from driver mutations which inactivate tumor suppressor genes, such as the ones we evaluated in this study, tumor evolution should be influenced in a way that outgrowth of 'dangerous' MSI cancer cell clones should become significantly less likely.
In conclusion, mutational landscapes in MSI cancers suggest negative selection of mutations that give rise to highly immunogenic FSP neoantigens. This supports the validity of the immunoediting concept in non-viral human tumors. Neoantigen-based vaccination approaches for the prevention of MMR-deficient cancers should account for the natural immune surveillance during their development and focus on strengthening the host's immune response against neoantigens that are related to essential driver mutation events.

Tumor specimens
Formalin-fixed, paraffin-embedded (FFPE) archival tissue blocks were collected from 139 MSI colorectal carcinomas and 14 MSI endometrial carcinomas. Pseudonymized clinical data of each tumor patient is summarized in Table S1. Tumors were obtained from the Department of Applied Tumor Biology, University Hospital Heidelberg in frame of the German HNPCC Consortium, the Finnish Lynch syndrome registry, and Leiden University Medical Center. The study was approved by the Institutional Ethics Committee, University Hospital Heidelberg. Informed consent was obtained from all patients.
Tissue workup and DNA isolation FFPE tumor sections (5 µm) were deparaffinized and stained with hematoxylin and eosin according to standard protocols. DNA was isolated from tissue sections after separate microdissection of normal and tumor tissue. Only samples with a tumor cell content of more than 80% were used for the analysis. Genomic DNA was isolated using the Qiagen DNeasy Tissue Kit (Cat.No. 69506, Qiagen, Hilden, Germany) according to the manufacturer's instructions.

MSI analysis
The tumors were characterized for their MSI status using the NCI/ICG-HNPCC five microsatellite marker panel supplemented with additional mononucleotide markers BAT40 and CAT25 (61). Tumors displaying instability in more than 30% of the analyzed markers were classified as MSI.

Analysis of frameshift mutations in coding microsatellites (cMS)
In order to amplify the coding microsatellite loci, primers were either obtained from the Seltarbase (http://www.seltarbase.org) (27) or designed using primer3 software (Primer3web version 4.0.0, http://primer3.ut.ee/), with one primer of the primer set carrying a 5' fluorescent (FITC) label. Primer were designed to generate amplicons in range between 100 and 150 nucleotides for robust PCR amplification (Table S7). PCR was performed in a total volume of 5 µl containing 0.5 µl 10x reaction buffer (Invitrogen, Karlsruhe, Germany), 1.5 mM MgCl 2 , 200 mM dNTP mix, 0.3 mM of each primer, 0.1 U Taq DNA polymerase (Invitrogen), and 10 ng of genomic DNA, using the following protocol: initial denaturation at 94°C for 5min; 36 cycles of denaturation at 94°C for 30s, annealing at 58°C for 45s and primer extension at 72°C for 1min; final extension step at 72°C for 7min. PCR fragments were separated on an ABI3130xl genetic analyzer (Applied Biosystems, Darmstadt, Germany). Generated raw data were analyzed using GeneMapper™ Software version 4.0 (ThermoFisher, Waltham, USA). Peak height profiles were extracted and processed using ReFrame based on R version 3.4.3. The R script is available as Supplementary Material 1.

Microsatellite allele distributions analyzed using Regression-based Frameshift quantification (ReFrame)
In general, PCR amplification of microsatellite loci generates fragments that can vary in length, either due to indel mutations in MMR-deficient cells or due to polymerase slippage during amplification (stutter band artifacts). These two phenomena cause overlays of peak patterns and hamper data interpretation. We developed a ReFrame, a REgression-based FRAMEshift quantification algorithm, to allow quantitative analysis of microsatellite mutations by removing stutter band artifacts. We obtained main-peak fractions as a function of microsatellite length, to which a logistic function, in the following referred to as ( ) was then fitted. For each microsatellite in question, an effective length was computed using that fit. We then determined stutter fractions for each gene, by calculating the ratios of additional fragments occurring at each microsatellite locus in MMR-proficient control samples (n = 20) to establish baseline reference values . For each cMS, we computed the expected relative contributions of each insertion/deletion in the range of = −4 deletion to = +4 insertion to each band in the data as: We used these relative contributions to set up a linear system for the true peak size without stutter contributions ( where and are the observed and true peak sizes respectively. Resulting allele profiles were imported into a database for further analysis. Validation of ReFrame was performed in three steps: First, DNA of colonic normal tissue was used to determine baseline deviations of the method in negative controls (Fig. S6c). Additionally, microsatellite-stable cell line DNA (HT29) was used as a control. Finally, two cell line DNAs with differing mutation states (HT29 displaying wild type peak pattern, LS180, displaying a mutant peak pattern) were mixed in 10%-steps and expected allele distributions were compared to the ReFrame results (Fig. S6d).

Code availability
The source code of all used algorithms can be accessed on https://github.com/atb-data/neoantigen-landscape-msi

Selection of coding microsatellites and frameshift peptide sequences
For HLA class I binding prediction, 524 FSP neoantigen sequences from 262 mononucleotide changes were retrieved from the Selective Targets in Human MSI-H Tumorigenesis Database (Seltarbase, http://www.seltarbase.org) (27). All cMS with a length of at least eight bases were included. In particular cases other cMS representing putative driver genes, as well as genes which give rise to FSPs with predicted high-affinity binding epitopes according to the literature were also added to the study. In order to also assess potential epitopes located at the junction between N-terminal wild type and C-terminal mutant peptide sequences, the tested peptide sequences all comprised 8 wild type amino acids directly located upstream of the FSP neoantigen sequence to encompass possible fusion epitopes. The whole list of used FSP neoantigens is depicted in Table S5.

HLA binding predictions
For HLA binding prediction, the neoantigen sequences derived from each the M1 and M2 mutated alleles were analyzed for the presence of binders using the publicly available prediction tool NetMHCpan 4.0 (www.cbs.dtu.dk/services/NetMHCpan/) (9), whose performance has been evaluated to be one of the best of the available tools (51). As m1induced and p2-induced M1 neoantigens (akin to m2-induced and p1-induced M2 neoantigens) are identical, except for one additional amino acid at the transition between wild type and neo-sequence, we only used M1/m1 and M2/m2 neoantigens for HLA binding prediction. Predicted epitopes were subdivided into three classes based on commonly used thresholds. While the first class included epitopes with a predicted affinity of IC 50 below 50 nM, referred to as high-affinity binders, the second class included all predicted binders below 500 nM (low-affinity binders). The last class was containing all putative epitopes with lower than 5000 nM affinity (very low-affinity binders). All potential HLA binders with an affinity higher than 5000 nM were discarded. The peptide length of interest was set to 8mer to 14mer peptides.  (31,32). A list of all chosen cMS and FSP sequences were submitted to a Python driver script operating NetMHCpan 4.0 (9) to predict putative HLA binding peptides. The prediction results were processed using a Python script applying the above-mentioned IC50 thresholds to all predicted peptides, yielding three datasets of peptides with potential very low, low and high HLA binding affinity. The resulting datasets were then used to generate figures visualizing the predicted epitopes using matplotlib (62). To that end, predicted epitopes were counted and mapped for each HLA type, neoantigen candidate and the respective epitope class (high-, low-or very low-affinity binder). The results of that analysis were used to generate heatmaps per candidate and HLA type using another Python script (see Suppl. Material 1 for all scripts).

Selection of HLA allele frequency data
HLA allele frequency data sets were selected from the Allele Frequency Net Database (34) by taking the largest datasets of each ethnicity with at least 10000 data points and sufficient resolution in HLA alleles. These were further processed together with epitope and mutation data to compute the immunological scores.

Computation of immunological scores
For all candidate FSP neoantigen, measures of probable immunological relevance were computed based on the above described predicted IC 50 values and mutation frequencies. A hierarchy of probabilities for the given candidates to produce immune reactions were computed, those being an epitope likelihood score (ELS) per HLA type, a generalized epitope likelihood score (GELS) comprising all HLAs under consideration, as well as an immunological relevance score (IRS). The ELS was defined to describe the probability of a given neoantigen to be effective across a population, relative to a single HLA: where ∈ is a given HLA, ∈ is a given FSP neoantigen, the allele frequency of a given HLA allele, the probability, that a given predicted epitope is actually bound, that is the true positive rate of the prediction algorithm, and ( ) the set of all epitopes predicted for a given HLA and neoantigen. Taken together, constitutes the probability of a given candidate having at least one true binding epitope for an HLA and a random person from a given population having at least one allele of . Consequently, the GELS gives the probability of a candidate having at least one binding epitope among all HLAs, for which the given HLA is also present in a randomly selected individual: where is the set of HLA types considered for locus .
Finally, the IRS gives the joint probability of a given FSP and its underlying cMS mutation being present in an individual and at least one predicted binder existing for an HLA present in that individual, assuming independence between the presence of HLA alleles and present FSPs: ELS and GELS were computed for all candidate FSPs and HLAs considered using Python on the three output classes of epitope prediction, where binding probabilities were incremented from 0% to 90% in steps of 10%. HLA allele frequencies were obtained from the Allele Frequency Net Database (34). Immunological relevance scores were computed for all candidates with available mutation frequency data.

Cluster analysis of mutation patterns
Frameshift mutation abundances (m4 to p4) for each gene and tumor sample were filtered for missing data. For all subsequent clustering experiments, missing values were replaced by the dataset mean. Abundances of frameshift mutations were summarized by their respective reading frame (M2, M1, wt), providing the features used for all subsequent analyses. Resulting features were grouped by tumor sample and candidate cMS respectively. Hierarchical clustering using Ward's minimum variance linkage (63) was performed for both feature-sets grouped by cMS and tumors for all tumor samples considered, as well as for cMS features considering only B2M wildtype and mutated tumors respectively. Three clusters of candidate cMS were extracted from hierarchical clustering both for features considering all tumor samples and features considering B2M wildtype tumors only.   Fig. S3 for complete dataset). Each row constitutes one analyzed tumor sample with its related allele ratios. For each cMS, tumors were sorted by the proportion of wild type alleles top to bottom. The number of samples analyzed for a certain candidate is indicated below each candidate's figure. Color indicates the resulting reading frames: magenta indicates the M1 frame, corresponding to m1, m4 and p2 mutations; green indicates the M2 frame, corresponding to m2, p1, p4 mutations. Because wt, m3 and p3 mutations do not result in translational frameshifts, they are shown in black (M0) (see also magnification, right panel). Intensities represent ReFrame-calculated ratios from white (0%) to full intensity magenta/green/black (100%) according to the resulting reading frame of the column. The annotated solid lines (first horizontal line top down) show the end of the nonmutated tumor samples while the dotted lines (second horizontal line top down) mark the beginning of tumors being more than 50% mutated, associated with biallelic hits within the respective sample. (D) Calculated mutation frequencies and mean allele ratios of most common mutation types (m3 -p1) resulting from ReFrame analysis in 10 representative cMS (see Tab. S2 for complete dataset). The table is showing an overview of cMS with mutation frequencies above 50% in MSI CRC or MSI EC, depicting the mutation frequencies (%mut), the ratio of samples with biallelic hits, indicated by a proportion of wt alleles lower than 50% of all the detected signals (%wt<0.5), as well as the mean mutational pattern for the cMS candidates sorted by their length. The allele ratios are depicted for wild-type (wt), minus one up to three base pair deletions (m1 -m3) and one base pair insertions as m4 or p2 -p4 mutations only rarely occurred or were completely absent. The same color code as depicted in (C) was used.  The full hierarchy is displayed as a dendrogram for both feature sets, with the threshold dissimilarity for clustering indicated by a red line. Here, B2M-mutated features show no clustering at the given dissimilarity threshold. The same data is again shown using RBF kernel PCA with two principal components. While the wildtype data shows the same clusters as all tumors combined, the clustering is lost in the case of mutated B2M. (C) The mutational frequency of FSPs resulting from one base pair deletions (m1) is shown on the y axis against the GELS of the resulting M1 FSP neoantigens (x axis). For the calculation of the GELS, all predicted epitopes (IC50 < 500 nM) were taken into account, with an assumed probability for a binder to be a true positive of p binding =50%. Every bubble depicted represents one candidate. The gradient intensity of the bubbles shows the IRS, with white color representing a low IRS, while dark red displays a high IRS. All candidates with an IRS of 10% or higher are annotated. (D) Correlation between the number of predicted epitopes in cMS mutation-induced FSP neoantigens and the frequency of the respective cMS mutations in MSI colorectal cancer separated by B2M mutation status. The Pearson´s r from the correlation test is shown on the y-axis, while the different groups of tumors are shown on the x-axis. Whiskers indicate 95% confidence intervals. A significant inverse correlation was observed showing r=-0.42, p=0.0149 at n=41 candidates for 99 MSI colorectal cancers with wild type B2M, with a conservative estimate of predicted epitope fidelity of p binding =50%, indicating that high epitope likelihood was related to lower mutation frequency.

Fig. 5. Implications of immune selection during tumor evolution in MSI cancers. (MULTI CELL)
Inactivation of the MMR system results in the accumulation of a high number of somatic cMS mutations during cell division. These cMS mutation events depend on the likelihood of polymerase slippage at the microsatellite loci, i.e. on microsatellite length, but are random with regard to the functional consequences of the mutations, which results in a random distribution of cMS mutations in the initiated cell population. During progression, driver mutations promoting cell survival and proliferation are favorable, while highly immunogenic mutations are disavowable due to immune supervision. As such, the distribution of cMS mutations across a cell population is shaped by both driver effects and immune supervision. Abrogation of cellular antigen presentation, i.e. due to B2M mutationinduced loss of HLA class I stability, the immunogenicity of neoantigens resulting from cMS mutations is expected to become irrelevant for the selection of cell clones. Therefore, the distribution of cMS mutations is no longer shaped by the immune system and depends only on driver effects. (SINGLE CELL) Insertions and deletions due to polymerase slippage in cMS result in two equivalence classes of frameshift neopeptides with M1 or M2 frameshifts. Survival of a given cell with cMS mutations then depends on the binding behavior of these neopeptides to the cell's HLA class I complexes. If neoantigens contain HLA binding peptides, they can be recognized as foreign by T cells, resulting in the possibility of T cell-mediated induction of cell death. In contrast, neoantigens not containing HLA binding peptides are neutral and do not impair cell survival. Destabilization of HLA class I by B2M mutation leads to a general lack of peptide-containing HLA class I complexes on the cell surface, theoretically corresponding to a complete lack of HLA class I binders. (WORKFLOW) Distribution of cMS mutations across tumor samples was quantified using ReFrame, which performs deconvolution on observed frameshift sequence abundances including stutter contributions to recover the true abundance of each frameshift sequence. NetMHCPan 4.0 predicts the IC50 of putative epitopes for all cMS-derived FSPs, identifying potential highly immunogenic FSPs by their number of predicted low-IC50 epitopes. This information is composed into a hierarchy of immunogenicity scores (ELS, GELS, IRS) combining multiple probabilities of HLA class I binding, presence of correct HLA types and presence of cMS mutations. The top 10 IRS FSPs are picked as possible candidates for vaccination.