Main

Personalized cancer vaccines comprising neoepitope peptides generated from somatic mutations have shown potential as targeted immunotherapies1,2,3. Other types of aberrant peptides, including cancer germline antigens generated from genes that are transcriptionally silent in adult tissues, have been shown to act as tumor neoepitopes in immune rejection4,5. Dysregulation of RNA splicing through intron retention, which is common in tumor transcriptomes6,7, represents another potential source of tumor neoepitopes, but has not been previously explored. Intron retention is caused by splicing errors that lead to inclusion of an intron in the final mRNA transcript. Retained intron (RI) transcripts are translated and degraded by the nonsense-mediated decay pathway, which generates peptides for endogenous processing, proteolytic cleavage and presentation on MHC type I8,9,10.

We developed a computational approach to detecting intron retention events from tumor RNA-seq data (Fig. 1a and Online Methods). Intron fragments likely to be translated on the basis of their position downstream of a translated exon and upstream of an in-frame stop codon were identified. Predicted binding affinities between RI peptide sequences and the products of sample-specific HLA class I alleles were calculated to identify candidate RI neoepitopes. We filtered and thresholded preliminary results to exclude artifacts. This process (Online Methods) generated a robust list of putative RI neoepitopes for each sample.

Figure 1: Computationally predicted RI neoepitopes detected in clinical patient cohorts.
figure 1

(a) An in silico pipeline detects intron retention events from transcriptome sequencing, determines open reading frames extending into introns, and identifies putative HLA-specific neoepitopes. ORF, open reading frame; WES, whole exome sequencing. (b) Distribution of total RI load, neoepitope-yielding RI load, and RI neoepitope load in patient cohorts (n = 27 Hugo samples, n = 21 Snyder samples). Box plots show the median, first and third quartiles, whiskers extend to 1.5 × the interquartile range, and outlying points are plotted individually. (c) Somatic and RI neoepitope load by patient. Within each cohort, patients are sorted by total neoepitope load. Neoepitope counts (y-axis values) are represented in natural log format.

We applied this method to tumor sequencing data from two cohorts of melanoma patients treated with checkpoint inhibitors11,12 to identify putative RI neoepitopes (n = 48 melanomas; Supplementary Tables 1 and 2). Apart from one outlier, both cohorts had comparable levels of intron retention and predicted RI neoepitopes (Fig. 1b). Slight variation in RI neoepitope load between cohorts was expected given differences in RNA sequencing run, depth, and quality13. The total predicted neoepitope load included RI neoepitopes, as well as somatic mutation neoepitopes derived computationally using published methods (Supplementary Fig. 1, Supplementary Table 1 and Online Methods). Most patients showed substantially augmented total neoepitope loads with the additional consideration of RI neoepitopes. Mean somatic neoepitope load was 2,218 and mean RI neoepitope load was 1,515, yielding a 0.7-fold increase in mean total neoepitope load with the addition of RI neoepitopes (Fig. 1c). Excluding one outlier sample with a vastly higher level of somatic neoepitopes than the rest, incorporation of RI neoepitopes roughly doubled the total neoepitope load. There was no significant correlation between somatic neoepitope load and RI neoepitope load (ordinary linear regression P = 0.63; Supplementary Fig. 2).

To demonstrate that RI neoepitopes are processed and presented on MHC I, we predicted RI neoepitopes from six human tumor cell lines and detected neoepitopes that were complexed to MHC I by mass spectrometry (Supplementary Table 3). In melanoma cell line MeWo, the predicted RI neoepitopes EVYAAGKYV and YAAGKYVSF from KCNAB2 (chr1:6142308–6145287) were experimentally discovered in complex with MHC I via mass spectrometry with high confidence (Fig. 2a). We identified RI neoepitopes in another melanoma cell line, SK-MEL-5 (AMSDVSHPK and LAMSDVSHPK from SMARCD1), in B cell lymphoma cell lines CA46 (FRYVAQAGL from LRSAM1) and DOHH-2 (TLFLLSLPL and FLLSLPLPV from CYB561A3), and in leukemia cell lines HL-60 (SVLDDVRGW from TAF1) and THP-1 (LTSQGKSAF from ZCCHC6) (Fig. 2b and Supplementary Fig. 3). Applying this method to somatic mutation–derived neoepitopes, a comparable percentage of predicted neoepitopes were detected by mass spectrometry (Supplementary Table 4). The discovery of peptides in complex with MHC I in cell lines using mass spectrometry with RI neoepitope sequences predicted computationally with our pipeline provides direct evidence of the processing and presentation of RI neoepitopes through the MHC I pathway.

Figure 2: Predicted RI neoepitopes from human cancer cell lines are identified by mass spectrometry bound to MHC class I.
figure 2

(a) Two RI neoepitopes identified in the MeWo cell line originating from gene KCNAB2 were both predicted in silico and found by mass spectrometry in the MeWo immunopeptidome. Integrative Genomics Viewer (IGV) sashimi plot indicating RNA-seq read depth (RI expression in TPM = 5.13, percent-spliced-in (PSI) value = 1.07%) and mass spectra. Experiments were repeated five times with independent measurements for cell line MeWo. Neoepitopes shown had one peptide-to-spectrum match (PSM) and were identified in one replicate within a 1% false discovery rate. CCLE, Cancer Cell Line Encyclopedia. (b) Predicted RI neoepitopes were found to have mass spectrometric evidence supporting their presentation in complex with MHC I using the same methodology in additional tumor cell lines: SK-MEL-5, CA46, DOHH-2, HL-60 and THP-1.

Given that somatic neoepitope burden is a known correlate of checkpoint inhibitor response in melanoma14, we next examined whether RI neoepitope load might be similarly associated with response. However, there was no association between RI neoepitope load and clinical benefit from checkpoint inhibitor therapy, nor was there correlation with expression of the canonical markers of immune cytolytic activity CD8A, GZMA or PRF115, or clinical covariates (Pearson correlation P > 0.05 for all; Supplementary Figs. 4, 5, 6). Rather, there was a nonsignificant trend toward association between high RI neoepitope load and lack of benefit (two-sided Mann–Whitney U, P = 0.29 Snyder12 cohort, 0.61 Hugo11 cohort). Tumors with high RI neoepitope load and tumors unresponsive to checkpoint inhibitors, with only 38% overlap, shared common transcriptional programs consistent with cell cycle and DNA damage repair activity (Supplementary Fig. 7 and Supplementary Table 5).

Here we demonstrate that tumor-specific RI neoepitopes can be identified computationally in both patient- and cell-line-derived samples and a subset can be validated as presented in complex with MHC I. These data support the hypothesis that aberrant splicing results in intron retention, which generates abnormal transcripts that are translated into immunogenic peptides, loaded on MHC I and presented to the immune system, underscoring their relevance in patients receiving immunotherapy. Further studies will be necessary to clinically validate the immunogenicity of specific RI neoepitopes in patients, including identification of T cells specific to predicted RI neoepitopes.

Furthermore, we found that RI neoepitope load was not associated with checkpoint inhibitor response and discovered that samples from patients with high RI neoepitope load are transcriptionally similar to those whose tumors did not respond to immunotherapy: both patient groups have enrichment of cell cycle and DNA damage repair–related gene sets. Intron retention has been shown to regulate the cell cycle in both nonmalignant16 and malignant cells17. These findings warrant further investigation and experimental validation, given the emerging synergistic relationship between cell cycle inhibition and immune checkpoint blockade therapies18,19,20.

Identification of a wider array of tumor neoepitopes, including those derived from somatic mutation, aberrant gene expression and splicing dysregulation, will contribute to a more complete understanding of the tumor immune landscape. Additional work dissecting the relationship between the prediction, processing and presentation, and ultimate immunogenicity of neoepitopes derived from different sources will be required to ensure clinical relevance of this approach. It has been shown that melanoma in particular may feature certain shared epitopes across patients that are derived from incomplete splicing processes, which may render these cancers more susceptible to RI-derived neoepitopes21,22. Similar approaches across different tissues will provide further clarity on the role of RI neoepitopes in tumor immunity across cancer contexts. Currently, our findings are limited by the availability of clinically annotated cohorts with high-quality RNA sequencing and matched normal tissue. Incorporation of matched normal tissue will improve exclusion of RIs that represent normal gene expression and may help increase precision of our filtering approach. Prediction of patient-specific RI neoepitopes has the potential to contribute to the development of personalized cancer vaccines.

Methods

Clinical cohorts.

Analysis was conducted on published cohorts of melanoma patients treated with immune checkpoint inhibitors. The Hugo et al. cohort included samples from 27 melanoma patients (26 before treatment, 1 on treatment) treated with the PD-1 inhibitor pembrolizumab11. Patient outcomes were classified as responding to therapy (R) (n = 14) or not responding to therapy (NR) (n = 13), as described in the original publication. These samples were sequenced from fresh-frozen tissue using a standard, poly(A)-selecting protocol. The Snyder et al. cohort included post-treatment samples for 21 melanoma patients treated with ipilimumab (anti-CTLA-4 therapy)12,23. Outcomes were classified as receiving long-term clinical benefit (LB) (n = 8) or not receiving clinical benefit (NB) (n = 13), as described in the original publication. RNA sequencing of the Snyder cohort was performed on fresh-frozen tissue using a standard, poly(A)-selecting protocol.

RI neoepitope pipeline.

Raw RNA-seq FASTQ files were pseudoaligned to an augmented hg19 (GENCODE Release 19, GRCh37.p13)24 transcriptome index containing both exonic and intronic transcript sequences, and transcript expression was quantified via kallisto25. The KMA algorithm26, implemented as a suite of Python scripts within an R package, was used to identify the genomic loci of expressed intron retention events with limited false positives. Using these RI loci, the UCSC Table Browser27 database was queried via public MySQL server to obtain the nucleotide sequences corresponding to the intronic regions and fragments of the previous exonic sequences, as well as the open reading frame orientation at the start of the intron. RI peptide sequences of 9 or 10 amino acids, with at least 1 intronic amino acid, were generated by translating open reading frames into intronic sequences until hitting an in-frame stop codon. These peptides, along with sample HLA class I alleles identified via the POLYSOLVER algorithm28, were assessed for putative peptide–MHC I binding affinity via NetMHCpan v3.129. A threshold of rank < 0.5% was used to identify putative RI neoepitopes.

Several filters were applied at various steps throughout the pipeline to eliminate likely false positive RIs and RI neoepitopes. After expression quantification, RIs expressed at a level ≤1 transcript per million, likely artifactual, were eliminated from the analysis. Additional expression-based filters were applied within the KMA algorithm: RIs that did not reach a level of at least 5 unique counts in at least 25% of samples in a cohort and whose neighboring exons did not reach a level of at least 1 transcript per million in at least 25% of samples in a cohort were eliminated as false positives26. Owing to the absence of matched normal RNA-seq data for our melanoma clinical cohorts, a 'panel of normals' approach was taken in an attempt to filter out introns commonly retained in normal skin tissue, which would not produce immunogenic peptides as a result of likely host immune tolerance. RIs were identified in six normal skin samples (three individuals, two samples per individual: subject ERS326932 with samples ERR315339 and ERR315376, subject ERS326943 with samples ERR315372 and ERR315460, and subject ERS327007 with samples ERR315401 and ERR315464) from the Human Protein Atlas. RNA-seq paired-end FASTQ files for each sample were downloaded from the following open-access link: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-1733/samples/. All normal sample retention profiles were highly concordant, both within and across individuals (Supplementary Fig. 8a). The final filter set of 7,050 normal RIs was obtained by intersecting the sets of RIs shared by each unique combination of one sample per individual—eight groups total (Supplementary Fig. 8b and Supplementary Table 6). These RIs were eliminated from downstream tumor sample analyses. In addition, RI peptides with amino acid sequences present in the normal proteome, derived from the UniProt human reference proteome version 2017_03, downloaded on 5 July 2017, were filtered because of likely host immune tolerance30. Finally, a set of RIs that were flagged due to abnormally high expression values and discovered upon manual review via Integrative Genomics Viewer31 to be erroneously annotated in either the reference transcriptome or the Table Browser database were eliminated from the analysis (Supplementary Fig. 9a–d and Supplementary Table 6).

Clinical cohort somatic neoepitope analysis.

Putative somatic neoepitopes were identified in silico for each sample as described in Van Allen et al. 201514. Briefly, BAM files from each cohort underwent sequencing quality control to ensure concordance between tumor and matched normal sequences and adequate depth of sequencing coverage. Single nucleotide variants were called using MuTect32 and insertions and deletions were called using Strelka33. Annotation of identified variants was done using Oncotator (http://www.broadinstitute.org/cancer/cga/oncotator). Sequences of 9- or 10-amino acid peptides with at least one mutant amino acid were generated. These peptides, along with HLA class I alleles called with POLYSOLVER were analyzed using NetMHCpan v3.0 to identify HLA–peptide binding interactions28,29. For each patient, all peptides with predicted binding rank ≤2.0% for at least one patient HLA Class I allele were called somatic neoepitopes.

Cell line analyses.

Raw RNA-seq data from published34 cell lines CA46, DOHH-2, HL-60, THP-1, MeWo and SK-MEL-5 were obtained from the Cancer Cell Line Encyclopedia35 via the NCI Genomic Data Commons and run through our computational pipeline as previously described, with minor adaptations as follows. HLA class I alleles were used for each cell line as enumerated in publication. A threshold of predicted binding rank ≤ 2.0% for at least one HLA class I allele was used to distinguish cell line RI neoepitopes. All pipeline filters applied to patient data described above were implemented on the cell line data except that RI neoepitopes expected to be retained in normal tissue were not filtered because these experiments were focused on presentation of RI neoepitopes rather than immune system stimulation once presented.

Mass spectrometric data from Ritz et al.34, as well as previously unpublished data for cell lines MeWo, DOHH-2 and SK-MEL-5, were searched against a database consisting of 93,250 sequences of the human reference proteome downloaded from UniProt on 7 July 2017 concatenated with putative retained intron sequences (TPM > 1), or concatenated with 133,811 intron sequences with TPM < 1 (not retained) as negative control. Fragment mass spectra were searched with SEQUEST and filtered to a 1% false discovery rate with Percolator to identify high confidence events.

Gene set enrichment analysis.

Gene expression was quantified in patient samples using kallisto25. Gene set enrichment analysis (GSEA) was run to compare both patients in the top quartile vs. bottom quartile of RI load and patients whose tumors responded to immunotherapy vs. those whose did not. Initially, 50 Hallmark gene sets were tested36. GSEA analyses of the Founders gene sets underlying the Hallmark gene sets that were significantly enriched in both of the above comparisons were subsequently performed. All statistical values reported are Benjamini–Hochberg false discovery rate q values corrected for multiple hypothesis testing.

Statistical analyses.

Assessment of difference in means or medians for a continuous variable between two clinical response groups (i.e., clinical benefit vs. no clinical benefit) was performed using the two-sided nonparametric Mann–Whitney U test for non-normally-distributed variables (for example, RI neoepitope burden). All statistical analyses were conducted in the R statistical software environment (v.3.3.1).

Life Sciences Reporting Summary.

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Code availability.

Pipeline code is publicly accessible on GitHub at https://github.com/vanallenlab/retained-intron-neoantigen-pipeline and as Supplementary Software.

Data availability.

Raw RNA-seq data for the Snyder et al. 2014 patient cohort are available on dbGaP under accession code phs001038.v1.p1 and for the Hugo et al. 201611 cohort on the Sequence Read Archive under accession code SRP070710.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.