A large peptidome dataset improves HLA class I epitope prediction across most of the human population


Prediction of HLA epitopes is important for the development of cancer immunotherapies and vaccines. However, current prediction algorithms have limited predictive power, in part because they were not trained on high-quality epitope datasets covering a broad range of HLA alleles. To enable prediction of endogenous HLA class I-associated peptides across a large fraction of the human population, we used mass spectrometry to profile >185,000 peptides eluted from 95 HLA-A, -B, -C and -G mono-allelic cell lines. We identified canonical peptide motifs per HLA allele, unique and shared binding submotifs across alleles and distinct motifs associated with different peptide lengths. By integrating these data with transcript abundance and peptide processing, we developed HLAthena, providing allele-and-length-specific and pan-allele-pan-length prediction models for endogenous peptide presentation. These models predicted endogenous HLA class I-associated ligands with 1.5-fold improvement in positive predictive value compared with existing tools and correctly identified >75% of HLA-bound peptides that were observed experimentally in 11 patient-derived tumor cell lines.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Mass spectrometric characterization of peptides eluted from HLA proteins in mono-allelic cell lines.
Fig. 2: Identification of shared motifs and submotifs amongst HLA-A, -B, -C and -G alleles.
Fig. 3: Mono-allelic data uncover length-specific HLA binding preferences.
Fig. 4: Proteasomal and peptidase shaping of the HLA-associated peptidome.
Fig. 5: Generation and evaluation of allele-and-length-specific and pan-allele-pan-length MS-based models on mono-allelic data.
Fig. 6: Integrative MS-informed models more accurately predict peptides directly observed on primary tumor cells.

Data availability

The original mass spectra for 79 of 95 mono-allelic datasets generated for this study, the protein sequence database and tables of peptide spectrum matches for all 95 alleles have been deposited in the public proteomics repository MassIVE (https://massive.ucsd.edu) and are accessible at ftp://massive.ucsd.edu/MSV000084172/. MS data for the 16 previously published mono-allelic datasets in MassIVE can be downloaded at ftp://massive.ucsd.edu/MSV000080527. Datasets for the patient samples are accessible at ftp://massive.ucsd.edu/MSV000084442/. B721.221 RNA-seq data for HLA-C (C*04:01, C*07:01) are deposited under GEO: GSE131267. Melanoma RNA-seq data are deposited in dbGaP (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001451.v1.p1 (ref. 15). Glioblastoma bulk RNA-seq data are available through dbGaP (https://www.ncbi.nlm.nih.gov/gap) with accession number phs001519.v1.p1 (ref. 26). All other data are available from the corresponding authors upon reasonable request.

Code availability

Code used to generate plots characterizing allele-specific preferences (for example, logo plots, entropy plots, peptide projection and clustering, overlap with IEDB data and so on) as well as code to build a sample neural network prediction model is provided as Supplementary Code. The HLAthena predictors are available to use online for research purposes only at http://HLAthena.tools. For commercial usage inquiries please contact the authors or the Broad Institute.


  1. 1.

    Lefranc, M.-P. et al. IMGT®, the international ImMunoGeneTics information system® 25 years on. Nucleic Acids Res. 43, D413–D422 (2015).

  2. 2.

    Robinson, J. et al. The IPD and IMGT/HLA database: allele variant databases. Nucleic Acids Res. 43, D423–D431 (2015).

  3. 3.

    Jurtz, V. et al. NetMHCpan-4.0: improved peptide–MHC class I interaction predictions integrating eluted ligand and peptide binding affinity data. J. Immunol. 199, 3360–3368 (2017).

  4. 4.

    Abelin, J. G. et al. Mass spectrometry profiling of HLA-associated peptidomes in mono-allelic cells enables more accurate epitope prediction. Immunity 46, 315–326 (2017).

  5. 5.

    O’Donnell, T. J. et al. MHCflurry: open-source class I MHC binding affinity prediction. Cell Syst. 7, 129–132.e4 (2018).

  6. 6.

    Gfeller, D. et al. The length distribution and multiple specificity of naturally presented HLA-I ligands. J. Immunol. 201, 3705–3716 (2018).

  7. 7.

    Bulik-Sullivan, B. et al. Deep learning using tumor HLA peptide mass spectrometry datasets improves neoantigen identification. Nat. Biotechnol. 37, 55–63 (2018).

  8. 8.

    Nielsen, M. & Andreatta, M. NetMHCpan-3.0; improved prediction of binding to MHC class I molecules integrating information from multiple receptor and peptide length datasets. Genome Med. 8, 33 (2016).

  9. 9.

    Rajasagi, M. et al. Systematic identification of personal tumor-specific neoantigens in chronic lymphocytic leukemia. Blood 124, 453–462 (2014).

  10. 10.

    de Kruijf, E. M. et al. HLA-E and HLA-G expression in classical HLA class I-negative tumors is of prognostic value for clinical outcome of early breast cancer patients. J. Immunol. 185, 7452–7459 (2010).

  11. 11.

    Zhang, R.-L. et al. Predictive value of different proportion of lesion HLA-G expression in colorectal cancer. Oncotarget 8, 107441–107451 (2017).

  12. 12.

    Dawson, D. V., Ozgur, M., Sari, K., Ghanayem, M. & Kostyu, D. D. Ramifications of HLA class I polymorphism and population genetics for vaccine development. Genet. Epidemiol. 20, 87–106 (2001).

  13. 13.

    Gragert, L., Madbouly, A., Freeman, J. & Maiers, M. Six-locus high resolution HLA haplotype frequencies derived from mixed-resolution DNA typing for the entire US donor registry. Hum. Immunol. 74, 1313–1320 (2013).

  14. 14.

    Solberg, O. D. et al. Balancing selection and heterogeneity across the classical human leukocyte antigen loci: a meta-analytic review of 497 population studies. Hum. Immunol. 69, 443–464 (2008).

  15. 15.

    Ott, P. A. et al. An immunogenic personal neoantigen vaccine for patients with melanoma. Nature 547, 217–221 (2017).

  16. 16.

    Pearson, H. et al. MHC class I-associated peptides derive from selective regions of the human genome. J. Clin. Invest. 126, 4690–4701 (2016).

  17. 17.

    Vita, R. et al. The immune epitope database (IEDB) 3.0. Nucleic Acids Res. 43, D405–D412 (2015).

  18. 18.

    Sette, A. & Sidney, J. HLA supertypes and supermotifs: a functional perspective on HLA polymorphism. Curr. Opin. Immunol. 10, 478–482 (1998).

  19. 19.

    Robinson, J., Malik, A., Parham, P., Bodmer, J. G. & Marsh, S. G. E. IMGT/HLA database—a sequence database for the human major histocompatibility complex. Tissue Antigens 55, 280–287 (2000).

  20. 20.

    Parham, P. & Moffett, A. Variable NK cell receptors and their MHC class I ligands in immunity, reproduction and human evolution. Nat. Rev. Immunol. 13, 133–144 (2013).

  21. 21.

    Nielsen, M. et al. NetMHCpan, a method for quantitative predictions of peptide binding to any HLA-A and -B locus protein of known sequence. PLoS ONE 2, e796 (2007).

  22. 22.

    Rist, M. J. et al. HLA peptide length preferences control CD8+ T cell responses. J. Immunol. 191, 561–571 (2013).

  23. 23.

    Maenaka, K. et al. Nonstandard peptide binding revealed by crystal structures of HLA-B*5101 complexed with HIV immunodominant epitopes. J. Immunol. 165, 3260–3267 (2000).

  24. 24.

    Kaur, G. et al. Structural and regulatory diversity shape HLA-C protein expression levels. Nat. Commun. 8, 15924 (2017).

  25. 25.

    Celik, A. A., Simper, G. S., Hiemisch, W., Blasczyk, R. & Bade-Döding, C. HLA-G peptide preferences change in transformed cells: impact on the binding motif. Immunogenetics 70, 485–494 (2018).

  26. 26.

    Keskin, D. B. et al. Neoantigen vaccine generates intratumoral T cell responses in phase Ib glioblastoma trial. Nature 565, 234–239 (2019).

  27. 27.

    Javitt, A. et al. Pro-inflammatory cytokines alter the immunopeptidome landscape by modulation of HLA-B expression. Front. Immunol. 10, 141 (2019).

  28. 28.

    Di Marco, M. et al. Unveiling the peptide motifs of HLA-C and HLA-G from naturally presented peptides and generation of binding prediction matrices. J. Immunol. 199, 2639–2651 (2017).

  29. 29.

    Liepe, J. et al. A large fraction of HLA class I ligands are proteasome-generated spliced peptides. Science 354, 354–358 (2016).

  30. 30.

    Faridi, P. et al. A subset of HLA-I peptides are not genomically templated: evidence for cis- and trans-spliced peptide ligands. Sci. Immunol. 3, eaar3947 (2018).

  31. 31.

    Mylonas, R. et al. Estimating the contribution of proteasomal spliced peptides to the HLA-I ligandome. Mol. Cell. Proteom. 17, 2347–2357 (2018).

  32. 32.

    Rolfs, Z., Solntsev, S. K., Shortreed, M. R., Frey, B. L. & Smith, L. M. Global identification of post-translationally spliced peptides with neo-fusion. J. Proteome Res. 18, 349–358 (2018).

  33. 33.

    Rolfs, Z., Müller, M., Shortreed, M. R., Smith, L. M. & Bassani-Sternberg, M. Comment on ‘A subset of HLA-I peptides are not genomically templated: evidence for cis- and trans-spliced peptide ligands’. Sci. Immunol. 4, eaaw1622 (2019).

  34. 34.

    Bassani-Sternberg, M. et al. Direct identification of clinically relevant neoepitopes presented on native human melanoma tissue by mass spectrometry. Nat. Commun. 7, 13404 (2016).

  35. 35.

    Schuster, H. et al. The immunopeptidomic landscape of ovarian carcinomas. Proc. Natl Acad. Sci. USA 114, E9942–E9951 (2017).

  36. 36.

    Girdlestone, J. Regulation of HLA class I loci by interferons. Immunobiology 193, 229–237 (1995).

  37. 37.

    Chong, C. et al. High-throughput and sensitive immunopeptidomics platform reveals profound interferonγ-mediated remodeling of the human leukocyte antigen (HLA) ligandome. Mol. Cell. Proteom. 17, 533–548 (2018).

  38. 38.

    Kidera, A., Konishi, Y., Oka, M., Ooi, T. & Scheraga, H. A. Statistical analysis of the physical properties of the 20 naturally occurring amino acids. J. Protein Chem. 4, 23–55 (1985).

  39. 39.

    Bremel, R. D. & Homan, E. J. An integrated approach to epitope analysis I: dimensional reduction, visualization and prediction of MHC binding using amino acid principal components and regression approaches. Immunome Res. 6, 7 (2010).

  40. 40.

    McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at arXiv:1802.03426 [stat.ML] (2018).

  41. 41.

    Harndahl, M. et al. Peptide binding to HLA class I molecules: homogenous, high-throughput screening, and affinity assays. J. Biomol. Screen. 14, 173–180 (2009).

  42. 42.

    Bassani-Sternberg, M., Pletscher-Frankild, S., Jensen, L. J. & Mann, M. Mass spectrometry of human leukocyte antigen class I peptidomes reveals strong effects of protein abundance and turnover on antigen presentation. Mol. Cell. Proteom. 14, 658–673 (2015).

  43. 43.

    Hunt, D. F. et al. Characterization of peptides bound to the class I MHC molecule HLA-A2.1 by mass spectrometry. Science 255, 1261–1263 (1992).

  44. 44.

    Rammensee, H. G., Friede, T. & Stevanoviíc, S. MHC ligands and peptide motifs: first listing. Immunogenetics 41, 178–228 (1995).

  45. 45.

    Rammensee, H., Bachmann, J., Emmerich, N. P., Bachor, O. A. & Stevanović, S. SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics 50, 213–219 (1999).

  46. 46.

    Kim, Y., Sidney, J., Pinilla, C., Sette, A. & Peters, B. Derivation of an amino acid similarity matrix for peptide:MHC binding and its application as a Bayesian prior. BMC Bioinformatics 10, 394 (2009).

Download references


We acknowledge technical assistance from K. Pelton, S. Santagata, O. Spiro, L. Elagina, B. Knisbacher, S. Shukla, J. Brugge and A. Apffel. We further express gratitude for constructive input from M. Rooney, J. Abelin and Z. Hu. We acknowledge support from the National Institutes of Health: grant nos. NCI-1RO1CA155010-02 (to C.J.W.), NHLBI-5R01HL103532-03 (to C.J.W.), NIH/NCI U24 CA224331 (to C.J.W.), NIH/NCI R21 CA216772-01A1 (to D.B.K.), NCI-SPORE-2P50CA101942-11A1 (to D.B.K.), NHGRI T32HG002295 and NIH/NCI T32CA207021 (to S.S.), NCI 5T32CA009172-41 (to D.A.B.), NIH/NCI U24-CA210986 and NIH/NCI U01 CA214125 (to S.A.C.). This work was supported in part by The G. Harold and Leila Y. Mathers Foundation and the Bridge Project, a partnership between the Koch Institute for Integrative Cancer Research at MIT and the Dana-Farber/Harvard Cancer Center. D.A.B. is supported in part by the John R. Svenson Fellowship. C.J.W. is a scholar of the Leukemia and Lymphoma Society, and is supported in part by the Parker Institute for Cancer Immunotherapy. S.K. is a Cancer Research Institute/Hearst Foundation fellow.

Author information

D.B.K., C.J.W., N.H. and S.A.C. directed the overall study design. S.S. performed computational analyses and developed predictive models. S.K., C.R.H., H.K. and K.R.C. generated the MS data and performed data analysis. D.B.K. and G.L.Z. selected the HLA alleles for analysis. D.B.K., P.M.L. and L.W.L. generated the single-HLA allele cell lines and performed data generation. D.B.K., G.O., K.L.L., D.A.B., P.M.L. and L.W.L. developed the patient-derived tumor cell lines. I.K.Z. and J.M.R. generated and provided cells from an ovarian cancer PDX model. P.B. provided CLL samples for analysis. W.Z. provided expert technical assistance. T.E. generated RNA-seq data for mono-allelic cell lines. T.O. and T.L. generated and quantified ribosome profiling data. J.S. and W.J.L. performed HLA typing and validation of all cell lines. S.J. performed HLA binding validation assays. S.S., S.K., N.H., C.J.W. and D.B.K. wrote the manuscript, with contributions from all co-authors.

Correspondence to Nir Hacohen or Steven A. Carr or Catherine J. Wu or Derin B. Keskin.

Ethics declarations

Competing interests

D.B.K. has previously advised Neon Therapeutics, and owns equity in Aduro Biotech, Agenus Inc., Armata Pharmaceuticals, Biomarin Pharmaceutical Inc., Bristol–Myers Squibb Com., Celldex Therapeutics Inc., Editas Medicine Inc., Exelixis Inc., Gilead Sciences Inc., IMV Inc., Lexicon Pharmaceuticals Inc. and Stemline Therapeutics Inc. D.A.B. has received consulting fees from Octane Global, Defined Health, Dedham Group, Adept Field Solutions, Slingshot Insights, Blueprint Partnership, Charles River Associates, Trinity Group and Insight Strategy, and is a member of the RCC translational medicine advisory broad of Bristol–Myers Squibb. K.L.L. owns equity and is a founder of Travera LLC and is an advisor to Bristol–Myers Squibb Com. and Rarecyte. S.A.C. is a member of the scientific advisory boards of Kymera, PTM BioLabs and BioAnalytix and a scientific advisor to Pfizer and Biogen. C.J.W. and N.H. are founders of Neon Therapeutics and members of its scientific advisory board. N.H. is also an advisor for IFM therapeutics. W.J.L. is a member of the scientific advisory board of CareDx. All other authors have no competing interests. Patent applications have been filed on aspects of the described work entitled as follows: ‘HLA single allele lines’, and ‘Methods for identifying neoantigens’.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Materials

Supplementary Figs. 1–6 and Notes 1–7.

Reporting Summary

Supplementary Data 1

Peptide exports for mono-allelic samples

Supplementary Data 2

Peptide exports for multi-allelic samples

Supplementary Table 1

Characteristics of HLA alleles and mono-allelic data

Supplementary Table 2

Allele similarity and submotifs derived from mono-allelic data

Supplementary Table 3

Mono-allelic data reveal length-based preferences

Supplementary Table 4

HLA presentation of IFN-γ response genes increases after treatment

Supplementary Table 5

Cross-validated model evaluation results on mono-allelic data

Supplementary Table 6

Cross-validated model evaluation results on multi-allelic data

Supplementary Code

Sample scripts for reproducing analysis and models.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sarkizova, S., Klaeger, S., Le, P.M. et al. A large peptidome dataset improves HLA class I epitope prediction across most of the human population. Nat Biotechnol (2019). https://doi.org/10.1038/s41587-019-0322-9

Download citation