Introduction

The immune system is crucial in monitoring cancer and identifying neoantigens produced by tumor cells that can trigger cellular immune responses1. But tumor cells have developed strategies to evade immune surveillance2. To address this, cancer immunotherapy was developed, aiming to stimulate the immune system or create lab-engineered substances that restore the ability to recognize and eliminate cancer cells. Immunotherapy options include immune checkpoint inhibitors (ICI), cancer vaccines, adoptive cellular therapies (ACT), cytokine, tumor-infecting viruses, targeted antibodies, and adjuvants. While immunotherapy has significantly improved patient outcomes, its effectiveness is confined to a small and unpredictable subset of patients within a given cancer diagnosis3, and immune-related adverse events (irAEs) may occur4. Therefore, precise identification of a patient’s tumor microenvironment (TME) and of the ability to predict its immunotherapy response are essential for enhancing overall immunotherapy effectiveness.

Current prediction of immunotherapy response relies on biomarkers such as immune-cell infiltration5, tumor mutational burden (TMB)6, PD-1/PD-L1 expression7, CTLA-4 expression8, mismatch repair (MMR) and microsatellite instability (MSI)9. However, existing clinical practices based on simplistic threshold-based methods lack accuracy. In this context, machine learning (ML) technologies have emerged as valuable tools, offering the potential to refine the precision of immunotherapy response prediction. By harnessing sophisticated algorithms and analyzing extensive datasets, ML models can discern intricate patterns and interactions among various molecular biomarkers, providing a more nuanced understanding of the complex immunotherapy tumor microenvironment. These state-of-the-art ML models not only capture subtle relationships between individual biomarkers but also adapt to the dynamic nature of immune responses, offering a more comprehensive and adaptable approach than traditional threshold-based methods. Indeed, ML-based approaches have shown capacity in various oncology applications, including early diagnosis10, cancer type classification11, the complexity and plasticity of TME and immune system deciphering12, response and prognosis prediction13, and potential neoantigen detection14.

This review summarizes the application of ML in molecular analyses related to immunotherapy, including prediction of immunotherapy responses, identification of response-associated biomarkers, and analysis of the TME (Fig. 1). It also explores ML approaches developed to optimize the identification of crucial neoantigens in personalized immunotherapy. Additionally, this review discusses the challenges and opportunities encountered in current research endeavors, aiming to enhance understanding and recognition of the significant contribution of ML in advancing cancer immunotherapy.

Fig. 1: Genomic landscape of machine learning in tumor immunotherapy.
figure 1

We provide an overview of machine learning methodologies applicable to different aspects of tumor immunotherapy including prediction of immunotherapy responses, identification of response-associated biomarkers, and analysis of the TME. ML machine learning, DEG differentially expressed gene, RFE recursive feature elimination, UAF univariate association filtering, LASSO the least absolute shrinkage and selection operator, LR logistic regression, SVM support vector machine, RF random forest, FCNN fully-connected neural network, CNN convolutional neural network, RNN recurrent neural network, PFS progression-free survival, TME tumor microenvironment.

Employing machine learning for predicting immunotherapy response and identifying biomarkers associated with response

While immunotherapy has brought significant benefits to cancer treatments, its effectiveness remains confined to a small and unpredictable subset of patients with a given cancer diagnosis15. Moreover, the treatment process often imposes substantial financial, physical and mental care burden on patients. Acknowledging these challenges, researchers are increasingly directing their efforts toward identifying valuable molecular biomarkers capable of predicting immunotherapy outcomes and improving its overall utility16. Considering the complex omics space, conducting extensive sampling through experimental methods is impractical. Consequently, in silico approaches, including those employing ML algorithms, provide an opportunity to address this critical need.

Tumors are caused by the accumulation of various genetic variations that regulate the way cells growth and multiplication17,18,19. In light of this, recent studies have turned on ML models to predict a patient’s response to immunotherapy by leveraging his genomic biomarkers and clinical features (Table 1, Fig. 2). Somatic mutations, including single-nucleotide variants (SNVs), insertions, and deletions, provide direct evidence documenting the driving forces behind tumorigenesis and tumor cell proliferation20. These mutations have demonstrated their ability in predicting immunotherapy responses. For example, Peng et al.21 used SNV data and convolutional neural network (CNN) model to classify anti-PD-1/PD-L1 therapy response from metastatic non-small-cell lung cancer (NSCLC) patients. Nonsynonymous mutations can alter transcription, subsequently impacting pathway activations and gene functions. Leveraging the distinct changes in gene expression levels, particularly in oncogenes and tumor suppressor genes, ML models can subtly predict immunotherapy response. According to our survey, RNA-based features, including bulk RNA sequencing22,23,24,25,26,27,28,29,30,31, single cell RNA sequencing (scRNA-seq)32,33, flow cytometry34, and circulating cell-free microRNA sequencing35,36, have been widely implemented in immunotherapy response prediction. Furthermore, the availability of numerous accessible RNA-seq datasets, coupled with the outstanding performance of models utilizing RNA-seq data, has been instrumental in advancing research. From RNA-seq data, many advanced features can be extracted from bioinformatic or ML tools to better characterize tumor genomic profiles, such as tumor-infiltrating lymphocytes (TILs)32,33, pathway activity28 and cell-cell communication23. It is worth noting that some studies have leveraged these features to extract high-level features, thereby enhancing predictive performance. For instance, Wang et al.22 utilized TMB information based on SNV data, gene expression information, and support vector machines-recursive feature elimination (SVM-RFE) algorithm to select gene features. Subsequently, they used least absolute shrinkage and selection operator (LASSO) logistic regression classifier to predict responses of urothelial carcinoma patients treated with the PD-L1 inhibitor atzolizumab using selected gene features. Their approach achieved an AUC of 93% in the test dataset. Additionally, they utilized generalized linear models (GLMs) to derive a TMB-related LASSO score (TLS) from the LASSO regression results. The TLS can serve as an effective indicator for immunotherapy response prediction like TMB. Lapuente-Santana et al.23 employed regularized multi-task linear regression (RMTLR) to identify interpretable biomarkers in relation to immune cells markers, intracellular networks, and intercellular networks for predicting immunotherapy response. On the other hand, Zeng et al.28 implemented a joint nonnegative matrix factorization (NMF) to decompose gene expression matrix, molecular phenotype matrix, and immunotherapy response matrix. This approach aims to identify pivotal genes correlated with immunotherapy response. Similar to RNA-seq data, Shang et al.37 and Filipski et al.38 have successfully employed DNA methylation profiles (CpG sites) for ICI response prediction in NSCLC37 and metastatic melanoma38 patients. Apart from these conventional biomarkers, clinical information39 and Raman spectroscopy data40 have shown promise as reliable biomarkers of ICI response prediction. In a separate study, Sidhom et al.41 integrated human leukocyte antigen (HLA) and T cell receptor (TCR) sequencing to predict ICI response in melanoma. Their approach involved employing a multiple-instance learning model that incorporated HLA into the featurization of the TCR sequences to provide a representation of a joint TCR-HLA antigenic latent space. The contextualization of TCR-HLA was then trained on multihead attention networks to learn the attention weights, which were used to predict the final immunotherapy response.

Table 1 Publications relevant to machine learning in immunotherapy response prediction
Fig. 2: An overview of machine learning techniques for immunotherapy response prediction.
figure 2

Advancements in sequencing technologies have paved the way for exploring diverse approaches in immunotherapy response prediction. To improve efficiency and mitigate overfitting, dimensionality reduction and feature selection techniques are performed prior to model training. Multimodal models offer the flexibility to train on data with multiple modalities. These models first utilize sub-models to extract unimodal features from each data modality. Subsequently, a data fusion step transforms each extracted unimodal feature into a compact multimodal representation. Finally, a classification sub-model is implemented to infer response based on the integrated features.

Accompanied by the advancement of sequencing technologies, recent studies have focused on developing complex ML models incorporating multi-omics datasets for immunotherapy prediction42,43,44,45,46,47. Compared to single omics approaches, the integration of multiple omics data can provide a more comprehensive scope of tumor profile, from the original cause of tumors (genetic, environmental, or developmental) to the functional consequences48,49, and consequently leads to improved performance in immunotherapy response prediction47. In a recent approach50, researchers integrated RNA-seq data with somatic mutations, copy number alterations and protein expression alterations to comprehensively investigate various subcohorts within TME using a sparse hierarchical clustering model. By employing this method, they have successfully identified distinct subcohorts within the TME, each exhibiting unique responses to different cancer treatments, including immunotherapy. This model holds significant potential in guiding precise decision-making for combination therapy strategies. However, integrating and training multi-omics data, usually accrued from different platforms, is more challenging than training unimodal features. Addressing this challenge, a recent study by Vanguri et al.47 developed a dynamic deep attention-based multiple-instance learning model that integrates radiology, pathology, and multiomics data to predict the response of NSCLC patients treated with anti-PD-1/PD-L1 blockade. Comparisons demonstrated that the multimodal approach, integrating these features, enables higher accuracy than unimodal approach in the prediction of immunotherapy response. Notably, their multimodal model can also handle redundant information and missing values in combination with data from different modalities. In addition to ICI prediction, ML models have been utilized to predict the chimeric antigen receptor (CAR) T therapy response. Daniels et al.51 developed a DL model to utilize signaling motifs to evaluate the antitumor efficacy of a given CAR. Their DL framework takes the motif sequence of the CAR as the input and propagated the encoded sequence through two CNN layers, a long short-term memory (LSTM) network layer and seven fully connected neural network (FCNN) layers. This approach enables directly prediction of tumor stemness and cytotoxicity based on the motif combinations designed for CAR T cells, thereby guiding the design of the engineering of CAR signaling domains in CAR T therapies.

In our survey, to improve the computational efficiency and reduce noise and complexity of ML models, most studies utilized statistical or ML algorithms, or both to identify a subset of gene markers for model training. ML techniques for gene selection can handle high-dimensional data and identify patterns that may not be apparent through manual inspection or traditional analyses. These ML approaches, including LASSO29, random forest (RF)21,40, SVM-RFE22,26, NMF28, and logistic regression (LR)27, automatically assess the importance of each gene in relation to the immunotherapy response prediction. Using these extracted biomarkers, various algorithms, including LASSO, LR, RF, XGBoost, naive Bayes (NB), SVM, decision tree (DT) and NN, have demonstrated their ability to accurately predict immunotherapy responses. By focusing on this refined set of features, researchers can also enhance the interpretability and generalizability of the models, fostering a more effective integration of machine learning into the complex landscape of immunotherapy research.

Employing machine learning as a supplementary tool for the identification of biomarkers in the tumor microenvironment for immunotherapy

The TME refers to the intricate cellular landscape surrounding tumors, including immune cells, cancer cells, stroma cells, the inflammatory cytokines and chemokines, metabolites, acidity, cytokines and hypoxia52. It plays a critical role in supporting tumorigenesis and progression, and immunotherapy effectiveness53,54. Extensive studies have elucidated the complex interactions within TME, driving functions like angiogenesis55, metastasis56 and immunosuppression57. Although obtaining accurate datasets for TME factors such as hypoxia and low pH poses challenges, the integration of tumor omics data and the implementation of ML models have enabled the identification of TME characteristics directly or indirectly associated with cancer immunotherapy (Table 2, Fig. 3).

Table 2 Application of machine learning technologies in immunotherapy-related tumor microenvironment analyses
Fig. 3: Machine learning offers promising strategies for evaluating tumor microenvironment.
figure 3

Various machine learning models have been developed to effectively identify biomarkers and comprehend the relationship between tumor microenvironment and immunotherapy, including risk, development and treatment. These models aim to improve the efficiency of immunotherapies by providing valuable insights and understanding. ML machine learning, TME tumor microenvironment, TIL tumor-infiltrating lymphocytes, CAF cancer-associated fibroblast, CSC cancer stem-like cell, MSI microsatellite instability, TMB tumor mutational burden.

Microsatellite instability and tumor mutational burden

MSI and TMB are FDA-approved biomarkers that predict immunotherapy response. While not directly related to the TME, MSI and TMB reflect genetic alterations occurring within tumor cells. MSI indicates microsatellite length polymorphism due to mismatch repair deficiency, while TMB represents the number of somatic mutations per million bases in the exome region58. Tumors with higher MSI or TMB tend to produce neoantigens recognized by the immune system, rendering them more responsive to immunotherapy. Given the high cost of large-scale genomic sequencing, using ML models to assess MSI or TMB based on a panel with limited genes can offer a more cost-effective approach. Zhou et al.59 successfully identified a 54-microsatellite-site biomarker using an RF classifier, allowing accurate classification of microsatellite instability-high (MSI-H) and microsatellite stable (MSS) tumors. In a similar vein, Lu et al.60 implemented LASSO regression to identify a gene-targeted panel capable of accurately assessing TMB levels. Recently, many deep learning (DL) models have been developed to use whole-slide images (WSIs) to predict MSI61,62 and TMB status63,64, which enables a more cost-effective means of predicting immunotherapy response without relying on genomic data.

Cancer stem-like cell

Cancer stem-like cells (CSCs) are a small population of cancer cells that can reconstitute and propagate tumors. They have been implicated in metastasis, relapse, and resistance to cancer therapies65,66. Numerous studies have focused on identifying and categorizing CSCs within tumor cell populations. Researchers such as Wei et al.67 and Wang et al.68 employed LASSO regression to identify stemness features in tumor samples using RNA-seq data. These identified stemness features have shown a strong correlation with the prognosis of immunotherapy and can serve as valuable biomarkers for predicting immunotherapy response.

Cancer-associated fibroblast

Cancer-associated fibroblasts (CAFs), also a critical component of TME, can modulate cancer metastasis through signaling interactions with cancer cells. They can also influence leukocyte infiltration, drug access and therapy responses69. To identify CAF related genes, Wang et al.70 applied ML models to classify tumor samples as either CAF-enriched (CAF+) or CAF-absent (CAF−). Their analysis revealed that the CAF− subtype was associated with longer overall survival and higher immune cell infiltration compared to the CAF+ subtype. These findings provide valuable insights for predicting the response to immunotherapy. Similarly, Tian et al.71 used LASSO regression to obtain six CAF-related genes that can be used to predict the response to anti-PD-1 therapy in melanoma patients. These studies demonstrate the utility of ML models in elucidating the role of CAFs and their associated genes in immunotherapy response prediction.

Tumor-infiltrating lymphocyte

TILs are highly specific immunological reactive lymphocytic cell populations that can recognize and kill tumor cells72. Their presence is crucial in mediating response to cancer therapy, and a higher abundance of TILs is often associated with better clinical outcomes after immunotherapy73,74,75. Currently, ML models have been broadly implemented to quantify various TIL-based biomarkers for immunotherapy response prediction. These biomarkers encompass RNA-seq data and somatic mutation features76, including protein-protein interaction (PPI) networks77, tumor-infiltrating immune cell-associated lncRNAs78,79, T cell signatures80,81, B cell signatures82 and immunophenotype-related DNA methylation signatures (iPMS)83. A general approach adopted in these models involves clustering tumor samples based on the tumor immune microenvironment, such as immunoactivity, disease stages, and survival outcomes. ML models are then employed to extract significant biomarkers for cluster classification. Subsequently, another ML model was then utilized for predicting immunotherapy response to validate the selected biomarkers for each cluster. Typically, TILs comprise both mononuclear and polymorphonuclear immune cells, including T cells, B cells, natural killer cells, macrophages, neutrophils, dendritic cells, mast cells, eosinophils, and basophils. Accurately assessing the abundance of each immune cell type within tumor tissues is essential for treatment decision-making and evaluating drug response. To this end, ML models have been developed to automatically estimate the abundance of these immune cells82,84,85,86,87, enabling precise deconvolution. In a recent study, Fernández et al.84 used their deconvolved proportions of 22 immune cells as the input feature, which could accurately predict the response of patients treated with ICI therapy.

Metabolism

Metabolism refers to the changes observed in cellular metabolic pathways in tumor cells. Typically, oncogenic transformation can induce cancer cells to adopt a well-characterized metabolic phenotype that can profoundly influence the TME88. Increasing evidence has highlighted the role of metabolism in tumor immunosuppressive responses and resistance to immunotherapy89,90. For instance, tumor cells can alter metabolism by increasing glucose uptake and fermentation of glucose to lactate, promoting tumor growth, survival, proliferation, and long-term maintenance91. To improve immunotherapy efficacy, researchers have proposed employing ML models to identify metabolic TME subtype that respond favorably to immunotherapy. Ge et al.92 conducted an analysis of lipid metabolism genes and immune-related genes of lung adenocarcinoma (LUAD) patients and identified two distinct subtypes, namely “metabolism phenotype” and “immunoactive phenotype”, using Cox regression. The “metabolism subtype” exhibited reduced sensitivity and poorer prognosis to immunotherapy. The identified metabolic features hold promise as potential biomarkers to predict immunotherapy response.

Neoantigen

Neoantigens are novel peptides that form in tumor cells due to certain somatic mutations. Neoantigens have the potential to be recognized by immune cells, triggering immune responses against tumor cells93,94. Immunogenic neoantigens have been identified as crucial for developing personalized neoantigen-targeted cancer immunotherapies95,96, including vaccines and adoptive T-cell therapies94. However, the process of neoantigen discovery and validation remains a daunting question that must be addressed before neoantigen-based immunotherapies can become prominent in cancer treatment97. For example, many tumor peptides lack immunogenicity, highlighting the importance and complexity of accurately identifying which neoantigens can effectively stimulate immune cell responses.

Recently, novel pipelines and state-of-the-art ML algorithms have been developed to identify T-cell neoantigens through major histocompatibility complex (MHC) class I and II presentations (Table 3, Fig. 4). Pipelines utilize genomics data, usually derived from whole-genome sequencing (WGS) or whole-exome sequencing (WES), obtained from tumor samples to infer the mutated peptides based on the somatic non-synonymous SNVs. To facilitate neoantigen prediction, researchers have conducted The Immune Epitope Database (IEDB), which provides experimentally characterized T cell epitopes and a comprehensive set of MHC-binding and MHC eluted ligand (EL) data for humans98. These resources significantly enhance the convenience and accuracy of neoantigen prediction. Based on our review, the majority of ML models focus on identifying class I MHC alleles, which have the ability to bind peptides derived from intracellular proteins and present them on the cell surface to CD8 + T cells. Some studies employ ML models to predict neoantigens by estimating the binding affinity between a given mutated peptide and a class I MHC molecule, known as peptide-MHC (pMHC) binding affinity99,100,101,102,103,104,105,106. These models can be categorized into two groups based on their output. The first group predicts a score representing the relative binding affinity between a peptide and MHC99,100,101,102,103. Among these models, NetMHC99 and NetMHCpan100 used the FCNN framework. While NetMHC was trained solely on binding affinity datasets and can predict peptide binding to specific MHC alleles, NetMHCpan integrated information from both binding affinity data and mass spectrometry (MS) EL data, allowing it to predict binding for a wider range of MHC molecules with high accuracy. Different from NetMHC and NetMHCpan, MHCflurry101 added two one-dimensional convolutional layers before fully connected layers, resulting in higher accuracy. EDGE102, on the other hand, used three peptide-extrinsic features (RNA abundance, flanking sequence, per-gene coefficients) captured from MS data as the input, propagating them into three locally connected layers respectively before merging them into fully connected layers for binding affinity prediction. This approach extracts more information than using a single input, resulting in higher positive predictive values (PPV) compared to benchmarked models. Another model, MHCRoBERTa103, built a transfer learning model by pre-training on the UniprotKB/Swiss-Prot dataset and fine-tuning on IEDB dataset. This strategy allows the model to maintain high accuracy and efficiency simultaneously. The second form of binding affinity prediction in these models involves providing a binary classification result to determine whether a given peptide is a binder104,105,106. In most studies, a threshold of <500 nM of the IC50 value is used to define candidate peptides that are likely to bind to MHC. Notably, to improve performance, ForestMHC105 considered six different sequence-related features and their combinations as input features to select the optimal feature subset. Similarly, Anthem106 collected five published sequence scoring functions that can calculate a binding probability based on sequence information. These scoring functions, along with their combinations, were used as input features to select the optimal subset of scoring functions for binder classification. Considering the distinct advantages and limitations of each binding affinity model, Gartner et al.107 built a random forest-based model that integrates known class I candidate human tumor neoantigens predicted by other models and next-generation sequencing (NGS) data from individuals with metastatic cancers. This model can rank the candidate neoantigens, providing a ranked list that can serve as therapeutic targets and facilitate studies aimed at developing more effective immunotherapies. Recently, increasing evidence indicates that CD4 + T cells can recognize cancer-specific antigens and control tumor growth. As a result, MHC class II neoantigen prediction has become important in immunotherapies like vaccine design and targeted therapy development. However, unlike MHC class I molecules that are highly specific and bind a limited set of peptides of a narrow length distribution108, MHC class II molecules are highly polymorphic and the size of the peptides presented are promiscuous109, making it more challenging for neoantigen prediction. In response to this challenge, several models have been established to predict the binding affinity of pMHC class II complexes110,111,112,113,114. Compared with class I binding affinity prediction models, the MHC class II prediction models were generally trained on more complicated datasets, such as the IEDB MHC class II cell surface receptor (IEDB MHC-DR) restricted peptide-binding dataset. In particular, MHC class II prediction models need to consider longer or even variable peptide lengths as their inputs.

Table 3 Application of machine learning technologies in neoantigen and immunogenicity prediction
Fig. 4: Identification of tumor neoantigens using machine learning models.
figure 4

The identification of tumor neoantigens involves the elution of MHC epitopes from tumor cells and the extraction of somatic mutations from sequencing data. Machine learning algorithms are then processed to model the binding affinity between mutant peptides and MHC proteins, allowing for the prediction of candidate neoantigens. To improve performance, some models incorporate TCR sequencing data to screen for candidate neoantigens with high proportions that interact with TCR and induce T cell responses. MHC major histocompatibility complex, TCR T-cell receptor, APC antigen-presenting cell, pMHC peptide-MHC, WES whole-exome sequencing, WGS whole-genome sequencing, LC/MS liquid chromatography/mass spectrometry.

Typically, identifying binding affinity between MHC and peptides alone is insufficient for accurate neoantigen predictions with high confidence. To overcome the limitation, some studies have focused on assessing the immunogenicity of the predicted binding molecules115,116,117,118. Immunogenicity refers to the ability of protein products to provoke an immune response, and it depends on several factors, including protein expression, peptide-MHC binding affinity and stability, peptide competition for MHC binding, and more94,119. Among these models, DeepHLApan117 designed a multi-task neural network model consisting of three layers of bidirectional Gated Recurrent Unit (BiGRU) with an attention layer. This model can simultaneously predict the binding affinity and the immunogenicity. Similar to DeepHLApan, DeepNetBim116 used a CNN model with an attention layer to predict binding affinity and binary immunogenic categories. In comparison to DeepHLApan, DeepNetBim incorporated an additional layer to merge the two independent outputs together, namely the binding affinity and the binary immunogenic category, in order to calibrate the final immunogenicity prediction. Seq2Neo118 took a different approach by developing an end-to-end software that directly utilize raw sequencing data (WES/WGS, RNA) in FASTQ, SAM and BAM formats to predict the immunogenicity directly through a CNN-based model. In contrast, Charoentong et al.120 did not focus on developing a state-of-the-art DL model like most approaches. Instead, they designed a comprehensive biomarker consisting of 127 features, including somatic mutation features, class-I and class-II MHCs, immune inhibitory and stimulatory genes, adaptive immunity cells and innate immunity cells from integrated WES, RNA-seq and clinical data. Their results demonstrated that proper feature extraction could achieve a high accuracy for tumor immunogenicity prediction using only an RF classifier. In addition to assessing the immunogenicity of the predicted binding molecules, some studies have explored the integration of TCR sequence to predict the likelihood of peptide-TCR interaction for neoantigen prediction. Besser et al.121 proposed using CD8 + T cell responses as the task of their models to detect neoantigens. To accomplish this, they employed an additional step in their ML models, training them on the Tantigen dataset122, a comprehensive database of tumor T cell antigens. Through this step, they were able to learn the changes in key parameters and features associated with T cell response, enabling them to predict whether a given MHC class I peptide was positive for inducing CD8 + T cell response. Likewise, iTTCA-Hybrid123 utilized the tumor T cell antigen dataset from Tantigen122 and non-tumor T cell antigen dataset from IEDB98 to train an ensemble model capable of classifying tumor and non-tumor T cell antigens. More recently, DLpTCR124 and pMTnet125 suggested that assessing the propensity of CD8 + TCR to recognize the pMHC complex is crucial for neoantigen prediction, as most in silico predicted antigen peptides fail to elicit immune responses in vivo. Both models take peptide and TCR sequences as input data, and their output is a binary classification of whether the TCR-pMHC has an interaction. To achieve better performance, DLpTCR124 designed an ensemble strategy based on three deep learning models: FCNN, LeNet-5 and ResNet. On the other hand, pMTnet125 utilized an autoencoder and an LSTM network to obtain the hidden encoding of the TCR sequence and peptide sequence, respectively. These encodings were then fed into an FCNN classifier for final prediction.

It is worth noting that peptide sequence encoding plays a crucial role in neoantigen prediction. Two commonly employed methods for encoding are one-hot encoding and BLOcks SUbstitution Matrix (BLOSUM) encoding (Table 3). Among them, BLOSUM is more prevalent as it offers insights into the homologies between protein sequences. In addition, personalized sequencing encoding techniques utilizing ML algorithms have also gained popularity. These include byte pair encoding103, skip-gram encoding104, principal component analysis (PCA) encoding124 and physicochemical properties (PCP) encoding124.

In conclusion, ML has emerged as a promising approach for evaluating TME, identifying TME related biomarkers and unraveling the intricate relationship between TME and immunotherapy. The biomarkers derived from ML approaches hold great potential for predicting clinical outcomes of immunotherapy and enhancing personalized immunotherapy strategies, thereby facilitating the advancement and wider application of immunotherapy in cancer treatment.

Challenges and opportunities

Despite the extensive application of ML in immunotherapy studies, several challenges remain to be addressed. These challenges pertain to gaining a mechanistic understanding of how immunotherapies target and eradicate tumor cells126 and the neoantigens that can be recognized by immune cells127. Whether and how ML models prompt the progression of immunotherapy will depend on how these challenges, as discussed below, are met in the future.

Insufficient amount of available data

Immunotherapy has emerged as a promising cancer treatment, driving numerous clinical trials worldwide128. Nevertheless, current clinical trials have primarily focused on PD-1/PD-L1 therapy, result in limited data for other treatment like CTLA-4 and CAR T therapy (Table 1). This data scarcity poses a significant barrier for developing ML models, particularly DL models that require substantial training data to avoid overfitting and enhance model performance129. To mitigate the limitations, the generation of pseudo databases has emerged as a potential solution. State-of-the-art generative models, such as generative adversarial network (GAN)130 and diffusion models131, have shown promise in computer vision and can generate synthetic data to supplement training datasets, mitigating overfitting issues. Likewise, Sové et al.132 developed a model using an ML approach to capture interpatient diversity in clinical trials, allowing the simulation of virtual patients. By leveraging these virtual patients, it becomes possible to mimic a virtual clinical trial scenario to quantitatively assess the efficacy of ICI treatments in a controlled environment.

Multi-omics data integration and analysis

The advent of multi-omics technologies has revolutionized our understanding of the biological mechanisms of driving immunotherapy. However, analyzing these large multi-omics data, particularly those from single-cell-based133 and spatial-based134 technologies, has brought new computational challenges. One challenge is the batch effects, resulting from diverse platforms used for data generation. To ensure accurate downstream analyses, removing platform-specific noise is crucial. Recently, ML models, particularly joint dimension reduction algorithms such as negative matrix factorization (NMF), PCA, singular value decomposition (SVD), canonical correlation analysis (CCA), have emerged as powerful tools for encoding data from diverse platforms into a shared latent space, thereby enabling effective batch effect removal135. Additionally, the training data often exhibit distinct statistical modalities. To tackle this challenge, multimodal learning with specialized modelling strategies has gained attention for integrating diverse data modalities, such as medical imaging and genomics41,47. By harnessing the strengths of multiple modalities, multimodal learning models offer the potential to address immunotherapy-related questions.

Meta-analysis

In the field of immunotherapy response prediction, the definitions of “response” vary across studies. For example, Vanguri et al.47 and Chowell et al.42 employed Response Evaluation Criteria in Solid Tumors (RECIST)136 as their criterion for defining response, whereas Filipski et al.38 utilized survival (defined as the time from start of ICI treatment to date of decease) to characterize response. The disparate use of these distinct criteria underscores the considerable variability in how the concept of “response” is operationalized across studies, posing a challenge to the synthesis of studies and the establishment of a standardized framework for meta-analysis. Standardization the definition and harmonization data are necessary to achieve a consensus on common criteria or thresholds for defining immunotherapy response.

Neoantigen prediction

With ongoing developments of new algorithms, the field of cancer neoantigen identification holds promise for immunotherapies94. Given the uniqueness of the neoantigen landscape to each individual, the accurate targeting of neoantigens establishes a solid foundation for conducting systematic studies in precision medicine and providing clinical decision support for cancer immunotherapy. Computational models, especially ML algorithms, are commonly used for immunogenic neoantigen prediction. However, comparative studies have revealed that, thus far, none of the existing studies have achieved accurate identification of immunogenic neoantigens127. Factors such as tumor heterogeneity, diversity within the TCR repertoire, and the absence of true labeled data contribute to this inaccuracy. Future studies should focus on developing more comprehensive models integrating both pMHCs and TCR sequencing data to improve predictive performance of neoantigen identification. It is worth noting that certain studies have explored targeting tumor-specific gene fusion137 and MHC gene loss of heterozygosity (LOH)138 to improve immune recognition in neoantigen identification. Incorporating these factors could augment neoantigen predictions and contribute to higher accuracy in future studies.

Model generalizability and interpretability

While numerous ML models have been developed for immunotherapy response prediction, they often struggle to adapt well to unseen data. Their performance on new data is often moderate or deficient, indicating a lack of generalizability. Moreover, these models typically employ ML or statistical approaches to select marker genes. However, the selected marker genes vary between studies and may have limited effectiveness within specific datasets. To address these challenges, recent studies have employed transfer learning algorithms for immunotherapy response prediction. By leveraging pre-trained models and applying them to train on new, similar datasets39, this approach can enhance the efficiency and robustness139. In addition, the interpretability of ML models in immunotherapy remains a persistent concern, ML algorithms often function as black boxes, making it difficult to understand the decision-making process and the underlying biological rationale behind their predictions. To improve the generalizability, researchers are exploring feature insights and interactions through explainable AI (XAI) models140. XAI approaches can provide global and local explanations, enabling a deeper understanding of predictions and facilitating effective fine-tuning on new data.

Models in handling continual incremental datasets with real-time adaptation

In our studies we reviewed, almost all models applied for immunotherapy analyses are traditional batch learning approaches. These methods utilized entire datasets simultaneously for training, deploying the trained model for inference without frequent updates. However, they usually encounter high retraining cost when adapting to new training data141. With the growing of clinical and genomics data during the patient treatment, there is a need to develop models with the capacity to conduct incremental datasets and adapt in real-time to new information. Online learning emerges as a scalable and efficient approach that learn to continuously updates the model based on feedback on its decisions in the form of a sequence of examples141,142,143, demonstrating premium performance in clinical applications144. This approach holds the potential to significantly assist clinicians via providing diagnoses or making management decisions.

Clinical translation

While numerous ML models have been developed for predicting immunotherapy outcomes, our review reveals that almost none of these models have undergone clinical testing. Furthermore, contemporary ML-based clinical decision support systems, such as IBM Watson Health145 and Google DeepMind Health146, encounter obstacles hindering the smooth transition of models from research settings to standard clinical practice. This discrepancy underscores the critical necessity for rigorous clinical validation to evaluate the real-world efficacy and reliability of these predictive models. The complexity of the immune system, the dynamic nature of immunological responses, the lack of data quality and standardization, and the absence of highly reliable biomarkers all contribute to the challenges impacting the performance of these models. Conducting comprehensive clinical trials and validation studies is crucial to bridging the gap between theoretical concepts and practical applications in the field of immunotherapy.

Opportunities

Despite the limited number of databases, there are still a growing number of resources available for immunotherapy research. The Cancer Genome Atlas (TCGA)147 is a prevalent curated database containing genomic, epigenomic, transcriptomic, proteomic and whole slide imaging data across 33 cancer types. Among them, a significant number of patients were treated with immunotherapy, and these samples have been widely used in training ML models as reviewed in this study. In addition, the medical images (MRI, CT, digital histopathology, etc.) of some of these patients can be downloaded from The Cancer Imaging Archive (TCIA)148 database, enabling the multi-modality analysis of immunotherapy studies. Tumor Immunotherapy Gene Expression Resource (TIGER)149 and ICBatlas150 are comprehensive resources for integrative analysis of the transcriptome profiles related to tumor immunology. The Cancer Immunome Atlas120 is a web-accessible database that characterizes the intratumoral immune landscapes and the cancer antigenomes of 20 solid cancers. This database has also developed an immunophenoscore to quantify tumor immunogenicity from genomic features, which helps inform cancer immunotherapy and facilitate the development of precision immuno-oncology. To ensure safe cancer treatment, Wang et al.151 developed an irAE data resource consisting of a total of 893 irAEs. They also performed comparative analyses on these irAEs, making it more intuitive to identify and understand how off-targets of ICIs are involved in irAEs. In addition to clinical resources, there are datasets available for other immunotherapy-related collections. IEDB98 and Tantigen122 provide a comprehensive set of data related to antibody, B and T cell epitopes for humans, along with tools to assist in the prediction and analysis of neoantigens for immunotherapy. In summary, these resources and databases have facilitated the generation of new research tools, diagnostic techniques, vaccines and therapeutics that were previously used in immunotherapy studies.

Conclusions

Immunotherapy holds promise for cancer treatment, but the rapid accumulation of immunotherapy-related data has raised challenges. This review summarizes the use of ML approaches in addressing these challenges. Conventional ML algorithms (LR, RF, SVM, LASSO, XGBoost) have demonstrated their versatility in handling various omics datasets, including mutations, CNVs, methylation profiles, and expression profiles, to predict immunotherapy responses. ML models also analyze TME to identify biomarkers and subcohorts with distinct immunotherapy responses. Unsupervised clustering algorithms are typically utilized for subcohort identification, while LASSO regression is employed to identify subcohort biomarkers. Notably, DL approaches are extensively implemented for handling the sequencing data in neoantigen prediction. Natural language processing-related models, including word-to-vector models, are broadly used for sequence encoding, whereas recurrent neural networks-based models or transformers are commonly utilized for task training. Moreover, we highlight the prevailing challenges, emphasizing the need for ML models to handle multi-modal data to facilitate the rapid accumulation of imaging and omics data. Ultimately, this review aims to inspire cutting-edge ML research in maximizing the potential of immunotherapies.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.