Interpretable deep learning for improving cancer patient survival based on personal transcriptomes

Sun, Bo; Chen, Liang

doi:10.1038/s41598-023-38429-7

Download PDF

Article
Open access
Published: 13 July 2023

Interpretable deep learning for improving cancer patient survival based on personal transcriptomes

Scientific Reports volume 13, Article number: 11344 (2023) Cite this article

1367 Accesses
2 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Precision medicine chooses the optimal drug for a patient by considering individual differences. With the tremendous amount of data accumulated for cancers, we develop an interpretable neural network to predict cancer patient survival based on drug prescriptions and personal transcriptomes (CancerIDP). The deep learning model achieves 96% classification accuracy in distinguishing short-lived from long-lived patients. The Pearson correlation between predicted and actual months-to-death values is as high as 0.937. About 27.4% of patients may survive longer with an alternative medicine chosen by our deep learning model. The median survival time of all patients can increase by 3.9 months. Our interpretable neural network model reveals the most discriminating pathways in the decision-making process, which will further facilitate mechanistic studies of drug development for cancers.

Autosurv: interpretable deep learning framework for cancer survival analysis incorporating clinical and multi-omics data

Article Open access 05 January 2024

Deep generative neural network for accurate drug response imputation

Article Open access 19 March 2021

Advances in AI and machine learning for predictive medicine

Article Open access 29 February 2024

Introduction

Deep learning techniques take many forms and have shown great promise in various fields^1,2,3,4,5. There has been significant interest in applying deep learning to advance biological or biomedical sciences. For example, AlphaFold⁶ computationally predicts the protein structure at an unprecedented accuracy with an average error of approximately 1.6 angstroms. DCell⁷ predicts the impact of genetic mutations on cellular growth response. DrugCell⁸ embeds chemical structures of drugs into neural networks to predict the drug response of human cancer cell lines based on DNA point mutations. Besides their applications to cancer prognostics and therapeutics^9,10, deep neural networks (DNN) have also been applied to cancer survival analysis^11,12, focusing on integrating multimodal data¹³ and interpretability^14,15. However, no deep-learning method has yet considered drug treatment information in survival prediction. By incorporating the relationship between drug treatment, transcriptome, and survival, we can advance the goal of delivering personalized medicine and improving cancer prognosis.

Patients diagnosed with the same cancer and driver mutations frequently show distinct clinical features and rarely have identical responses to treatments. Genetic, transcriptomic, and other clinicopathological parameters may affect patients' survival. For some tumor types, the most significant contribution was reported from the transcriptome¹⁶. How genotypes and phenotypes are intertwined in cancer clinicopathology remains unclear, but the effect of genetic and cytogenetic alterations ought to be reflected in gene expression. At the same time, environmental impacts are also reflected in transcriptomes^17,18. With deep learning techniques, global gene expression profiling appears to be the most powerful predictor of clinical outcomes. Unlike DrugCell predicting relative cell growth from point mutations, we moved beyond the cell lines and aimed to predict the clinical survival time directly based on patients’ gene expression profiles and the chemical structures of their administered drugs. Our deep-learning model enables the optimal drug choice based on individual transcriptome data.

In this work, we performed interpretable deep learning on cancer data from The Cancer Genome Atlas (TCGA)¹⁹ involving 33 primary tumor types. Specifically, we obtained the gene expression of primary tumor patients from TCGA and their clinical records logged in the Genomic Data Commons (GDC) database, such as vital status, survival time, and prescribed clinical drugs. We developed a DNN and named it CancerIDP (Cancer Interpretable neural network based on Drug prescriptions and Personal transcriptomes). The system was trained to embed drug chemical structures and gene expression profiling in the feature space and then to predict the survival time based on those embeddings. The network architecture of CancerIDP is based on the Gene Ontology (GO) structure in human cells, similar to DrugCell.

Our model is highly predictive regarding cancer patient survival: the prediction is accurate for the hold-out testing data. We report a 96% accuracy for binary classification of short-lived or long-lived patients. Pearson’s R (i.e., Pearson correlation coefficient) is 0.937 between survival time prediction and the actual clinical record. More importantly, the separate transcriptome encoder and drug encoder empower an in-silico optimal drug selection. Through a computational brute-force search, each patient is paired with every drug to test whether an alternative drug other than the actual prescribed one works better concerning survival time. In summary, our interpretable deep learning model enables precision medicine and has great potential to improve patient health.

Results

TCGA cancer genomics and clinical data

To predict the survival time of cancer patients based on their transcriptome profiles and drug treatments, we obtained the transcriptome and clinical data from TCGA. A total of 4311 patients have both transcriptome data and clinical information such as drug prescriptions and survival time, covering 178 drugs and 33 cancer types. The most abundant samples are from breast and ovarian tumors (Fig. 1a), and the gender distribution is 63.4% females and 36.6% males (Fig. 1b). Based on the vital status (Fig. 1c), 1645 patients (38.2%) were deceased. The distribution of their days-to-death is shown in Fig. 1d. For the other 2666 (61.8%) alive or censored patients, we used the days-to-the-last-follow-up (distribution shown in Fig. 1e) to infer whether a patient was long-lived. We trained a Gene Ontology (GO)-guided deep learning network to predict the impact of different drugs on the survival time given the patient’s gene expression profile. The actual number of patients used in our network training was slightly less, depending on whether the Morgan fingerprint of the drug is available.

Two-stage training with knowledge transfer to fully utilize data

Transfer learning was initially devised to transfer the learner trained in similar but different domains to the target domain²⁰. Recent studies show that two-stage training can improve deep learning models in the data-scarce setting^21,22. Here the majority of available TCGA data are censored. Therefore, to harness the power of machine learning in personalized medicine, we devised a two-stage training schema in our CancerIDP with knowledge transfer.

In the first stage, the network weights are randomly initialized, and we aim to utilize as many patients as training samples as possible. To do that, we included both deceased and qualified censored patients to create binary classification supervision signals: long-lived or short-lived. Positive samples are long-lived patients known to be alive after 1,200 days, including patients with either days-to-death or days-to-the-last-follow-up greater than 1200. Negative samples are short-lived patients who died within 1000 days (i.e., days-to-death < 1000). The first stage's goal is to exploit as much data as possible to abstract deep representations from gene expression and drug structures for the vital status prediction (more straightforward binary classification), which will be used as a starting point in the second stage for fine-grind survival time prediction.

In the second stage, the network structure remains the same (Fig. 2), and the weights are initialized as those in the best-performing model on the validation set from the first stage. The assumption is that the feature extractor learned in the binary survival status classification can be a good starting point for a more fine-grind survival time prediction. Only deceased patients with exact days-to-death records are used in the second stage. The binary network head from the first stage is discarded, and a new regression head is trained from scratch to predict the days-to-death given a patient’s gene expression profile and the prescribed drug. In the actual training, the days-to-death was transformed to months-to-death on a log scale. Figure 3a shows that the best-performing model achieves 96% classification accuracy in the first stage on the hold-out data. In the second stage, Pearson’s R equals 0.937 between the ground truth months-to-death and the prediction (Fig. 3b).

The memorization pitfall when predicting survival responses from drug treatments

Deep learning models have impressive expressiveness, but in practice, we noticed a potential pitfall that comes with the ability of the deep network to emulate any function. We called this the “memorization effect” in our survival prediction. In our earlier experiments, we included all drugs in the deep learning model and prepared the training and the hold-out validation sets by the random split. However, we observed that for some rarely prescribed drugs with only one or two supporting data points (i.e., 1–2 patients) in the training set, the network tends to simply “memorize” such drugs instead of generalizing meaningful embedding. And then, in the testing phase, the network just outputs the survival time of the patients that it has memorized (observed) in the training phase. Such effect is demonstrated in Fig. 3c: some of the predicted months-to-death values (log scale) are the same since the model only observed one patient with this specific drug during the training, and the model output the survival time of that patient in the testing when fed into this particular drug.

To counter the memorization effect, we eliminated drugs with less than five supporting data points in the training set. Although the number of training samples decreases slightly, the network does not suffer from the memorization effect but learns meaningful embeddings and is more generalizable. Pearson’s R improves from 0.834 to 0.937 after eradicating the memorization effect in the months-to-death prediction (Fig. 3c,d).

Improving patients’ survival based on personalized medicine enabled by CancerIDP

Distinct from other deep-learning-based cancer survival time predictions, our model not only predicts survival time but also enables optimal drug selection for each patient based on the expression encoder and the drug encoder. From the existing clinical data, we know that patient X with transcriptome T is prescribed drug D, and the survival time is S. This triplet (T, D, S) forms a training data point for our network. With the predictive ability achieved through training, the network has learned to encode transcriptomes and drugs as predictive deep embeddings. We can then predict the drug response in silico by pairing each patient with each medication and identifying the optimal drug (D*) with the longest survival time for patient X.

In this TCGA cancer genomics dataset, based on our in-silico drug selection, we can find a drug different from the actual prescription for 27.4% of patients, with which the predicted survival time is longer than the observed survival time (Fig. 4a). Figure 4b compares the actual survival time for the prescribed drugs and the predicted survival time if paired with the deeply-learned optimal drugs. Deep-learning-enabled personalized medicine can significantly improve patient survival by increasing the median of months-to-death from 27.1 to 31.0. Noticeably, 23.4% of patients can have a relative improvement in survival time by at least 5% (Fig. 4c, colored in orange, the size of the mark is proportional to the relative improvement). The most significant improvement is 2.97-fold, where the months-to-death is 24.7 for the prescribed drug and 73.4 for the optimal drug suggested in-silico.

To support our in-silico drug discovery, we randomly selected a patient (TCGA-ZH-A8Y4, patient 1 in Supplementary Fig. S1) and identified nine more patients with the most similar transcriptome profiling according to the cosine similarity between transcriptomes (normalized to unit vectors). Three of the ten patients had their predicted optimal drugs matching their actual prescribed drugs. Five drugs selected by CancerIDP were actually prescribed to at least one of the ten patients. Particularly, the drug Halichondrin B (canonical SMILE: CC1CC2CCC3C(=C)CC(O3)CCC45CC6C(O4)C7C(O6)C(O5)C8C(O7)CCC(O8)CC(=O)OC9C(C3C(CC4C(O3)CC3(O4)CC4C(O3)C(CC3(O4)CC(C4C(O3)CC(O4)C(CC(CO)O)O)C)C)OC9CC(C1=C)O2)C, PubChem CID: 44581725) was prescribed to patients 3 and 9, and predicted to be the optimal drug for patients 4, 5, and 9. After being prescribed Halichondrin B, patient 3 survived for 24.69 months, and patient 9 survived for 34.12 months. In contrast, patients 4 and 5 only survived for 18.21 and 29.45 months under different drug treatments. As predicted by CancerIDP, prescribing Halichondrin B could be beneficial for patients 4 and 5, as it may lead to longer survival times of 22.21 and 33.35 months, respectively, which is similar to its effect on patient 9. Thus, information obtained from CancerIDP could be used as input for an algorithmic expert system to aid clinical decision-making.

To further support our in-silico drug discovery, we also checked the newly approved cancer drugs since 2010 (summarized in²³). These newly approved drugs usually increase cancer patients’ survival time. Our CancerIDP model included 46 patients involving these recently approved drugs. Of these, 35 patients had predicted drugs matching their actual prescribed newly approved drugs. Our analysis using CancerIDP showed that prescribing these newly approved drugs could increase the survival times of seven patients who were receiving other drugs. In contrast, CancerIDP predicted other drugs for only four patients who were prescribed these new drugs. These results may suggest that our CancerIDP predictions are more aligned with the recommendations of medical companies. Thus, deep representation learning provides great promise here. As the system is fed with more abundant data, such data-driven personalized drug prescriptions will become more valuable and trustworthy.

Learn transcriptome embeddings that are predictive of months-to-death

The final prediction of the network is based on two components: the transcriptome embedding learned from the GO-guided encoder, and the drug embedding from the autoencoder. The drug autoencoder is simply a black-box model that learns representations from drug chemical structures expressed in the Morgan fingerprints. We are currently unable to interpret the mechanisms. However, our transcriptome embedding is based on the hierarchy structure of GO terms and has greater interpretability of how genes act together to predict drug survival responses.

We inspected the transcriptome embedding learned from all patients using t-SNE and visualized the top two principal components (Fig. 5a–c). The transcriptome embedding is informative about months-to-death, as shown in Fig. 5a. Patients with similar transcriptome embeddings share similar months-to-death values. The expression of some marker genes such as ESR1 (implicated in hormone resistance and anti-estrogen therapies for breast cancer, Fig. 5b) or KIFC3 (encodes a member of the kinesin-14 family of microtubule motors, Fig. 5c) is consistent with the spatial pattern of the transcriptome embedding, indicating the contribution of these genes to the transcriptome embedding for drug survival responses.

Discover the most discriminating GO components for a cancer patient

Cellular responses of drugs depend on both transcriptomes and drug chemical structures²⁴. In CancerIDP, we encode the expression information and drug information via transcriptome GO encoder and drug autoencoder (details in “Methods”). Since the neural network structure of the transcriptome GO encoder is designed according to the GO system structure in human cells, it is the key to the interpretability of CancerIDP. Specifically, we build the network based on the GO system. Each GO term (node) is represented by a multi-layer perceptron (MLP). These network nodes take inputs from child GO terms and send outputs to parent GO terms. The state of each GO term is represented by the activation of MLP, and the connectivity weights between nodes are determined through end-to-end network training. We applied the attention mechanism^25,26 to learn the most important GO terms for survival time prediction (details in “Methods”). Attention weight assigned to each GO node indicates the relative importance of that GO term to the prediction task at hand. After training, when we forward pass a patient’s gene expression profile, we interpret the nodes with the highest attention weights as the most distinguished GO terms for survival determination for that patient.

Figure 5d shows an example of patient TCGA-57-1582 with ovarian cancer. The most discriminating GO terms are highlighted in red, including catabolic processes like carbohydrate catabolic process (GO:0016052), macromolecule catabolic process (GO:0009057), T-cell co-stimulation (GO:0031295), and some signaling pathways involved in phagocytosis. Therefore, such a GO-guided encoder informs the activated pathway at the individual level, paving the way for the mechanistic study of personalized treatment.

Similarly, we forward passed each patient’s gene expression to the trained GO encoder, and obtained attention weights of GO terms for each patient. We then summarized the most common GO terms with large average attention weights across patients. The discriminating GO terms with average attention weights across patients ≥ 0.15 are provided in Supplementary Table S1. Their positions in the pathway hierarchy are shown in Supplementary Fig. S2. Important GO terms include T cell mediated immune response to tumor cell (GO:0002424), negative regulation of G2/M transition of mitotic cell cycle (GO:0010972), extrinsic apoptotic signaling pathway via death domain receptors (GO:0008625), and so on.

Discussion

Personalized medicine has great potential to significantly improve patient outcomes by taking into account an individual’s genomics information. Here, we developed an interpretable deep learning model CancerIDP on TCGA cancer genomics and clinical data. The data, covering patients diagnosed with 33 different tumor types and their treatment drugs, are publicly available from the National Cancer Institute’s (NCI) Genomic Data Commons (GDC) database. With proper training techniques, we proved that cancer survival time is highly predictable from high-dimensional transcriptomic data. A 96% accuracy can be achieved to distinguish between long- or short-lived patients. Pearson’s correlation coefficient on exact months-to-death between predictions and actual observations is 0.937. Our transcriptome encoder was designed to simulate the GO hierarchy in human cells. Instead of considering DNA point mutations from fixed cell lines like in DrugCell, our model takes gene expression data from individuals. Transcriptomes are dynamic and heterogenous, reflecting differences in genetics, environment, and lifestyle. They capture cellular changes caused by disease progression and therapy responses, while genetic data cannot. The application of transcriptomics in personalized medicine has rapidly emerged and made important impact to diseases such as cancers, cardiovascular diseases, and neurodevelopmental disorders²⁷. Our incorporation of transcriptomes into the deep learning model further harnesses their power to yield transformative personalized medicine contributions.

While much progress has been made in improving the interpretability of DNNs, many of them focus on how different features of the input contribute to the output of the model^28,29. Such techniques may be used to figure out which input genes are important for survival prediction. However, it is equally important or even more important to understand how these genes act together and contribute to patients’ survival time. The interpretability of our CancerIDP relies on the GO hierarchy in the model architecture design. This is totally different from attribution analysis or saliency analysis. The activating neurons point to the most discriminating GO pathways for each individual and inform us which molecular systems in a cell play a vital role in the patient’s survival prediction. When the system finishes training with high predicting capability, we presume the network forms an internal understanding of the relationship between patients’ survival, their transcriptomes, and drug chemical structures.

Based on the satisfactory prediction performance of CancerIDP, we hence propose an in-silico drug selection to search for a more desirable treatment. This in-silico drug optimization, combined with the interpretability of the GO structured expression encoder, provides data-driven precision medicine guidance for clinical practice. However, the implementation of personalized medicine is complex and multifaceted. More research involving collaborations between researchers and health providers is needed to fully realize its potential. Our work is the proof-of-concept of a drug selection algorithm powered by deep learning and computation. With more data becoming available in the future, more comprehensive frameworks can be developed. Even purely computational, the more advanced networks trained with more data will provide value in at least guiding drug design. It should be mentioned that in cases where a combination of drugs was used as a treatment, we treated each drug as an individual treatment. This limitation was due in part to the relatively small sample size of our training data. However, modeling the combination of drugs in-silico could be an interesting avenue for future research. Given that complex diseases such as cancer typically require combination therapies, this would be an important area to explore further.

In our cancer survival analysis, the survival time is the time from drug treatment to patient death. Real-life databases often record incomplete information and lack the exact survival time of some patients. When the survival time is unknown for a patient who lost clinical follow-up, such an observation is referred to as a censored data point. The last available time point is known as censoring time, usually days-to-the-last-follow-up. Censoring observations, however, contain useful information for cancer survival modeling as censoring time provides a lower bound for survival time. Such characteristic is common to real-life data: more data contain coarse information than precise information. This calls for advanced models to utilize the data fully.

Recently, two-stage (pre-train + fine-tuning) approaches^30,31 were proposed to harness the deep learning models on data with such characteristics. In general, much data with coarse labeling are gathered to train simple classifiers in the pre-train stage. And then, knowledge obtained in the pre-train stage is transferred by sharing the feature extractor weights learned from the simple classifier. Finally, the learner is fine-tuned by training samples with fine-grind labeling. In the first stage of our model, we propose a binary classification such that patients known to be deceased within 1000 days of treatment are labeled short-lived (y = 0) and patients with a survival time lower bound (days-to-death or days-to-the-last-follow-up) greater than 1200 days are labeled as long-lived (y = 1). In doing so, some censored patients' cancer transcriptomics and clinical data can be incorporated into the network training. Thereby, we utilize the lower-bound information contained in the censored patients to its full potential. Then in the second stage, by sharing encoder weights, we transfer the learned knowledge for transcriptome and drug encoding and train a new network head for exact survival time prediction. In the second stage, only deceased patients with exact survival time labels are included. In our ablation study, we found that the Pearson's correlation coefficient for months-to-death prediction was 0.891 when trained with a single stage, and increased to 0.937 in the two-stage training paradigm.

Previous deep-learning models for survival analysis were applied for a single or small number of cancer types and did not consider drug and transcriptome data. These include colorectal cancer survival prediction based on histologic images and other clinical variables¹⁵, lung cancer survival prediction based on clinicopathologic variables¹², and prediction for ten cancer types based on histopathologic images³². Our work considers all 33 primary cancer tumor types. This raises the question of whether the tissue type should be distinguished before training since tumor tissue type can be predictive to some extent for cancer survival³³. We compared and found that when incorporating additional clinical information of the biopsies (sex, ethnicity, tissue of origin, and age at diagnosis) as additional features, the performance is on par with gene expression and drug embedding only. This indicates that the transcriptomic heterogeneity provides enough prognostic information for deep-learning-based survival prediction and that the embedding learned from high-dimensional expression data embraces human-interpretable knowledge, at least for common features such as genders and tissues. This makes our model innately simple: it requires no extra feature engineering or data processing efforts and favors end-to-end training. The only preprocessing we performed was a cross-sample normalization for the expression matrix. The end-to-end system also eliminates potential problems for clinical data since many patient data can be missing, and how to impute missing values in a deep-learning system post significant challenges for network training^34,35.

In summary, our CancerIDP demonstrates the great potential of deep learning models to improve cancer patient survival based on personal transcriptomes in precision medicine. Furthermore, with unprecedented amounts of data available in the near future, deep learning models will further enhance the predictive performance and derive essential values from the data to benefit health through such personalized approaches.

Methods

Clinical drug data

Both clinical data and transcriptome profiling RNA-seq data used in this paper for TCGA were obtained from the GDC data portal (https://portal.gdc.cancer.gov/repository). Clinical supplement files for the human cancer dataset in BCR biotab format were downloaded from the GDC portal. Drug prescription, if available, for each TCGA patient was logged in those clinical supplements. Raw data cover 12,571 patients and 384 drugs. We first removed patients without clinical information and drugs without matched standard names, resulting in 4431 patients and 202 unique drugs. We standardized drugs by querying the GDISC database³⁶, which was established by querying PubChem entries for each compound. Among the filtered drugs, 178 of them have at least five survival data points. For patients (103 such patients) that had a tandem usage (different drugs prescribed chronologically) of multiple drugs or prescribed with a combination of drugs at the same time, we pair the patient’s transcriptome profile with each drug signature to be used as a training data point. To further convert drug data to a quantitative representation that can be used in the network encoder, each drug is firstly converted to a simplified molecular-input line-entry system (SMILES) representation of its chemical structure via PubChem (https://pubchem.ncbi.nlm.nih.gov) and then encoded as a numerical vector with a fixed length (n = 2048) of the Morgan fingerprint. The conversion of the SMILES representation to the Morgan fingerprint of drugs was performed with the open-source Python library RDKit.

Expression data

Transcriptome profiling via RNA-seq was used as expression data. We used Fragments Per Kilobase of transcript per Million mapped reads (FPKM) expression measurements for samples with “Sample Type” being “Primary tumor.” We focus on the 3008 genes identified from the top 15% most frequently mutated genes in Cancer Cell Line Encyclopedia (CCLE)³⁷ that were annotated to Gene Ontology (GO)³⁸ terms. This list of genes is also referred to as “DrugCell genes”⁸. Input normalization is extensively used in deep learning models to facilitate gradient learning and speed up convergence. Compared to raw FPKM values, standardized FPKM values across individuals (subtracting its mean and dividing by its standard deviation) speed up training and convergence and improve the final performance by 1%. The network we described in the main text was trained with the standardized FPKM.

Survival data

Based on the vital status of the patients, we designed two experiments. The first was designed as a binary classification, with patients known to be deceased within 1000 days as short-lived (y = 0) and patients known to be alive after 1200 days as long-lived (y = 1). A total of 2399 patients were included (1121 short-lived patients and 1278 long-lived patients). The second experiment is a more fine-grind model, wherein we used all deceased patients (a total of 1631 patients) to train the model to predict the clinical months-to-death data recorded for each patient.

In silico optimal drug selection

To perform the in silico optimal drug selection for every patient, we fed in each of the drugs to our CancerIDP. Thus, no matter whether the patient had ever taken the drug or not, we paired the drug’s chemical structure and the patient’s transcriptome expression and fed into the model. The predicted survival time was recorded. The drug corresponding to the longest predicted survival time was selected as the optimal drug for the considered patient.

Neural network CancerIDP construction

The neural network is composed of two branches. The transcriptome GO encoder takes the log-transformed expression values as input and produces expression embedding. The drug autoencoder takes the Morgan fingerprints of drugs and learns the deep representation of drugs. The transcriptome GO encoder is built by mimicking the directed acyclic graph (DAG) formed by GO terms with the 3008 cancer genes as leaf nodes. The GO term architecture is based on DrugCell⁸. From the leaf nodes (gene expression values) to the root formed by the GO hierarchy, each internal node (GO term) is modeled with k (k = 6) neurons through a linear layer and a batch normalization layer³⁹. And the input to the linear layer is the concatenation of all child states of the internal nodes. The drug autoencoder has no prior information injected and is realized by a multilayer perceptron with three layers and the number of neurons for each layer is 64, 32, and 4. Finally, the network head combines the gene expression embedding and deeply-learned drug representation to predict the binary long- or short-lived status or the months-to-death via a non-linear classifier.

Identification of the most discriminating GO terms

The gene expression encoder is structured according to the GO hierarchy in human cells. The internal state of each term is finally brought down to a single neuron value with a neuron activation ReLU⁴⁰. A standard attention mechanism was applied to the child nodes (genes or child terms in the GO hierarchy) when computing the hidden representation for parent nodes (leaf or internal GO terms). This approach helps us track and interpret the important child nodes in relation to calculating the parent value. Specifically, for the root and depth-1 terms (where each node has many children), we consider the child nodes with an attention weight > 0.05 as the most discriminating GO terms. For depth-2 terms, we consider an attention weight > 0.1 as most discriminating, and for depth-3 to leaf terms, an attention weight > 0.2 is considered most discriminating.

Data availability

All data analyzed in this study are described in the article. The source codes and detailed analysis steps are available at https://github.com/bsun0802/CancerIDP.

References

He, K., Zhang, X., Ren, S. & Sun, J. In Proceedings of the IEEE International Conference on Computer Vision. 1026–1034.
Liu, Z. et al. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012–10022.
Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28, 91–99 (2015).
Google Scholar
Sun, B., Li, B., Cai, S., Yuan, Y. & Zhang, C. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7352–7362.
Sun, B., Yan, J., Zhou, X. & Zheng, Y. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 84–93.
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589. https://doi.org/10.1038/s41586-021-03819-2 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Guo, C. et al. In Proceedings of the ACM SIGCOMM 2008 Conference on Data Communication. 75–86.
Kuenzi, B. M. et al. Predicting drug response and synergy using a deep learning model of human cancer cells. Cancer Cell 38, 672-684.e676. https://doi.org/10.1016/j.ccell.2020.09.014 (2020).
Article CAS PubMed PubMed Central Google Scholar
Rafique, R., Islam, S. M. R. & Kazi, J. U. Machine learning in the prediction of cancer therapy. Comput. Struct. Biotechnol. J. 19, 4003–4017. https://doi.org/10.1016/j.csbj.2021.07.003 (2021).
Article CAS PubMed PubMed Central Google Scholar
Tran, K. A. et al. Deep learning in cancer diagnosis, prognosis and treatment selection. Genome Med. 13, 152. https://doi.org/10.1186/s13073-021-00968-x (2021).
Article PubMed PubMed Central Google Scholar
Hao, L., Kim, J., Kwon, S. & Ha, I. D. Deep learning-based survival analysis for high-dimensional survival data. Mathematics 9, 1244 (2021).
Article Google Scholar
Wang, J. et al. SurvNet: A novel deep neural network for lung cancer survival analysis with missing values. Front. Oncol. 10, 588990. https://doi.org/10.3389/fonc.2020.588990 (2020).
Article PubMed Google Scholar
Vale-Silva, L. A. & Rohr, K. MultiSurv: Long-term cancer survival prediction using multimodal deep learning. medRxiv, 2020.2008. 2006.20169698 (2020).
Hao, J., Kim, Y., Mallavarapu, T., Oh, J. H. & Kang, M. Interpretable deep neural network for cancer survival analysis by integrating genomic and clinical data. BMC Med. Genomics 12, 189. https://doi.org/10.1186/s12920-019-0624-2 (2019).
Article PubMed PubMed Central Google Scholar
Wulczyn, E. et al. Interpretable survival prediction for colorectal cancer using deep learning. NPJ Digit. Med. 4, 71 (2021).
Article PubMed PubMed Central Google Scholar
Gerstung, M. et al. Combining gene mutation with gene expression data improves outcome prediction in myelodysplastic syndromes. Nat. Commun. 6, 5901. https://doi.org/10.1038/ncomms6901 (2015).
Article ADS CAS PubMed Google Scholar
Choi, J. K. & Kim, S. C. Environmental effects on gene expression phenotype have regional biases in the human genome. Genetics 175, 1607–1613 (2007).
Article CAS PubMed PubMed Central Google Scholar
Tung, J. & Gilad, Y. Social environmental effects on gene regulation. Cell Mol. Life Sci. 70, 4323–4339. https://doi.org/10.1007/s00018-013-1357-6 (2013).
Article CAS PubMed PubMed Central Google Scholar
Tomczak, K., Czerwinska, P. & Wiznerowicz, M. The Cancer Genome Atlas (TCGA): An immeasurable source of knowledge. Contemp. Oncol. 19, A68-77. https://doi.org/10.5114/wo.2014.47136 (2015).
Article Google Scholar
Zhuang, F. et al. A comprehensive survey on transfer learning. Proc. IEEE 109, 43–76 (2020).
Article Google Scholar
Gupta, A., Thadani, K. & O’Hare, N. In Proceedings of the 28th International Conference on Computational Linguistics. 1061–1066.
Wang, Y. et al. A tree ensemble-based two-stage model for advanced-stage colorectal cancer survival prediction. Inf. Sci. 474, 106–124 (2019).
Article Google Scholar
Kunnumakkara, A. B. et al. Cancer drug development: The missing links. Exp. Biol. Med. 244, 663–689. https://doi.org/10.1177/1535370219839163 (2019).
Article CAS Google Scholar
Turner, R. M., Park, B. K. & Pirmohamed, M. Parsing interindividual drug variability: An emerging role for systems pharmacology. Wiley Interdiscip. Rev. Syst. Biol. Med. 7, 221–241. https://doi.org/10.1002/wsbm.1302 (2015).
Article PubMed PubMed Central Google Scholar
Jaegle, A. et al. In International Conference on Machine Learning. 4651–4664 (PMLR).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017).
Google Scholar
Mubarak, G. & Zahir, F. R. Recent major transcriptomics and epitranscriptomics contributions toward personalized and precision medicine. J. Pers. Med. https://doi.org/10.3390/jpm12020199 (2022).
Article PubMed PubMed Central Google Scholar
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30, 4765–4774 (2017).
Google Scholar
Ribeiro, M. T., Singh, S. & Guestrin, C. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1135–1144.
Li, H., Baucom, B. & Georgiou, P. Linking emotions to behaviors through deep transfer learning. PeerJ Comput. Sci. 6, e246 (2020).
Article PubMed PubMed Central Google Scholar
Mendes, A., Togelius, J. & Coelho, L. D. S. Multi-stage transfer learning with an application to selection process. arXiv preprint arXiv:2006.01276 (2020).
Wulczyn, E. et al. Deep learning-based survival prediction for multiple cancer types using histopathology images. PLoS ONE 15, e0233678. https://doi.org/10.1371/journal.pone.0233678 (2020).
Article CAS PubMed PubMed Central Google Scholar
Iorio, F. et al. A Landscape of pharmacogenomic interactions in cancer. Cell 166, 740–754. https://doi.org/10.1016/j.cell.2016.06.017 (2016).
Article CAS PubMed PubMed Central Google Scholar
Cheng, C. Y., Tseng, W. L., Chang, C. F., Chang, C. H. & Gau, S. S. A deep learning approach for missing data imputation of rating scales assessing attention-deficit hyperactivity disorder. Front. Psychiatry 11, 673. https://doi.org/10.3389/fpsyt.2020.00673 (2020).
Article PubMed PubMed Central Google Scholar
Emmanuel, T. et al. A survey on missing data in machine learning. J. Big Data 8, 1–37 (2021).
Article Google Scholar
Spainhour, J. C. G., Lim, J. & Qiu, P. GDISC: A web portal for integrative analysis of gene–drug interaction for survival in cancer. Bioinformatics 33, 1426–1428. https://doi.org/10.1093/bioinformatics/btw830 (2017).
Article CAS PubMed PubMed Central Google Scholar
Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607. https://doi.org/10.1038/nature11003 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Ashburner, M. et al. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29. https://doi.org/10.1038/75556 (2000).
Article CAS PubMed PubMed Central Google Scholar
Ioffe, S. & Szegedy, C. In International Conference on Machine Learning. 448–456 (pmlr).
Nair, V. & Hinton, G. E. In Proceedings of the 27th International Conference on Machine Learning (ICML-10). 807–814.

Download references

Funding

This work was partially supported by NIH grants R01GM137428, R01NS15276 and R01NS104041.

Author information

Authors and Affiliations

Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA, 90089, USA
Bo Sun & Liang Chen

Authors

Bo Sun
View author publications
You can also search for this author in PubMed Google Scholar
Liang Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B.S. and L.C. conceived and designed the work. B.S. conducted formal analyses. L.C. supervised the study. B.S. and L.C. wrote the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Liang Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sun, B., Chen, L. Interpretable deep learning for improving cancer patient survival based on personal transcriptomes. Sci Rep 13, 11344 (2023). https://doi.org/10.1038/s41598-023-38429-7

Download citation

Received: 05 August 2022
Accepted: 07 July 2023
Published: 13 July 2023
DOI: https://doi.org/10.1038/s41598-023-38429-7

This article is cited by

From understanding diseases to drug design: can artificial intelligence bridge the gap?
- Anju Choorakottayil Pushkaran
- Alya A. Arabi
Artificial Intelligence Review (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.