Introduction

Cancer is a complex process, and its progression involves diverse processes in the patient’s body1. Consequently, the cancer research community generates massive amounts of molecular and phenotypic data to study cancer hallmarks as comprehensively as possible. The rapid accumulation of omics data catalysed by breakthroughs in high-throughput technologies has given rise to the notion of ‘big data’ in cancer, which we define as a dataset with two basic properties; first, it contains abundant information that can give novel insights into essential questions, and second, its analysis demands a large computer infrastructure beyond equipment available to an individual researcher — an evolving concept as computational resources evolve exponentially following Moore’s law. A model example of such big data is the dataset collected by The Cancer Genome Atlas (TCGA)2. TCGA contains 2.5 petabytes of raw data — an amount 2,500 times greater than modern laptop storage in 2022 — and requires specialized computers for storage and analysis. Further, between its initial release in 2008 to March 2022, at least 10,242 articles and 11,054 NIH grants cited TCGA according to a PubMed search, demonstrating its transformative value as a community resource that has markedly driven cancer research forward.

Big data are not unique to the cancer field, and play an essential role in many scientific disciplines, notably cosmology, weather forecasting and image recognition. However, datasets in the cancer field differ from those in other fields in several key aspects. First, the size of cancer datasets is typically markedly smaller. For example, in March 2022, the US National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database3 — the largest genomics data repository to our knowledge — contained approximately 1.1 million samples with ‘cancer’ as a keyword. However, ImageNet, the largest public repository for computer vision, contains 15 million images4. Second, cancer research data are typically heterogeneous and may contain many dimensions measuring distinct aspects of cellular systems and biological processes. Modern multi-omics workflows may generate genome-wide mRNA expression, chromatin accessibility and protein expression data on single cells5, together with a spatial molecular readout6. The comparatively limited data size in each modality and the high heterogeneity among them necessitate the development of innovative computational approaches for integrating data from different dimensions and cohorts.

The subject of big data in cancer is of immense scope, and it is impossible to cover everything in one review. We therefore focus on key big-data analyses that led to conceptual advances in our understanding of cancer biology and impacted disease diagnosis and treatment decisions. Further, we detail reviews in the pertaining sections to direct interested readers to relevant resources. We acknowledge that our limited selection of topics and examples may omit important work, for which we sincerely apologize.

In this Review, we begin by describing major data sources. Next, we review and discuss data analysis approaches designed to leverage big datasets for cancer discoveries. We then introduce ongoing efforts to harness big data in clinically oriented, translational studies, the primary focus of this Review. Finally, we discuss current challenges and future steps to push forward big data use in cancer.

Common data types

There are five basic data types in cancer research: molecular omics data, perturbation phenotypic data, molecular interaction data, imaging data, and textual data. Molecular omics data describe the abundance or status of molecules in cellular systems and tissue samples. Such data are the most abundant type generated in cancer research from patient or preclinical samples, and include information on DNA mutations (genomics), chromatin or DNA states (epigenomics), protein abundance (proteomics), transcript abundance (transcriptomics) and metabolite abundance (metabolomics) (Table 1). Early studies relied on data from bulk samples to provide insights into cancer progressions, tumour heterogeneity and tumour evolution, by using well-designed computational approaches7,8,9,10. Following the development of single-cell technologies and decreases in sequencing costs, current molecular data can be generated at multisample and single-cell levels11,12 and reveal tumour heterogeneity and evolution at a much higher resolution. Furthermore, genomic and transcriptomic readouts can include spatial information13, revealing cancer clonal evolutions within distinct regions and gene expression changes associated with clone-specific aberrations. Although more limited in resolution, conventional bulk analyses are still useful for analysing large patient cohorts as the generation of single-cell and spatial data is costly and often feasible for only a few tumours per study.

Table 1 Common molecular omics data types in cancer research

Perturbation phenotypic data describe how cell phenotypes, such as cell proliferation or the abundance of marker proteins, are altered following the suppression or amplification of gene levels14 or drug treatments15,16. Common phenotyping experiments include perturbation screens using CRISPR knockout17, interference or activation18; RNA interference19; overexpression of open reading frames20; or treatment with a library of drugs15,16. As a limitation, the generation of perturbation phenotypic data from clinical samples is still challenging due to the requirement of genetically manipulable live cells.

Molecular interaction data describe the potential function of molecules through their interacting with diverse partners. Common molecular interaction data types include data on protein–DNA interactions21, protein–RNA interactions22, protein–protein intercations23 and 3D chromosomal interactions24. Similar to perturbation phenotypic data, molecular interaction datasets are typically generated using cell lines as their generation requires a large quantity of material that often exceeds that available from clinical samples.

Clinical data such as health records25, histopathology images26 and radiology images27,28 can also be of considerable value. The boundary between molecular omics and image data is not absolute as both can include information of the other type, for example in datasets that contain imaging scans and information on protein expression from a tumour sample (Table 1).

Data repositories and analytic platforms

We provide an overview of key data resources for cancer research organized in three categories. The first category comprises resources from projects that systematically generate data (Table 2); for example, TCGA generated transcriptomic, proteomic, genomic and epigenomic data for more than 10,000 cancer genomes and matched normal samples, spanning 33 cancer types. The second category describes repositories presenting processed data from the aforementioned projects (Table 3), such as the Genomic Data Commons, which hosts TCGA data for downloading. The third category includes Web applications that systematically integrate data across diverse projects and provide interactive analysis modules (Table 4). For example, the TIDE framework systematically collected public data from immuno-oncology studies and provided interactive modules to study pathways and regulation mechanisms underlying tumour immune evasion and immunotherapy response29.

Table 2 Large-scale projects generating cancer genomic datasets
Table 3 Data repositories hosting cancer genomics data
Table 4 Web applications that enable interactive analysis of cancer datasets

In addition to cancer-focused large-scale projects enumerated in Table 2, many individual groups have deposited genomic datasets that are useful for cancer research in general databases such as GEO3 and ArrayExpress30. Curation of these datasets could lead to new resources for cancer biology studies. For example, the PRECOG database contains 166 transcriptomic studies collected from GEO and ArrayExpress with patient survival information for querying the association between gene expression and prognostic outcome31.

Integrative analysis

Although data-intensive studies may generate omics data on hundreds of patients, the data scale in cancer research is still far behind that in other fields, such as computer vision. Cross-cohort aggregation and cross-modality integration can markedly enhance the robustness and depth of big data analysis (Fig. 1). We discuss these strategies in the following subsections.

Fig. 1: Considerations for using big data in translational applications and basic research.
figure 1

Clinical decisions, basic research and the development of new therapies should consider two orthogonal dimensions when leveraging big-data resources; integrating data across many data modalities and integrating data from different cohorts, which may include the transfer of knowledge from pre-existing datasets.

Cross-cohort data aggregation

Integration of datasets from multiple centres or studies can achieve more robust results and potentially new findings, especially where individual datasets are noisy, incomplete or biased with certain artefacts. A landmark of cross-cohort data aggregation is the discovery of the TMPRSS2ERG fusion and a less frequent TMPRSS2ETV1 fusion as oncogenic drivers in prostate cancer. A compendium analysis across 132 gene-expression datasets representing 10,486 microarray experiments first identified ERG and ETV1 as highly expressed genes in six independent prostate cancer cohorts32, further studies identified their fusions with TMPRSS2 as the cause of ERG and ETV1 overexpression. Another example is an integrative study of tumour immune evasion across many clinical datasets that revealed that SERPINB9 expression consistently correlates with intratumoural T cell dysfunction and resistance to immune checkpoint blockade29. Further studies found SERPINB9 activation to be an immune checkpoint blockade resistance mechanism in cancer cells29 and immunosuppressive cells33.

A general approach for cross-cohort aggregation is to obtain public datasets that are related to a new research topic or have similar study designs to a new dataset. However, use of public data for a new analysis is challenging because the experimental design behind each published dataset is unique, requiring labour-intensive expert interpretation and manual standardization. A recent framework for data curation provides natural language processing and semi-automatic functions to unify datasets with heterogeneous meta-information into a format usable for algorithmic analysis34 (Framework for Data Curation in Table 3).

Although data aggregation may generate robust hypotheses, batch effects caused by differences in laboratories, individual researcher’s techniques or platforms or other non-biological factors may mask or reduce the strength of signals uncovered35, and correcting for these effects is therefore a critical step in cross-cohort aggregations36,37. Popular batch effect correction approaches include the ComBat package, which uses empirical Bayes estimators to compute corrected data36, and the Seurat package, which creates integrated single-cell clusters anchored on similar cells between batches38. Despite the availability of batch correction methods, analysis of both original and corrected data is essential to draw reliable conclusions as batch correction can introduce false discoveries39.

Cross-modality data integration

Cross-modality integration of different data types is a promising and productive approach for maximizing the information gained from data as the information embedded in each data type is often complementary and synergistic40. Cross-modality data integration is exemplified by projects such as TCGA, which provides genomic, transcriptomic, epigenomic and proteomic data on the same set of tumours (Table 2). Cross-modality integration has led to many novel insights regarding factors associated with cancer progression. For example, the phosphorylation status of proteins in the EGFR signalling pathway — an indicator of EGFR signalling activity — is highly correlated with the expression of genes encoding EGFR ligands in head and neck cancers but not receptor expression, copy number alterations, protein levels or phosphorylations41, suggesting that patients should be stratified to receive anti-EGFR therapies on the basis of ligand abundance instead of receptor status.

A recent example of cross-modality data integration used single-cell multi-omics technologies that allowed genome-wide transcriptomics and chromatin accessibility data to be measured together with a handful of proteins of interest42. The advantages of using cross-modality data were clear as during cell lineage clustering, CD8+ T cell and CD4+ T cell populations could be clearly separated in the protein data but were blended when the transcriptome was analysed42. Conversely, dendritic cells formed distinct clusters when assessed on the basis of transcriptomic data, whereas they mixed with other cell types when assessed on the basis of cell-surface protein levels. Chromatin accessibility measured by assay for transposase-accessible chromatin using sequencing (ATAC-seq) further revealed T cell sublineages by capturing lineage-specific regulatory regions. For each cell, the study first identified neighbouring cells through similarities in each data modality. Then, the study defined the weights of the different data modalities in the lineage classification as their accuracy for predicting molecular profiles of the target cell from the profiles of neighbouring cells. The resulting cell clustering, using the weighted distance averaged across single-cell RNA, protein and chromatin accessibility data, was then shown to improve cell lineage separation42.

Another common type of multimodal data analysis involves integrating molecular omics data and data on physical interaction networks (typically those involving protein–protein or protein–DNA interactions) to understand how individual genes interact with each other to drive oncogenesis and metastasis43,44,45,46. For example, an integrative pan-cancer analysis of TCGA detected 407 master regulators organized into 24 modules, partly shared across cancer types, that appear to canalize heterogeneous sets of mutations47. In another study, an analysis of 2,583 whole-tumour genomes across 27 cancers by the Pan-Cancer Analysis of Whole Genomes Consortium revealed rare mutations in the promoters of genes with many interactions (such as TP53, TLE4 and TCF4), and these mutations correlated with low downstream gene expression45. These examples of integrating networks and genomics data demonstrate a promising way to identify rare somatic mutations with a causal role in oncogenesis.

Knowledge transfer through data reuse

Existing data can be leveraged to make new discoveries. For example, cell-fraction deconvolution techniques can infer the composition of individual cell types in bulk-tumour transcriptomics profiles48. Such methods typically assemble gene expression profiles of diverse cell types from many existing datasets and perform regression or signature-enrichment analysis to deconvolve cell fractions49 or lineage-specific expression50,51 in a bulk-tumour expression profile.

Other data reuse examples come from single-cell transcriptomics data analysis. As single-cell RNA sequencing (scRNA-seq) has a high number of zero counts (dropout)52, analyses based on a limited number of genes may lead to unreliable results53, and genome-wide signatures from bulk data can therefore complement such analyses. For example, the transcriptomic data atlas collected from cytokine treatments in bulk cell cultures has enabled the reliable inference of signalling activities in scRNA-seq data34. Further, single-cell signalling activities inferred through bulk data have been used to reveal therapeutic targets, such as FIBP, to potentiate cellular therapies in solid tumours and molecular programmes of T cells that are resilient to immunosuppression in cancer54. In another example, the analysis of more than 50,000 scRNA-seq profiles from 35 pancreatic adenocarcinomas and control samples revealed edge cells among non-neoplastic acinar cells, whose transcriptomes have drifted towards malignant pancreatic adenocarcinoma cells55; TCGA bulk pancreatic adenocarcinoma data were then used to validate the edge-cell signatures inferred from the single-cell data.

Data reuse can assist the development of new experimental tests. For example, existing tumour whole-exome sequencing data were used to optimize a circulating tumour DNA assay by maximizing the number of alterations detected per patient, while minimizing gene and region selection size56. The resulting circulating tumour DNA assay can provide a comprehensive view of therapy resistance and cancer relapse and metastasis by detecting alterations in DNA released from multiple tumour regions or different tumour sites57.

Although the data scale in cancer research is typically much smaller than in other fields, the number of input features, such as genes or imaging pixels, can be extremely high. Training a machine learning model with a high number of input dimensions (a large number of features) and small data size (a small number of training samples) is likely to lead to overfitting, in which the model learns noise from training data and cannot generalize on new data58. Transfer learning approaches are a promising way of addressing this disparity related to data reuse. These approaches involve training a neural network model on a large, related dataset, and then fine-tuning the model on the smaller, target dataset. For example, most cancer histopathology artificial intelligence (AI) frameworks start from pretrained architectures from ImageNet — an image database containing 15 million images with detailed hierarchical annotations4 — and then fine-tune the framework on new imaging datasets of smaller sizes. As a further example of this approach, a few-shot learning framework enabled the prediction of drug response using data from only several patient-derived samples and a model pretrained using in vitro data from cell lines59. Despite these successful applications, transfer learning should be used with caution as it may produce mostly false predictions when data properties are markedly different between the pretraining set and the new dataset. Training a lightweight model60 or augmenting the new dataset61 are alternative solutions.

Data-rich translational studies

Many clinical diagnoses and decisions, such as histopathology interpretations, are inherently subjective and rely on interpreters’ experience or the availability of standardized diagnostic nomenclature and taxonomy. Such subjective factors may bring interpretive error62,63,64 and diagnostic discrepancies, for example when senior stature can have an undue influence on diagnostic decisions — the so-called big-dog effect65. Big-data approaches can provide complementary options that are systematic and objective to guide diagnosis and clinical decisions.

Diagnostic biomarkers trained from data cohorts

A major focus of translational big-data studies in cancer has been the development of genomics tests for predicting disease risk, some of which have already been approved by the US Food and Drug Administration (FDA) and commercialized for clinical use66. Distinct from biomarker discoveries through biological mechanisms and empirical observations, big data-derived tests analyse genome-scale genomics data from many patients and cohorts to generate a gene signature for clinical assays67. Such predictors mainly help clinicians determine the minimal therapy aggressiveness needed to minimize unnecessary treatment and side effects. The success of such tests depends on their high negative predictive value — the proportion of negative tests that reflect true negative results — so as not to miss patients who need aggressive therapy options66.

Some early examples of diagnostic biomarker tests trained from big data include prognosis assays for patients with oestrogen receptor (ER)- or progesterone receptor (PR)-positive breast cancer, such as Oncotype DX68,69, MammaPrint67,70, EndoPredict71 and Prosigna72. These tests are particularly useful as adjuvant endocrine therapy alone can bring sufficient clinical benefit to ER/PR-positive, HER2-negative patients with early-stage breast cancer73. Thus, patients stratified as being at low risk can avoid unnecessary additional chemotherapy. Predictors for other cancer types include Oncotype DX biomarkers for colon cancer74 and prostate cancer75 and Pervenio for early-stage lung cancer76.

In the early applications discussed above, large-scale data from genome-scale experiments served in the biomarker discovery stage but not in their clinical implementation. Owing to the high cost of genome-wide experiments and patent issues, the biomarker tests themselves still need to be performed through quantitative PCR or NanoString gene panels. However, the rapid decline of DNA sequencing costs in recent years could allow therapy decisions to be informed directly by genomics data and bring notable advantages over conventional approaches77. Gene alterations relevant to therapy decisions could involve diverse forms, including single-nucleotide mutations, DNA insertions, DNA deletions, copy number alterations, gene rearrangements, microsatellite instability and tumour mutational burden78,79,80. These alterations can be detected by combining hybridization-based capture and high-throughput sequencing. The MSK-IMPACT81 and FoundationOne CDx82 tests profile 300–500 genes and can use DNA from formalin-fixed, paraffin-embedded tumour specimens to detect oncogenic alterations and identify patients who may benefit from various therapies.

Variant interpretation in clinical decisions is still challenging as the oncogenic impact of each mutation depends on its clonality83, zygosity84 and co-occurrences with other mutations85. Sequencing data can uncover tumorigenic processes (such as DNA repair defects, exogenous mutagen exposure and prior therapy histories81) by identifying underlying mutational signatures, such as DNA substitution classes and sequence contexts86. Future computational frameworks for therapy decisions should therefore consider many dimensions of variants and inferred biological processes, together with other clinical data, such as histopathology data, radiology images and health records.

Data-rich assays that complement precision therapies currently focus on specific genomic aberrations. However, epigenetic therapies, such as inhibitors that target histone deacetylases87, have a genome-wide effect and are typically combined with other treatments, and therefore current genomics assays may not readily evaluate their therapeutic efficacy. We could not find any clinical datasets of histone deacetylase inhibitors deposited in the NCBI GEO database when writing this Review, indicating there are many unexplored territories of data-driven predictions for this broad category of anticancer therapies.

Clinical trials guided by molecular data

Genome-wide and multimodal data have begun to play a role in matching patients in prospective multi-arm clinical trials, particularly those investigating precision therapies. For example, the WINTHER trial prospectively matched patients with advanced cancer to therapy on the basis of DNA sequencing (arm A, through Foundation One assays) or RNA expression (arm B, comparing tumour tissue with normal tissue through Agilent oligonucleotide arrays) data from solid tumour biopsies88. Such therapy matches by omics data typically lead to off-label drug use. The WINTHER study concluded that both data types were of value for improving therapy recommendations and patient outcomes. Furthermore, there were no significant differences between DNA sequencing and RNA expression with regard to providing therapies with clinical benefits88, which was corroborated by a later study89.

Other, similar trials have demonstrated the utility of matching patients for off-label use of targeted therapies on the basis of genome-wide genomics or transcriptomics data89,90,91,92 (Fig. 2). In these studies, the fraction of enrolled patients who had therapies matched by omics data ranged from 19% to 37% (WINTHER, 35%88; POG, 37%89; MASTER, 31.8%92; MOSCATO 01, 19.2%90; CoPPO, 20%91). Among these matched patients, about one third demonstrated clinical benefits (WINTHER, 25%88; POG, 46%89; MASTER, 35.7%92; MOSCATO 01, 33%90; CoPPO, 32%91). Except for the POG study, all studies used the end point defined by the Von Hoff model, which compares progression-free survival (PFS) for the trial (PFS2) with the PFS recorded for the therapy preceding enrolment (PFS1) and defines clinical benefit as a PFS2/PFS1 ratio of more than 1.3 (ref.93).

Fig. 2: Prospective clinical studies guided by omics data to use off-label drugs.
figure 2

Recent umbrella clinical trials88,89,90,91,92 have focused on multi-omics profiling of the tumours of enrolled patients by generating and analysing genome-wide data — including data from DNA sequencing, gene expression profiling, and copy number profiling — to prioritize treatments. After multi-omics profiling, a multidisciplinary molecular tumour board led by clinicians selects the best therapies on the basis of the current known relationships between drugs, genes and tumour vulnerabilities. For each therapy, the relevant altered vulnerabilities could include direct drug targets, genes in the same pathway, indirect drug targets upregulated or downregulated by drug treatment, or other genes interacting with the drug targets through physical or genetic interactions. This process then results in patients being treated with off-label targeted therapies. The end points for evaluating clinical efficacy include the ratio of the progression-free survival (PFS) associated with omics data-guided therapies (PFS2) and the PFS associated with previous therapy (PFS1), or differences in survival between patients treated with omics data-guided therapies and patients treated with therapies guided by physician’s choice alone.

A recent study demonstrated the feasibility and value of an N-of-one strategy that collected multimodal data, including immunohistochemistry data for multiple protein markers, RNA levels and genomics alterations in cell-free DNA from liquid biopsies94 (Fig. 2). A broad multidisciplinary molecular tumour board (MTB) then made personalized decisions using these multimodal omics data. Overall, patients who received MTB-recommended treatments had significantly longer PFS and overall survival than those treated by independent physician choice. Similarly, another study also demonstrated overall survival benefits brought by MTB recommendations95.

With these initial successes, emerging clinical studies aim to collect additional data beyond bulk-sample sequencings — such as tumour cell death response following various drug treatments96 or scRNA-seq data collected on longitudinal patient samples — to study therapy response and resistance mechanisms97. Besides omics data generated from tumour samples, cross-modality data integration is a potential strategy to improve therapy recommendations. One such promising direction involves the study and application of synthetic lethal interactions98,99,100,101,102,103,104, which, once integrated with tumour transcriptomic profiles, can accurately score drug target importance and predict clinical outcomes for many anticancer treatments, including targeted therapies and immunotherapies98. We foresee that new data modalities and assays will provide additional ways to design clinical trials.

Artificial intelligence for data-driven cancer diagnosis

Genomics datasets, such as gene expression levels or mutation status, can typically be aligned to each other on gene dimensions. However, data types in clinical diagnoses, such as imaging data or text reports, may not directly align across samples in any obvious way. AI approaches based on deep neural networks (Fig. 3a) are an emerging method for integrating these data types for clinical applications105.

Fig. 3: Data-driven artificial intelligence to support cancer diagnosis.
figure 3

a | A common artificial intelligence (AI) framework in cancer detection uses a convolutional neural network (CNN) to detect the presence of cancer cells from a diagnostic image. CNNs use convolution (weighted sum of a region patch) and pooling (summarize values in a region to one value) to encode image regions into low-dimensional numerical vectors that can be analysed by machine learning models. The CNN architecture is typically pretrained with ImageNet data, which is much larger than any cancer biology imaging dataset. To increase the reliability of the AI framework, the input data can be augmented through rotation or blurring of tissue images to increase data size. The data are separated into non-overlapping training, tuning and test sets to train the AI model, tune hyperparameters and estimate the prediction accuracy on new inputs, respectively. False-positive predictions are typically essential data points for retraining the AI model. b | An example of the application of AI in informing clinical decisions, as per the US Food and Drug Administration-approved AI test Paige Prostate. From one needle biopsy sample, the pathologist can decide whether cancer cells are present. If the results are negative (‘no cancer’) or if the physician cannot make a firm diagnosis (‘defer’), the Paige Prostrate AI can analyse the image and prompt the pathologist with regard to potential cancer locations if any are detected. The alternative procedure involves evaluating multiple biopsy samples and performing immunohistochemistry tests on prostate cancer markers, independently from the AI test185.

The most popular application of AI for analysing imaging data involves clinical outcome prediction and tumour detection and grading from tissue stained with haematoxylin and eosin (H&E)26. In September 2021, the FDA approved the use of the AI software Paige Prostate106 to assist pathologists in detecting cancer regions from prostate needle biopsy samples107 (Fig. 3b). This approval reflects the accelerating momentum of AI applications on histopathology images108 to complement conventional pathologist practices and increase analysis throughput, particularly for less experienced pathologists. The CAMELYON challenge for identifying tumour regions provided 1,399 manually annotated whole-slide H&E-stained tissue images of sentinel lymph nodes from patients with breast cancer for training AI algorithms109. The top performers in the challenge used deep learning approaches, which achieved similar performance in detecting lymph node metastasis as expert pathologists110. Other studies have trained deep neural networks to predict patient survival outcomes111, gene mutations112 or genomic alterations113, on the basis of analysing a large body of H&E-stained tissue images with clinical outcome labels or genomics profiles.

Besides histopathology, radiology is another application of AI imaging analysis. Deep convolutional neural networks that use 3D computed tomography volumes have been shown to predict the risk of lung cancer with an accuracy comparable to that of predictions by experienced radiologists114. Similarly, convolutional neural networks can use computed tomography data to stratify the survival duration of patients with lung cancer and highlight the importance of tumour-surrounding tissues in risk stratification115.

AI frameworks have started to play an important role in analysing electronic health records. A recent study evaluating the effect of different eligibility criteria on cancer trial outcomes using electronic health records of more than 60,000 patients with non-small-cell lung cancer revealed that many patient exclusion criteria commonly used in clinical trials had a minimal effect on trial hazard ratios25. Dropping these exclusion criteria would only marginally decrease the overall survival and result in more inclusive trials without compromising patient safety and overall trial success rates25. Besides images and health records, AI trained on other data types also has broad clinical applications, such as early cancer detection through liquid biopsies capturing cell-free DNA116,117 or T cell receptor sequences118, or genomics-based cancer risk predictions119,120. Additional examples of AI applications in cancer are available in other reviews40,121.

New AI approaches have started to play a role in biological knowledge discovery. The saliency map122 and class activation map123 can highlight essential portions of input images that drive predicted outcomes. Also, in a multisample cohort, clustering data slices on the basis of deep learning-embedded similarities can reveal human-interpretable features associated with a clinical outcome. For example, clustering similar image patches related to colorectal cancer survival prediction revealed that high-risk survival predictions are associated with a tumour–adipose feature, characterized by poorly differentiated tumour cells adjacent to adipose tissue124. Although the molecular mechanisms underlying this association are unclear, this study provided an example of finding imaging features that could help cancer biologists pinpoint new disease mechanisms.

Despite the promising results described above, few AI-based algorithms have reached clinical deployment due to several limitations26. First, the performance of most AI predictors deteriorates when they are applied to test data generated in a setting different from that in which their training data are generated. For example, the performance of top algorithms from the CAMELYON challenge dropped by about 20% when they were evaluated on the basis of data from other centres108. Such a gap may arise from differences in image scanners (if imaging data are being evaluated), sample collection protocols or study design, emphasizing the need for reliable data homogenization. Second, supervised AI training requires a large amount of annotated data, and acquiring sufficient human-annotated data can be challenging. In imaging data, if a feature for a particular diagnosis is present in only a fraction of image regions, an algorithm will need many samples to learn the task. Furthermore, if features are not present in the training data, the AI will not make meaningful predictions; for example, the AI framework of AlphaFold2 can predict wild type protein structures with high accuracy, but it cannot predict the impact of cancer missense mutations on protein structures because the training data for AlphaFold2 do not contain altered structures of these mutated proteins125.

Many studies of AI applications that claim improvements lack comparisons with conventional clinical procedures. For example, the performance study of Paige Prostate evaluated cancer detection using an H&E-stained tissue image from one needle biopsy sample126. However, the pathologist may make decisions on the basis of multiple needle biopsy samples and immunohistochemistry stains for suspicious samples instead of relying on one H&E-stained tissue image (Fig. 3b). Therefore, rigorous comparison with conventional clinical workflows is necessary for each application before the advantage of any AI framework is claimed.

New therapy development aided by big-data analysis

Developing a new drug is costly, is time-intensive and suffers from a high failure rate127. The development of new therapies is a promising direction for big-data applications. To our knowledge, no FDA-approved cancer drugs have been developed primarily through big-data approaches; however, some big data-driven preclinical studies have attracted the attention of the pharmaceutical industry for further development and may soon make impactful contributions to clinics128.

Big data have been used to aid the repurposing of existing drugs to treat new diseases129,130 and the design of synergistic combinations131,132,133,134. By creating a network of 1.2 billion edges among diseases, tissues, genes, pathways and drugs by mining more than 40 million documents, one study revealed that the combination of vandetanib and everolimus could inhibit ACVR1, a drug efflux transporter, as a potential therapy for diffuse intrinsic pontine glioma135.

Recent studies have combined pharmacological data and AI to design new drugs (Fig. 4). A deep generative model was used to design new small molecules inhibiting the receptor tyrosine kinase DDR1 on the basis of information on existing DDR1 inhibitors and compound libraries, with the lead candidate demonstrating favourable pharmacokinetics in mice136. Deep generative models are neural networks with many layers that learn complex characteristics of specific datasets (such as high-dimensional probability distributions) and can use them to generate new data similar to the training data137. For each specific drug design application, such a framework can encode distinct data into the neural network parameters and thus naturally incorporate many data types. A network aiming to find novel kinase inhibitors, for example, may include data on the structure of existing kinase inhibitors, non-kinase inhibitors and patent-protected molecules that are to be avoided136.

Fig. 4: Design of new kinase inhibitors using a generative artificial intelligence model.
figure 4

The variational autoencoder, trained with the structures of many compounds, can encode a molecular structure into a latent space of numerical vectors and decode this latent space back into the compound structure. For each target, such as the receptor tyrosine kinase DDR1, the variational autoencoder can create embeddings of compound categories, such as existing kinase inhibitors, patented compounds and non-kinase inhibitors. Sampling the latent space for compounds that are similar to existing on-target inhibitors and not patented compounds or non-kinase inhibitors can generate new candidate kinase inhibitors for downstream experimental validation. Adapted from ref.136, Springer Nature Limited.

AI can also be used for the virtual screening of bioactive ligands on target protein structures. Under the assumption that biochemical interactions are local among chemical groups, convolutional neural networks can comprehensively integrate training data from previous virtual screening studies to outperform previous docking methods based on minimizing empirical scores138. Similarly, a systematic evaluation revealed that deep neural networks trained using large and diverse datasets composed of molecular descriptors and drug biological activities could predict the activity of test-set molecules better than other approaches139.

Big data in front of narrow therapeutic bottlenecks

During dynamic tumour evolution, cancers generally become more heterogeneous and harbour a more diverse population of cells with different treatment sensitivities. Drug resistance can eventually evolve from a narrow bottleneck of a few cells140. Furthermore, the difference between a treatment dose with antitumour effects and toxicity leading to either clinical trial failure or treatment cessation is small66. These two challenges are common reasons for anticancer therapy failures as increasing drug combinations to target rare cancer cells will quickly lead to unacceptable toxic effects. An essential question is whether big data can bring solutions to overcome heterogeneous tumour evolution towards drug resistance while avoiding intolerable toxic effects.

Ideally, well-designed drug combinations should target various subsets of drug-tolerant cells in tumours and induce robust responses. Computational methods have been developed to design synergistic drug pairs131,141; however, drug synergy may not be predictable for certain combinations even with comprehensive training data. A recent community effort assessed drug synergy prediction methods trained on AstraZeneca’s large drug combination dataset, consisting of 11,576 experiments from 910 combinations across 85 molecularly characterized cancer cell lines134. The results showed that none of the methods evaluated could make reliable predictions for approximately 20% of the drug pairs whose targets independently regulate downstream pathways.

There could be a theoretical limitation of the power of drug combinations in killing heterogeneous tumour cells while avoiding toxic effects on normal tissues. A recent study mining 15 single-cell transcriptomics datasets revealed that inhibition of four cell-surface targets is necessary to kill at least 80% of tumour cells while sparing at least 90% of normal cells in tumours142. However, a feasible drug-target combination may not exist to kill a higher fraction of tumour cells while sparing normal cells.

An important challenge accompanying therapy design efforts is the identification of genomic biomarkers that could predict toxicity. A community evaluation demonstrated that computational methods could predict the cytotoxicity of environmental chemicals on the basis of the genotype data of lymphoblastoid cell lines143. Further, a computational framework has been used to predict drug toxicity by integrating information on drug-target expression in tissues, gene network connectivity, chemical structures and toxicity annotations from clinical trials144. However, these studies were not explicitly designed for anticancer drugs, which are challenging with regard to toxicity prediction due to their extended cytotoxicity profiles.

Challenges and future perspectives

While many big-data advancements are encouraging and impressive, considerable challenges remain regarding big-data applications in cancer research and the clinic. Omics data often suffer from measurement inconsistencies between cohorts, marked batch effects and dependencies on specific experimental platforms. Such a lack of consistency is a major hurdle towards clinical translation. Consensus on the measurement, alignment and normalization of tumour omics data will be critical for each data type35. Besides these technical challenges, structural and societal challenges also exist and may impede the progress of the entire cancer data science field. We discuss these in the following subsections.

Less-than-desirable data availability

A key challenge of cancer data science is the insufficient availability of data and code. A recent study found that machine learning-based studies in the biomedical domain compare poorly with those in other areas regarding public data and source code availability145. Sometimes, the clinical information accompanying published cancer genomics data is not provided or complete, even when security and privacy issues are resolved. One possible reason for this bottleneck is related to data release policies and data stewardship costs. Although many journals require the public release of data, such requirements are often met by deposition of data into repositories that require author and institutional approval-of-access requests due to intellectual property and various other considerations. Furthermore, deposited data may be missing critical information, such as missing cell barcodes for single-cell sequencing data or low-resolution images in the case of histopathology data.

In our opinion, the mitigation of these issues will require the enforcement of policies regarding public data availability by funding agencies and additional community efforts to examine the fulfilment of open data access. For example, a funding agency may suspend a project if the community readers report any violations of data release agreements upon publication of articles. The allocation of budgets in grants for patient de-identification upon manuscript submission and financial incentives for checking data through independent data stewardship services upon paper acceptance could markedly help facilitate data and code availability. One notable advance in data availability through industry–academia alliances has come in the form of data-sharing initiatives; specifically, making large repositories of patient tumour sequencing and clinical data available for online queries to researchers in partner institutions146. Such initiatives typically involve query-only access (that is, without allowing downloads), but are an encouraging way to expand the collaborative network between academia and industry entities that generate massive amounts of data.

Data-scale gaps

As mentioned earlier, the datasets available for cancer therapeutics are substantially smaller than those available in other fields. One reason for such a gap is that the generation of medical data depends on professionally trained scientists. To close the data-scale gap, more investments will be required to automate the generation of at least some types of annotated medical data and patient omics data. Rare cancers especially suffer from a lack of preclinical models, clinical samples and dedicated funding147. Moreover, the usability of biomedical data is typically constrained by the genetic background of the population. For example, the frequency of actionable mutations may differ among East Asian, European and American populations148.

A further reason for the data-scale gap is a lack of data generation standards in cancer clinical and biology studies. For example, most clinical trials do not yet collect omics data from patients. With the exponential decrease in sequencing cost, collection of omics data in clinical trials should, in our opinion, be markedly expanded, and possibly be made mandatory as a standard requirement. Further, current data repositories, such as ClinicalTrials.gov and NCBI GEO, do not have common metalanguage standards, whose incorporation would markedly improve the development of algorithms applied to their analysis. Although semi-automated frameworks are becoming available to homogenize metadata34, the foundational solution should be establishing common vocabularies and systematic meta-information standards in critical fields.

Conclusion

Data science and AI are transforming our world through applications as diverse as self-driving cars, facial recognition and language translation, and in the medical world, the interpretation of images in radiology and pathology. We already have available tumour data to facilitate biomedical breakthroughs in cancer through cross-modality integration, cross-cohort aggregation and data reuse, and extraordinary advancements are being made in generating and analysing such data. However, the state of big data in the field is complex, and in our view, we should acknowledge that ‘big data’ in cancer are not yet so big. Future investments from the global research community to expand cancer datasets will be critical to allow better computational models to drive basic research, cancer diagnostics and the development of new therapies.