Pan-cancer analyses that examine commonalities and differences among various cancer types have emerged as a powerful way to obtain novel insights into cancer biology. Here we present a comprehensive analysis of genetic alterations in a pan-cancer cohort including 961 tumours from children, adolescents, and young adults, comprising 24 distinct molecular types of cancer. Using a standardized workflow, we identified marked differences in terms of mutation frequency and significantly mutated genes in comparison to previously analysed adult cancers. Genetic alterations in 149 putative cancer driver genes separate the tumours into two classes: small mutation and structural/copy-number variant (correlating with germline variants). Structural variants, hyperdiploidy, and chromothripsis are linked to TP53 mutation status and mutational signatures. Our data suggest that 7–8% of the children in this cohort carry an unambiguous predisposing germline variant and that nearly 50% of paediatric neoplasms harbour a potentially druggable event, which is highly relevant for the design of future clinical trials.
Cure rates for childhood cancers have increased to about 80% in recent decades, but cancer is still the leading cause of death by disease in the developed world among children over one year of age1,2. Furthermore, many children who survive cancer suffer from long-term sequelae of surgery, cytotoxic chemotherapy, and radiotherapy, including mental disabilities, organ toxicities, and secondary cancers3. A crucial step in developing more specific and less damaging therapies is the unravelling of the complete genetic repertoire of paediatric malignancies, which differ from adult malignancies in terms of their histopathological entities and molecular subtypes4. Over the past few years, many entity-specific sequencing efforts have been launched, but the few paediatric pan-cancer studies thus far have focused only on mutation frequencies, germline predisposition, and alterations in epigenetic regulators4,5,6.
We have carried out a broad exploration of cancers in children, adolescents, and young adults, by incorporating small mutations and copy-number or structural variants on somatic and germline levels, and by identifying putative cancer genes and comparing them to those previously reported in adult cancers by The Cancer Genome Atlas (TCGA)7. We have also examined mutational signatures and potential drug targets. The compendium of genetic alterations presented here is available to the scientific community at http://www.pedpancan.com.
This integrative analysis includes 24 types of cancer and covers all major childhood cancer entities, many of which occur exclusively in children8 (Fig. 1, Supplementary Table 1). Ninety-five per cent of the patients in this study were diagnosed during childhood or adolescence (aged 18 years or younger) and 5% as young adults (up to 25 years) (Extended Data Fig. 1a). This study is biased towards central nervous system tumours, and is complemented by an additional study of a non-overlapping paediatric cohort with mainly leukaemias and extracranial solid tumours9.
We compiled paired-end Illumina-based sequencing data for 961 tumours (914 individual patients) from previous cancer-type specific studies (see Methods and Supplementary Note 1) including 547 whole-genome sequences (WGS, median coverage 37×) and 414 whole-exome sequences (WES, 121×) partially complemented by low-coverage whole genomes (Supplementary Tables 1, 2). Tumour and matched germline samples were processed with standardized pipelines to detect single nucleotide variants (SNVs), short insertions and deletions (indels), copy-number variants (CNVs) and other structural variants. Secondary (relapse) tumours (n = 82, including 47 matched to primaries) were analysed separately from the main primary cohort (n = 879).
Mutation frequencies across cancer types
Coding somatic SNV (93%) and indel (7%) counts correlated across all samples (n = 879) (R = 0.27, P = 9.1 × 10−5; Extended Data Fig. 1b, c). Mutation frequencies varied between cancer types (0.02–0.49 mutations per Mb) and were overall 14 times lower than in adult cancers7 (0.13 versus 1.8 mutations per Mb, TCGA data; Fig. 1, Extended Data Fig. 1c, Supplementary Table 3). Relapse tumours harboured significantly more mutations than primary tumours (P = 0.0015, excluding highly mutated tumours; Extended Data Fig. 1d).
Tumours with more than 10 mutations per Mb have been referred to as ‘hypermutators’, and are often related to deficiencies in mismatch repair (MMR)10,11. In this cohort, hypermutation occurred exclusively in H3.3 or H3.1 K27-wildtype (K27wt) high-grade gliomas with biallelic germline mutations in MSH6 or PMS2, with an extremely high mutational burden similar to the highest among adult tumours (in POLE- or POLQ-mutated carcinomas)7,12 (Fig. 1). Some paediatric tumours had a mutational burden below this threshold, but markedly above average (2–10 mutations per Mb, referred to as ‘paediatric highly mutated’), including several K27wt high-grade gliomas with monoallelic germline variants in MSH2, MSH6 or PMS2 (Fig. 1). Whether these highly mutated tumours respond to immune checkpoint inhibitors, as described for paediatric glioblastoma, should be of clinical interest13.
As in previous reports, the somatic mutation burden increased with patient age (R = 0.39, P = 2.9 × 10−6), except in Burkitt’s lymphoma (immunoglobulin hypermutation) and tumours with ‘kataegis’ events of localized hypermutation at double-stranded breakpoints14,15 (Extended Data Fig. 1e, f). Both SNVs (R = 0.37, P = 1.0 × 10−5) and indels (R = 0.27, P = 5.4 × 10−4) correlated with patient age overall, although within some cancers (for example, acute lymphoblastic leukaeumia (ALL), Ewing’s sarcoma, and rhabdomyosarcoma), we observed almost random mutational loads (R < 0.2). Rhabdomyosarcomas were largely dominated by embryonal tumours with more mutations than the few alveolar cases (median 0.27 versus 0.12 mutations per Mb, P = 0.002).
Mutational processes in childhood cancers
Most cancer types predominantly harboured C > T transitions (≥30% of SNVs in two-thirds of cancer types) linked to mutational signature 1, whose previously described age-association occurred in some paediatric brain tumours15,16 (P < 0.05; Extended Data Figs 1g, 2a–c). Mutational signatures, possibly reflecting biochemical cellular processes, have previously been investigated for many, mainly adult, cancers15. In this paediatric cohort (WGS, n = 503), we found evidence for major contributions of 16 out of 30 published signatures and also identified one new signature15 (Fig. 2, Extended Data Fig. 2a, Supplementary Table 4). This ‘signature P1’, which is distinct from any previously documented signatures and harbours elevated C > T mutations in a CCC/CCT context, occurred in several atypical teratoid rhabdoid tumours (ATRTs) and one ependymoma (Fig. 2, Extended Data Fig. 2d, Supplementary Table 5). Its activity correlated with ‘multiple nucleotide variants’ (MNVs; R = 0.87, P = 1.1 × 10−12), but no particular loci or genes were mutually altered in the affected tumours (Extended Data Fig. 2d). Notably, all ATRTs with signature P1 were in the recently defined subgroup ‘SHH’, and even within one proposed methylation subset of these17 (P = 0.003, Wilcoxon rank-sum test; Extended Data Fig. 2d). Signatures 16 and 18 were heterogeneously represented within several cancer types, with signature 16 being most prominent in pilocytic astrocytomas, and signature 18, previously proposed to be associated with oxidative DNA damage and related to C > A transversions, in neuroblastomas, rhabdomyosarcomas, and other tumours with multiple structural variants15,18 (Extended Data Figs 1g, 2a, c, 3a).
Signature 3, the ‘canonical’ double-stranded break signature linked to mutations in BRCA1 or BRCA2 or to a ‘BRCAness’ phenotype, and signatures 8 (recently linked to BRCA2 or PALB2 germline mutations in medulloblastomas; S. M. Waszak et al., personal communication) and 13 were linked to chromothripsis and TP53 mutations. This was particularly true for TP53 germline-mutated SHH medulloblastomas, and similarly for adrenocortical carcinomas and rhabdomyosarcomas (Extended Data Fig. 3b, c). Overall, signatures 3, 8, and 13 were more pronounced in cancer types with higher genomic instability (that is, structural variants; Extended Data Fig. 2e).
Germline variants in cancer predisposition genes
A recent study of more than 1,000 patients estimated that about 8% of children with cancer harbour a hereditary predisposition5. Accordingly, in our cohort (n = 914 individual patients, about 25% of samples overlapping with the previous study), 7.6% of samples were determined as being likely to be associated with a pathogenic germline variant5,19 (162 genes investigated; Supplementary Tables 6, 7). No general age-of-onset bias was observed in patients with a predisposition; however, onset was later in germline MMR-deficient patients (P = 0.0001), even within the high-grade glioma sub-cohort (P = 0.001).
Hereditary predisposition was most common in adrenocortical carcinomas (50%) and hypodiploid B-ALL (28%), followed by K27wt high-grade gliomas, ATRTs, SHH medulloblastomas, and retinoblastomas (15–25% each; Fig. 3a). Compared to the previous study, LZTR1, TSC2, and CHEK2 emerged as new putative predisposition genes, and possible new associations, such as SDHA with medulloblastoma, were detected5 (Fig. 3b).
Most germline variants were related to DNA repair genes from mismatch (MSH2, MSH6, PMS2) and double-stranded break (TP53, BRCA2, CHEK2) repair (Fig. 3b, c). Both groups are clinically relevant: patients with constitutional MMR deficiency could be candidates for immune checkpoint inhibition13 (Figs 1, 3b, c). Carriers of TP53 germline mutations (Li–Fraumeni syndrome), here most common in adrenocortical carcinomas, hypodiploid B-ALL, SHH medulloblastomas, and K27wt high-grade gliomas, are at a 50% risk for early-onset cancer compared to 1% overall, and are susceptible to treatment-induced secondary oncogenesis2,20,21,22 (Fig. 3b). Correcting the predisposition frequency of 7.6% in this cohort for the relative incidence of cancer types as a whole, we find that approximately 6% of all childhood cancer patients may carry a causative germline variant (Fig. 3d).
Significance analysis identifies cancer driver genes
Genome-wide analysis for significant mutation clusters (n = 538, WGS excluding hypermutators) identified non-coding mutations in the TERT promoter in 2.5% of tumours (Extended Data Fig. 4a, b, Supplementary Table 8). Further high-confidence clusters corresponded to coding mutations in frequently mutated genes (TP53, H3F3A, CTNNB1), and to localized hypermutation at the rearranged MYC locus in Burkitt’s lymphoma, while the bulk were classified as likely technical artefacts23 (Extended Data Fig. 4b).
MuSiC identified 77 significantly mutated genes (SMGs), which were ranked according to their pan-cancer mutation frequency24 (Fig. 4, Supplementary Tables 9, 10). Most SMGs were mutually exclusively mutated across cancer types, demonstrating specificity of single putative driver genes in childhood cancers as compared to more frequent co-mutation in adult cancers in the TCGA study7 (Extended Data Fig. 4c–e). None of the SMGs showed a bias towards samples with higher mutation frequencies. The allele frequencies of mutations in SMGs were higher than in non-SMGs, and ranked higher in individual tumours, suggesting an early clonal occurrence of these likely driver events (Extended Data Fig. 4f). Two additional SMGs emerged from analysis of the relapse tumours (n = 82): PRPS1 and NT5C2, both of which have been previously implicated in disease progression and chemotherapy resistance25,26 (Extended Data Fig. 4g).
Genes linked to epigenetic modification emerged as the most common (25% of tumours, 23 of 24 cancer types) and the largest (20%) group of SMGs (Extended Data Fig. 5a). Compared to a previous study6, for example, we also detected ARID1A and BCOR. Transcriptional regulators and MAP-kinase-associated genes accounted for 12–15% of SMGs. TP53 was the only DNA repair gene among somatic SMGs, in contrast to the multiple DNA repair-related germline mutations, and also in contrast to adult cancers (9% of SMGs, TCGA)7. PI3K-associated SMGs are the most commonly altered (31%) genes in adult cancers, compared to only 3% in paediatric cancers, which could be related to their often late occurrence in the evolution of multi-hit adult cancers27 (Extended Data Fig. 5a).
Forty-seven per cent of paediatric tumours harboured at least one SMG mutation, with most tumours (57%) having only one. SMG mutations were rare (<15%) in ependymomas, hepatoblastomas, Ewing’s sarcomas (driven by EWSR1 fusions instead of by point mutations28), and pilocytic astrocytomas, and common (>90%) in K27M high-grade gliomas, WNT medulloblastomas, and Burkitt’s lymphomas. By contrast, 93% of adult cancers harbour at least one mutation in an (adult cancer-related) SMG and 76% in multiple SMGs7 (Extended Data Fig. 5b). In line with the accompanying paediatric pan-cancer study9, only around 30% of paediatric SMGs overlapped with adult SMGs (Extended Data Fig. 5c). On the basis of incidence-normalized mutation frequencies, TP53 is predicted to be the most common somatically mutated gene (4% of childhood tumours), followed by KRAS, ATRX, NF1, and RB1 (1–2% of tumours); in adult cancers, with similarly normalized data, TP53 is also the most commonly mutated gene, albeit ten times more frequently (Extended Data Fig. 5d).
Assessment of high functional impact mutations (OncodriveFM)29 revealed well-known tumour suppressor genes (TSGs) such as TP53, ATRX, SMARCA4, and RB1, and further putative TSGs, including FMR1 in SHH/WNT medulloblastomas and MALRD1 (also known as C10orf112) in rhabdomyosarcomas (Extended Data Fig. 6a). Locally clustered ‘hotspot mutations’ (OncodriveClust)29,30 identified known oncogenes, such as CTNNB1, PIK3CA, KRAS, and BRAF, proposed oncogenes (ACVR1, KBTBD4, TBR1), and possible new candidates, such as SF3B1, in Group 4 medulloblastomas (Extended Data Fig. 6b).
Recurrent structural and copy-number variants
The degree of genomic instability (that is, the number of structural variants, including insertions, deletions, translocations, and inversions), varied substantially (median 1–434 structural variants) across cancer types (WGS, n = 539), with more than 1,000 structural variants in individual samples of adrenocortical carcinoma and osteosarcoma (Fig. 5a, Supplementary Table 11). Genomic instability correlated with germline (P = 3 × 10−15) and somatic (P = 2 × 10−4) TP53 mutations across all samples, but differed markedly between cancer types—again suggesting cancer type-specific effects of DNA repair (Fig. 5b, Extended Data Figs 3b, 7a).
Genomically unstable cancers were also more often hyperdiploid31 (Supplementary Table 12). Twelve per cent of tumours had a ploidy of four or more, 72% retained a near-diploid state (ploidy 1.5–2.5), and hypodiploidy was observed mainly in hypodiploid B-ALLs (Extended Data Fig. 7b). Hyperdiploidy was associated with somatic (P = 0.005) and germline (P = 0.003) TP53 mutations, in line with a role for mutant TP53 in the bypassing of the G1 tetraploidy checkpoint32 (Extended Data Fig. 7c–e). Chromothripsis was also often observed in hyperdiploid cancers and co-occurred with somatic (P = 2.3 × 10−10) and germline TP53 (P = 5 × 10−8) mutations in 50% and 66% of these tumours, compared to 8% in TP53 wild-type tumours33,34,35 (Extended Data Fig. 7f–h, Supplementary Table 13).
Thirty-four regions recurrently altered by copy-number changes (17 amplified, 17 deleted) were identified using GISTIC2.0 (WGS, n = 516)36; candidate driver genes were assigned to each based on known cancer genes and literature review (Fig. 5c, Extended Data Fig. 8a, b, Supplementary Tables 14–17). Alterations per cancer type are summarized in Extended Data Fig. 9.
Recurrently amplified regions contained known oncogenes, including MYC, MYCN, or GLI2, with 11 regions involving high-level amplifications (at least 5-fold gain) (Extended Data Fig. 8b). Further interesting regions included 17q11.2 with 61 genes, containing NCOR1 as a potential candidate, and a region on 12q24.31 near (~0.1 Mb) the proposed oncogene KDM2B37,38. Recurrently deleted regions were predominantly associated with epigenetic or cell cycle regulators, most commonly TP53, PTEN, SETD2, and CDKN2A or CDKN2B. Further potential tumour suppressors included RAD51D on 17q12 and FOXF1 on 16q24.1, with significant loss across the cohort39.
As evidenced by recurrent structural variation outside genes (based on breakpoint clusters in 10-kb windows), rearrangements linked to enhancer hijacking were also found, involving GFI1B and DDX31 in medulloblastomas and TERT in neuroblastomas40,41. Together with genes directly affected by breakpoints, in total 70 structural variant-related putative cancer genes were found, many associated with cell cycle or growth (for example, the tumour suppressor PTPRD) or epigenetic regulators (such as SUZ12)42,43 (Extended Data Fig. 8c, Supplementary Tables 18, 19). Cancer type-specific events that occurred together with high expression (data derived from Northcott et al.44) included alterations of RIMS245.
The analysed genomic alterations were combined into 166 ‘likely functional events’ (LFEs) affecting 149 genes, classified as M-(mutation)-type or as SC-(structural/copy-number variant)-type (Extended Data Fig. 10a, Supplementary Table 20). Along the ‘cancer genome hyperbola’, individual tumours (WGS, n = 539) differentiated between an M-class (more M-type LFEs) and an SC-class (more SC-type LFEs)46 (Extended Data Fig. 10b, Supplementary Table 21). Fifty-five per cent of tumours were exclusive to one class, 27% were mixed but dominated by one type of LFE, 8% were ambiguous, and 10% had no LFEs (which may be of particular interest in assessing other tumour-driving events at the epigenetic or transcriptomic level). Germline MMR mutations were enriched in the M-class, and germline TP53 mutations in the SC-class (P = 0.0003 and P = 0.05, respectively, Fisher’s exact test; Extended Data Fig. 10c). Individual cancer types displayed varying relative distributions of mutation classes (Extended Data Fig. 10d).
Drug targets in childhood cancers
To assess the status of druggability of childhood cancers, the cohort (n = 675 with full genomic information; WES-only, n = 39; see Methods) was screened for potentially druggable events19(PDEs, that is, alterations in 179 genes with a directly or indirectly targeted treatment currently available or under development; Supplementary Table 22). This analysis revealed 453 PDEs in 59 genes, including 3% germline events (Supplementary Table 23). Most cancer types had tumours with PDEs related to both M- and SC-type (Fig. 6a). Most commonly, PDEs occurred in Burkitt’s lymphomas and pilocytic astrocytomas, while none were detected in ependymomas or hepatoblastomas (although the latter lacked information regarding CNVs or structural variants). Associated pathways included RTK/MAPK signalling, transcriptional regulation, cell cycle control, and DNA repair (Fig. 6a).
When the data are normalized for relative cancer incidence, 52% of all primary paediatric tumours may harbour a PDE (Fig. 6b); this might be an underestimate, given that some structural variants may not have been detected by this approach (for example, the common MYC translocations in Burkitt’s lymphoma)23. After incidence adjustment, MAPK signalling and cell cycle control were most commonly affected. Notably, the PDEs often varied between primary and relapse tumours from one patient (n = 41): only 37% of primary tumours with PDEs retained these upon progression, while most of them partially or completely gained or lost events. This highlights the need for profiling of the current tumour when considering personalized therapy.
Our analysis of this pan-cancer compendium outlines the landscape of genomic alterations across multiple childhood cancer types. Although some alteration types and rarer entities are still under-represented and significance analyses are probably limited, this dataset of nearly 1,000 tumours (which can be explored at http://www.pedpancan.com) provides an unprecedented data resource for paediatric cancer research, further complemented by the accompanying pan-cancer study9 (https://pecan.stjude.org/proteinpaint/study/pan-target). The multiple differences found compared to previous studies of adult tumours emphasize the need to consider paediatric cancers separately, further demonstrating a need for mechanism-of-action driven drug development for paediatric indications47.
The predicted frequency of pathogenic germline variants in 6% of patients, together with previous findings, demonstrates the relevance of genetic predisposition in childhood cancer5. Germline TP53 variants, which are clinically highly important, are estimated for 1.5% of children with cancer, and for more than 10% within individual cancer types. Genetic counselling should thus be systematically considered, particularly for patients with indicated high-risk entities.
Although stratified targeted treatment is currently incorporated only rarely into first-line therapy for paediatric cancer patients, our finding that nearly 50% of primary childhood tumours harbour a potentially targetable genetic event is encouraging. It also highlights the need for personalized profiling for each patient, both to increase diagnostic accuracy and to exploit the potential for potentially more effective and less harmful precision therapies. This may also transcend the direct targeting of genes or pathways, for example, through immune checkpoint inhibition in hypermutated tumours13 or through PARP inhibition in genomically unstable (‘BRCAness’) tumours48. It is hoped that ongoing personalized medicine approaches for patients at relapse will give initial information on the use and effectiveness of such targeted drugs (for example, in the clinical trials pedMATCH-NCT03155620; eSMART-NCT02813135; INFORM19). Additional longitudinal monitoring, for example using serial liquid biopsies, may further improve our understanding of tumour biology and the development of resistance mechanisms, and shed light on therapeutic challenges such as tumour heterogeneity.
In summary, this multi-faceted pan-cancer analysis provides a valuable resource for assessing genomic alterations across the spectrum of paediatric tumours. While there are undoubtedly more discoveries to come in terms of expanded cohorts and whole-genome and transcriptome analysis, we believe that this study provides a strong basis for functional follow-up and investigation of potential therapeutic targets in this specific patient population.
The cohort analysed in this study is a compilation of individual sequencing datasets from various sources: the International Cancer Genome Consortium (ICGC) – Pedbrain Tumor and MMML-seq (http://www.icgc.org), the German Cancer Consortium (DKTK) (https://dktk.dkfz.de/en/home), the Pediatric Cancer Genome Project (PCGP) (http://explore.pediatriccancergenomeproject.org/), the Heidelberg Institute for Personalized Oncology (HIPO) (http://www.dkfz.de/en/hipo), the Individualized Therapy For Relapsed Malignancies in Childhood (INFORM) registry (www.dkfz.de/en/inform), and other previously published datasets (listed below). For all included tumours, matched germline control tissue was available. Ninety-five per cent of the patients were under 18 years of age (or age unspecified but confirmed age group paediatric), but available data were included for patients up to 25 years, as these were considered relevant for cancer types that typically peak at a young age. All centres have approved data access and informed consent had been obtained from all patients.
External data were downloaded from the European Genome-Phenome Archive (EGA; https://www.ebi.ac.uk/ega/home) using the accession numbers EGAD00001000085, EGAD00001000135, EGAD00001000159, EGAD00001000160, EGAD00001000161, EGAD00001000162, EGAD00001000163, EGAD00001000164, EGAD00001000165, EGAD00001000259, EGAD00001000260, EGAD00001000261, EGAD00001000268, and EGAD0000100026949,50,51,52,53,54,55,56,57,58,59,60,61,62; internal datasets are related to previous PMIDs 27748748, 27479119, 26923874, 25670083, 25253770, 24972766, 24553142, 25135868, 26632267, 26179511, 24651015, 28726821, 23817572, 25962120, 2629472517,19,44,63,64,65,66,67,68,69,70,71,72,73,74 (Supplementary Note 1).
The final cohort included 914 individual patients of no more than 25 years of age including primary tumours for 879 patients with 47 matched relapsed tumours, and an additional 35 independent relapsed tumours (Supplementary Tables 1, 2). Deep-sequencing (~30×) whole-genome data (WGS) were available for 547 samples with matched control, whole-exome sequencing (WES) for 414, and low-coverage whole-genome sequencing (lcWGS) for an additional 54 germline and 186 tumour samples. Depending on the requirements of each sub-analysis, we used WES and WGS, WGS only (excluding Ewing’s sarcoma, Wilms tumour, hepatoblastoma, and T-ALL), or WES, WGS and lcWGS (germline excluding Ewing’s sarcoma, Wilms tumour and hepatoblastoma; tumours excluding Ewing’s sarcoma and hepatoblastoma) were used (Supplementary Table 24). ‘Subgroups’ of cancer types were considered as separate entities if there was considerable evidence of differences in terms of clinical and molecular behaviours, if sub-cohort sizes were substantial, and if full annotation of all samples was available. All samples had been sequenced using Illumina technology and 99% of samples were paired-end sequences with 100 bp read length. Ninety-eight per cent of exome sequences are covered with at least 30×, 94% with at least 60×, and the total median exome coverage is 121×. The whole-genome sequenced samples have a median coverage of 37× and 94% of samples are covered with at least 30×. Information on coverage and other metrics for all samples are provided in Supplementary Table 2.
Cancer type incidence
Information on incidence of cancer types in the population was derived from the SEER database (Surveillance, Epidemiology, and End Results program)8; further detailed information on different subgroups of cancer types (central nervous system tumours and subgroups of medulloblastoma, ependymoma, and ALL) was transferred from cancer type-specific publications75,76,77,78,79. Survival data are based on information from the German Childhood Cancer Registry80. Incidence rates of adult cancers were taken from information in the German GEKID database (http://www.gekid.de/, 2003–2012).
All data were processed using a standardized alignment and variant calling pipeline, which was developed in the context of the ICGC Pan-Cancer project (https://dockstore.org/containers/quay.io/pancancer/pcawg-dkfz-workflow)81.
Datasets were available in either raw FASTQ or aligned BAM format. To allow standardized processing for all included samples, BAM files were sorted by read name using sambamba (v.0.4.6) and converted to a raw-like FASTQ format using SamToFastq (v.1.61). Reads were then aligned to the phase II reference human genome assembly of the 1000 Genomes Project including decoy sequences (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz) using BWA-MEM (v.0.7.8 using default settings except ‘-T 0’). Matching genotypes of tumour and control samples were confirmed by calculating pairwise DNA sequence similarities at 1,000 reference SNPs (dbSNP v.138)82.
SNVs were called with the previously described samtools-based DKFZ pipeline adjusted for ICGC Pan-Cancer settings, and short indels were called using Platypus (v.0.7.4)74,83. Variants were first identified in the tumour sample and germline or somatic origin was determined based on their presence or absence in the matched control tissue. Functional effects were annotated using ANNOVAR and GENCODE19 (http://www.gencodegenes.org/releases/19.html)84.
Somatic structural variant discovery
Somatic structural variant discovery was pursued across all whole-genome sequenced samples (high-quality structural variants available for n = 539 primary tumours) using the DELLY ICGC Pan-Cancer analysis workflow (https://github.com/ICGC-TCGA-PanCancer/pcawg_delly_workflow)85. A high-stringency structural variant set was obtained by additionally filtering somatic structural variants detected in 1% or more of a set of 1,105 germline samples from healthy individuals belonging to phase I of the 1000 Genomes Project and by removing somatic structural variants present in any of the paediatric germline samples of this study86. High-stringency structural variants were further required to have at least four supporting read pairs with a minimum mapping quality of 20 and were restricted to somatic structural variant sizes from 300 bp to 500 Mb.
Copy numbers were estimated using ACEseq (allele-specific copy-number estimation from sequencing) (K. Kleinheinz et al., unpublished data), using a binned tumour–control coverage ratio and B-allele frequency (BAF). Allele frequencies were obtained for all single nucleotide polymorphism (SNP) positions recorded in dbSNP version 13582. To improve sensitivity with regard to imbalanced and balanced regions, SNP positions in the control were phased with impute287. Additionally, the coverage for 10-kb windows with sufficient mapping quality and read density was recorded and subsequently corrected for GC content and replication timing.
The genome was segmented using the PSCBS package incorporating structural variant breakpoints defined by DELLY88,89. Segments were clustered based on coverage ratio and BAF using k-means and neighbouring segments in the same cluster were joined; focal segments (<9 Mb) were stitched to the more similar neighbour. Tumour cell content and ploidy were estimated by testing how well different combinations of both explain the data. Segments with balanced BAF were assigned to even-numbered copy-number states, whereas unbalanced segments were allowed to match with uneven numbers as well. Finally, estimated tumour cell content and ploidy were used to compute the total and allele-specific copy-number for each segment. High-quality copy-number calls were available for n = 516 of the WGS samples.
The frequency of somatic mutations in coding regions was determined for each sample individually by normalizing the total number of coding mutations for the number of sufficiently covered (≥6×) coding bases to account (determined using MuSiC-bmr) for different data types (WGS/WES) and for different exome target enrichment kits24. Mutation spectra were obtained by categorizing observed SNVs into base substitution types in pyrimidine context. Spearman’s rank correlation test was applied to infer correlations between different types of mutation counts or between mutation counts and age. Generalized linear models were used to fit regression lines. Clusters of localized hypermutation were identified using a previously presented approach adjusted for mutation rates in human paediatric cancers90.
Deciphering mutation signatures
Exome-sequenced tumours, except for hypermutator cases, were excluded from signature analysis owing to their low numbers of mutations. In brief, signatures are represented as probability distributions of substitution types of SNVs in pyrimidine context. Considering the immediate sequence context of each SNV, this results in 96 possible mutation types with directly adjacent mutations (multiple nucleotide variants, MNVs) being excluded, which are counted per tumour to compile its mutational profile.
As proposed by Alexandrov et al.91, the mutational profile of a tumour is expected to reflect a superposition of mutational processes (signatures) acting on its genome, where each mutational process has a different intensity (exposure). For a cohort of tumour genomes, this is modelled as a system of matrices for signatures (P) and exposures (E) defining the observed mutational catalogue (M)91: M ≈ P × E.
De novo deciphering of signatures was done as described91 based on the mutational catalogues of all cancer types and of the pan-cancer cohort. All resulting signatures were compared to published signatures (available in the COSMIC database, http://cancer.sanger.ac.uk/cosmic/signatures) based on their cosine similarity15. Signatures that did not correspond to any of the previously known signatures (cosine similarity <0.85) were further analysed to examine their relevance for modelling the cancer genomes. First, linear independence from the known set of signatures was confirmed. Second, for each potentially novel signature, we examined whether the modelling of mutation profiles improved when compared to having used the set of known signatures: for each sample, the observed mutational profile was compared to the theoretical profiles calculated using the set of known signatures only, and using the extended set including the new candidate signature. Here, only samples with a total number of mutations over 200 were considered. Reconstruction was calculated as the difference between cosine similarity of the modelled profile and the observed profile. On the basis of the resulting distribution of similarities in both alternatives, a signature was considered to have a relevant contribution to the model, and thus a potential new signature, if both of the following conditions were fulfilled: the reconstruction (measured as the difference of similarities) of at least one sample increased by 0.02 and that sample had a reconstruction accuracy of <0.9 based on the known set of signatures only.
This procedure resulted in one new candidate signature, signature P1, which was added to the set of reference signatures. In order to achieve maximum resolution per sample, a sample-wise re-extraction of exposures from the mutational profiles was performed using quadratic programming with the reference signature set used for P and the exposures in E as unknown variables. Samples with a reconstruction accuracy below 0.5 were excluded (resulting in n = 503 tumours with high-quality signature information), as these samples would not be correctly accounted for by the model, which might be due to quality issues or to contributions of unknown signatures that are not present at intensities sufficient to be identified by a de novo approach. The resulting exposures were used for further downstream analyses and visualization. Previously published signatures without validation were first included to model the mutational catalogues as precisely as possible, but then summarized as ‘other’ for representation.
Spearman’s rank correlation and two-sided Kolmogorov–Smirnov tests were used to associate exposure of signatures with numerical and categorical variables, respectively. Exposures to signatures across multiple groups were compared using ANOVA and the post hoc Tukey’s test.
Identifying mutations in genes predisposing to cancers
To identify germline variants with a high likelihood of being implicated in cancer development, we investigated 162 candidate genes adapted from ref. 19 (110 genes regarded as following a dominant inheritance pattern and 52 genes with recessive inheritance) (Supplementary Table 6).
Germline SNVs and indels were subjected to a stepwise filtering approach to eventually classify them into five categories: benign, likely benign, uncertain significance, likely pathogenic, and pathogenic. First, variants reported in both the 1000 Genomes (release November 2010) and dbSNP (v.141) databases were excluded. High-quality variant calls were selected by including only positions with ≥15× coverage, a germline allele frequency of ≥0.2, and a phred-based quality score of ≥10. Variants with a population frequency ≥0.01 reported in additional common databases (esp6500siv2, X1000g2015, and exac03 included in ANNOVAR (http://annovar.openbioinformatics.org)) or with ClinVar (ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/) annotations of ‘benign’, ‘likely benign’ or ‘uncertain significance’ were removed.
Furthermore, variants with a phred-scaled CADD score ≥15 (http://cadd.gs.washington.edu/info) and with Mutation Assessor (http://mutationassessor.org/r3/) categories ‘medium’ and ‘high’, or no available annotation, were included. Variants with a dbSNP classification of ‘precious’ were not subject to these two filtering steps. As indel calling is more prone to alignment and calling errors, potentially deleterious indels were manually investigated for artefacts. For recessive tumour genes, variants were included only with an allele frequency of one or with two compound heterozygous mutations of the same gene in the same patient. In total, the filtering steps narrowed down the number of potentially pathogenic mutations to n = 433. Every variant was then manually checked and scored by the use of varied, mainly gene-specific online databases (http://p53.iarc.fr/, http://www.lovd.nl/3.0/home, https://www.ncbi.nlm.nih.gov/clinvar/, and others). Only likely pathogenic and pathogenic mutations were considered as cancer-relevant and used for representation in Fig. 3. Additionally, whole-genome sequenced samples were manually screened for copy-number losses in 13 tumour suppressor genes of the candidate list, which are known to occasionally harbour germline focal deletions (MLH1, MSH2, MSH6, NF1, PMS2, PRKAR1A, PTCH1, PTEN, RB1, SMARCA4, SMARCB1, SUFU, TP53).
Detecting genome-wide mutation clusters
To identify genomic regions with single or clusters of recurrent mutations, the human genome was binned into non-overlapping windows of various sizes (50–500 bp) and compared the observed mutations to a background model (V. A. Rudneva et al., unpublished data) which was estimated using the ‘global’ model: the genome was stratified into 25 evenly sized groups of genomic windows based on the combined vector of five genetic and epigenetic features (replication timing, gene expression level, GC content, H3K9me3, and open versus closed chromatin conformation). For each region an enrichment score, binomial P value, and negative binomial test P value were computed.
Cross-validations were used to determine the significance cut-off that would provide reproducible results (with samples segregated by subgroup). A combination of the window size (500 bp), test statistics (enrichment score, mutational recurrence, binomial test P value, and gamma Poisson test P value), and a cut-off value that ensured high precision and recall values based on the precision-recall analysis (P = 10−20) were chosen (Extended Data Fig. 4a). Recall was calculated as the number of regions that satisfied the cut-off in results obtained on both halves of the dataset; precision was calculated as a fraction of the recalled regions to the total number of regions that satisfied the cut-off in each of the datasets. The chosen parameters were then used to run the pipeline on the complete dataset and then the mutations in the resulting regions were further examined manually for potential false positives in order to identify high-confidence candidate regions (Extended Data Fig. 4b).
Significantly mutated genes
Significantly mutated genes based on somatic SNVs and indels were identified with the SMG module of the MuSiC tools suite24 separately from all cancer types and from the pan-cancer cohort, and then merged.
This kind of significance analysis often produces false positive hits (for example, very large genes), despite normalization procedures, and thus several filters were applied to the raw output30. First, all genes of >30,000 bp exonic length or >10,000 bp with additional replication timing >800 were excluded (Cancer Cell Line Encyclopedia; CCLE)92. Genes that scored significant in three or more cancer types, or that were recurrently mutated at the same position, were manually inspected for artefacts from ambiguous alignments (for example, repetitive sequence regions). Also, genes that are probably not associated with tumour development but rather represent non-neoplastic somatic hypermutation processes in the context of immune function were removed. Furthermore, genes mutated in <2% of the cohort were included only if they had a secondary signal from either functional impact or from localized clustering bias (Intogen modules OncodriveFM and OncodriveClust v. 3.0 beta) or from being among known cancer genes29,93. Mutation needle plots were generated using MutationMapper94. Biological processes were assigned to the significantly mutated genes mostly exclusively, except for a few genes with high relevance for multiple processes, as specified in Supplementary Table 9.
Occurrence of chromothripsis was determined by manual inspection of coverage ratio plots (tumour/control) for WGS samples based on previously proposed guidelines95: at least ten copy-number switches on one chromosome, oscillating copy-number variation (usually with changes of +1 or −1, but also between other levels where additional large-scale copy-number changes interfere), and many more of such copy-number variations in one chromosome or chromosome arm compared to the remaining genome. In samples with an exceptionally high degree of structural variation, several chromosomes could be affected, and some samples showed an ‘amplifier’ type of chromothripsis, which was classified as several high-level focal amplifications on exactly the same copy-number level that are thus likely to be connected to one single event.
Generation of copy-number profiles
Copy-number calls reported by ACEseq were converted to the ‘SEG’ segmentation format, similar to the output of the circular binary segmentation algorithm based on chromosomal segment borders as pseudo marker positions96. All possible marker positions were determined from the whole cohort before assessing sample-wise copy-number profiles per marker in order to achieve identical resolution for all samples. Owing to sparse and highly oscillating sequencing coverage at centromeres, centromeric coordinates (±3 Mb around the centre of annotated centromeres) were excluded from whole-genome segmentation, as were two likely artefact regions on chromosomes 7 and 14 with nonspecific occurrences of relative copy-number gains and losses in 28% and 30% of all analysed samples in 17 of 19 entities (14q11.2, 7p14.1), which were identified using GISTIC2.0 (as described below) with ±1 Mb.
Identifying recurrent copy-number/structural variations
GISTIC2.0 (v.2.0.22, gene-gistic default parameter settings) was applied to the segmented copy-number data (per cancer type and pan-cancer) to identify significant copy-number alterations36. The resulting peaks were filtered for significance (q ≤ 0.1) and size (≤10 Mb). Compared to array-based data, which commonly serve as inputs for copy-number significance analysis, sequencing-based copy-number profiles are more prone to artefact copy-number variations, for example, due to repetitive regions leading to ambiguous alignments. Thus, several filtering steps were used to eliminate false-positive GISTIC peak calls and to discover potentially cancer-relevant copy-number alterations: first, peaks overlapping with common fragile genomic sites were excluded, as these are likely to be consequences of genomic instability rather than cancer-driving events97; next, peaks overlapping within 1 Mb of chromosomal ends were removed, as here sequencing coverage tends to vary frequently; and last, peaks overlapping with copy-number variable regions98 (regions ranked 1–100) were excluded. Additionally, some of the resulting peaks were classified as ‘passengers’ of variable regions that were called as separated peaks from most likely one event, for example, a peak with MYCNOS as passenger peak of MYCN amplification. For overlapping peaks called in multiple entities and/or pan-cancer, the final region was determined based on the analysis with highest significance for each peak, respectively.
Genes with a breakpoint inside the gene borders were assumed to be altered by structural variation and considered as recurrently altered if they had breakpoints in ≥5 samples in total or in ≥2 samples of one cancer type (for samples without chromothripsis). For other samples, genes with breakpoints in ≥5 samples were included as candidates, but these were not used for further downstream analyses. Additionally, recurrent sites of structural variation outside of gene bodies by clustering breakpoints were determined in 10-kb windows.
Scoring of druggable mutations
To identify candidates for targeted therapy, somatic and germline mutations (SNV and indels) were screened for variants in genes that are directly or indirectly involved in pathways with matched drugs either approved or currently being investigated in clinical trials (Supplementary Table 22a, adapted from ref. 19). The mutations were then manually assessed by experts in translational oncology and prioritized according to an internal algorithm taking into account the type of alteration, the mechanism of action of potential drugs within the pathway, the level of evidence for the specific alteration, and its role in the present cancer type (Supplementary Table 22b, adapted from ref. 19). Only alterations scored ‘intermediate’ or ‘high’ were regarded as being relevant in terms of druggability. A clonality analysis was not performed owing to limited sequencing depth in whole-genome-sequenced tumours.
Additionally, copy-number plots of whole-genome-sequenced data (including low-coverage WGS) were used to manually screen 52 druggable genes for amplifications or deletions (Supplementary Table 22a). Only focal CNVs (<10 Mb) with at least 5 copies (log2 ≥ 1.3) in the case of amplifications or the loss of ≥ 1 copy (log2 ≤ −1) for deletions were included and subsequently prioritized as described for the SNVs/indels. The data representation includes all tumours with full genomic information (WES + lcWGS or WGS; n = 675) and, additionally, tumours analysed by WES only for cancer types without any whole-genome-sequenced tumours (T-ALL, Ewing’s sarcoma, HB; n = 39), but the latter were excluded from downstream analyses.
Mutation data have been deposited into commonly used public data portals and are accessible at http://pedpancan.com. They can be explored in and downloaded from the R2 Analysis and Genomics Platform, the PedcBio Portal for Cancer Visualization, and the TARGET Data Matrix. Sequencing data were obtained from previous studies as listed in Supplementary Note 1 and include the following accession codes: RP012816, PRJEB11430 (European Nucleotide Archive); EGAS00001001139, EGAS00001001953, EGAS00001000607, EGAS00001000381, EGAS00001000906, EGAS00001001297, EGAS00001000443, EGAS00001000213, EGAS00001000263, EGAS00001000192, EGAS00001000255, EGAS00001000254, EGAS00001000253, EGAS00001000256, EGAS00001000246, EGAS00001000379, EGAS00001000380, EGAS00001000346, EGAS00001000349, EGAS00001000347, EGAS00001000192 (European Genome-Phenome Archive).
This project was mainly supported and funded by the German Cancer Consortium (DKTK Pediatric Malignancies Joint Funding Project) and German Cancer Aid (#108128) and Deutsche Kinderkrebsstiftung (German Cancer Childhood Foundation) for the INFORM project. Additional support came from the German Ministry for Education and Research (BMBF #01KU1201A) and the German Cancer Aid (#109252) for the ICGC (International Cancer Genome Consortium) PedBrain Tumor Project and the ICGC MMML-Seq Project (within Program for Medical Genome Research Grants #01KU1002A–#01KU1002J), and the BioTOP Project (#01EK1502B). This work was also supported by an ERC starting grant to J.O.K. (#336045), MMML-MYC-SYS (#0316166I), ICGC DE-Mining (#01KU1505G), the Heidelberg Center for Personalized Oncology (DKFZ-HIPO) and the BMBF-funded Heidelberg Center for Human Bioinformatics (HD-HuB) within the German Network for Bioinformatics Infrastructure (de.NBI) (#031A537A, #031A537C). For technical support and expertise we thank the DKFZ Genomics and Proteomics Core Facility, M. Hain from the Division of Molecular Genetics (DKFZ), N. Jaeger and R. Kabbe from the Department of Pediatric Neurooncology (DKFZ), and S. Oelmez from the Data Management Group (DKFZ). We further thank members and technical staff of the ICGC MMML-Seq (International Cancer Genome Consortium Molecular Mechanisms in Malignant Lymphoma by Sequencing) and the European Renal Tumor Study Group (SIOP-RTSG).
Extended data figures
This file contains Supplementary Tables 1-24.