Introduction

Along with DNA mutation, epigenetic reprogramming can enable the acquisition of the hallmarks of cancer1,2. DNA methylation is an epigenetic mark that regulates genome interpretation during development, differentiation, and disease3. CpG islands (CGIs) are short, interspersed DNA sequences that are GC-rich, CpG-rich, and predominantly nonmethylated, with about 70% of promoters located near genes containing a CGI4. DNA methylation at CpG sites in CGIs of promoters causes stable silencing of genes5. Regarding pediatric brain and central nervous system (CNS) tumors in particular, DNA methylation array profiling as a method for classification represents a valuable tool to combine with histopathology6,7,8,9,10,11. For the approximately 100 known tumor types of the CNS, the high tissue- and cell type-specificity of DNA methylation makes it an ideal choice for developing tumor classifiers according to histologic type3,7. Beyond tumor classification, DNA methylation profiling can be used to catalog the genes epigenetically silenced or unsilenced in cancer cells12,13. Aberrant DNA methylation in cancer can lead to transcriptional silencing of tumor suppressor genes or a loss of gene regulation promoting tumorigenesis2,13.

While DNA methylation may uncover cell type differences or molecular subtypes underlying pediatric brain and CNS histologic types, as carried out by many studies to date3,7,8,9,10, another way to utilize DNA methylation profiling data is to integrate them with DNA mutation data. Mutations impacting genes include noncoding mutations identifiable by whole-genome sequencing (WGS)14. Somatic structural variants (SVs) are rearrangements of large DNA segments, which can include deletions, insertions, inversions, tandem duplications, translocations, and more complex rearrangements15. Severe hypomethylation has been associated with genomic instability2, and somatic SV breakpoints in cancer are enriched in hypomethylated domains16. Consistent with observations in human tumors17,18, experimental studies have demonstrated that DNA repair of double-stranded breaks—which would be involved in genomic rearrangements—can lead to altered CpG methylation at the repair site19,20,21,22,23. Somatic SVs impact gene transcription globally through diverse mechanisms, including copy number alterations, enhancer hijacking, Topologically Associated Domain (TAD) disruption, and altered DNA methylation18,24,25,26,27,28,29. Surveys of adult cancers have cataloged genes with CGIs having DNA methylation alterations consistently associated with nearby somatic SV breakpoints, with many of these genes representing known cancer drivers18,24. Recently, using combined WGS and RNA sequencing (RNA-seq) data from the Children’s Brain Tumor Network (CBTN)26,30,31,32, we cataloged hundreds of genes with altered expression in association with nearby somatic SV breakpoints across pediatric brain tumors of various histologic types26. Compared to adult cancers, pediatric brain tumors involved a different set of genes with SV-associated altered cis-regulation. However, DNA methylation data on the CBTN tumors were not available at the time of our earlier study.

Comprehensive knowledge of the genes underlying human cancers is a critical foundation for developing improved therapeutic approaches33. Through pan-cancer studies involving a range of different cancer types according to tissue of origin, catalogs of non-randomly mutated genes have been established at various levels, including within-gene Single Nucleotide Variants (SNVs)33,34, DNA copy alterations35, and non-coding mutations36,37. Pan-cancer analyses have also identified aberrant DNA methylation patterns in cancer to identify genes with associated cis-transcription changes13. A combined multi-omics dataset of WGS, RNA-seq, and DNA methylation in pediatric tumors offers unique opportunities to explore how non-coding mutations impact DNA methylation and related cis-transcription. We hypothesized that somatic SVs in pediatric brain and CNS tumors would show a nonrandom and genome-wide influence on DNA methylation, leading to recurrently altered expression of a critical set of genes involved with disease etiology or progression. Some of these genes may have known roles in pediatric brain cancers, while others may be less studied and would have the potential to help further understand disease biology and offer therapeutic opportunities. Global genomic studies of adult cancers18,24 are insufficient for understanding pediatric cancers, as different disease etiologies and associated cancer driver genes would be involved in the pediatric setting37. Like pan-cancer analyses, pediatric brain tumor studies across different histologies can examine commonalities and differences among the various diseases.

CBTN provides perhaps the largest open-access pediatric brain tumor multi-omic dataset annotated with longitudinal clinical and outcome data, representing pediatric brain and CNS tumors across multiple histologic types30. As part of its ongoing X01 phase30, CBTN continues to expand its datasets with additional tumor samples and data platforms, recently adding DNA methylation data for over 1700 tumors. The X01 extension includes additional paired longitudinal tumors for comparing progressive or recurrent with initial tumors. The recent CBTN-led OpenPBTA molecular survey study38 did not include the methylation data or X01 samples. The CBTN X01 molecular datasets offer unique opportunities to identify patterns cutting across histologic types of pediatric brain cancer31, in addition to genomic analyses within individual tumor types. This basic overall approach to pan-histology genomic analysis derives inspiration from pan-cancer studies of adult cancers of various tissues of origin, including large-scale efforts from The Cancer Genome Atlas (TCGA)13,28,39 and the Pan-Cancer Analysis of Whole Genomes (PCAWG)14.

In this present study, we carry out a multi-omic, pan-histology survey of pediatric brain and CNS tumors from CBTN to systematically identify DNA methylation alterations and any coordinate gene expression changes in conjunction with somatic SV patterns. Furthermore, we leverage longitudinal and patient survival data across diverse tumor histologies to help prioritize the widespread DNA methylation changes observed, as well as to define molecular signatures of progressive or recurrent tumors. The results reveal widespread associations of noncoding alterations in conjunction with aberrantly methylated and expressed genes having cancer roles in pediatric tumors, distinct from the genes involved similarly in adult cancers.

Results

A multi-omic and pan-histology pediatric brain tumor dataset

Our study focused on 2417 pediatric brain and CNS tumors—involving 2127 patients—with data generated by the CBTN26,30,31,32 on at least one of three -omic data platforms (Supplementary Data 1): WGS (involving 1926 tumors from 1736 patients), RNA-seq, involving 1765 tumors from 1567 patients), and DNA methylation arrays (involving 1744 tumors from 1536 patients). Tumor samples in CBTN spanned at least 33 different tumor types based on histology, the most highly represented of which (30 or more tumors for each) included pediatric low-grade glioma/astrocytoma (PLGG, n = 556 tumors), ependymoma (EPMT, n = 298), medulloblastoma (MBL, n = 280), high-grade glioma/astrocytoma (PHGG, n = 250), craniopharyngioma (CRANIO, n = 98), ganglioglioma (GNG, n = 97), Diffuse intrinsic pontine glioma (DIPG, n = 81), atypical teratoid rhabdoid tumor (ATRT, n = 77), meningioma (MNG, n = 67), dysembryoplastic neuroepithelial tumor (DNT, n = 53), schwannoma (SCHW, n = 47), choroid plexus papilloma (CPP, n = 39), and neurofibroma/plexiform (NFIB, n = 35). Some tumor types represented in this cohort—including sarcoma (SARCNOS, n = 41), germinoma (GMN, n = 11), Langerhans cell histiocytosis (LCH, n = 9), malignant peripheral nerve sheath tumor (MPNST, n = 9), and neuroblastoma (NBL, n = 10)—originate from cell types not specific to the brain, even if the CBTN tumors were obtained from the brain region. As CBTN’s regulatory for CNS tumors intentionally allowed for the broad collection of abnormal cell growth, not necessarily limited to brain-specific cell types, our present study included data from all available tumor types, where even the anomalous tumors would provide useful information that may help compare or contrast the patterns found in tumors originating from brain cell types. Of the 2417 tumors with available data, 854 were part of our previous CBTN SV-expression study26.

By multiple data platforms, global molecular differences in pediatric brain tumors by histologic type were observable across the CBTN cohort. Based on unsupervised clustering of DNA methylation profiling data (a simplistic analytical approach compared to more refined approaches7), tumors broadly segregated according to histologic type, with associated global differences at the DNA methylation level also reflected at the mRNA level (Fig. 1a). Somatic SVs (based on WGS data) also showed differential patterns associating with one or more histologic types (Fig. 1b). For each gene, we tabulated the number of tumors with somatic SV breakpoints falling within 1 Mb of the gene start. A top set of 2277 genes showed significant breakpoint enrichment patterns for at least one of the major histologic types represented in the cohort (p < 0.0001, one-sided Fisher’s exact test, Supplementary Data 2). Most of these genes involved PHGG, DIPG, ATRT, and EPMT tumor types. Many of the observed SV breakpoint enrichment patterns by cytoband region involved copy gain or loss of key cancer genes previously associated with the involved histologic type, including SMARCB1 loss on chromosome 22q11.23 for ATRT40, FGFR1 gain on 8p11.23 for DNT41, MEN1 loss on 11q13.1 for EPMT42, NF2 loss on 22q12.2 for MNG43, and CDKN2A loss on 9p21.2 for PHGG44. Other breakpoint enrichment patterns involved gene fusions, notably KIAA1549-BRAF on 7q34 for PLGG45. Still, other breakpoint enrichment patterns involved genes for which expression or CGI methylation is higher or lower in association with SV breakpoints across the CBTN cohort after correction for tumor histology and gene-level copy, as further explored below.

Fig. 1: Global molecular differences in pediatric brain tumors by histologic type involving DNA methylation, gene expression, and structural variation.
figure 1

a Hierarchical clustering of the top 2000 most variable CGI probes, as carried out on the CBTN DNA methylation dataset, involving 1744 tumors and 1536 patients. The differential patterns for the 2000 CGI probes and the top 2000 most variable genes are shown (each respective feature set involving different genes), with features respectively ordered by hierarchical clustering. As expected, the tumors are broadly segregated according to histologic type, although there is some variability in the grouping of tumors of a given histology26. See Methods regarding histology-based tumor type abbreviations. b Somatic SV breakpoint enrichment patterns by brain tumor histologic type. For each gene, the number of tumors with somatic SV breakpoints falling within 1 Mb of the gene start was tabulated, involving 1926 tumors and 1736 patients with WGS data. Represented are 2283 genes with significant breakpoint enrichment patterns for at least one of the histologic types represented (p < 0.0001, one-sided Fisher’s exact test). Represented along the bottom are any genes for which expression or CGI methylation is higher or lower with nearby somatic SV breakpoints (positive and negative correlation with SV breakpoints, respectively, p < 0.01 across all tumors by linear modeling of 1 Mb region by distance metric method18,48, correcting for both histologic type and gene copy number). Source data are provided as a Source Data file.

Impact of somatic SVs on gene expression

We assessed gene-level associations between expression and nearby somatic SV breakpoints across the expanded CBTN tumor cohort, involving 1448 tumors from 1317 patients with combined RNA-seq and WGS data (Supplementary Fig. 1ac). This 1448-tumor cohort represented a considerable expansion from an earlier 854-tumor cohort previously analyzed for somatic SV-expression associations26, allowing us to define SV-expression associations here across additional tumors, for the purposes of downstream comparisons involving related DNA methylation patterns (see below). After corrections for SV-associated copy number alterations28,46, SVs with breakpoints near a gene can lead to altered cis-regulation, e.g., by enhancer hijacking or altered DNA methylation18, and SV breakpoints within a gene could result in gene disruption or a gene fusion28. Given multiple tumors from the same patient (e.g., involving multiple initial tumors or initial compared to recurrent or progressive tumors), we considered each tumor sample separately from the others in our analyses, as different tumors from the same patient tend to demonstrate extensive molecular heterogeneity among them24,26,47 (Supplementary Fig. 2ac). Using our previously demonstrated analytical approaches18,24,26,27,28,48,49, we defined gene-level SV-associated altered expression for each of the following genomic region windows: SV breakpoints within 100 kb upstream of the gene, 100 kb downstream, the gene boundary, or 1 Mb upstream or downstream (Supplementary Data 2). Significant SV-expression associations examined in downstream analyses below represent those that remained significant after correction for tumor histologic type and gene copy numbers, meaning that these covariates alone would not explain the SV-associated altered expression patterns observed (Supplementary Fig. 1b). Consistent with our previous studies in other cohorts18,24,26,27,28,48, many more genes showed positive correlations with SV breakpoints (i.e., expression was higher with nearby breakpoints) than negative correlations (Supplementary Fig. 1b).

Across the different genomic regions examined, hundreds of genes showed altered expression in relation to nearby somatic SV breakpoints in the expanded CBTN cohort (Fig. 2a and Supplementary Fig. 1b and Supplementary Data 2). When considering a 1 Mb region window upstream or downstream of each gene (with the model giving more weight to breakpoints closer to the gene), 271 genes showed positive correlations with SV breakpoints, and 80 genes showed negative correlations with FDR < 10%50, with corrections for histologic type and gene-level copy. There was a strong significant overlap between the SV-expression associations from the expanded cohort and those from the earlier CBTN cohort of 854 tumors26 (Supplementary Fig. 3a). Of the above 351 genes with FDR < 10%, 199 were significant in the 854 cohort (p < 0.05 by linear model with covariates). Differences between the respective cohorts can involve false positives or negatives and the sparse nature of SV breakpoint patterns. In contrast, there was less overlap (though statistically significant) between the CBTN pediatric tumor SV-expression associations and associations previously identified using TCGA and PCAWG cohorts18 (Supplementary Fig. 3b). We also evaluated significant SV-expression associations separately by tumor histologic type, with many genes being significant for specific tumor types but not reaching significance when analyzing the entire CBTN cohort (Fig. 2b and Supplementary Data 3). Of the above 351 genes with FDR < 10% across the entire cohort, 24 had a well-established cancer association by COSMIC51, a significant overlap (p = 0.003, one-sided Fisher’s exact test), which genes included MET, MYC, MYCN, TERT, CDKN2A, KIAA1549, ATRX, BCOR, TRIM24, and POU5F1 (Supplementary Data 2 and Fig. 2c), with SV events spanning multiple histologic types.

Fig. 2: Genes with altered expression associated with nearby SV breakpoints across 1448 pediatric brain tumors.
figure 2

a Gene-level SV-associated altered expression was defined for the following genomic region windows: SV breakpoints within 100 kb upstream of the gene, 100 kb downstream, the gene boundary, or 1 Mb upstream or downstream. Significance of SV-associated altered genes (region with smallest FDR) is plotted (y-axis) versus the percentage of tumors with breakpoints (x-axis, not including gene amplification for positive associations). b For 1 Mb region, FDR in the most significant of 13 histologic types analyzed separately (x-axis) versus the FDR for the 1448 tumors analyzed together (y-axis). Genes in upper left quadrant reached significance only in the pan-histology analysis; genes in lower right quadrant reached significance only in one or more single-type analyses. Data point coloring represents the most significant tumor type. For a and b, gene-level significance by linear modeling, correcting for histology and copy. c Boxplots of ATRX, BCOR, TRIM24, and POU5F1 expression by alteration class (amp., amplification). P-values versus other by t-test. Boxplots represent 5%, 25%, 50%, 75%, and 95%. d Fractions of SVs involving TAD disruption and altered expression. e Percentages of SV breakpoint associations involving an enhancer near the gene, as tabulated for all SV breakpoint associations and the subsets involving altered gene expression. Left plot involves translocation SVs and enhancer hijacking. Right plot involves enhancer-spanning duplication SVs. f For translocation (trans.) SVs, left plot represents average methylation change (SV-associated versus unaltered) for altered expression versus all SV-gene associations; right plot, for higher or lower methylation versus all SV-CGI associations (beta difference>0.1 or <−0.1, respectively). For d, e, and f, OE and UE respectively involve higher or lower expression with p < 0.01 (linear model with covariates for 1 Mb region) and expression >0.4 SD or <−4 SD from the median for the tumor harboring the breakpoint. Enrichment p-values by chi-squared test. Methylation differences p-value by t-test. Error bars represent standard error. g By gene and histology, tumors involving the rearrangement of a region of low methylation (average beta difference < −0.2), with corresponding decrease in methylation and increase in expression (<−0.1 and >0.4 SD, respectively). Source data are provided as a Source Data file.

Multiple mechanisms would involve the SV-expression associations mined from CBTN18,24,25,26,28 (Supplementary Data 4). SVs involved in gene over-expression were highly enriched for SVs that disrupt TADs (Fig. 2d). In addition, SV-gene associations (using 1 Mb region) involving both translocation SVs and mRNA over-expression were enriched (p = 0.001, chi-squared test) for potential enhancer hijacking events, where the rearrangement positioned an enhancer within 0.5 Mb of the gene (Fig. 2e). Another mechanism evident in the CBTN cohort involved the duplication of intergenic enhancer regions (Fig. 2e)52, whereby SV-gene associations involving both duplication SVs and mRNA over-expression were enriched (p < 1E-200, chi-squared test) for SVs with breakpoints spanning a nearby enhancer. Genes with higher expression associated with tandem duplication of nearby enhancers included MYC, MYCN, TERT, TRIM24, RCOR2, KIAA1549, and C11orf95/ZFTA (Supplementary Data 4). SV-expression associations also involved altered DNA methylation patterns. Regarding translocation SVs, SV-gene associations involving under-expression showed an increase on average in DNA methylation with the breakpoint mate compared to the unaltered region (Fig. 2f, left, p < 1E-15, t-test). For translocation SVs involving associations between SVs breakpoints and CpG Islands (CGIs) near the gene, SV-CGI associations involving lower methylation of the CGI array probe also showed a decrease on average in DNA methylation with the breakpoint mate as compared to the unaltered region (Fig. 2f, right, and 2 g, p < 1E-90, t-test).

Impact of somatic SVs on DNA methylation

In adult cancers, somatic SVs have been associated with recurrent alterations in DNA methylation of CGIs18,24. The expanded CBTN X01 datasets—with combined DNA methylation and WGS data involving 1343 tumors from 1209 patients (Supplementary Fig. 4a)—provided an opportunity to explore this phenomenon in pediatric brain tumors. At the DNA methylation level, a set of 1780 CGI array probes (out of 133,345 examined) showed SV-associated altered DNA methylation at FDR < 10% significance level (linear modeling correcting for covariates) for any one of four gene region windows examined for SV breakpoints: 100 kb upstream, 100 kb downstream, within the gene, and 1 Mb upstream or downstream (Fig. 3a and Supplementary Fig. 4b and Supplementary Data 5). In the CBTN cohort, CGI probes with SV-associated decreased methylation were highly enriched for gene body CGIs but anti-enriched for promoter-associated CGIs (Fig. 3b, p < 1E-30, chi-squared test), consistent with previous observations in adult cancers18,24. We observed a small but statistically significant overlap between the positive SV-CGI methylation associations in the CBTN cohort and the positive associations previously identified using TCGA cohort18 (Supplementary Fig. 4c). We also evaluated significant SV-CGI methylation associations separately by tumor histologic type in CBTN, with significant numbers of these involving SV-expression associations for the same genes in the same histologic type but not reaching statistical significance in the entire pan-histology cohort (Supplementary Fig. 4d, e and Supplementary Data 6). SV breakpoint events were broadly associated with increased or decreased CGI methylation, not limited to recurrently altered CGI probes (Supplementary Fig. 4f).

Fig. 3: Genes with altered CGI methylation and concordant expression associated with nearby somatic SV breakpoints.
figure 3

a Heatmap of significance patterns for 1780 CGI probes associated with SV-altered DNA methylation (FDR < 10%, linear model) for any genomic region window examined (SV breakpoints 100 kb upstream of the gene, 100 kb downstream, within the gene, or 1 Mb upstream or downstream, as indicated). Red denotes significant positive correlation; blue, negative correlation. The corresponding significance results for the CGI-associated genes are represented at the mRNA level. COSMIC51 genes listed to the right were significant for both methylation and expression (p < 0.05 for each, linear model). b Top: Fraction of promoter-associated CGIs for those respectively associated with increased or decreased methylation (from part a). Bottom: Breakdown by probe position relative to the gene for CGIs associated with increased or decreased methylation, respectively. TSS, transcription start site; UTR, untranslated region. Meth., methylation. c Overlap between CGI probes with SV-associated altered methylation and nearby genes with corresponding SV-associated altered expression (p < 0.01 for both by linear modeling). Genes with mRNA- and methylation-SV breakpoint associations inverse to each other are listed by name with CGI probe numbers involved. For b and c, enrichment p-values by chi-squared test. In parts a-c, SV associations correct for histology and gene copy, and significant CGI probes are filtered for those with at least one tumor with SV breakpoint involving methylation beta value difference>0.2 from the sample median. d DNA methylation (purple) and mRNA (orange) levels of BCOR, TERT, ATRX, and CDKN2A, corresponding to SV breakpoints in the genomic region surrounding the respective genes. Each point represents a single tumor (closest breakpoint for each tumor). Methylation and mRNA values are respectively normalized to standard deviations from the median (using logit- and log2-transformed values, respectively). Part e indicates the CGI probes used for each gene. Tumor histology is indicated at the top of each plot (color coding in part e). e Boxplots of BCOR, TERT, ATRX, and CDKN2A CGI methylation for indicated probes by alteration class. P-values versus other by t-test on logit-transformed values. Boxplots represent 5%, 25%, 50%, 75%, and 95%. Source data are provided as a Source Data file.

Significant fractions of the SV-CGI methylation associations present in the CBTN cohort also involved SV-expression associations for the gene near the CGI (Fig. 3c). Using a p-value cutoff of <0.01 (1 Mb region, with corrections for covariates), 1177 CGI methylation probes were positively correlated with nearby SV breakpoints, of which 61 probes involved genes negatively correlated between mRNA expression and SV breakpoints (p < 0.01), a significant overlap (p < 1E-30, chi-squared test). The 61 CGI probes included five for CDKN2A and three for ATRX, the associated SVs involving multiple histologic types, including PHGG and PLGG (Fig. 3c–e). Out of 926 CGI probes negatively correlated with nearby SV breakpoints (p < 0.01), 79 involved genes positively correlated between mRNA and SV breakpoints (overlap p < 1E-40, chi-squared test). Genes involving SV-associated lower CGI methylation with concordant mRNA changes included MYC, C11orf95/ZFTA (involving the C11orf95-RELA gene fusion in EPMT53, Supplementary Data 1), RCOR2, TERT, and BCOR (Fig. 3c). BCOR-associated SVs appeared to represent the rare disease entity known as central nervous system tumor with BCOR internal tandem duplication (ITD)54, with the ITD breakpoints but not the other breakpoints being associated with both decreased methylation and increased expression (Fig. 3d, e). Interestingly, while several CGI probes near TERT had SV-associated decreased methylation and increased expression (Fig. 3c), one probe, cg02545192, involving a repressive element located near the TERT core promoter55,56, showed both methylation and expression increased with nearby SV breakpoints (Fig. 3d, e). The SV associations involving cg02545192 in the CBTN cohort align with similar observations in the PCAWG adult cancer cohort27.

We also looked beyond CGI methylation probes to examine DNA methylation changes involving gene enhancers and nearby somatic SV breakpoints. DNA methylation at distal enhancer regions has been implicated in gene regulation, mainly by interfering with transcription factor binding to enhancer regions57,58,59,60. For each of 30292 enhancers61, we mapped the nearest DNA methylation probe and determined its methylation association with nearby SV breakpoints within 100 kb of the enhancer (Fig. 4a). We found 144 enhancers with an SV-methylation association significant at FDR < 10% (linear modeling with covariates) and with at least one SV breakpoint involving methylation beta value difference>0.2 from sample median (Fig. 4b and Supplementary Data 4). For most of these significant enhancers, SV breakpoints associated with lower methylation and involved less than 5% of tumors in the CBTN cohort (Fig. 4b), though for most of the 144 enhancers, the SV-methylation association remained significant when expanding the SV region window to 1 Mb (Supplementary Data 4). Two enhancers of particular interest involved MYC and MYCN, respectively (Fig. 4d, e). SV breakpoints within 1 Mb of the gene enhancer involved 45 tumors for MYC enhancer and 71 tumors for MYCN enhancer, including PHGG and MNG tumors. Interestingly, when examining all CBTN patients with survival data (with one tumor per patient in the analysis), lower DNA methylation levels near MYC enhancer and MYCN enhancer respectively associated with worse outcomes, even after correcting for survival differences according to histologic type (Fig. 4f). Tumors harboring SVs associating with lower enhancer methylation for MYC or MYCN also tended to have amplification of the respective gene (Fig. 4d), though the DNA methylation associations with survival remained significant (by multivariate Cox) after correcting for amplification status.

Fig. 4: Altered DNA methylation near enhancers associated with nearby somatic SV breakpoints.
figure 4

a Schematic of the analysis approach. For each of 30292 enhancers61, we mapped the nearest DNA methylation probe (within 20 kb) and determined its methylation association with nearby SV breakpoints (within 100 kb) by linear modeling correcting for histologic type. b Significance of SV-enhancer altered methylation (involving breakpoints within the region 100 kb upstream or downstream) is plotted (y-axis) versus the percentage of tumors with breakpoints (x-axis). c For top enhancers with significant SV-methylation association (p < 0.01 by linear modeling, involving>9 tumors), the genes within 500 kb of the enhancer were examined for SV-expression association (p < 0.01 by linear modeling with covariates, 1 Mb region). One-sided Fisher’s exact tests compare the number of genes with significant SV-expression association with chance expected. d DNA methylation levels near enhancers associated with MYC (left) or MYCN (right), corresponding to somatic SV breakpoints in the surrounding genomic region. Each point represents a single tumor (the closest SV breakpoint for each tumor). Breakpoints near the respective enhancers tend to show lower methylation for the corresponding probe. e Boxplots of DNA methylation for indicated probes involving MYC enhancer (left) and MYCN enhancer (right) by alteration class. P-values versus other by t-test on logit-transformed values. Boxplots represent 5%, 25%, 50%, 75%, and 95%. f Association with pediatric brain tumor patient survival of DNA methylation levels near MYC enhancer (left) and MYCN enhancer (right). P-values by univariate Cox and log-rank test, correcting for histologic type (n = 1162 patients, one tumor per patient). Source data are provided as a Source Data file.

Somatic SVs and associated genes involving patient survival

To help further prioritize the widespread molecular changes associated with somatic SVs, we leveraged patient survival data from the CBTN and TCGA adult pan-cancer cohorts (the latter involving 33 cancer types)62. As different tumor types based on histology or tissue of origin may involve differences in patient survival over time, we utilized statistical models to correct for tumor type, whereby any associations of molecular features with survival would not be explainable by differences involving tumor type representation alone24,25,39,47. Of the 677 genes with SV-expression associations at p < 0.01 for 1 Mb region (linear model adjusting for covariates) across the CBTN cohort, 126 were also associated with patient overall survival (p < 0.05, Cox incorporating histologic type) in the same direction (e.g., genes both higher with SV breakpoints and higher in patients with worse survival), representing a highly significant overlap (p < 1E-9, chi-squared test, Fig. 5a and Supplementary Data 7). Similarly, out of 2104 CGI probes with SV-methylation associations at p < 0.01 (linear model after filtering), 268 involved genes with expression associated with worse patient outcome (p < 0.05) but in the opposite direction, also a significant overlap (p = 0.0003, chi-squared test, Fig. 5b and Supplementary Data 7). The respective enrichments of SV-associated molecular alterations for genes associated with survival suggest a role for many of these genes in more aggressive disease. Along these lines, genes with lower CGI methylation associated with SV breakpoints and with high expression associated with worse survival included several involving chromosome organization or histone modification (Supplementary Fig. 5a and Supplementary Data 7), including BCOR, CARM1, DAXX, DOT1L, and RCOR2.

Fig. 5: Somatic SV-associated genes involving patient survival.
figure 5

a Gene-level expression associations with pediatric brain tumor patient overall survival were examined in the CBTN cohort of 1259 patients (correcting for histologic type). Of 677 genes with SV-expression associations in CBTN (p < 0.01, using 1 Mb region, with covariates), 126 had expression associated with worse patient outcome (p < 0.05, Cox with histology correction) in the same direction (significance of overlap p < 1E-8, chi-squared test). b Of 2104 CGI probes with SV-methylation associations in CBTN (p < 0.01, using 1 Mb region, with covariates), 268 involved genes with expression associated with worse patient outcome in the opposite direction (significance of overlap p = 0.0003, chi-squared test). CGI probes are filtered for those with at least one tumor with SV breakpoint involving methylation beta value difference>0.2 from the sample median. c Association of the 268-CGI methylation signature from part b with patient survival in the CBTN pediatric DNA methylation dataset (n = 1162 patients). P-values by log-rank test (patients binned by tertiles) and by Cox, as indicated, both correcting for histologic type. Meth., methylation; sig., signature. d Left, association of the 126-gene expression signature from part a with patient survival in TCGA adult pan-cancer gene expression dataset (n = 10152 patients)82. Right, association of the 268-CGI methylation signature from part b with patient survival in TCGA adult pan-cancer DNA methylation dataset (n = 8833 patients). Statistical tests correct for cancer type unless otherwise noted. e For RCOR2, included in both part a and part b signatures, nearby SV breakpoints associate with decreased methylation (left, t-test on logit-transformed values) and increased expression (middle, t-test), and higher expression associates with worse pediatric brain tumor patient outcome (right, p-values correcting for histologic type, n = 1259 patients). f For PDLIM4, included in the part b signature, association of nearby SV breakpoints with increased CGI methylation (left, t-test on logit-transformed values), association of higher CGI methylation with worse pediatric brain tumor patient outcome (middle, p-values correcting for histologic type, n = 1162 patients), and association of higher expression with better patient outcome (right, p-values correcting for histologic type, n = 1259 patients). Boxplots in parts e and f represent 5%, 25%, 50%, 75%, and 95%.

The SV-associated gene alterations involving patient survival, as uncovered in the CBTN datasets, also had prognostic power in adult cancers in TCGA representing various tissues of origin. This finding indicated that the genes collectively represented information as to the genes and processes underlying more aggressive versus less aggressive cancers, not limited to pediatric brain tumors. We applied the above 268-CGI methylation signature (Fig. 5b) to the CBTN DNA methylation dataset of 1162 patients with pediatric tumors, where we had not used methylation-level survival associations to define the signature. We scored each DNA methylation profile based on its overall level of similarity or dissimilarity to the 268-CGI signature pattern. When stratifying patients into tertiles based on the methylation signature score, we observed major differences in overall survival among the groups, even after correcting for overall survival differences by histologic type (p < 0.0001, stratified log-rank test, Fig. 5c). At one year of follow-up, patients in the bottom third of signature scores had an 83% chance of survival, while patients in the top third had a 37% chance. We similarly applied both the 126-gene mRNA signature (Fig. 5a) and the 268-CGI signature (Fig. 5b) to the RNA-sequencing and DNA methylation datasets, respectively, in TCGA (Fig. 5d). Interestingly, both signatures could stratify adult patients into higher-risk versus lower-risk groups in terms of overall survival, even after correcting for survival differences by cancer type alone. Many factors, not limited to SVs, may underlie tumor expression and DNA methylation patterns in the external datasets.

Many genes of interest arose from the above survival-related analyses that did not already have a well-established cancer role as denoted by COSMIC51, including RCOR2 and PDLIM4 (Fig. 5e, f). RCOR2, a co-repressor of REST (also known as Neuron-Restrictive Silencing Factor, or NRSF), has a role in regulating the proliferation-differentiation balance in the developing brain63. RCOR2 was part of both mRNA and CGI methylation survival signatures, with nearby SV breakpoints associating with decreased CGI methylation and increased expression and with higher expression associating with worse pediatric brain tumor patient outcome (Fig. 5e). PDLIM4, PDZ and LIM domain protein 4, is repressed by hypermethylation in prostate cancer and suppresses prostate cancer cell growth64, making it a candidate tumor suppressor gene. PDLIM4 was part of the 268-CGI methylation survival signature, with nearby SV breakpoints associating with increased CGI methylation, and with both higher CGI methylation and lower expression associating with worse pediatric brain tumor patient outcome (Fig. 5f).

Molecular signatures of progressive or recurrent tumors

Of the 2417 tumors in our expanded CBTN cohort, 173—involving 139 patients—were progressive or recurrent tumors for which an initial tumor from the same patient was also sampled (Supplementary Data 8). These data presented an opportunity to explore molecular changes associated with progression or recurrence in pediatric brain tumors, using a paired analysis according to patient. Consistent with previous results obtained from the earlier CBTN cohort26, increased numbers of Somatic SVs were detected in recurrent or progressive tumors from a given patient compared to the initial tumor from the same patient (Fig. 6a, p < 1E-7, paired t-test). With the DNA methylation dataset of the expanded X01 cohort, we found that average CGI probe methylation was also increased in recurrent or progressive tumors (Fig. 6a, p = 0.0007, paired t-test). By paired analysis, we observed widespread gene expression and CGI methylation differences in recurrent/progressive tumors at a robust significance level of FDR < 10%, respectively, involving 890 mRNAs and 5716 CGI probes (Fig. 6b, c and Supplementary Data 8). Interestingly, the CGI probes showing higher methylation in progressive or recurrent tumors overlapped significantly with the probes for genes with higher expression in progressive or recurrent tumors (Fig. 6b).

Fig. 6: Gene expression and DNA methylation signatures of progressive or recurrent pediatric brain tumors associate with worse patient outcomes.
figure 6

a Significant differences between initial tumor and recurrent or progressive tumors for total number of somatic SVs detected (left) and for average CGI probe methylation (right), based on paired analyses. P-values by paired t-test on log2-transformed and logit-transformed data, respectfully. Boxplots represent 5%, 25%, 50%, 75%, and 95%. b Overlap between CGI probes with methylation (meth.) higher in progressive or recurrent tumors and CGI probes (FDR < 10% paired t-test on logit-transformed data, with beta difference>0.1 for ten or more tumors) for which the corresponding gene had higher or lower expression (p < 0.01, paired t-test on log2-transformed data) in progressive or recurrent tumors. Enrichment p-values by chi-squared test. c Top, heat map of 890 genes differentially expressed (FDR < 10%, paired t-test on log2-transformed data) in recurrent or progressive tumors compared to the initial tumor, involving 124 recurrent or progressive tumors from 101 unique patients. Bottom, heat map of 5716 CGI probes differentially methylated (FDR < 10% paired t-test on logit-transformed data, with beta difference>0.1 for ten or more tumors). Expression or methylation values for each recurrent or progressive tumor represented are centered on the corresponding initial tumor. A subset of genes with both higher methylation and lower expression are listed off to the right. d Left, the gene expression signature from part c was applied to the expression profiles from the remaining CBTN pediatric brain tumor patients not used to define the signature (n = 1139 patients), associating with worse overall survival in these patients. Right, the DNA methylation signature from part c was applied to the methylation profiles from the remaining CBTN pediatric brain tumor patients not used to define the signature (n = 1049 patients), associating with worse survival. e The DNA methylation signature from part c was applied to the methylation profiles from TCGA adult pan-cancer patients (n = 8833 patients), associating with worse survival. For parts d and e, survival association p-values correct for histologic or cancer types.

The gene expression and DNA methylation signatures of progressive or recurrent tumors (Fig. 6c) could predict patient survival in validation cohorts, indicating of these signatures underscored more aggressive disease. We respectively applied each signature to the expression profiles and DNA methylation profiles from the remaining CBTN pediatric brain tumor patients not used to define the signature, and in each instance the signature could stratify patients by survival after correction for differences by histologic type (Fig. 6d). Interestingly, the gene expression signature of progressive or recurrent tumors from CBTN did not robustly predict survival in TCGA adult cancers, while the DNA methylation signature did, even after correcting for cancer type (Fig. 6e). In contrast to the somatic SV associations explored above, many molecular differences associated with progressive or recurrent tumors likely involve non-cancer and cancer cells47. Along these lines, genes with lower expression involved genes related to extracellular exosomes, immune response, and MHC protein complex (Supplementary Fig. 5a and Supplementary Data 8). Previously, we had defined paired expression differences between recurrent or progressive tumors using a limited dataset of 63 tumors from 55 patients, where few genes had arisen in significance above FDR, likely due to statistical power issues47. However, highly significant numbers of the nominally significant genes from the earlier cohort overlapped with the significant genes from the additional cohort benefitting from increased tumor numbers and power (Supplementary Fig. 5b and Supplementary Data 8), a further indication of these expression differences being reproducible and robust.

Discussion

Our present study systematically catalogs the specific genes and nearby CGIs with SV-associated altered expression and methylation, respectively, across pediatric brain and CNS tumors. Here, we took advantage of the unique molecular profiling datasets offered by CBTN, which included combined DNA methylation and WGS profiling for over 1300 tumors. We identify specific genes of known relevance to pediatric tumors impacted by SV-associated altered methylation involving appreciable numbers of tumors, including fusion partner genes such as KIAA1549 and C11orf95/ZFTA, tumor suppressor genes such as ATRX and CDKN2A, and oncogenes such as MYC, MYCN, and TERT. Understanding how this class of noncoding genomic alterations impacts genes with established cancer roles would have implications for personalized and precision medicine approaches, as additional patients that exome-focused approaches would miss would involve these genes26. Other genes arising in our findings that warrant further study in pediatric cancers include BCOR, RCOR2, and PDLIM4, which could represent additional disease drivers and potential targets and for which there is literature support for roles in other cancer types63,64. Some of our top significant genes would primarily involve one or two histological types studied, though most genes involved tumors spanning multiple histologies in the CBTN cohort, underlying the rationale for our pan-histology approach. While SV-associated alterations for any gene may involve only a fraction of tumors, the overall impact of SVs across many cancer gene drivers would cumulatively involve an appreciable percentage of patients24,26. Here, we establish that genomic rearrangement would show a major role in shaping the cancer DNA methylome of pediatric brain tumors, to be considered alongside commonly accepted mechanisms, including histone modifications and disruption of DNA methyltransferases18. While this phenomenon has been examined in adult cancers18,24, the specific genes impacted in pediatric tumors would differ from those observed in adult cancers, with the two broadly representing distinct disease entities26.

Our study approach reveals evidence for diverse mechanisms of noncoding somatic alterations underlying the deregulation of genes with key roles in pediatric tumors. In addition to enhancer hijacking and enhancer duplication events, our present study showed that somatic SVs could involve DNA methylation changes surrounding gene enhancers, impacting the regulation of nearby genes. While the importance of enhancers in MYC-driven cancer initiation and progression has recently become understood65, a role for SV-mediated lower DNA methylation at MYC and MYCN enhancers in cancers such as high-grade pediatric glioma and neuroblastoma may warrant further attention. Somatic SVs underlying altered methylation or expression involve different genomic coordinates in different patients49, often entailing different mechanisms, complicating our studying the functional impact of SVs. While the SV-associated mechanisms explored here are understood to exist in principle, our cataloging of the specific genes involved in a given mechanism in pediatric brain and CNS tumors represents an important exercise, revealing additional genes with potential roles in specific tumor subsets. In certain cancer types, mutation within specific genes such as IDH1 or SETD2 can involve global hyper- or hypomethylation patterns in tumors13, while the results of our study identify somatic SVs involving alterations specifically targeting a more select set of genes. DNA repair of double-stranded breaks can lead to altered CpG methylation at the repair site19,20,21,22,23. In experimental models, genome-wide hypomethylation has been repeatedly observed in structurally unstable cancer genomes66,67. Also, structural transitions defining the boundaries of the unmethylated CpG island in normal cells may induce local changes in chromatin accessibility, which can, in turn, trigger hypermethylation and subsequent gene silencing68. More investigation into mechanisms involving SV-altered methylation is warranted. Where global DNA methylation patterns would underlie the cell-of-origin for a given CNS tumor type, somatic SVs would be one way to alter the DNA methylation landscape such that specific genes normally silenced are unsilenced or enhanced while other genes normally expressed are silenced.

Our present study leveraged longitudinal and patient survival data to help prioritize the widespread molecular changes associated with somatic SVs. Our pan-histology approach identified widespread molecular correlates of patient survival across the CBTN patients after correction for differences by histologic type. We identified significant patterns of overlap between gene survival correlates and SV-associated genes, suggesting some of these genes have specific roles in more aggressive disease. The widespread associations with patient survival involving DNA methylation and gene expression that we observed here represent information that could mined to identify additional potential targets. Different criteria may be factored in when selecting genes from our results for further study, including evidence in the literature for disease-relevant roles based on studies of non-pediatric cancers. Much of the information represented by our molecular signatures of survival would also be relevant beyond pediatric brain tumors, as we could apply DNA methylation signatures from CBTN patients to predict survival in adult cancers of various tissues of origin. Several factors, not limited to SVs, may underlie the expression and DNA methylation patterns in tumors, even where adult tumors may have different SV-expression associations from pediatric tumors. Future genomic datasets involving greater numbers of patients, providing increased statistical power, would considerably aid in further establishing the catalog of recurrently altered genes in pediatric brain and CNS tumors. Additional data would also help define survival associations within each histologic type, as each histologic type would have a unique set of somatic alterations driving the disease. Data generation for >5000 CBTN samples is currently underway30, which data could be leveraged with data from other large-scale genomic initiatives. Future studies could also define genes involved in patient responses to therapy as such data become available25. As reflected in our study, data integration across multiple platforms of molecular complexity can provide a more complete picture of the key genes involved in patient tumor subsets.

Methods

Patient cohorts

Results are based on data from the Children’s Brain Tumor Network (CBTN), some of which (involving the initial, pre-X01 patient cohort) have been included in previous studies26,38. Patients were consented by one of 32 participating sites and enrolled on a local IRB-approved protocol which includes key language to enable prospective collection of, future research on, and sharing of, de-identified surgical specimens, patient demographics, medical history, diagnoses, treatments, and clinical imaging30. Patient ancestry was self-defined. Patient sex was self-reported. Tumor molecular profiling data were generated through informed consent as part of CBTN efforts and analyzed here per CBTN’s data use guidelines and restrictions. Analysis of patient data was limited to data made publicly available in accordance with the informed consent. Protected CBTN molecular data, including raw sequencing data, are available under restricted access. To gain access to protected data, one can apply to the CBTN and sign a Data Use Agreement.

At the time of this study, 2417 tumors representing 2127 patients had data available on at least one data platform for whole-genome sequence (WGS, at 60x coverage), RNA sequencing (RNA-seq), or DNA methylation (Supplementary Data 1). Combined WGS and RNA-seq data were available for 1448 tumors representing 1317 patients. Combined WGS and DNA methylation data were available for 1343 tumors representing 1209 patients. Tumor samples in CBTN spanned at least 33 different tumor types: APTAD, Adenoma; ATRT, Atypical Teratoid Rhabdoid Tumor; CHDM, Chordoma; CNC, Neurocytoma; CPC, Choroid plexus carcinoma; CPP, Choroid plexus papilloma; CRANIO, Craniopharyngioma; DIPG, Diffuse intrinsic pontine glioma; DNT, Dysembryoplastic neuroepithelial tumor (DNET); EPM, Subependymal Giant Cell Astrocytoma (SEGA); EPMT, Ependymoma; ES, Ewing’s Sarcoma; GMN, Germinoma; GNBL, Ganglioneuroblastoma; GNG, Ganglioglioma; GNOS, Glial-neuronal tumor not otherwise specified (NOS); HMBL, Hemangioblastoma; LCH, Langerhans Cell histiocytosis; MBL, Medulloblastoma; MNG, Meningioma; MPNST, Malignant peripheral nerve sheath tumor; NBL, Neuroblastoma; NFIB, Neurofibroma/Plexiform; ODG, Oligodendroglioma; PBL, Pineoblastoma; PCNSL, Primary CNS lymphoma; PHGG, High-grade glioma/astrocytoma (WHO grade III/IV); PLGG, Low-grade glioma/astrocytoma (WHO grade I/II); PNET, Supratentorial or Spinal Cord primitive neuroectodermal; RMS, Rhabdomyosarcoma; SARCNOS, Sarcoma; SCHW, Schwannoma; TT, Teratoma; and Other/unspecified. The histologic designations of the tumors, as provided by the individual CBTN member institutions contributing the samples, were confirmed by independent pathology review at the CBTN centralized biorepository, with most contributing sites providing representative histology slides.

A subset of CBTN tumors represents multiple tumors taken from the same patient, involving 514 tumor samples from 224 patients. Of the 1448 tumors with combined WGS and RNA-seq data, 236 involved multiple samples for 105 patients; of the 1343 tumors with combined WGS and DNA methylation data, 241 involved multiple samples for 107 patients. As indicated in Supplementary Data 1, multiple tumors from the same patient may entail samples from multiple initial tumors or samples taken at different times, e.g., samples taken initially from the initial tumor and later from a progressive or recurrent tumor. Different tumors from the same patient often demonstrated extensive molecular heterogeneity with respect to each other (Supplementary Fig. 2a), as demonstrated previously24,26,31,47,69. Therefore, each tumor sample was analyzed independently in the integrative analyses, with the numbers of patients and tumors involved with a particular pattern of interest noted where warranted. Alternate analyses taking one tumor per patient yielded similar SV-expression or SV-methylation associations as that of the entire dataset (Supplementary Fig. 2b). One tumor was selected to represent the patient for analyses involving pediatric brain tumor patient survival (see below).

The results here are also based partly on data from The Cancer Genome Atlas (TCGA) Research Network and the International Cancer Research Consortium (ICGC). Previously, we carried out combined WGS and RNA-seq analysis for 2334 TCGA-ICGC cancer cases in total18, 1892 of which were from TCGA and 1232 of which (including all ICGC cases and 790 TCGA cases) were part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) consortium efforts. Of the 2334 cases, 25 involved patients under the age of 18, of which 21 were ICGC lymphomas. Tumors in TCGA spanned 33 different TCGA projects, each representing a specific tumor type. Tumor molecular profiling data were generated through informed consent as part of previously published studies and analyzed per each original study’s data use guidelines and restrictions. Previously, we carried out combined WGS and DNA methylation analysis for 1482 TCGA cancer cases18.

Molecular profiling datasets

The somatic DNA workflow for DNA variant calling is available in the KidsFirst Github repository (https://github.com/kids-first/kf-somatic-workflow). CBTN used Manta SV v1.4.0 algorithm for somatic Structural Variant (SV) calls70 based on WGS data. The hg38 reference used for somatic SV calling was limited to canonical chromosome regions. We accessed the somatic SV VCF files through the CBTN Cavatica site (https://cbtn.org) on March 7, 2023, from both CBTN and CBTN-X01 folders. We used only somatic SV calls that passed quality filters in the analyses. Manta algorithm classified each SV call as one of the following: tandem duplications, insertions, deletions, inversions, and translocations. Gene-level copy number calls, based on CBTN WGS data, were likewise obtained from the CBTN Cavatica site.

We obtained processed RNA-seq data for CBTN tumors from the CBTN Cavatica site on March 7, 2023, from both CBTN and CBTN-X01 folders. As we observed extensive batch effects between the CBTN and CBTN-X01 RNA-seq datasets (due to the data for the CBTN dataset being generated much earlier than the CBTN-X01 dataset26), we generated a batch effect-corrected RNA-seq dataset using the SVA package and ComBat algorithm in R Bioconductor71,72, using histology as the experimental group (with histologies of <20 tumors consolidated into an “other” group), and removing genes with non-zero values in <20 tumors from the batch-correction. The final batch-corrected dataset showed dominant global expression patterns according to histologic type rather than batch (Fig. 1a). After joining the RNA-seq features with the gene-level copy features by WGS, 19640 genes were included in the final dataset for analysis.

For CBTN DNA methylation data generated using the Illumina Infinium MethylationEPIC BeadChip array platform (Illumina, San Diego, CA), we obtained raw IDAT image files from the CBTN Cavatica site under the “methylation” folder. We processed the IDAT files using the minfi package in R Bioconductor73, with quantile normalization using the Bioconductor preprocessQuantile function to generate the final methylation beta values. Illumina EPIC probe annotation was based on hg19 coordinates. When overlapping methylation array probe coordinates to SV calls based on hg38, we first used the UCSC Genome Browser LiftOver tool to convert the probe coordinates to hg38.

Integrative analyses between somatic SVs and gene expression

Using SVExpress48, we defined genes with altered expression associated with nearby somatic SV breakpoints. No germline SVs were involved in any analyses. Relative to each gene, genomic region windows considered included the within-gene regions and within 100 kb upstream or 100 kb downstream of the gene. SVExpress constructed a gene-to-sample matrix for the above regions, with entries as 1 if a breakpoint occurs in the specified region for the given gene in the given sample and 0 if otherwise. We also used SVExpress to examine a 1 Mb region surrounding each gene, using the relative distance metric option18, whereby breakpoints close to the gene will have more numeric weight in identifying SV-expression associations, while breakpoints further away but within 1 Mb can have some influence. The 1 Mb window was considered as genome rearrangements may involve the translocation of enhancers, which may impact genes as far as ~1 Mb away18. Gene-level SV-expression association analyses included 19640 unique named genes. Using the gene X sample SV breakpoint matrix, SVExpress assessed the correlation between the expression of the gene and the presence of an SV breakpoint using a linear regression model (with log-transformed expression values), incorporating sample histologic type and gene-level copy number. For the analyses involving within-gene, 100 kb upstream, and 100 kb downstream gene regions, we only considered genes with at least three tumors associated with an SV within the given region when estimating False Discovery Rate (FDR)50.

By SVExpress, a gene shows significant SV-expression associations if the expression and SV breakpoint patterns line up non-randomly with respect to each other across all samples analyzed, after correction for covariates. By design18, our integrative analytical approach does not assume the specific mechanism of altered expression and treats SV breakpoints representing different classes (tandem duplications, insertions, deletions, inversions, and translocations) and insert sizes the same25. Significant SV-gene associations may involve all SV classes, e.g., SV-associated increases or decreases in gene expression would not be respectively limited to duplication SVs or deletion SVs (Supplementary Fig. 1c)18. For the 1 Mb region, if multiple breakpoints occur near the gene, the breakpoint closest to the gene start is used in the breakpoint matrix. In addition to modeling expression as a function of SV events, the model incorporated histologic type (as encapsulated by one of the ~33 CBTN tumor types) as a covariate. Therefore, any significant association between genes and SV breakpoint patterns must rise above any association that would be explainable by histologic type alone. Similarly, the models also incorporated gene-level copy number as a covariate, given the observed associations of copy number alterations with nearby somatic SV breakpoints18,27,28. In the downstream analyses, we explored the genes for which somatic SV associations were significant after correcting for both tumor type and gene-level copy number.

In addition to analyzing all tumors from the various cancer types (by histology) in the entire cohort, we carried out analyses within tumor subgroups by histologic type, focusing on the 13 histologic types with at least 20 tumors: ATRT, CPP, CRANIO, DIPG, DNT, EPMT, GNG, MBL, MNG, NFIB, PHGG, PLGG, and SCHW. SVExpress generated SV-expression associations by histologic type using the distance metric method and 1 Mb region (including gene-level copy number as a covariate). Here, we only considered genes with at least two tumors associated with an SV within the given region when estimating FDR.

We found that neither RNA sequencing center nor tumor purity contributed to the significant SV-expression associations observed. For 841 of the 1443 tumors with combined WGS and RNA-seq data in our present study, the sequencing center had been noted in the CBTN OpenPBTA study (the latter study not including X01 samples)38. Of the 641 genes significant at p < 0.01 for 1 Mb region, correcting for cancer type and copy number as described above (limited here to the 841 tumors of the OpenPBTA study), all but two genes remained significant (p < 0.05) when incorporating sequencing center (NantOmics, BGI@CHOP Genome Center, or BGI) as a covariate. Similarly, polyA versus stranded library preparation was not a major confounder (with 839 out of the 841 genes significant with p < 0.05 when incorporating library preparation as a covariate). When analyzing 1419 tumors with tumor fraction (i.e., tumor purity) information in addition to WGS and RNA-seq data, all 641 genes significant at p < 0.01 (1 Mb region, with cancer type and copy number covariates) without tumor fraction were significant at p < 0.02 with tumor fraction added as a covariate.

Integrative analyses using TAD and enhancer genomic coordinates

To identify breakpoints associated with Topologically-Associated Domain (TAD) disruption, we used SVExpress48 and published TAD data from the IMR90 cell line74, and using the UCSC Genome Browser LiftOver tool to convert TAD coordinates from hg18 to hg38. We defined TAD-disrupting somatic SVs as those somatic SVs for which the two breakpoints did not fall within the same TAD. As compared to all SVs that could be categorized as either TAD-disrupting or TAD-preserving, we examined the fractions of TAD-disrupting SVs involving gene over-expression and under-expression, defined as those genes having p < 0.01 for SV-expression association (linear model with tumor type and gene-level copy for 1 Mb region) and expression >0.4 SD or <−0.4 SD from the median for the tumor harboring the breakpoint.

We examined the percentages of SV-gene associations involving an enhancer near the gene, both for all SV-gene associations and for the subsets involving altered gene expression. We cataloged all SV-gene associations for any SV breakpoints within the 1 Mb region from the gene start, with the SV having the breakpoint closest to the gene start assigned to the gene. We utilized the enhancer annotations provided by Kumar et al.61, using the UCSC Genome Browser LiftOver tool to convert enhancer coordinates from hg19 to hg38. We separately examined putative enhancer hijacking events, involving translocation SVs, and duplication of enhancer elements, involving tandem duplication SVs. Using SVExpress, we first tabulated somatic SV breakpoint-to-gene associations for the set of translocation SVs, then determined which of these involved an enhancer translocated within 0.5 Mb of the SV breakpoint in proximity to the gene (based on the orientation of the SV breakpoint mate), where the unaltered gene either had no enhancer within 1 Mb or had an enhancer further away from the gene than the translocated enhancer. In this analysis, we considered only SVs with breakpoints on the distal side of the gene. We separately considered duplication SVs with both breakpoints spanning an enhancer element. For both enhancer hijacking and enhancer duplication events, we considered the percentage involving enhancers for all SV-gene associations and for the subsets involving gene over-expression and under-expression, defined as those genes having p < 0.01 for SV-expression association (linear model with tumor type and gene-level copy for 1 Mb region) and expression >0.4 SD or <−0.4 SD from the median for the tumor harboring the breakpoint.

Translocation of differentially methylated regions

For SV-gene associations involving translocation SVs, we determined the average difference in DNA methylation represented by the breakpoint mate compared to the unaltered region near the gene. For each MethylationEPIC DNA methylation probe, we computed the average methylation beta value by tumor histologic type. For each translocation SV event, we mapped the methylation probe closest to the gene that would be represented by the breakpoint mate (within 1 Mb), and, for the given tumor histologic type, we compared the average methylation beta value represented by the probe associated with the breakpoint mate as compared to the probe closest to the gene in the unaltered scenario. We examined the average methylation change (SV-associated versus unaltered) for all gene-SV associations and for the subsets involving gene over-expression and under-expression, defined as those genes having p < 0.01 for SV-expression association (linear model with tumor type and gene-level copy for 1 Mb region) and expression >0.4 SD or <−0.4 SD from the median for the tumor harboring the breakpoint.

We also examined the average methylation change (according to tumor histologic type) involving translocation SVs versus the unaltered region for all CGI-SV associations and for the subsets involving altered DNA methylation. To define all CGI-SV associations represented by the genes in this analysis we assigned to each gene the MethylationEPIC CGI methylation probe with the lowest p-value for SV-expression association (by linear modeling and 1 Mb region, see below). For each CGI-SV association, we compared the CGI methylation beta value for the tumor sample with the average CGI methylation for the tumor histologic type. For CGI-SV associations for which the sample versus average CGI methylation difference was >0.1 or <−0.1, the average methylation change represented by the SV (according to tumor histologic type, not limited to CGI probes) was respectively compared with that of all CGI-SV associations. The top translocation SV events represented in Fig. 2g involve those for which the SV mate represented a region of lower methylation (average beta difference from the unaltered region of < −0.2), with the tumor sample CGI methylation also having <−0.1 beta differences from that of the average for the corresponding histologic type and with the associated gene showing higher expression of >0.4 SD from the sample median.

Integrative analyses between somatic SVs and DNA methylation

Using SVExpress, we defined genes with altered DNA methylation involving their associated CGI probe in conjunction with nearby somatic SV breakpoints. We used logit-transformed methylation beta values in the linear regression modeling for DNA methylation data (a common practice for making DNA methylation data better align with linear model assumptions75). Here, patterns of altered DNA methylation association with nearby SV breakpoints focused on the 133,345 array probes falling within CGIs. The genes associated with each CGI probe, according to the Illumina platform annotation, were used to construct the CGI probe X sample breakpoint matrices, like those constructed above for the SV-expression associations, using the gene as the reference for the relative breakpoint locations. For genomic region windows 100 kb upstream of the gene, 100 kb downstream, within the gene body, or 1 Mb upstream or downstream (the latter using the distance metric method), SVExpress assessed the correlation between CGI methylation and the presence of an SV breakpoint using linear regression models. For CGI probes with significant SV-methylation associations considered in downstream analyses, we required significant associations to arise after correction for sample histologic type and gene-level copy number. In addition, for CGI probes located on the X or Y chromosome, these probes needed to remain significant after correction for patient sex in addition to sample histologic type and gene-level copy number. For the analyses involving within-gene, 100 kb upstream, and 100 kb downstream gene regions, we only considered genes with at least three tumors associated with an SV within the given region when estimating FDR. Furthermore, where indicated, we further filtered significant CGI probes for those with at least one tumor with SV breakpoint involving methylation beta value difference>0.2 from the sample median.

In addition to analyzing all tumors from the various cancer types (by histology) in the entire cohort, we carried out SV-CGI methylation analyses within tumor subgroups by histologic type, focusing on the 13 histologic types analyzed above for SV-expression associations. SVExpress generated SV-methylation associations by histologic type using the distance metric method and 1 Mb region (including gene-level copy number as a covariate, as well as patient sex for those probes on X or Y chromosomes). Here, we only considered genes with at least two tumors associated with an SV within the given region for inclusion in the top results. We then crossed the top significant CGI probes for each histologic type (p < 0.01) with the set of associated genes with significant SV-expression associations (p < 0.01, linear model with histologic type and gene-level copy as covariates).

For each of 30292 enhancers, we mapped the nearest DNA methylation probe (within 20 kb) and determined its methylation association with nearby SV breakpoints (within 100 kb) by linear modeling correcting for histologic type. We utilized the enhancer annotations provided by Kumar et al.61, using the UCSC Genome Browser LiftOver tool to convert enhancer coordinates from hg19 to hg38. We used SVExpress to construct an enhancer-to-sample matrix with entries as 1 if a breakpoint occurs within 100 kb of the given enhancer in the given sample and 0 if otherwise. For enhancers located on the X or Y chromosome, these needed to remain significant after correction for patient sex in addition to sample histologic type in the linear modeling. When estimating FDR, we only considered enhancers with at least three tumors associated with an SV within 100 kb. We further filtered top significant enhancers for those with at least one tumor with SV breakpoint involving methylation beta value difference>0.2 from sample median, involving 14073 enhancers in all with SV in three tumors and minimum difference.

Varying tumor purity levels did not contribute to the significant SV-methylation associations observed. When analyzing 1319 tumors with tumor fraction (i.e., tumor purity) information in addition to WGS and DNA methylation data, all 4209 CGI probes significant at p < 0.01 (1 Mb region, with cancer type and copy number covariates) without tumor fraction were significant at p < 0.02 with tumor fraction added as a covariate.

Survival analyses

We obtained CBTN patient survival data from the PedCBioportal (https://pedcbioportal.kidsfirstdrc.org/) on September 12, 2023. For patients with multiple tumors profiled, one tumor was selected to represent the patient in the survival analyses (Supplementary Data 1), where we favored an initial tumor over a progressive or recurrent tumor, followed by a random selection. We identified gene-level molecular correlates of patient overall survival for both mRNA and nearby SV breakpoints. We utilized the gene X sample relative distance breakpoint matrix generated by SVExpress (1 Mb region) to associate nearby SV breakpoints with patient outcomes. For each gene, we used a stratified Cox (correcting for histologic type and gene-level copy) to associate patient overall survival with the log2-transformed relative distance to the nearest breakpoint for that gene. We also associated mRNA expression of the gene with overall survival using stratified Cox (corrected for histologic type, using as.factor in R). Similarly, we identified CGI-level methylation correlates of patient overall survival using a stratified Cox (correcting for histologic type).

We defined both a gene expression signature and a DNA methylation signature involving both somatic SV-associated genes and patient survival in pediatric brain tumors. The gene expression signature was the intersection of genes with SV-expression associations and genes with overall survival associations (by Cox with histology correction), with the respective associations being in the same direction. The DNA methylation signature was the intersection of CGI probes with SV-methylation associations and CGI probes for genes with expression associated with overall survival, with the respective associations being in the opposite direction. We applied the gene expression signature to TCGA RNA-seq profiles to stratify patients according to patient outcome. The TCGA log2-transformed gene expression values were normalized gene-wise to standard deviations from the median across tumors. The gene signature score was derived using our t-score metric76,77,78, comparing the average normalized expression values for the positive signature genes versus the average for the negative signature genes. We respectively applied the gene expression signature to both CBTN DNA methylation profiles (where we had not directly used DNA methylation associations with survival to define the original signature) and TCGA 450 K DNA methylation profiles to stratify patients according to outcomes. We normalized the DNA methylation logit-transformed CGI methylation values probe-wise to standard deviations from the median across tumors. We derived the DNA methylation signature score using our t-score metric, as we did for the gene expression data. For samples scored by signature, patients were stratified into tertiles by signature score, with survival associations determined by both stratified log-rank test and by Cox stratified by cancer type.

Taking the CBTN RNA-seq and CGI methylation data for progressive or recurrent tumors from patients for which CBTN profiled the corresponding initial tumor, we carried out a paired t-test between recurrent/progressive and initial tumor groups (using log2-transformed or logit-transformed values). The paired analysis controlled differences between histologic types, as we evaluated relative differences between each recurrent or progressive tumor and its paired initial tumor reference across the dataset. We applied the gene expression and DNA methylation signatures of progressive or recurrent tumors to the molecular profiles of CBTN patients that were not used to define the signatures, in the same manner described above for the other survival-associated signatures, to stratify patients according to outcome. Similarly, we applied the DNA methylation signature to the methylation profiles from TCGA adult pan-cancer patients.

Patient age (in years) was not significantly associated with overall survival by univariate Cox analyses (p > 0.05). In multivariate Cox analyses incorporating age in addition to the molecular features and signature scoring highlighted in our study (e.g., MYC/MYCN enhancer methylation, PDLIM4 expression or methylation, RCOR2 expression, Fig. 5b CGI methylation signature, progressive/recurrent expression and methylation signatures, etc.), the significance of the survival association for each molecular variable was unchanged. For key molecular features and signatures highlighted in our study, these remained significant by univariate Cox (p < 0.05) when using event-free survival (deceased due to disease, second malignancy, recurrence, progression, metastatic) instead of overall survival as the patient outcome. Our study’s goal was not to derive molecular signatures with the best prognostic performance or that would add information on top of existing clinical variables (e.g., to develop an assay for application in the clinical setting). Rather, the survival associations aimed to help further prioritize the widespread molecular changes associated with somatic SVs, with significant survival associations possibly alluding to biological roles for the genes involved in advanced disease.

Statistical analysis

All P-values were two-sided unless otherwise specified. Nominal p-values do not involve multiple comparison adjustments, while FDRs involve p-values adjusted for multiple gene feature comparisons. As described above, we utilized linear regression models to associate gene expression and DNA methylation patterns with nearby SV breakpoints. In the linear models of this study, we used appropriate data transformations to make the data align better with the model assumptions. We performed linear modeling using the lm function in R (version 4.0.3). One-sided Fisher’s exact tests or chi-squared tests determined the significance of overlap between two given feature lists, using the total set of profiled features as the baseline (e.g., the 133,345 CGI probes represented in the DNA methylation dataset for CGI feature lists or the 19,640 genes represented in the expression dataset for expression-based feature lists). The method of Storey and Tibshirani50 estimated FDR for significant genes, using the following formula for each gene: [(nominal p-value) X (total number of genes tested) / (total number of genes that were significant at the given p-value)]. We relied on a stricter FDR cutoff for defining top genes when carrying out global molecular associations for a single analysis (e.g., gene-level SV-mRNA associations). When overlapping different top feature results sets (e.g., gene-level SV-expression associations with SV-CGI methylation associations), we used a more relaxed p-value cutoff to limit false negatives, helping us identify significant overlap patterns. We performed heat map visualization using JavaTreeview79 and matrix2png (version 1.2.1)80. We evaluated enrichment of GO annotation terms within sets of differentially expressed genes using SigTerms software81 and one-sided Fisher’s exact tests. Figures indicate exact value of n (number of tumors), and the statistical tests used are noted in the Figure legends and next to reported p-values in the Results section. Boxplots represent 5% (lower whisker), 25% (lower bound of box), 50% (center), 75% (upper bound of box), and 95% (upper whisker). Figures represent biological and not technical replicates.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.