The acquisition of molecular drivers in pediatric therapy-related myeloid neoplasms

Pediatric therapy-related myeloid neoplasms (tMN) occur in children after exposure to cytotoxic therapy and have a dismal prognosis. The somatic and germline genomic alterations that drive these myeloid neoplasms in children and how they arise have yet to be comprehensively described. We use whole exome, whole genome, and/or RNA sequencing to characterize the genomic profile of 84 pediatric tMN cases (tMDS: n = 28, tAML: n = 56). Our data show that Ras/MAPK pathway mutations, alterations in RUNX1 or TP53, and KMT2A rearrangements are frequent somatic drivers, and we identify cases with aberrant MECOM expression secondary to enhancer hijacking. Unlike adults with tMN, we find no evidence of pre-existing minor tMN clones (including those with TP53 mutations), but rather the majority of cases are unrelated clones arising as a consequence of cytotoxic therapy. These studies also uncover rare cases of lineage switch disease rather than true secondary neoplasms.

Here, using a comprehensive sequencing approach, we show that Ras/MAPK pathway mutations, alterations in RUNX1 or TP53, and KMT2A rearrangements are frequent somatic drivers in pediatric tMN, and we find that in some cases aberrant MECOM expression is secondary to enhancer hijacking. Additionally, using samples from serial timepoints, we find no evidence of pre-existing minor tMN clones (including those with TP53 mutations) like in adults with tMN [5][6][7] , but rather the majority of cases are unrelated clones arising as a consequence of cytotoxic therapy.
Putative germline variants in pediatric tMN. Fourteen pathogenic or likely pathogenic presumed germline sequence alterations were identified in 13 of 84 patients (15%, 95% exact binomial CI: 8.5-25.0%) ( Table 2 & Supplementary , indicating that germline alterations may be more common in tMN than the published prevalence of 8.5-10% in other groups of children with cancer [19][20][21][22] . This includes 4 patients with germline TP53 mutations. There was also evidence of TP53 mosaicism in the non-tumor tissue in 5 additional patients (Fig. 1e & Supplementary Data 15). Collectively, 15 patients (18%) had somatic (mutation and/or copy number alteration) or germline alterations in TP53 ( Supplementary Fig. 3). There was a significant enrichment of complex cytogenetics in patients with TP53 alterations (11 of 13) versus wild-type TP53 patients when considering those with comprehensive sequencing (n = 62, 85% vs. 12%; Fisher's p < 0.0001) ( Supplementary Fig. 3e). Three other patients had low VAF somatic truncating mutations in exon 6 of PPM1D (Supplementary Fig. 4) 23,24 . Despite the fact that deletions or CN-LOH involving chromosome 7 (del (7)) were the most common copy number alteration (22 of 62, 35%) ( Mutational signatures of pediatric tMN. C > T transitions were the predominant mutation type (Fig. 2b, c). Mutational signature analysis on the 16 WGS cases and 3 WES cases with a sufficient quantity of SNVs (>30) identified drug signatures in 9 cases, including 4 with the cisplatin signature (COSMIC 31 & 35), and 5 with the thiopurine signature 28 , consistent with the prior treatment history (Supplementary Data 17). Eight cases did not have a detectable drug signature but rather clock-like signatures 1, 5, and 40 ( Fig. 2d) 29,30 , while 2 additional patients had a signature similar to one of unknown etiology recently reported in relapsed mismatch repair (MMR)-deficient ALL 31 which we term the "relapse MMR" signature. Both had germline (SJ016519) or somatic (SJ016494) pathogenic PMS2 mutations. The relapse MMR signature bore similarities to the thiopurine signature ( Supplementary Fig. 6), had similar strand bias to the thiopurine signature 28 (Supplementary Fig. 7), and occurred in patients with previous thiopurine exposure, thus suggesting it was a variant of the thiopurine signature that occurs under MMR-deficient conditions. We determined the probability that driver SNVs were caused by each signature as reported previously 28 (Fig. 2d, bottom), and found that 2 TP53 mutations were most likely (>50% probability) induced by cisplatin or thiopurines along with several Ras pathway and other variants. Example calculations showing the probability that specific driver mutations were caused by individual signatures are shown in Supplementary Fig. 8. These calculations are based on the signatures present in each sample and their mutation preference at specific trinucleotide contexts; thus, two KRAS G12D mutations in two different patients (SJ030799 and SJ016494) were likely caused by different

Coding Variants
SJ016482 SJ016463   (Fig. 3c). Two cases involved noncoding regions of chromosome 2 adjacent to ZFP36L2, a gene encoding an RNA binding protein that is highly expressed in hematopoietic cells and is involved in hematopoiesis, and the other involved noncoding regions of chromosome 17 adjacent to MSI2, another gene encoding an RNA binding protein that has been found to be recurrently rearranged in hematological malignancies ( Fig. 3d) [43][44][45][46][47] . The existing ENCODE data and similar studies in human CD34 cells support that these regions of the genome are super-enhancers in hematopoietic cells, suggesting a proximity effect in which these enhancers have been hijacked to drive high levels of MECOM expression (Supplementary Fig. 13) 48,49 . Furthermore, despite the lack of in-frame fusions in the RNA-seq data, these cases demonstrate allelespecific MECOM expression 50 , further suggesting a cis-regulatory element may be driving this aberrant expression (Fig. 3d). WGS also identified a MECOM SV in SJ030441 (SATB1@-MECOM), but elevated MECOM RNA levels were not present in this case (Fig. 3b); however, immunohistochemical studies on the patient material demonstrated high MECOM protein expression in the blasts (Fig. 3e). Similar MECOM protein expression was detected in the other MECOM altered cases 51 , but not in tMN cases without a MECOM SV (Fig. 3e). Contrary to pediatric de novo AML studies, there was not a statistically significant association between higher MECOM expression and disease-related deaths within this pediatric tMN cohort ( Supplementary Fig. 14) 36 .

(N) T (N) T (N) T (N) T (N) T (N) T (N) T (N) T (N) T
Rather, a multivariable analysis shows that the presence of complex cytogenetics does significantly impact disease-related mortality risk (Fine-Gray model HR = 2.17; p = 0.04).
Clonal evolution of pediatric tMN. Finally, using a combination of targeted capture resequencing and a bioinformatic error suppression approach 52 we described the timing of acquisition and evolution of the somatic mutations for 37 cases using samples from interval time points prior to the development of tMN, including 26 cases in which material for the primary malignancy was available for analysis (Supplementary Data 21). We demonstrated that the somatic variants most commonly arose after the introduction of cytotoxic therapy (n = 23 of 26, 88%), and we  15 & 16). Three cases were found to be clonally related to the original malignancy. These included a tMDS that developed 8 months after AML and both were found to harbor a NUP98-NSD1 fusion ( Fig. 4b) with multiple discrete WT1 mut subclones, and 2 cases where the initial lymphoid malignancy (ALL or NHL) and tMN developed from a common clone that subsequently underwent a lineage switch (Fig. 4c- Supplementary Fig. 17).

Discussion
Here we show the results of our comprehensive sequencing of pediatric tMN which reveals that KMT2Ar are the most common driver alterations in our pediatric tMN cohort along with Ras/MAPK pathway mutations. Somatic TP53 alterations were also frequent, but these mutations appeared to arise after chemotherapy, unlike adult tMN 5 . Additionally, we identified MECOM overexpression to be frequent, and in some of these cases the overexpression was driven by enhancer hijacking. Finally, we show that pediatric tMN-defining variants arise most commonly as a consequence of cytotoxic therapy, and that these malignant clones can be identified, on average, >1 year before morphologic evidence of neoplasm. While these studies reflect the experience of a single institution, the findings highlight the diverse nature of genomic alterations in pediatric tMN and suggest that genomic screening approaches may be able to identify at risk patients prior to tMN development.    53 .
Supplementary Data 1 contains clinicopathological information for all samples included in our analyses. Samples were de-identified before nucleic acid extraction and analysis. The study cohort is comprised of 84 total patients (tMDS = 28, tAML = 56). Sixty-two patients had available tumor and normal tissue for characterization, while the remaining 22 lacked sufficient tumor material for comprehensive sequencing (Table 1). For the 62 tumor/normal pairs, flow sorted lymphocytes from the diagnostic tMN samples were used as the source of normal comparator genomic DNA in 53 cases, while bone marrow (n = 4) or peripheral blood (n = 5) from alternate timepoints was used for the remainder. Cryopreserved bulk bone marrow cells were thawed in a 37°C water bath and transferred to 20% FBS in PBS to remove residual DMSO according to standard approaches 54  After library quality and quantity assessment, WGS, WES, or RNASeq samples were sequenced on various Illumina platforms (HiSeq 2500, HiSeq 4000, or NovaSeq 6000). Mapping, coverage, quality assessment, single-nucleotide variant (SNV) and indel detection, and tier annotation for sequence mutations (SNVs discovered by WGS were classified as tier 1, tier 2, tier 3, or tier 4) have been described previously [55][56][57] and briefly described here. DNA reads were mapped using BWA 58,59 (WGS: v0.7.15-r1140; WES: v0.5.9-r26-dev and v0.7.12-r1039 since data were generated over a period of time) to the GRCh37/hg19 human genome assembly. Aligned files were merged, sorted and de-duplicated using Picard tools 1.65 (broadinstitute.github.io/picard/). SNVs and Indels in WGS and WES were detected using Bambino 60 . For WGS data, sequence variants were classified into the following four tiers: (i) tier 1: coding synonymous, nonsynonymous, splice-site and noncoding RNA variants; (ii) tier 2: conserved variants (conservation score cutoff of greater than or equal to 500, based on either the phastConsElements28way table or the phastConsElements17way table from the UCSC Genome Browser) and variants in regulatory regions annotated by UCSC (regulatory annotations included are targetScanS, ORegAnno, tfbsConsSites, vis-taEnhancers, eponine, firstEF, L1 TAF1 Valid, Poly(A), switchDbTss, encodeU-ViennaRnaz, laminB1 and cpgIslandExt); (iii) tier 3: variants in non-repeat masked regions; and (iv) tier 4: the remaining SNVs. Structural variations in whole-genome sequencing data were analyzed using CREST 61 (v1.0). RNA-sequencing was performed using TruSeq Stranded Total RNA library kit (Illumina) and analyzed, as previously described 16,17 . Briefly, RNA reads were mapped using our StrongARM pipeline (internal pipeline, described by Wu et al. 62 ). Paired-end reads from RNAseq were aligned to the following four database files using BWA: (i) the human GRCh37-lite reference sequence, (ii) RefSeq, (iii) a sequence file representing all possible combinations of non-sequential pairs in RefSeq exons and, (iv) the Ace-View database flat file downloaded from UCSC representing transcripts constructed from human ESTs. Additionally, they were mapped to the human GRCh37-lite reference sequence using STAR. The mapping results from databases (ii)-(iv) were aligned to human reference genome coordinates. The final BAM file was constructed by selecting the best of the five alignments. Chimeric fusion detection was carried out using CICERO 63 30.0) was used to retrieve read counts from BAM files for the SNV/Indels called in WES, requiring MAPQ > = 1 and base quality Phred score > = 20. We also performed de novo mutation calling in an attempt to catch canonical low variant allele frequency (VAF) cancer gene mutations missed by WES using VarScan 2 66 (v2.3.5) on the TWIST data with the following criteria: MAPQ > = 1; base quality Phred score > = 20; VAF > = 0.01 and variant call p-value < = 0.05. Selected somatic variants (WES read count <5 and targeted capture read count <10) and all somatic TP53 variants identified via WES were validated by custom amplicon sequencing. PCR primers (Supplementary Data 22) were designed to flank the putative variants. Amplicon sizes were approximately 200 base pairs. PCR was performed using KAPA HiFi HotStart ReadyMix (Roche), 100 nM of each primer (IDT) and 20 ng of gDNA in a 40uL reaction volume. Thermocycling was performed using the following parameters: 95°C for 3 min; 98°C for 20 s, 62°C for 15 s, and 72°C for 15 s for a total of 30 cycles; and 72°C for 1 min. All amplicons were quality checked on a 2% agarose gel. Primers were designed to incorporate Illumina overhang adapter sequences which allowed for indexing using the Nextera XT Index kit (Illumina) following the manufacturer's instructions. Libraries were normalized, pooled, and sequenced on an Illumina MiSeq instrument using a 2 × 150 paired-end version 2 sequencing kit. We used the CleanDeepSeq 52 approach with default settings for error suppression in this ultra-deep amplicon sequencing.
Copy number analysis using NGS data. Copy number analysis of the WGS (n = 4) cases was done using CONSERTING 67 . Copy number analysis of the WES (n = 58) cases was done following these steps: Samtools 68 (v1.2) mpileup command was used to generate an mpileup file from matched normal and tumor BAM files with duplicates removed; VarScan2 66 (v2.3.5) was then used to take the mpileup file to call somatic CNAs after adjusting for normal/tumor sample read coverage depth and GC content; Circular Binary Segmentation algorithm 69 implemented in the DNAcopy R package 70 was used to identify the candidate CNAs for each sample; B-allele frequency info for all high quality dbSNPs heterozygous in the germline sample was also used to assess allele imbalance.
Germline analysis. Whole exome sequencing data were analyzed using internal workflows that were previously described 19 . Briefly, the sequencing data were analyzed for the presence of single-nucleotide variants and small insertions and deletions (Indels) and for evidence of germline mosaicism. Germline copy-number variations and structural variations were identified with the use of the Copy Number Segmentation by Regression Tree in Next Generation Sequencing (CONSERTING) 67 and Clipping Reveals Structure (CREST) 61 algorithms. For all SNPs and Indels, functional prediction (e.g., SIFT, CADD, and Polyphen) scores and population minor allele frequency (MAF) were annotated. In this work, 3 databases were used for population MAF annotation: (i) NHLBI GO Exome Sequencing Project (http://evs.gs.washington.edu/EVS/); (ii) 1000 genomes (http:// www.internationalgenome.org); and (iii) ExAC non-TCGA version (http://exac. broadinstitute.org/). For missense mutations, REVEL (rare exome variant ensemble learner) score was also determined to help predict pathogenicity 71 . A gene list of 631 genes were composed from various resources: (i) literature review of genes that are potentially involved in AML, MDS, inherited bone marrow failure syndromes, as well as other cancer types 5,19,72-74 (ii) genes that were involved in splicing from predefined pathways (e.g., splicing) in KEGG, GeneOntology, Reactome, Gene Set Enrichment Analysis (GSEA), and NCBI (Supplementary Data 14). The following filtering criteria were applied: VAF ≥ 0.2, coverage >20x, ExAC MAF < 0.001 (or not present in ExAC), REVEL score >0.5 (for missense mutations), NHLBI and 1000 genomes MAF < 0.001. One TP53 variant that was lost through this filtering was manually recovered because the patient was clinically diagnosed with Li Fraumeni syndrome. Given this finding, all germline TP53 mutations were manually reviewed and analyzed as described below for mosaicism. Of note, the germline ETV6 p.N386fs in case SJ021960 was previously reported 75  Determination of mosaicism versus tumor-in-normal contamination. Because the normal samples used were hematopoietic specimens (sorted lymphocytes or remission bulk marrow), the mosaic mutations can be a result of incomplete remission. To rule out this possibility, we performed a previously developed statistical analysis that can model residual disease burden 19 . Briefly, we first determined purity (denoted as f) of the tMN tumor sample by clustering allele fractions of somatic SNVs/Indels by using R package "Mclust," where the cluster with the highest mean (denoted as u) center under 0.5 was used to estimate tumor purity (multiplied by 2 to account for diploid status, f = 2*u). To account for clonal evolution, we also calculated tumor purity by using heterozygous loss and copy neutral loss of heterozygosity (CN-LOH) regions with the highest magnitude of scores. For heterozygous loss regions, the purity is estimated as f = 2-2 (log.ratio+1) , while for CN-LOH region the purity is estimated as f = 2*AI where AI = | B-allele fraction -0.5 | . The maximum of the SNV/Indel and CNV/LOH-based purity estimate was used as the final purity estimate (f) for a given tumor. We then defined an SNV/Indel as diploid clonal if its allele fraction is > f*0.5*80% = u*80% and <0.6. The sum of mutant allele counts of these markers was denoted as M, and the sum of depth of these markers as T, thus the tumor-in-normal contamination level of the germline sample is then estimated as c = M/T. The expected allele fraction of TP53 mutation is estimated by considering its local ploidy and contamination level c. In our dataset, the TP53 mutations are either 1-copy loss-LOH or CN-LOH ( Supplementary Data 1, 4, and 16). For 1-copy-LOH, the expected allele fraction of TP53 under contamination is e = c*(2-c) −1 , while for CN-LOH the expected allele fraction of TP53 is simply e = c. We then tested the hypothesis that the observed TP53 allele counts in germline sample are due to contamination by using a binomial test. A significant p value (<0.01), after Bonferroni correction, would indicate that the observed allele counts are unlikely to be explained by contamination. To rule out the possibility of germline inheritance, we also tested the allele counts against inheritance (i.e., e = 0.5). A TP53 mutation with significant p values (<0.01) for both the contamination test and the inheritance test is called a mosaic mutation. For normal only samples, variants with a VAF of ≥0.2 were classified as germline, but variants with a VAF of <0.2 and with a supportive clinical history were classified as mosaic. We are unable to distinguish germline versus somatic mosaicism.
Mutational signature analysis. The trinucleotide context of each somatic SNV was identified using an in-house script, and mutations were assigned to one of each of the 96 trinucleotide mutation types 77 28 , and the relapse MMR signature. We used a required cosine increase of 0.02 or more for a signature to be detected in a single sample, and default parameters otherwise. For exome samples, we likewise tested for these signatures using SigProfilerSingleSample, but excluded from our analysis exome samples that had cosine reconstruction scores of less than 0.9 (comparing the sample's SNV catalog profile with the profile as reconstructed by signatures) or less than 30 SNVs total, or which already had WGS data, resulting in only 3 exome samples with usable signature data. We calculated the probability that individual SNVs were caused by a signature as done by others 79 and as we reported previously 28 . The probability that a variant was caused by a specific signature was calculated as follows. Let s k represent the signature strength vector for a given sample (measured in number of SNVs caused by the signature), where k = 1, 2, …, 8 is one of 8 signatures we identified, such that s 1 equals the number of specific SNVs caused by signature 1 in the sample, and ∑s k equals the total number of SNVs in the sample. Let c = 1, 2, …, 96 represent each of the 96 possible trinucleotide mutation types. Each of the k signatures mutates each of these 96 trinucleotide mutation types c with a probability P c,k (ranging from 0 to 1.0) where the sum of the probabilities for a given signature across all 96 trinucleotide mutation types is 1.0. The probability that a mutation of interest m (at trinucleotide mutation type c) was caused by a specific signature i is calculated as shown in Eq. 1: GRIN analysis. The genomic random interval (GRIN) method 18 was used to evaluate the statistical significance for the prevalence of SNVs, heterozygous deletions, fusion breakpoints, copy-neutral loss-of-heterozygosity, and amplification in each gene. For each gene, a p-value for each of these genomic alterations was computed. Also, for each gene, an overall p-value was computed by finding the minimum p-value across the five lesion types and comparing it to the beta distribution corresponding to the distribution of the minimum of five id uniform (0,1) realizations. For each set of p-values (one for each lesion type and the overall pvalue), a robust method 80 was used to compute false discovery rate estimates, which are reported with the symbol q. A total of 91 genes were identified as statistically significant with an overall q < 0.05. Additionally, MutSigCV 81 analysis was used to determine driver status of SNVs and indels.
Super enhancer analysis in CD34 + cells. H3K27ac ChIP-seq data were downloaded from GEO accession GSE104579 82 . Raw reads were adapter-trimmed and subject to quality filtering using Trim Galore (v0.4.4), retaining reads with a quality score >20. Reads were mapped to the human genome (GRCh37) using BWA (v0.7.12) 58 , converted to bam format, and duplicate reads were marked using bio-bambam2 (v2.0.87) 83 and removed using samtools (v1.10) 68 . H3K27ac peaks were called using macs2 (v2.1.1) 84 in BEDPE mode with a p-value cutoff of 1 × 10 −5 . ROSE was run using the de-duplicated H3K27ac and input bam files and the macs2 peak file with default parameters. For additional visualization of the chromatin landscape in human CD34 + cells, three additional datasets were included in IGV snapshots. The CTCF bigwig file was downloaded from GEO accession GSE104579.
The "CD34 + H3K27ac (Roadmap)" wiggle file was downloaded from GEO accession GSM772885 85 and converted to bigwig. CD34 + ATAC-seq data were downloaded from GEO accession GSE74912 86 and all biological replicates for CD34 + samples were merged into a single bedGraph file and converted to bigwig format for visualization. All RNA-seq tracks are normalized read coverage.
Statistical methods. The Wilcoxon-Mann-Whitney non-parametric test, twotailed, was used to compare means of quantitative variables across two experimental groups or diagnostic groups. The Fisher's exact test was used to compare the frequency of complex karyotype between patients with and without TP53 mutations. Survival analysis of cause-specific death was performed with a Fine-Gray model 87 that accounts for different causes of death as competing events and adjusts for hematopoietic stem cell transplant as a time-dependent outcome predictor variable.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The genomic data generated in this study have been deposited in the European