Genomic and evolutionary portraits of disease relapse in acute myeloid leukemia

Relapse in acute myeloid leukemia (AML) patients remains a clinical challenge. The majority of AML patients who receive induction treatment with combination chemotherapy achieve clinicopathologic remission. However, a significant proportion of these patients will relapse and succumb to chemoresistant disease [1]. The biological mechanisms that contribute to relapsed AML are yet to be fully deciphered. Previous studies investigating genetic contributions to AML disease relapse included small numbers of patient samples and/or focused on a small number of AML subtypes. These studies have suggested that disease relapse is associated with founder clone recurrence, subclonal expansion and/or the occurrence of relapse-specific events (reviewed in [2]). To better understand the somatic genomic changes that drive AML relapse, we analyzed specimens (n= 120) from a clinically annotated adult relapsed AML patient cohort [3] (Supplementary Table S1, Supplementary Fig. S1) for somatic events. The median age of the patient cohort was 50 years. All patients received standard of care combination chemotherapy, achieved complete remission and experienced disease relapse. We first reanalyzed whole exome sequencing (diagnosis, relapse and matched germlines) of 49 patients [3] in order to capture the complete intragenic mutational burden (Fig. 1A, Supplementary Tables S2 and S3). 21 patients had at least one mutation lost at relapse. Twenty-three patients gained at least one mutation at relapse. A subset of recurrent somatic These authors contributed equally: Ross L. Levine, Ari Melnick


Exome capture
Data for exome capture analysis was obtained from dbGap accession number phs001027.v2.p1 (2). Data were aligned using the Burrows-Wheeler Aligner (3) to the human genome (UCSC build hg19) and then pre-processed using GATK per recommended best practices. We performed somatic SNV calling by comparing diagnosis and relapse samples to their matched germline specimens with Mutect (4) , Varscan (5), and somaticSniper (6). Indels were determined using Pindel (7) and GATK SomaticIndelDetector (8). Mutations were annotated with Snpeff (9). Mutations considered for analysis were identified by two callers as previously described (10) or called by a single caller and have already been reported previously (either in (11) or in COSMIC ). Analysis using these tools was performed using an internally-developed pipeline (MSKCC Bioinformatics Core: https://www.mskcc.org/research/ski/core-facilities/bioinformatics). Results are summarized in Supplementary Table 2 and included in Supplementary Table 3.

Targeted resequencing
Targeted resequencing was performed using established protocols. Data from targeted resequencing (111 genes) of 51 diagnostic specimens was obtained ((11); EGA accession number: EGAS00001000275). 12 additional diagnostic and the matching 63 relapsed samples were profiled using an IDT custom capture panel (catalog number 1016302) per manufacturer's recommendations. Briefly, 200ng of DNA was sheared using a Covaris LE220 sonicator (adaptive focused acoustics). DNA fragments were end-repaired, adenylated, ligated to Illumina sequencing adapters, and amplified by PCR (using 10 cycles). Targeted capture was subsequently performed using 1µg of the DNA library and a custom AML capture probe set. Captured libraries were then enriched by PCR (using 10 cycles). Final libraries were evaluated using fluorescent-based assays including PicoGreen (Life Technologies) or Qubit Fluorometer (invitrogen) and Fragment Analyzer (Advanced Analytics) or BioAnalyzer (Agilent 2100), and were sequenced on an Illumina HiSeq2500 sequencer per manufacturer's recommendations using 2 x 125bp cycles. The reads were aligned to the human genome (UCSC build hg19) using the Burrows-Wheeler Aligner (3) with maximal exact matches. We used the Cancer Genome Project pipeline (https://github.com/cancerit) and compared the tumor samples to a standard cancer-free germline following the pipeline recommendations. Snpeff (9) was used to annotate variants with their functional consequences. We filtered out variants that were present in any of the population groups from the ExAC database (12) with a minor allele frequency of more than 1%. We only considered variants that were either present in at least two samples or classified as oncogenic or likely oncogenic according to the criteria outlined in (11). Results are summarized in Supplementary

Validation of somatic mutations
Validation of somatic mutations was performed using a custom targeted amplicon panel and sequenom assay. The custom targeted amplicon panel was performed and data was analyzed as previously described (13) Reads were trimmed of contaminating adapter sequences and low-31 quality bases using Trimmomatic v0.32 (trimmed when median Illumina base quality score < 20 over 6 32 nt sliding window). Overlapping paired end reads were merged into a single long consensus read using AdapterRemoval v241 when at least 12 bp overlap was present. 5. The remaining high-quality reads were aligned to the 1000 Genomes Phase 2 human reference genome and decoy contigs (hs37d5) using BWA MEM. Duplicate marking was performed using SamBlaster v0.1.21 and MarkDupsByStartEnd v0.2.1. 6. Single nucleotide variants (SNVs) and insertions/deletions (indels) were detected using VarDict-Java v1.4.6 in single sample mode with indel realignment. 7. Annotation of variants and their functional impact was performed using Variant Effect Predictor (VEP) v8547 and snpEff v4.1g. 8. To identify somatic variants, filtration based on population allele frequency data was applied so as to enrich somatic variants that are not likely inherited. To this end, variants were classified as probable somatic if exhibiting a dbSNP v142 or ExAC adjusted population allele frequency <= 0.25% or a median variant allele fraction (VAF) of 2.5%. Variants that occured at a VAF of 5% were compared to the exome capture results. The Sequenom assay was performed as previously described (14) to validate IDH1 and IDH2 mutations which occurred above a VAF of 0.05% in a subset of specimens. Results are included in Supplementary Tables 4 and 5.

Copy number aberration data
Copy number aberration data (CNA) was generated and sequenced (75 bp single-end reads) as previously described (15,16) (Supplementary Table 11). The resulting data was first analyzed by FastQC (17) to determine if it was of sufficient quality for further analysis. Reads were trimmed of adapters using Cutadapt (18) and mapped to the human reference genome hg19 with Burrows Wheeler Aligner (3) using the 'mem' command (BWA-MEM). Aligned reads were converted from SAM to BAM format, sorted, and indexed with Samtools (19). Duplicate reads were removed with MarkDuplicates from Picard Tools (20). 25853 genomic bins, each containing 100kb of nonoverlapping mappable sequence, were defined based on the hg19 mappability track (wgEncodeCrg-MapabilityAlign75mer.bigWig, UCSC Genome Browser). Next, reads were excluded from unmappable regions and counted in bins with Bedtools (21). The bin counts were then normalized to copy number estimates as follows: bins were sorted into 100 evenly sized groups according to GC content; the median read count was calculated for each GC group; each bin count was divided by the GC group median and then multiplied by 2. Next, to detect outliers, preliminary segments were defined with Circular Binary Segmentation (CBS) (22). The R package DNAcopy (23), which implements CBS, was used with alpha=0.001, min.width = 5, and undo.SD=1. Outliers were removed using Tukey's Outlier Method on the median of differences between bin value and preliminary segment value. Final segments were defined through a second segmentation by DNAcopy following outlier bin removal. To define thresholds for gains and losses, the distribution of segment copy numbers was inspected. For the paired AML cohort, a distribution of segments with length below the sum of median length plus two median absolute deviations (median + 2*MAD) was used. For the AML patients older than 60, a distribution of segments with bin length below the median length was used. Mixed Gaussian models were fitted to these distributions using R package mixtools (24). Gaussians centered near 2 were chosen to represent the diploid copy number state and were used to determine CNV thresholds. A two-tailed cumulative probability of 0.05 resulted in cutoffs equal to 1.73 (loss) and 2.36 (gain). Tools utilized for the analysis: R version 3.5.0: DNAcopy, mixtools, ggplot2, stringr, dplyr, fastqc version 0.11.5, cutadapt version 1.16, bwa version 0.7.17, samtools version 1.7, bedtools version 2.26.0, picard tools version 2.18.5, and java version 1.8.0. Results were subjected to manual review for confirmation. We assessed the copy number calls at the 5k, 20k and 50k bin resolutions and only considered an event true if it was found in at least one resolution. CNVs identified in both analysis pipelines are reported in Supplementary Table 6.

Analysis & genome versioning
R version 3.4.4 was used for all analyses unless otherwise specified. Reference genome used for annotation in all analyses was hg19.

Clonal evolution analysis
For each sample of the Targeted resequencing cohort, if no oncogenic or likely oncogenic mutation had a VAF > 0.4, we corrected for tumor content by considering that the highest likely oncogenic or oncogenic mutation was clonal (VAF = 0.5) and proportionally adjusting depth of sequencing and VAF for every mutation of that sample. We considered that a mutation was associated with a clone that contracted or expanded if the VAF of the mutation at relapse was lower or higher, respectively, than the VAF of the mutation at diagnosis by at least 0.05. The difference between the two VAFs was also required to associate with a p<0.05 according to a Fisher's exact test. We represented the changes in VAF with line plots where we represent the 95% confidence interval for each VAF using the function "binconf" from the "Hmisc" package (version 4.4-1; CRAN https://cran.r-project.org/web/packages/Hmisc/Hmisc.pdf). We considered that two mutations in the same patient belonged to different subclones if their VAF were different at diagnosis or relapse according to a Fisher's exact test (p<0.05) or they had different directions of evolution (for example, one was contracting and the other one was expanding between diagnosis and relapse). Changes in the clonal architecture in our dataset are represented using fishplots generated using the "fishplot" R package (version 0.4; github https://github.com/chrisamiller/fishplot). VAFs (Supplemental tables 7 and 9) were used to plot the clonal changes. VAFs less than 1 percent in either diagnosis or relapse were considered 0 in the fraction table. All the VAFs are multiplied by 2 to account for heterozygosity and summed up to 100% at each time point . The clone with highest VAF at a given time point was considered the parent clone.

Data deposition
Raw data files were used from previously deposited data into EGA (EGAS00001000275) and dbGap (phs001027.v2.p1). Newly generated sparse whole genome sequencing and targeted resequencing data from all analyses have been deposited into dbGap (phs001027.v3.p1).

Assessing association with clinical parameters
The association of evolutionary models with age , gender , epiallele clusters , treatment type (combination chemotherapy with or without stem cell transplantation), and time to relapse was estimated using the Kruskal Wallis test in R using the "kruskal.test" function.

Supplementary Tables
Supplementary Table 1: Summary of study subject clinical data. Table of clinical characteristics of patients in the study cohort. Included parameters are: Study subject identification numbers (Subject ID) match the assigned identifiers in dbGap accession number phs001027.v2.p1 (Subject_ID assignments). M = male, F = female; Age at diagnosis (years); clinical results of cytogenetic assessments and recurrent cytogenetic events (Cytogenetics at diagnosis) at the diagnosis stage; ELN = European Leukemia Net risk stratification(1); Induction and consolidation treatments each study subject received after diagnosis (Ara-C = cytarabine arabinoside; AlloSCT = Allogeneic stem cell transplant; AutoSCT = Autologous stem cell transplant); Time Until relapse (days between disease diagnosis and relapse); Genomic assays performed on DNA from each study subject (1 = performed; blank = not performed); Reference IDs (specimen IDs for diagnosis specimens subjected to targeted resequencing [EGAS00001000275]) from (11).  Table 3: Mutations identified in the whole exome sequencing cohort and associated changes during AML disease progression. Each line is a single somatic mutation identified in a single study subject. The columns are: study subject identification numbers, the gene name, the amino acid annotation, the chromosome, the genomic location, the reference allele, the alternative allele, the VAF at diagnosis, the VAF at relapse and if this mutation was gained (GA), lost (LO) or stable (ST) at relapse compared to diagnosis.

Supplementary Table 4: Somatic mutations validated in the whole exome sequencing cohort.
Each line is a single mutation identified in a single study subject. The columns are: study subject identifier, if the sample was the diagnosis or the relapse sample, the chromosome name, the genomic location, the reference allele, the alternative allele, the gene name, the HGSVc effect of the mutation, the HGSVp effect of the mutation, the exon of the gene the specified mutation is located in, the allele frequency in ExAC, the DBSNP GMAF allele frequency, the number of times this mutation has been reported in COSMIC, the TCGA mutation ID, the number of reads that contain the alternative allele, the sequencing depth at this position and the variant allele fraction. Table 5: IDH1 and IDH2 mutation validations using a sequenom assay. A subset of specimens were processed using a sequenom assay to detect IDH1-R132H, IDH2-R172K and IDH2-R140Q mutations. WT = wild type. Each line is a single mutation identified in a single study subject. The columns are: study subject identification numbers, the gene name, the amino acid annotation, the mutation significance (YES = oncogenic; ONCOGENIC, LIKELY = likely oncogenic, or UNKNOWN = unknown significance), the chromosome, the genomic location, the reference allele, the alternative allele, the VAF at diagnosis, the depth of coverage at diagnosis, the clone number is the clone that the mutation belongs to in the diagnosis sample (0 denoting a clone that is not present in the diagnosis sample), the VAF at relapse, the depth of coverage at relapse and the clone number is the clone that the mutation belongs to in the relapse sample (0 denoting a clone that is not present in the relapse sample). Gray coloring of the VAF and the depth of a mutation at diagnosis (respectively at relapse) indicates specimens in which the VAF and depth were corrected for low tumor content in the diagnosis (respectively relapse) sample.

Supplementary Table 8: Inferred clonal evolution groups.
Subject ID = study subject identification number. EvolutionModel.group = results of inferred evolution model analysis. EpialleleCluster = annotated epigenetic evolution progression for each subject identified in (2). Table 9: Multi-relapse patient specimen information and results. Targeted panel sequencing was performed on serial specimens from three patients. Sheet ST9a: Shown are the collection days relative to the date on which diagnosis specimens were collected. NA = not applicable. Sheet ST9b: mutations identified in the targeted panel sequencing performed on the multi-relapse specimens. Each line is a single mutation identified in a single study subject. The columns are: study subject identification numbers, the gene name, the amino acid annotation, the mutation significance (YES = oncogenic; ONCOGENIC, LIKELY = likely oncogenic, or UNKNOWN = unknown significance), predicted mutation effect, the chromosome, the genomic location, the reference allele, the alternative allele, the VAF detected, the depth of coverage, and the clone number indicates the clone that the mutation belongs to at first occurrence.  Table 11: Sequencing statistics of the sparse whole genome sequencing assay performed in the study. Specimen ID = study subject identification number; Disease stage indicates if the sequencing results are from the diagnosis or relapse stage; TOTAL = total number of reads; Number of duplicate reads = number of reads removed from analysis due to them being assessed to be duplicate reads; Percent mapped reads = percent of total reads that mapped uniquely to the genome and utilized in analysis; Mean coverage per base in mapped genomic regions = mean coverage per base after alignment of the reads to reference genome.