Whole genome sequencing provides comprehensive genetic testing in childhood B-cell acute lymphoblastic leukaemia

Childhood B-cell acute lymphoblastic leukaemia (B-ALL) is characterised by recurrent genetic abnormalities that drive risk-directed treatment strategies. Using current techniques, accurate detection of such aberrations can be challenging, due to the rapidly expanding list of key genetic abnormalities. Whole genome sequencing (WGS) has the potential to improve genetic testing, but requires comprehensive validation. We performed WGS on 210 childhood B-ALL samples annotated with clinical and genetic data. We devised a molecular classification system to subtype these patients based on identification of key genetic changes in tumour-normal and tumour-only analyses. This approach detected 294 subtype-defining genetic abnormalities in 96% (202/210) patients. Novel genetic variants, including fusions involving genes in the MAP kinase pathway, were identified. WGS results were concordant with standard-of-care methods and whole transcriptome sequencing (WTS). We expanded the catalogue of genetic profiles that reliably classify PAX5alt and ETV6::RUNX1-like subtypes. Our novel bioinformatic pipeline improved detection of DUX4 rearrangements (DUX4-r): a good-risk B-ALL subtype with high survival rates. Overall, we have validated that WGS provides a standalone, reliable genetic test to detect all subtype-defining genetic abnormalities in B-ALL, accurately classifying patients for the risk-directed treatment stratification, while simultaneously performing as a research tool to identify novel disease biomarkers.


Patients and methods a. Whole genome sequencing Patient cohort
Diagnostic and matched post-treatment samples from 210 and 208 childhood B-lineage acute lymphoblastic leukemia (B-ALL) patients, respectively, were prepared for whole genome sequencing (WGS) (Supplementary Table 1). Samples were obtained from the Blood Cancer UK Childhood Leukaemia Cell Bank approved by the South West-Central Bristol Research Ethics Committee. Written informed consent was obtained from patients or parents, in accordance with the Declaration of Helsinki. Patients were divided into two distinct cohorts: 1) with known established chromosomal abnormalities used in risk-stratification for treatment (n=38) (BCR::ABL1, ABL-class fusion, ETV6::RUNX1, near-haploidy, high hyperdiploidy, low hypodiploidy, iAMP21-ALL, KMT2A-r and TCF3::PBX1), to validate the effectiveness of detection by WGS. 2) Patients from a single clinical trial (UKALL2003) with sufficient diagnostic and germline material available that were negative for the above abnormalities based on routine testing by cytogenetic and FISH analyses (according to recommended testing for BCR::ABL1, ETV6::RUNX1 and KMT2A-r in UKALL2003, as previously described (1)). Of note, TCF3::PBX1 patients were identified by chromosomal analysis only in UKALL2003, no TCF FISH was performed, explaining the high number that emerged in cohort 2.
Cohort 2 patients were termed "B-other-ALL" within this study. This cohort included a high number of cases with normal karyotypes (n=50) or failed cytogenetic results (n=23), for which limited material restricted the level of genetic testing at diagnosis in some patients. Patients with CRLF2 rearrangements (CRLF2-r) were identified as previously described.(2, 3) They were underrepresented in this study, because, patients with unknown subtype-defining genetic abnormalities were prioritized, thus known CRLF2-r patients were excluded. Additional FISH tests were performed following WGS to validate key genetic abnormalities in some cases; these data are provided in Supplementary Table 1.

Sample collection and preparation
Diagnostic samples comprised bone marrow cells collected prior to the start of treatment. DNA extracted from bone marrow cells after the start of treatment was used as a matched germline sample in 208 patients. The time point from which the germline sample was taken was available for 203 patients; 11% (23/203) were taken before the end of induction therapy (EoI) (week 5), as indicated in Supplementary Table 1. Induction therapy for B-ALL is highly effective at reducing disease burden; it was previously shown that 94% of B-ALL patients have <1% leukaemic blast cells in their bone marrow at EoI.(4) Of the 23 cases with samples taken before EoI, 13 had leukemic blast counts <0.05%, while six had leukemic blast counts ≥1% (range, 1%-90%), while four samples had missing data, potentially representing germline samples contaminated with leukaemic cells. Patients with detectable levels of leukaemic blasts in their germline samples (minimal residual disease, MRD) were problematic for T-N data analyses, which was particularly relevant among patients with BCR::ABL1 and EBF1::PDGFRB fusions, who tend to be refractory or respond slowly to initial therapy.(5-7) To prepare the sequencing libraries, 1-3ug of high-quality double-stranded DNA (ds-DNA) was extracted from bone marrow aspirates using the DNeasy Extraction kit (Qiagen, Manchester, UK). The Qubit® dsDNA BR (Broad-Range) Assay kit was used to quantify dsDNA concentrations (Thermo Fisher Scientific, California, USA). WGS libraries were prepared using the TruSeq® DNA PCR-Free Library Preparation kit (Illumina) with 600ng input DNA. Libraries were sequenced as paired-ends (2 × 150 cycles) on a HiSeqX platform to a mean depth of >38x for germline and >76x for diagnostic samples. We assessed the effect of sequencing depth on the ability of WGS to detect structural variants from information of clonality (based on FISH results). We were able to detect structural variants when present in <20% of cells (based on FISH data) and with less than 18x in the diagnostic sample.
A small initial subset of samples were prepared with 500ng of input DNA using a pre-released version of the TruSeq PCR free protocol and libraries were sequenced as 100 bp paired-end reads on a Hiseq2500.

WGS data analysis
DNA sequence reads of the diagnostic and remission samples were aligned to human reference genome GRCh38 using the Whole Genome Sequencing pipeline (WGSv8) with the Illumina Isaac aligner (SAAC01325.18.01.29) and the removal of duplicate read-pairs.(8) Genetic variants were detected from the sequencing reads of diagnostic and remission DNA using Strelka for small variants, defined as single nucleotide variants (SNV) and indels <50bp.(9) Manta was employed for the identification of structural variants (SV) >50bp (deletions, duplications, inversions, insertions and translocations) and Canvas was used for the discovery of copy number variants.(10, 11) The Tumour Normal pipeline (TNWv6) used the somatic joint-calling mode in Strelka, Manta and Canvas to call somatic small and structural variants and copy number alterations. The version and description of individual software used in WGSv8 and TNWv6 in this study are summarised in the tables below.

IGH::DUX4
Two D4Z4 macrosatellite arrays are located proximal to the subtelomeric region of chromosomes 4 and 10. Each array contains 11 to more than 100 repeated units. Each repeat unit of D4Z4 is approximately 3.3-kb in length and contains an open reading frame (ORF) encoding the DUX4 gene. These ORFs have >99% sequence similarity within each array and >98% similarity between arrays. Therefore, fragmented reads from DUX4 can be aligned at multiple positions on the reference and due to this uncertainty are frequently assigned alignment MAPQ scores of 0. The IGH locus contains multiple copies of the repetitive IGHV gene-containing segments that range in size (4 to 24-kb) and exhibit high sequence similarity.
A review of samples harbouring ERG deletions, associated with IGH::DUX4 rearrangements in previous studies, was able to identify the characteristic clusters of reads pairs spanning putative IGH::DUX4 breakpoints in some but not all samples.(17-19) An investigation of the samples where putative IGH::DUX4 breakpoints could be detected revealed that within these samples the spanning reads often mapped to different D4Z4 repeat units and to different arrays. The hypothesis was that all ERG deleted samples have multiple reads spanning their IGH::DUX4 breakpoint but in most samples those read pairs are mapped to many different repeats units and the expected cluster of spanning reads is not present. It was tested by identification of all paired reads that mapped to both the IGH locus and the D4Z4 repeat array to determine if: 1) they were in excess in ERG deleted samples, 2) if after local assembly, the longer contigs mapped to a unique set of breakpoints, where one read mapped to a D4Z4 repeat array (chr4:189950000-190214555 or chr10:133614430-133797422 in GRCh38) and its pair mapped to the IGH locus (chr14:105536437-106929844 in GRCh38). This analysis was performed on 210 B-ALL and 208 matched germline samples.
The germline samples were split into a training set (n=100 samples) and the remaining samples in the cohort (n=318) were used as a validation set. The count of all spanning read pairs per sample was normalized to total read count. For each sample, the reads mapping to the IGH locus were placed into 1000bp non-overlapping bins. In the training set, the bin (chr14:1086883000-1086883999) contained ≥5 reads. All read pairs that mapped between these coordinates were excluded from further analysis of all samples. Spanning read pairs from all samples with >10 spanning reads per billion (SRPB) were locally assembled using SOAP21 and k-mers 15, 31, 41, 51.(20) Not all k-mers generated successful assemblies, thus the assembly that generated the longest scaffold was chosen for further analysis. All assembled contigs and scaffolds were aligned to the GRCh38 reference genome using BLASTN2.(21) Any contigs or scaffolds that mapped entirely within the D4Z4 microsatellite array or the IGH locus were excluded. The remaining contigs were used to determine the unique set of breakpoints in the rearrangements between IGH and DUX4.

DUX4-r detection involving other partner genes
Although IGH is the most common partner gene of DUX4, other partner genes have been reported in ALL, such as ERG. (19) A similar approach to IGH::DUX4 detection was used to identify cases that harbour DUX4-r involving other (non-IGH) partner genes. Briefly, all paired reads that mapped to both the D4Z4 repeat array and any other genomic region were identified. The training set of 100 germline samples identified 14x 1000bp bins in five regions of the genome (equivalent to <0.0001% of the genome) that had ≥6 reads mapping between the region and the DUX4 locus. These 14 regions, detailed in the table below, were excluded from analysis, as non-specific mapping may lead to the false positive detection of DUX4-r.
Regions excluded from DUX4-r detection involving other (non-IGH) partner genes. Chromosome and start and end co-ordinate of the 14x 1000bp binned regions is shown.  Figure 2: A complex BCR::ABL1 fusion, formed through two rearrangements, in patient 22722. The mate-pairs of rearrangement 1 (red) map to intron 6 of BCR and downstream of ABL1. While rearrangement 2 (green) had one mate located 418bp away from rearrangement 1 and the mate-pair mapped within intron 1 of ABL1. This complex rearrangement pattern will produce an in-frame BCR::ABL1 fusion (BCR exon 1-6 fused to ABL1 exon 2-11) for which there was weak read support; this is highlighted by three reads with poor mapping quality and duplication of ABL1 exon 2-11, as depicted by the higher read depth (3). (B) Alignment of paired-reads (labelled read 1 and read 2) that were split across genomic regions, demonstrating the BCR::ABL1 fusion between intron 6 of BCR and intron 1 of ABL1 via a number of rearrangements. The sequence in the read that mapped to a specific location in the  The study comprises two cohorts: cohort 1 (patients 1-38) with known chromosomal abnormalities used in risk-stratification for treatment (n=38); cohort 2 (patients 39-210) negative for these abnormalities according to standard-of-care testing (Supplementary Information). Key clinical and genetic data are provided. Abnormal FISH results are described; GAIN and LOSS include centromeric probes (such as CEP4, CEP10, CEPX) and gene-specific probes (such as RUNX1 (chromosome 21), IGH (chromosome 14), CRLF2 (chromosome X)), -r=gene rearrangement. The copy number of each probe [CN=x] and % of cells containing the abnormality [x%] are provided; GAIN is CN=3 and LOSS is CN=1 unless otherwise stated. Abnormal MLPA results: copy number [CN=x] of a gene or exonic region (Ex) of a gene is provided ; GAIN is CN=3 and DEL (deletion) is CN=1 unless otherwise stated. Subgroup classification is based on the detection of primary genetic abnormalities (Table 1). Cases lacking subtype-defining genetic variants are classified as 'Other' with key (recurrent or previously associated with leukemia) genetic abnormalities listed. Some cases had genetic abnormalities that characterise ≥2 genetic subtypes of ALL, for example patient 28956 harboured a BCR::ABL1 gene rearragement and a pattern of chromosome gain characteristic of high hyperdiploidy. WTS subtype data is summarised and the primary subgroup of each case is provided. The blast count at diagnosis and timepoint from which the germline sample was taken is provided where available. Only 12% (25/207) of germline samples were from a timepoint before the end of induction (week 5), at which point >90% of B-ALL patients have <1% leukemic blast count in their marrow [1].