Refined detection and phasing of structural aberrations in pediatric acute lymphoblastic leukemia by linked-read whole-genome sequencing

Structural chromosomal rearrangements that can lead to in-frame gene-fusions are a leading source of information for diagnosis, risk stratification, and prognosis in pediatric acute lymphoblastic leukemia (ALL). Traditional methods such as karyotyping and FISH struggle to accurately identify and phase such large-scale chromosomal aberrations in ALL genomes. We therefore evaluated linked-read WGS for detecting chromosomal rearrangements in primary samples of from 12 patients diagnosed with ALL. We assessed the effect of input DNA quality on phased haplotype block size and the detectability of copy number aberrations and structural variants in the ALL genomes. We found that biobanked DNA isolated by standard column-based extraction methods was sufficient to detect chromosomal rearrangements even at low 10x sequencing coverage. Linked-read WGS enabled precise, allele-specific, digital karyotyping at a base-pair resolution for a wide range of structural variants including complex rearrangements and aneuploidy assessment. With use of haplotype information from the linked-reads, we also identified previously unknown structural variants, such as a compound heterozygous deletion of ERG in a patient with the DUX4-IGH fusion gene. We conclude that linked-read WGS allows detection of important pathogenic variants in ALL genomes at a resolution beyond that of traditional karyotyping and FISH.


Results
We subjected diagnostic samples from 12 children with acute lymphoblastic leukemia (ALL) enrolled on the Nordic Society of Pediatric Hematology and Oncology (NOPHO) protocols during 1998-2008 (Table 1) 8,20 to linked-read WGS (Table S1). The DNA used to prepare linked-read sequencing libraries was obtained from biobanked DNA isolated by a standard column-based method or by freshly prepared HMW DNA extraction. The estimated length of the input DNA was directly correlated to the phase block size (Table S2). The proportion of phased SNPs was 81-99% (mean 92%), and the longest phased blocks ranged from 0.9-18 Mb (mean 7 Mb) ( Table S3). The DNA extracted using the High Molecular Weight protocol yielded the longest haplotype blocks (18 Mb), but the DNA extracted by the standard column-based method allowed for detection of all known SVs even at low sequencing coverage (10×), despite the shorter phase blocks produced (Fig. S1).
For five of the 12 ALL genomes, detailed karyotype information obtained at diagnosis by G-banding or FISH for the subtype-defining genetic aberrations high hyperdiploidy (HeH), t(12;21) and t(9;22) was available and allowed verification of the results from linked-read WGS. The remaining six patients with either T-ALL or B-other subtype had either complex or incomplete karyotype available from ALL diagnosis. Their subtypes were determined in previous studies by a combination of WGS, RNA-sequencing, and/or arrays (Table S1) 11,19 . In all cases, existing karyotype information, newly generated FISH data (when cells were available), and/or a combination of Infinium arrays for copy number estimates and RNA-sequencing validated the findings from linked-read WGS. The results for each patient and subtype are detailed below, and for each case a revised karyotype after linked-read WGS is given in Table 1.
High Hyperdiploidy (HeH). Two patients (ALL_370 and ALL_689) had the classical HeH subtype with 55 chromosomes. Using the linked-read WGS data, we binned the average sequencing coverage in 10 Kb bins across the genome and scanned for CNAs across the 22 autosomal chromosomes (Fig. 1a,b). The linked-read WGS estimates of copy numbers correlated perfectly with that from the karyotypes and array-based CNA for ALL_370 and ALL_689. For a third patient (ALL_47) with suspected HeH subtype 21 , we verified the HeH karyotype in the linked-read WGS data to be 58, XY, +4, +5, +6, +9, +10, +12, +14, +17, +18, −19, +19, +21, +21, +22, which was confirmed by array-based CNA analysis (Table 1; Fig. 1c). The copy neutral loss of chromosome 19 (uniparental disomy) was visible in the linked-read WGS data by an overrepresentation of homozygous SNVs on chromosome 19 (Fig. S2). Translocations t(12;21) and t(9;22). The t(12;21)[ETV6-RUNX1] translocation and associated aberrations were determined in two patients (ALL_386 and ALL_458) (Fig. 2a,b). As anticipated by karyotyping and previous WGS of patient ALL_458 11 , a balanced t(12;21) translocation resulting in the expression of both the canonical ETV6-RUNX1 and the reciprocal RUNX1-ETV6 fusion genes was unambiguously detected at base-pair resolution in the linked-read WGS data (Fig. 2c). A deletion spanning over a 2.1 Mb region that includes the second allele of ETV6 was observed on the other haplotype, thus affecting both alleles of ETV6 (Fig. S3). Besides gain of chromosome 10 and a heterozygous 38 Mb deletion of chromosome 11q22-q25, no other large structural variants were identified in ALL_458.
In contrast, the karyotype for patient ALL_386 suggested a complex series of translocations involving ETV6 and RUNX1 and chromosomes 3, 12, 14 and 21. In a previous study, two in-frame fusion genes were identified in this patient (ETV6-RUNX1 and DCAF5-ETV6) 19 . Linked-read data resolved that the DCAF5-ETV6 fusion gene arose from a translocation between 14q24.1 and 12p13.2 and the ETV6-RUNX1 fusion gene arose from a translocation between 12p13.2 and 21q22.12. The phasing information further resolved a heterozygous 0.15 Mb intragenic deletion in ETV6 (haplotype 1) and that the ETV6-RUNX1 and DCAF5-ETV6 fusion genes originated from the other allele (haplotype 2) of ETV6, thus disrupting both copies of ETV6 in this patient (Fig. 2d). Linked-read WGS resolved the exact breakpoints on chromosomes 3, 12 14 and 21, and identified several additional alterations that were missed by genetic analysis at diagnosis. Of these, DCAF5 (chr14) and the reciprocal RUNX1 (chr21) loci were separated by a 44 Mb insertion of a region originating from chromosome 2q33.1-q37.3 on the derivative chromosome 14q24.1 (Fig. 2e). Furthermore, a 650 Kb region from chromosome 3p21.31 was inverted and inserted into the derivative chromosome 3q21.2 arm where the material from chromosome 12q24.13 was translocated (Fig. S4). All of the derived chromosomes determined by linked-read WGS were subsequently validated by FISH (Fig. S5).
In patient ALL_402 with t(9;22)[BCR-ABL1], linked-read WGS revealed an unexpectedly complex rearrangement that involved the BCR (22q11.23), ABL1 (9q34.12), PRRC2B (9q34.13), SIL1 (5q31.2) and LINC01128 (1p36.33) loci (Fig. S6). In addition to the deletion of chromosome 9p21 reported in the karyotype, we detected a 35 Mb deletion (8p11.23-p23.3) and a gain starting at 8p11.23 and continuing through the entire q-arm of chromosome 8 (Fig. 3a). RNA-sequencing verified that the 5′ end of BCR is fused with the 3′ end of ABL1, the 5′ ends of the reciprocal ABL1 and SIL1 loci form a head to head translocation, resulting in two truncated transcripts, the 5′ end of LINC01128 is fused with the 3′ end of SIL1, whilst the 5′ end of PRRC2B is fused with the reciprocal 3′ end of the BCR gene (Fig. 3b). None of these complex rearrangements were phased in the linked-read WGS data, but phasing information was not required to fully resolve the structure of the breakpoints in this case. B-other group. DUX4 and ZNF384-rearrangements define newly described subtypes of BCP-ALL that were initially detected in large-scale RNA-sequencing studies 17 46, XY, der(7)t(7;9)(q34;q31), t(7;11)(q34;p15), der(9)t(7;9) (q34;q31)del(9)(p21p21), del(9) (p21p21) www.nature.com/scientificreports www.nature.com/scientificreports/ insertion of the DUX4 gene (subtelomeric region of chr4q and 10q), into the enhancer region of the IGH locus (chr14) 23 . With the exception of a 93 Mb deletion on chromosome 6q14.1-q27 in ALL_390, the two patients with DUX4-IGH (ALL_390 and ALL_501) had normal karyotypes typical of this subtype (Fig. S7). Previous short-read WGS of ALL_501 failed to identify the DUX4-IGH rearrangement in this patient 11 . The DUX4-IGH rearrangement was not directly detected in the linked-read data by the longranger software, however with the aid of the Integrated Genome Viewer, we were able to identify split linked-reads supporting the insertion of at least one copy of DUX4 into the IGH locus, thus supporting the rearrangement (Fig. S8). Besides the 6q deletion in ALL_390, the linked-read data revealed a compound heterozygous deletion of ERG transcript variant 1 (NCBI Reference Sequence: NM_182918.3). A large 6.5 Mb phase block on chromosome 21q22 enabled detection of a 9.3 Kb focal deletion of exon 1 on haplotype 1 and a separate 57.2 Kb deletion spanning exons 3-10 on haplotype 2 (Fig. 4a).
One patient with a PAX5-ELN fusion gene (ALL_707) detected by RNA-sequencing and short-read WGS was included 11 . The karyotype indicated two derivative chromosomes (chromosome 7 and 9) as well as a 9p deletion. These aberrations were resolved at a higher resolution with linked-read WGS, which demonstrated a derivative chromosome 9 harboring the PAX5-ELN fusion gene, a truncated chromosome 7, as well as a heterozygous deletion of chromosome 9p13.2 with the breakpoint in the PAX5 locus (Fig. S10). The structure of the resulting derivative chromosomes and their validation by FISH are shown in Fig. 4d-f.

T-ALL.
Based on karyotype, a bi-allelic deletion of chromosome 9p21 and two translocations involving chromosomes 7 and 9 and chromosomes 7 and 11 were expected in ALL_559. The homozygous deletion of chromosome 9p21 was clearly resolved in the linked-read WGS data (Fig. S11). Previous short-read WGS and RNA-sequencing data identified two translocations involving the T-cell receptor beta locus (TRBC2 gene) on chromosome 7, namely t(7;11)(q34;p15)[RIC3-TRBC2] and t(7;9)(q34;q31) resulting in the fusion of TRBC2 with an unannotated transcript expressed on chromosome 9 between the TAL2 and TMEM38B genes 11 . The linked-read WGS data clarified that the two alleles of TRBC2 were involved in independent translocation events. First, the t(7;11)(q34;p15) resulting in expression of RIC3-TRBC2 was a consequence of a balanced translocation of chromosome 7 involving one allele of TRBC2 (Fig. 5a). On the other allele of TRBC2, the t(7;9)(q34;q31) was accompanied by a 0.2 Mb deletion flanked by an inversion of chromosome 7q34 (Fig. 5b-d), a re-arrangement that was missed by both karyotyping and previous short-read WGS 11 . FISH verified the derivative chromosomes determined by linked-read WGS (Fig. 5e,f).

Detection of key diagnostic deletions for ALL.
To further demonstrate that linked-read WGS allows detection of other aberrations than large-scale aneuploidies and translocations, we screened the 12 ALL genomes for focal deletions in a set of relevant genes for ALL, including BTG1, CDKN2A/B, EBF1, ETV6, IKZF1, PAX5, RB1 and ERG 25 (Fig. S12). With the exception of RB1, each of the genes analyzed were deleted in at least one patient based on linked-read WGS. All deletions were verified by array-based CNA analysis. Phasing data www.nature.com/scientificreports www.nature.com/scientificreports/ revealed that both of the t(12;21) cases harbored ETV6 deletions on the allele that was not affected by the translocation, thus resulting in bi-allelic disruption of ETV6. Consistent with previous studies 23, 26 , recurrent BTG1 and IKZF1 deletions were detected in the t(12;21) and DUX4-IGH patients, respectively (Fig. S13).

Discussion
In our study the linked-reads enabled highly accurate resolution of the majority of the genomic aberrations defined by cytogenetic methods and refined or identified new structural rearrangements in 10 of the 12 analyzed ALL genomes. Although the ALL subtypes and numbers of samples were modest, these results show clear proof of principle for linked-read WGS for digital karyotyping in ALL. Studies that have applied linked-read-WGS to other cancer types such as triple negative breast cancer 27 , metastatic gastric tumors 28 , prostate cancer 29 , and cell lines 30,31 have reached similar conclusions.
Linked-read WGS requires long input DNA molecules to gain the most benefit from the technology 3 . However, when working with clinical samples, high molecular weight DNA extraction and handling of HMW DNA is not practical in most clinical settings. In our study we showed that DNA from patient samples with an average size of DNA < 50 kb prepared using a standard column-based DNA extraction method were highly Circos plots for patients ALL_386 and ALL_458. The first (outer) track shows the chromosomes and their banding, the second track shows log R ratios from Infinium arrays, the third track shows copy number determined by linked-read WGS in 10 Kb bins, and the fourth (innermost) track shows copy number calls using the CNVnator software. Red indicates gain and blue indicates deletion. Expressed fusion genes are highlighted within each circos plot, solid lines indicate in-frame fusion genes. (c) Heatmap of overlapping linked-reads supporting a balanced inter-chromosomal translocation t(12;21) resulting in the ETV6-RUNX1 fusion gene in ALL_486. (d) Linked-reads mapped to the two haplotypes at the ETV6 locus in patient ALL_386, which depicts a deletion on haplotype 1 (indicated by the red box) and the breakpoint giving rise to the DCAF5-ETV6 and the ETV6-RUNX1 fusion genes is indicated by a dashed line on the second allele (haplotype 2). (e) Schematic representation of the chromosomal rearrangements resulting in derivative chromosomes as determined by linked-read WGS in ALL_386. The resulting fusion transcripts with breakpoints are drawn alongside the chromosomes involved in the translocations. (2020) 10:2512 | https://doi.org/10.1038/s41598-020-59214-w www.nature.com/scientificreports www.nature.com/scientificreports/ informative for detection of genomic aberrations with linked-read WGS. When we compared HMW DNA to DNA from standard column extractions, and when we compared low-coverage GemCode to Chromium library preparation, the results were concordant. Although HMW DNA may increase the chances of phasing over chromosomal breakpoints, which makes interpretation of the chromosomal structure and organization easier, our data suggest that long DNA molecules and high sequencing depth may not be required for accurate detection of prognostically relevant aberrations present in the major clone of leukemic samples.
Although the genomic structure of most chromosomal rearrangements that are of clinical relevance in ALL were resolved with high precision by linked-read WGS, the recently described DUX4-IGH fusion gene failed to be precisely resolved by this technology. The DUX4-IGH rearrangement is a particularly challenging aberration to resolve due to the location of DUX4 in the complex tandemly repeated region D4Z4 in the subtelomeric region of chr4q and chr10q 32 , and the insertion of DUX4 into the IGH locus. This complexity is likely the reason for the lack of identification of the recurrent DUX4-IGH fusion gene prior to recent RNA-sequencing studies in ALL [17][18][19] . Nonetheless, a guided analysis based on identifying split linked-reads that map to the DUX4 and IGH loci identified support for the insertion of at least one copy of DUX4 into the IGH locus in the linked-read WGS data.
The present study is limited by the fact that we have not compared the linked-reads to other next generation approaches such as standard paired-end WGS, Hi-C, third-generation single-molecule sequencing, or optical mapping technologies, which when used in a multiplatform approach have been demonstrated to be a powerful method for resolving complex structural rearrangements [33][34][35] . Future studies will be required for more formal benchmarking of linked-read WGS and other next generation technologies for digital karyotyping specifically in ALL and for other cancer types.
In summary, we focused on detecting large-scale structural aberrations, which are the most relevant type of aberrations for clinical care in ALL 36 . We generated a detailed view of large-scale chromosomal aberrations in cells from pediatric ALL patients, which reaches beyond the resolution of traditional karyotyping data 11,12,37 . Our data suggests that digital karyotyping by linked-read WGS can replace, or at the least complement traditional clinical diagnostic methods such as G-banding and FISH in the future.

Patients and Methods
Patient samples. Primary ALL samples were collected as described previously 38 . The patients were selected from the NOPHO cohort based on presence of cytogenetic aberrations detected at diagnosis or fusion genes detected by previous WGS or RNA-sequencing studies ( Table S1) 11,19,21 . DNA and RNA were extracted from 2-10 million cells using the AllPrep DNA/RNA Mini Kit, AllPrep DNA/RNA/miRNA Universal Kit, or the MagAttract HMW DNA kit (Qiagen). The DNA concentrations were measured using the Qubit dsDNA Broad Range assay (Invitrogen). The study was approved by the Regional Ethics Review Board in Uppsala, Sweden and was conducted according to the guidelines of the Declaration of Helsinki. The patients and/or their guardians provided written informed consent.  www.nature.com/scientificreports www.nature.com/scientificreports/ XCP orange/green XCyting Chromosome Paints) and subtelomeric probes (Vysis Totelvysion probes) followed by analysis using a fluorescence microscope (Carl Zeiss) and the Isis software (MetaSystems) were used to validate translocations identified by linked-read WGS on metaphase spreads from cultured bone marrow cells.
Linked-read data analysis. Linked-read WGS data was processed and phased using the Long Ranger pipeline from 10x Genomics (v1.2.0 for GemCode and v2.1.6 for Chromium) with the hg19/GRCh37 reference genome. Data were visualized using the Loupe Genome Browser v2.1.1. SVs called by Long Ranger were manually reviewed against karyotype data, CNA data from Illumina Infinium arrays, and fusion genes detected by RNA-sequencing. Genomic copy number levels were estimated by chromosomal segmentation read-depth analysis in 10 Kb windows using the CNVnator software 40 . B-allele frequencies were calculated from VCF files using the VariantAnnotation package and custom scripts in R 41 . Ideograms of derivative chromosomes were drawn to scale with the CyDAS software 42 .

RNA-sequencing.
A RNA-sequencing library was constructed from 300 ng total RNA with the TruSeq stranded total RNA protocol (Illumina) for sample ALL_402. The library was sequenced on a NovaSeq. 6000 instrument with 100 bp paired-end reads. Strand-specific RNA-sequencing data was available from previous studies for all of the remaining patient samples, except from patient ALL_370 where RNA was not