Main

The identification of recurrent cytogenetic findings and single gene defects such as those occurring in FLT3, NPM1, and CEBPA form the diagnostic basis for risk-adapted therapy in leukemia, and, in particular, acute myeloid leukemia (AML).1, 2, 3 The 2008 World Health Organization classification of hematopoietic neoplasms recognizes seven such translocations in AML including t(8;21) (q22;q22), inv(16), t(15;17) (q22;q12), t(9;11) (p22;q23), t(6;9) (p23;q34), inv(3), and t(1;22) (p13;q13); and five translocations including t(9;22) (q34;q11.2), 11q23 rearrangements, t(12;21) (p13;q22), t(5;14) (q31.q32), and t(1;19) (q23;p13.3) in acute lymphoblastic leukemia.4 Successful identification of these translocations and their variants, as well as an ever-expanding list of single gene defects is now required for an accurate classification of leukemia. Although the number of genes and translocations requiring evaluation grows with increasing knowledge of leukemia pathogenesis and treatment response, the methods by which such information is obtained have changed little in the last 10–20 years. Furthermore, as the number and complexity of tests required for each new leukemia diagnosis increases, so do the costs and associated burdens placed on the clinical laboratory.

In the last several years, new DNA sequencing methods collectively known as ‘next generation sequencing’ have greatly increased our knowledge of cancer genomes, yielding rapid and relatively low-cost data sets.5, 6, 7, 8, 9 These studies have uncovered new prognostic markers in AML such as DNMT3A and IDH1, illustrating the importance of next generation sequencing in the research/discovery setting. In the clinical laboratory, however, such mutations are generally identified by more conventional methods such as PCR (and variants thereof), Sanger sequencing, and fluorescent in-situ hybridization (FISH). Although these methods are relatively simple to implement, the time required to optimize each makes it difficult to keep up with the rapid pace of research in the field. Although mutations in commonly tested genes involving AML prognosis such as FLT3 and NPM1 tend to have a limited spectrum of mutations, making them amenable to PCR-based testing, the broader range of mutations in genes such as CEPBA, DNMT3A, and KIT necessitates the use of direct sequencing, adding to the expense and turn-around time of testing.5, 10, 11, 12 The detection of recurrent, balanced chromosomal translocations is critical to leukemia prognosis and is generally done by FISH and G-banded cytogenetics, relying on direct visualization of DNA. Although FISH offers increased sensitivity over conventional cytogenetics, the identification of novel translocation partners, such as those involving the MLL locus, or variant breakpoints requires the use of multiple probes, again increasing the cost and complexity of testing.13, 14

Although the acquisition of next generation sequencing data is now relatively straightforward (see Mardis15 for an excellent review), its analysis can be extremely complicated and time consuming due not only to the volume of data (often >100 GB/run), but also the computational difficulty in aligning short reads (Figure 1). Next generation sequencing, as opposed to conventional Sanger sequencing, relies on massive parallelization of the sequencing process to generate large numbers of reads; however, these reads are generally much shorter (36–400 bp) than those obtained by Sanger sequencing. The increased efficiency of next generation sequencing has greatly reduced the cost of sequencing to <0.00001 cents/base (compared with 1–5 cents/base by Sanger sequencing), allowing for the sequencing of whole human genomes for $5000.16 Many software packages for the analysis for next generation sequencing data exist, including both freely available and commercial options. In this project, we relied solely on freely available software for both sequence alignment and downstream analysis. For the first step of sequence analysis we used both the Burroughs-Wheeler Aligner (BWA) and Novoalign, to perform alignments.17, 18 These software packages perform the computationally intensive task of aligning the short reads to the human reference genome. To identify sequence variations in the aligned data, we then used the Genome Analysis Tool Kit (GATK) Unified Genotyper software to identify single nucleotide variation (SNVs), and both the GATK and Pindel software to identify small and medium sized insertions and deletions (indels).19, 20 Finally, to find translocations we used the Breakdancer software package, which identifies discordant paired-end reads where one end maps to the targeted chromosome and the other to an alternate chromosome, and Slope, which identifies single end chimeric reads spanning the translocation boundary (Figure 2).21, 22 Although similar data can be obtained by commercial software packages, publically available options offer a greater degree of customization and scalability.

Figure 1
figure 1

Hybrid capture enriched next generation sequencing. Although next generation sequencing encompasses a variety of technologies, each relies on massive parallelization of sequencing to achieve enormous throughput. In this example, Illumina sequencing is depicted. (a) Genomic DNA is sheared into small pieces (typically 300–500 bp) by sonication. (b) Sequencing adapters and sequencing indexes, the latter allowing for the identification of individual samples in pooled multiplexed data, are ligated to the sheared genomic DNA. (c) The prepared DNA (now called a library) is captured using biotinnylated cRNA oligomers specific for the region of interest. Following hybridization, this enriched DNA is eluted. (d) The enriched DNA libraries from multiple samples (each with unique index tags) are then loaded into a ‘flow cell’ containing immobilized oligomers with sequences complementary to the ligated library adapters. The library DNA then binds the surface and undergoes ‘bridge amplification’ to produce small colonies containing the amplified DNA library sequence. (e) DNA in the colonies is then sequenced using fluorescent, reversibly blocked nucleotides. Each nucleotide is labeled with a unique fluorophore, and, following incorporation of the complementary labeled base, each colony is scanned by a laser to determine the sequence of the last incorporated base. This process is performed in parallel over millions of colonies and repeated for each base, resulting in reads ranging from 36 to 150 bp depending on the chemistry and instrument used.

Figure 2
figure 2

Overview of translocation identification by next generation sequencing. (a) Translocations occurring at the DNA level were identified by designing capture probes that 2 × tiled across the both exons (dark green) and introns (light green) of gene partners commonly involved in translocations. In this example ABL1 is captured, but its partner BCR is not. (b) Genomic DNA was then fragmented into 300 bp pieces, library prepped, and captured. Genomic DNA containing sequences complementary to the ABL1 (green)-specific biotin-labeled capture probes (blue), in this example, was enriched. Although most of the captured DNA represented contiguous areas of ABL1, regions with partial homology representing the actual DNA breakpoint (red and green) were also captured. (c) After aligning the sequence data, Breakdancer was used to identify paired reads in which one end of the paired-end read mapped to the targeted area (ABL1) and the other end did not (green/purple and green/red reads). (d) As the paired-end approach is subject to a high–false-positive rate due largely to DNA repeat regions, we employed a second level of filtering using Slope. Regions containing possible breakpoints were analyzed to find chimeric single end reads representing the actual translocation boundary (green/red). Chimeric reads mapping to repeat regions were removed based on their low score, and single high scoring hits identified.

Using targeted next generation sequencing, we propose a new paradigm in leukemia diagnostics in which prognostic information currently obtained by a variety of disparate methods can be acquired on a single platform with improved efficiency and markedly increased scalability. In this study, our goals were to prove that targeted next generation sequencing can be used to detect prognostically significant mutations and translocations, and that such data can be accurately analyzed using freely available software. As proof of concept, we leveraged the sequencing capacity of next generation sequencing to obtain high-fold coverage of genes with prognostic significance in leukemia via hybrid-capture enrichment (without gene-specific PCR amplification). In addition to correctly identifying SNVs, insertions, and deletions in commonly tested genes, we were able to reliably detect the t(15;17), t(9;22), and t(4;11) translocations in the NB4, K562, and MV4–11 cell lines, respectively, at the DNA level. Further, by capturing only one partner gene in a translocation (RARA, ABL1, MLL, etc) this methodology allows for the detection of novel partner genes or breakpoints, giving it a significant advantage over conventional break-apart FISH probes. Finally, we demonstrate the clinical utility of this methodology by identifying mutations and translocations in a patient-derived bone marrow sample.

Materials and methods

Cell Line and Patient Selection

To confirm the validity of this methodology we first sequenced DNA from five previously-characterized cell lines, including K562, NB4, MV4–11, kasumi-1, and OCI-AML3.23, 24, 25, 26, 27, 28 Bone marrow-derived DNA from a single anonymized patient with newly diagnosed AML and a known t(9;11) translocation was used to demonstrate the utility of the methodology on clinical material. The use of remnant patient samples for this study was approved by the University of Utah Investigational Review Board (IRB #7275).

Probe Design

120-bp cRNA Agilent SureSelect probes (Santa Clara, CA, USA) were designed to 2 × tile across genes of interest in leukemia biology and prognosis (Table 1). To aid in the discovery of translocations, both introns and exons were covered by the probes with the exceptions of RUNX1 and MKL1, in which only regions including exons/introns 1–3 and 1–5, respectively, were captured. Repeat masking was not performed to avoid the possibility of missing translocations occurring in repeat areas. The total size of the capture region was 1.0 Mb.

Table 1 Gene capture coordinates (build 37 reference)

Capture and Sequencing

Approximately 2 μg of genomic DNA, extracted from cell lines or bone marrow, was fragmented to segments of between 250 and 500 bp using the Covaris S2 Sonolab (Covaris, Woburn, MA, USA). The resulting DNA was then end repaired and ligated to Illumina adapters (Illumina, San Diego, CA, USA) per the manufacturer's protocol. Sequence indexes were added to the cell line samples to permit all five cell lines to be sequenced in a single flow cell. Small fragments of <100 bp and unligated adapters were removed from the mix by AMPure purification (Agencourt Bioscience, Beverly, MA, USA). Sequencing libraries were then hybridized with SureSelect probes per the manufacturer's instructions. Streptavidin-coated paramagnetic beads were then added and allowed to bind the biotinylated capture probes. An external magnetic field was then applied and unbound DNA removed. The bound, captured DNA was finally eluted from the magnetic beads by digestion of the cRNA capture probes and purified. The enriched DNA was then amplified using universal primers targeting the paired-end adapters, clusters generated, and DNA sequenced on an Illumina HiSeq instrument with 2 × 101 bp reads in the case of cell line DNA, or an Illumina GAIIx with 2 × 60 bp reads for bone marrow-derived DNA.

Data Analysis

Base calls and quality scores were provided by the included Illumina software. The resulting FASTQ files were aligned to build 37 of the human reference genome (hg19) using Novoalign or BWA with default parameters.17 Quality metrics including gene coverage were calculated using BedTools.29 SNVs were then called from the aligned sequence data using the UnifiedGenotyper in the GATK package.30 The list of SNVs was then referenced against dbSNP (build 130) to flag known variants. The SNVs were further filtered by removing all SNVs occurring within non-coding regions. Indels events were identified using both Pindel and the GATK Indel Genotyper V2.0, whereas translocations were found by first using Breakdancer to identify clusters of paired-end reads in which the two members mapped to different chromosomes, and then verified using Slope to confirm the presence of chimeric single-end reads in the vicinity (10 kb) of the Breakdancer calls.20, 21, 22 Default parameters were used with both programs. The above actions were combined using UNIX shell scripts to create an analysis pipeline (summarized in Figure 3).

Figure 3
figure 3

Data analysis pipeline. FASTQ files containing sequence and quality scores were output from the Illumina HiSeq and aligned to the human reference genome (build 37/hg19) using either BWA or Novoalign on a server cluster. The aligned data was then stored as a sorted BAM file and analyzed for SNVs, indels, and translocations. SNVs were called using the Unified Genotyper function of the GATK package. SNVs were further filtered by flagging known polymorphisms in dbSNP (build 130) and by removing SNVs occurring in non-coding regions. Small and medium size indels (<100 bp) were identified using Pindel and the GATK Indel Genotyper V2 software packages with default parameters. Indels occurring outside of coding regions or splice sites were ignored. Translocations were identified by first running Breakdancer to identify paired-end reads in which one end mapped to a gene in the capture region and the other did not. As this methodology is subject to considerable noise, largely because of sequence repeats and areas of homology, we then performed a second level of verification using Slope to find chimeric single end reads within the regions identified by Breakdancer. Finally, results from all three branches of the analysis pipeline were merged into single variant calling format (VCF) file.

Results

Quality Metrics

Total capture efficiency (percentage of total reads that mapped to the capture region) ranged from 5.7 to 14.9%. On average, each gene had 150±69 × coverage (range 0–2215) in the five cell lines sequenced (Supplementary Figure 1 and Supplementary Table 1). Genes with the lowest fold coverage included CEPBA (41±12 × average coverage) and NOTCH1 (37±12 × average coverage). These genes contain a higher percentage of regions with increased GC content (>70% GC) compared with the other captured genes, likely reducing the capture or library amplification efficiencies. By comparing GC content to fold coverage across the GC-rich CEBPA gene, we determined that optimal coverage is achieved with a GC content <70% (averaged over 200 bp increments) (Supplementary Figure 2). GC content >70% results in a considerable decrease in fold coverage, quickly leading to areas of zero coverage.

Identification of SNVs and Indels

We employed a data analysis pipeline comprised of freely available software to align and analyze next generation sequencing data. Single base substitutions were called using the GATK Unified Genotyper and then compared with dbSNP to exclude known variants. Results were further filtered to remove SNVs occurring in non-coding regions. To identify indels, we used both Pindel and the GATK Indel Genotyper V2, and found that although the two methods produced identical results in the majority of calls, there were differences. For example, we compared both methods using sequence data from the 97-kb (including introns) FLT3 gene in MV4–11 AML cell line. Using default parameters the GATK Indel Genotyper V2 identified 41 indels in the region whereas Pindel identified 66. Of these calls, the majority (35) were shared between the two programs; however, the 30-bp FLT3 internal tandem duplication (ITD) (the only orthogonally validated indel) was identified by Pindel only.

By applying this methodology, we correctly identified all published indels and single base mutations (SNVs) occurring within the five cell lines (Table 2). For example, we identified previously published mutations, including the 4-bp type A NPM1 exon 12 insertion in OCI-AML3, the 30-bp exon 14 FLT3 ITD in MV4–11, and the D822K KIT mutation in exon 17 of kasumi-1. Coverage across gene regions of clinical interest such as the NPM1 mutation, FLT3 ITD, and KIT mutations were then determined to establish a coverage estimate for the identification of such mutations and averaged across all five cell lines (Figure 4a and b). Average coverage within a 1-kb region flanking the NPM1 insertion and FLT3 ITD were 174 × and 106 × , respectively. Coverage within the CEBPA gene was too low to permit reliable identification of mutations. Other common mutations such as rs16754 in WT1 and rs11554137 in IDH1 were not observed.31 Additional mutations of unknown clinical significance are included in Supplementary Table 2.

Table 2 Next generation sequencing findings in commonly tested genes
Figure 4
figure 4

Coverage of clinically important genes and translocation areas. Blue: NB4, red: MV4–11, green: OCI-AML3, purple: K562, and orange: kasumi-1. (a) FLT3 coverage for all cases in the region of the FLT3 ITD. Average coverage was 116-fold. (b) NPM1 coverage in the area surrounding exon 11 that generally harbors insertions. Average coverage was 174-fold. (c) Coverage of 1 kb in ABL1 gene flanking the translocation site (middle) in K562 cells. Note that coverage in K562 increases dramatically at the breakpoint, likely indicating a copy number change in this region. (d) In contrast, the 1-kb area surrounding the t(15;17) RARA breakpoint in NB4 cells show no evidence of copy number change.

Identification of Translocations

Recurrent translocations were identified in three of the three cell lines in which one translocation partner was within the targeted capture area. Of the remaining two cell lines sequenced, the t(1;18) (p11;q11) described in OCI-AML3 is not considered a recurrent translocation and was not covered by the capture probes; the RUNX1 breakpoint in kasumi-1 occurred outside of the capture region (the RUNX1 capture region was truncated because of design limitations) and consequently was not sequenced.23, 32 Both the major and reciprocal translocations were identified in the K562 t(9;22) and NB4 t(15;17) cell lines, whereas only the major translocation was identified in the MV4–11 t(4;11) cell line. All translocations identified from next generation sequencing data were then validated by PCR and Sanger sequencing (verified breakpoint contigs are included in Supplementary Table 3).33 Furthermore, primers for the specific translocations were tested on additional AML cases and showed no evidence of non-specific amplification, ruling out the possibility of incorrectly identifying repeat regions as translocations. Coverage in the translocation areas was then calculated to determine the necessary coverage required for finding translocations. In the case of the five successful translocation identifications (in three cell lines) average coverage was 176 × within a 1-kb region of the breakpoint (Figure 4c and d).

By applying a second level of software analysis beyond paired-end read mapping, only a single candidate translocation was identified in each cell line and all candidates proved to be correct. We initially applied Breakdancer to aligned data to find candidate translocation events based on paired-end reads in which one end mapped to a targeted translocation region (ABL1, EVI1, CBFB, MKL1, MLL, NUP214, RARA, and RUNX1) and the mate did not. We then evaluated areas surrounding the coordinates of putative translocations with the Slope software to verify the presence of chimeric single end reads spanning the translocation breakpoint (Figure 5). For each cell line, Breakdancer identified an average of 75 high-scoring (score >90) putative translocations, however, with the application of single-end read verification this number was reduced to one translocation in MV4–11 and two translocations (major and reciprocal) in K562 and NB4.

Figure 5
figure 5

Identification of a t(9;22) translocation in the K562 cell line. (a) As only the ABL1 gene was directly targeted, we first identified reads in which one paired read mapped to ABL1 and the other did not using Breakdancer. This produced a table of 150 hits (partial list shown), the majority representing matches to repeat regions. Of the 150 hits, 11 (including the actual breakpoint) had maximal scores of 99. (b) Breakdancer coordinates were then passed to SLOPE to perform single end chimeric read verification. Only a single high-scoring hit was identified using this methodology. (c) These findings were subsequently verified by constructing PCR primers flanking the proposed translocation and then Sanger sequencing the products. To demonstrate specificity we attempted to amplify the same translocation from other cell lines as shown.

Patient Case

To demonstrate that the same methodology can be applied to lower quality clinical diagnostic material, we performed the same capture/sequencing steps using DNA derived from the non-enriched bone marrow of an anonymized patient with newly diagnosed AML. Previous testing showed an MLL rearrangement using breakapart FISH probes, and the partner was subsequently confirmed to be AF9, t(9;11), using a panel of common MLL fusion partner probes. Testing for common AML gene mutations including NPM1, FLT3 (ITDs, and D835), CEBPA, and KIT (exons 8 and 17) was performed at ARUP Laboratories and all produced negative results (data not shown). Next generation sequencing of the captured DNA demonstrated no clinically significant mutations (no previously described pathogenic mutations in coding regions), consistent with the findings obtained by conventional methods. We then used the same data analysis pipeline described for the cell line data to identify the t(9;11) MLL-AF9 translocation. A single translocation event occurring within intron 8 (chr11: 118354598) of MLL and intron 23 (chr9: 20355515) of AF9/MLLT3 was identified. This breakpoint was verified by standard PCR and Sanger sequencing, and was not present by PCR in other AML cases. Coverage analysis showed similar findings to the cell line data; genes including KIT, IL3, and RUNX1 showed the highest fold coverage whereas genes such as CEBPA and NOTCH1 exhibited low coverage. Coverage within the MLL gene at the translocation area was 300 × .

Discussion

Here we demonstrate a relatively simple method based on hybrid capture and next generation sequencing for the simultaneous identification of single gene mutations (including SNVs and indels) and translocations in leukemia. Using DNA from cell lines with previously characterized findings, we identified all published mutations occurring within genes on the capture panel, without false positives. We further identified translocations in three of three cell lines and in one patient sample (one of one) by analyzing both paired-end and single-end read data. Coverage data indicated that 177 × coverage was sufficient for reliable translocation detection, whereas slightly lower-fold coverage, 150 × , allowed for detection of SNVs and indels. Together these data demonstrate that targeted next generation sequencing is a viable clinical laboratory method that has the potential to replace a number of conventional methods such as Sanger sequencing, capillary-based sizing, and FISH for the detection of clinically significant DNA mutations.

Targeted clinical next generation sequencing methods offer considerable advantages compared with standard laboratory methods; however, some obstacles remain before clinical next generation sequencing enters mainstream use. We and others have found that next generation sequencing data analysis is the greatest impediment to use.34 For example, the single HiSeq lane containing the five indexed cell lines produced 28 GB of raw data, took 50 CPU hours to align, and required some basic knowledge of bioinformatics to analyze. We relied on freely available, peer-reviewed tools such as GATK, Pindel, Breakdancer, and Slope to identify mutations within aligned data and used perl scripts to filter out mutations of clinical interest. In our hands, these methods showed a high sensitivity and specificity in detecting clinically significant SNVs and indels in cell lines where the mutations were present in at least 50% of the sequenced cells. We note that the detection of medium sized indels, especially the FLT3 ITD, is difficult. Although the FLT3 ITD was identified in MV4–11 cells using Pindel it could not be detected by the more general tools included in the GATK. This fact further highlights that multiple software tools will likely be required to analyze the full spectrum of mutation in clinical next generation sequencing samples. The limit of sensitivity remains to be determined for cases in which pathogenic SNVs and indels in leukemic cells are diluted by a larger population of normal cells, such as in the case of some AML with low-blast counts, or post-chemotherapy samples. In theory, blasts may be enriched by initial flow cytometry sorting before capture, but such methods remain untested.

We used simple 2 × tiling without repeat masking in designing capture probes that resulted in an average of 9.8% of reads mapping to the targeted reference. This level of enrichment proved adequate for the detection of SNVs and indels in most genes; however, several genes such as CEBPA exhibited poor capture. We found that regions with high-GC content (>70%) exhibited low coverage. Optimal coverage was seen in the range of 20–69% GC content. Clearly the efficient capture of GC rich areas is problematic with current methods. Possible solutions including re-designing probe spacing or sizing to better match melting temperature content among probes and changing PCR conditions used in library construction.

Using this methodology we identified DNA-level translocations by capturing only one partner gene, thereby allowing for the detection of variant translocations that may be difficult to identify by FISH. For example, we captured only the MLL gene, but were able to identify both the t(4;11) translocation in MV4–11 cells as well as a patient t(9;11) translocation with single base accuracy by looking for sequencing reads that spanned non-adjacent segments of the genome. This method essentially exploits the ‘off-target’ reads, or areas of ‘shoulder coverage’, inherent to capture-based methods, but not present in PCR-based enrichment strategies. Although this approach worked successfully in the majority of cases, it should be noted that translocation sites often contain areas of homology or repeats that would make paired end or chimeric reads difficult to align.35 Therefore, even with sufficient coverage this method may not be capable of identifying all translocation events, and additional methodologies such as the inclusion of mate pair libraries may be required.36

Unlike current methods employed in the clinical molecular oncology laboratory, target-capture-based next generation sequencing offers greatly increased scalability, requires less technician labor, and is becoming less expensive. For example, mutations in many genes recently implicated in leukemia prognosis, such as CEBPA and DMNT3A occur throughout the coding region, and require sequencing of multiple PCR products and/or exons for full evaluation.5, 37 Although cumbersome by Sanger sequencing, obtaining sequence data over large coding areas and across multiple exons is fairly simple using target capture, requiring only the in silico design of sequence-specific capture probes and minimal optimization for most genes (depending on GC content as described earlier). It is also straightforward to add additional genes to a capture panel to accommodate new prognostic and diagnostic markers. Finally, the procedures involved in capture-based next generation sequencing can be highly automated and require minimal technician time for set up and running compared with standard methods.

In summary, we present proof of concept data showing that targeted next generation sequencing can be used in the clinical setting to detect prognostically significant mutations and translocations in leukemia. This methodology has the potential to replace a variety of more labor-intensive methods currently used to detect gene mutations and translocations in the clinical laboratory. In addition, we present a framework for the automated analysis of clinical next generation sequencing data using freely available software tools. Although we present data demonstrating the clinical utility of targeted next generation sequencing in leukemia, similar methods could be applied to solid tumors, for the simultaneous detection of both gene mutations and translocations, such as ERG-TMPRSS2 fusions in prostate cancer or EML4-ALK fusions in non-small cell lung cancer.38, 39 By identifying such rearrangements at diagnosis with single base accuracy from DNA, patient-specific primers can be created to monitor for subsequent disease recurrence by more sensitive real-time PCR methods.40 This method could in theory allow for disease monitoring in solid tumors similar to the use of BCR-ABL1 quantitative PCR in chronic myelogenous leukemia. Further, we have previously used a similar approach to identify viral insertion sites with DNA derived from formalin-fixed paraffin-embedded tissue, suggesting that the methodology described herein may be amenable to archival material.41 Finally, these methods are largely independent of sequencing platform and should be adaptable to other instrumentation depending on the particular needs of a laboratory.