Genetic lesions are crucial for cancer initiation. Recently, whole genome sequencing, using next generation technology, was used as a systematic approach to identify mutations in genomes of various types of tumors including melanoma, lung and breast cancer, as well as acute myeloid leukemia (AML). Here, we identify tumor-specific somatic mutations by sequencing transcriptionally active genes. Mutations were detected by comparing the transcriptome sequence of an AML sample with the corresponding remission sample. Using this approach, we found five non-synonymous mutations specific to the tumor sample. They include a nonsense mutation affecting the RUNX1 gene, which is a known mutational target in AML, and a missense mutation in the putative tumor suppressor gene TLE4, which encodes a RUNX1 interacting protein. Another missense mutation was identified in SHKBP1, which acts downstream of FLT3, a receptor tyrosine kinase mutated in about 30% of AML cases. The frequency of mutations in TLE4 and SHKBP1 in 95 cytogenetically normal AML patients was 2%. Our study demonstrates that whole transcriptome sequencing leads to the rapid detection of recurring point mutations in the coding regions of genes relevant to malignant transformation.
Acute myeloid leukemia (AML) is the most frequent hematological malignancy in adults, with an annual incidence of 3 to 4 cases per 100 000 individuals. Despite the increasing knowledge about the molecular pathology of AML, the prognosis remains poor, with a 5-year survival of only 25–30%. Chromosomal aberrations in tumor cells are found in approximately half of the AML patients, whereas the other half of the patients has a normal karyotype (cytogenetically normal-AML).1 Even though a growing number of submicroscopic genetic lesions is identified in AML, about 25% of cytogenetically normal-AML patients do not carry any of the currently known mutations. The list of frequently affected genes includes the receptor tyrosine kinase FLT3, the transcription factor CEBPA, the human trithorax homolog and histone methyltransferase MLL and nucleophosmin (NPM1).2, 3, 4, 5, 6, 7
So far, most of the genes that were found mutated in AML were found through a candidate gene approach, because of their involvement in translocations or in hematopoetic differentiation. For example, CEBPA knockout mice show a block in myeloid differentiation, and both MLL and NPM1 were initially found to be involved in fusion genes that resulted from chromosomal translocations in leukemia patients.5, 6, 7, 8, 9
With the advent of next generation sequencing technology, the unbiased detection of tumor-specific somatic mutations became possible.10, 11, 12, 13, 14, 15 Sequence analysis of an AML genome resulted in the identification of recurring mutations in the gene IDH1, encoding the enzyme isocitrate dehydrogenase 1.11 Metabolite screening of AML samples revealed that the related enzyme IDH2 is another mutational target.16 Despite its technical feasibility, whole genome sequencing is still cost intensive, and therefore several alternative approaches of targeted sequencing have been proposed, like the sequencing of coding regions. Although the size of a diploid human genome is about 6 Gbp, the transcriptome, as defined by the combined length of all mRNAs in a cell, is only 0.6 Gbp in size. This figure is based on the estimate that a cell contains about 300 000 transcripts, with an average length of 2000 bases.17, 18 Sequencing of only a few gigabases of the transcriptome should allow mutation detection in a large proportion of transcribed genes. Here we report that sequencing of an AML tumor and the corresponding remission transcriptome allowed us to analyze approximately 10 000 genes and to identify five tumor-specific somatic mutations.
Materials and methods
A diagnostic bone marrow sample was collected from a 69-year-old patient, diagnosed with AML M1 in May 2008. The patient was included in the AML Cooperative Group clinical trial, and informed consent and ethical approval for scientific use of the sample including genetic studies were obtained. After induction therapy using the sequential high-dose cytosine arabinoside and mitoxantrone (S-HAM) protocol, complete remission was achieved. After leukocyte recovery in July 2008, a remission sample from peripheral blood was taken.
Approximately 50 × 106 cells from each sample were used for mRNA extraction using Trizol (Invitrogen, Carlsbad, CA, USA). The sequencing library was prepared using mRNA-Seq sample preparation kit (Illumina, San Diego, CA, USA). In brief, mRNA was selected using oligo-dT beads (dynabeads, Invitrogen). The mRNA was then fragmented using metal ion hydrolysis and reversely transcribed using random hexamer primers. Following steps included end repair, adapter ligation, size selection and polymerase chain reaction enrichment.
Short-read alignment and consensus assembly were performed using the BWA (v.0.5.5) sequence-alignment program,19 with the default parameters and interactive trimming of low quality bases at the end of reads (cut-off quality value q=15). We used an expanded reference sequence comprising the human genome assembly (build NCBI36/hg18) and all annotated splice sites extracted from the University of California Santa Cruz (UCSC) genome browser-known gene track. In total, we generated 127 115 919 paired-end reads of 36 bp length for the AML sample, of which 95.08% aligned to the reference sequence, and 187 782 678 paired-end reads for the remission sample with 82 % aligning to the reference. Read mapping, subsequent assembly and variant calling were performed using the resequencing software packages BWA and SAMtools.19, 20 During alignment, 31.27 and 39.81% apparently duplicated reads were removed from the AML and remission sample, respectively.
Distribution of reads across exonic and non-exonic regions
To determine the success of the RNA library preparation, we calculated the percentage of reads matching to known exons from the UCSC genome browser. For the AML sample, ∼63% of reads aligned to exons, ∼28.5% to introns and ∼7.5% to intergenic regions, whereas for the remission sample, ∼73.5% of reads aligned to exons, ∼20.5% to introns and ∼6% to intergenic regions (Figure 1b). The relatively high proportion of intronic reads may stem from unspliced mRNAs. Variable proportions of intronic and exonic reads were observed between different preparations from the same samples, indicating that minor differences in RNA concentration and quality might strongly influence the competitive binding of shorter spliced and longer incompletely spliced mRNAs to oligo dT-beads. The values varied between the different chromosomes and the number of reads mapping to exons were correlated with overall gene density on the chromosome (Supplementary Figure S1).
Expression values were calculated as RPKM (reads per kilo-base of gene model per million mapped reads.21 In brief, the number of uniquely mapping reads (BWA mapping quality >0, ∼75 to 85% of reads for both samples) for each gene was counted and then normalized by gene length and the total number of reads generated in the experiment. As the reference set, we used a non-redundant gene set based on the Ensembl gene annotations by merging all annotated transcripts from the same gene into a single ‘maximum coding sequence’. This set contained 35 876 genes. Exonic regions that were shared by two or more different genes (for instance sense and anti-sense transcripts or non-coding RNAs within exons) were excluded and not used for RPKM calculation as reads from these regions can not be unambiguously assigned to single genes.
Spearman's rank correlation coefficient
Spearman's rank correlation coefficient was calculated from the log2 RPKM values of the tumor and remission sample, using the R package for statistical computing.
Variant calling was performed using the SAMtools package (v.0.1.5c).20 For the variant filter of SAMtools, we used the following settings: minimum read depth=3; maximum read depth=9999; minimum root mean square mapping quality for single-nucleotide polymorphisms (SNPs)=25; minimum mapping quality of gaps=10; minimum indel score for filtering=25; window size around potential indels=10; window size for filtering dense SNPs=10; maximum number of SNPs allowed in window=2.
Subsequently, we applied additional filters. We required each putative SNP to have (i) a median quality value of the variant bases of at least 20 (ii) that at least 15% of all reads covering the position show the variant allele and (iii) that at least 10% of reads showing the variant allele are from opposite strands.
Functional analysis of SNPs was performed with custom Perl scripts using data sets from Ensembl and the UCSC genome browser. Known SNP locations, Ensembl and known gene annotations were used as provided by the UCSC genome browser.
To demonstrate the feasibility of this approach, we selected an AML sample (bone marrow aspirate) and a corresponding remission sample (peripheral blood) for transcriptome sequencing. The patient, a 69-year-old female, presented with de novo AML, with blood counts and bone marrow morphology being consistent with the diagnosis of AML without maturation according to the French-American-British classification (FAB AML M1). After induction therapy, complete remission was achieved. One year after initial diagnosis, the patient relapsed and received an allogenic bone marrow transplant.
Conventional cytogenetic analysis revealed a normal female karyotype (46, XX). An internal tandem duplication of FLT3, an NPM1 mutation and a partial tandem duplication in the MLL gene were excluded in a routine diagnostic screen. We further investigated whether the tumor sample contained somatic copy number variations using the HumanOmni1-Quad chip (Illumina), containing probes for approximately 1 million loci. We found no evidence of somatic loss-of-heterozygosity indicating the presence of a normal diploid genome. A total of 29 copy number changes were present in both the tumor and remission sample. We compared the copy number variations with those contained in the database of genomic variants and 1600 controls from a population-based study. All the copy number variations were present at least once in these cohorts.
We sequenced 4.35 and 5.54 Gbp of the tumor and remission sample, respectively, on an Illumina GA IIx sequencer (Illumina). We used the NCBI36/hg18 genome assembly as reference sequence and compiled a non-redundant mRNA set from the Ensembl transcripts database resulting in a set of 35 876 genes. Read mapping to the reference genome was performed with the BWA software.19 Approximately 95 and 82% of the reads mapped to the reference, of which 63 and 74% mapping to exonic sequences in the tumor and remission sample, respectively (Figure 1b, Supplementary Figure S1).
The average sequence read depth for every gene was first calculated to obtain the number of genes suitable for mutation detection. The read depth per gene ranged from 0 to over 1000. A total of 10 152 genes had an average read depth of at least sevenfold and 6989 genes had an average read depth of 20 or greater in both samples. These numbers were only slightly higher when the tumor and remission samples were analyzed individually, indicating that the gene expression pattern was comparable even though the tumor sample was a bone marrow aspirate with more than 90% blasts, whereas the remission sample was from peripheral blood with a normal white blood cell count (Figure 1a). The comparability was supported by a high correlation of the gene expression levels between the samples as shown by a Spearman Rank correlation coefficient of 0.82 (Figure 2a).
Single-nucleotide variants (SNV) were called with the SAMtools software package,20 using mainly the default parameters and custom filters applied at later stages. To achieve a low false-positive rate, we required a minimum read depth of 7 in both samples. We set this threshold because there is a detection rate of approximately 70% at this read depth.22 For the same purpose, we quality filtered the SNV set of the tumor sample, but used an unfiltered set of the remission sample for comparison (Figure 2b).
Quality filtering in the tumor resulted in a set of 8978 SNVs in coding regions. This compares favorably with approximately 20 000 SNVs that can be found in the entire coding sequence using exome sequencing.23 In the next step, we excluded all coding SNVs that were present in the dbSNP database version 130 or in the exomes of 8 HapMap samples. The remaining 926 sites contained 612 SNVs, which led to an amino acid substitution or, which disrupted canonical splice sites. These 612 SNVs were then compared with the unfiltered calls of the remission sample at these 612 positions. We excluded all positions with any indication that the same SNV was also present in the remission sample.
This strategy resulted in the identification of 11 candidate SNVs unique to the tumor sample. Capillary sequencing of genomic DNA from both the tumor and the remission sample confirmed five SNVs, which affected the genes RUNX1, TLE4, SHKBP1, XPO7 and RRP8. (Table 1, Figure 3). Two SNVs were false positives with the same heterozygous SNVs being also present in the genomic DNA of the remission sample, four SNVs could not be confirmed in the AML sample.
RUNX1 (AML1) carried a heterozygous stop mutation in the Runt domain. RUNX1 is the fusion partner of RUNX1T1 (eight twenty one (ETO)) in the recurring t(8;21) (q22;q22) translocation present in 8–13% of de novo AML cases.24 In addition, point mutations in RUNX1 have recently been described in AML, in particular AML secondary to myelodysplastic syndrome, radiation exposure or chemotherapy, at a frequency of 8–10%.25
TLE4 carried a missense mutation at position 511 (N511S). TLE4 is located on chromosome 9 band q34, which is frequently deleted in AML with t(8;21) translocations, and is therefore a putative tumor suppressor gene. Interestingly, the TLE4 protein interacts with RUNX1, and haploinsufficiency of TLE4 was shown to collaborate with the RUNX1/RUNX1T1 fusion to rescue cells from apoptosis.26
The third tumor-specific SNV resulted in a missense mutation (V89I) in SHKBP1 (also known as SETA binding protein 1, SB1). Through SETA, SHKBP1 interacts with CBL,27 a ubiquitin ligase that regulates the degradation of FLT3. CBL mutations, which result in the increased activity of FLT3, have recently been described in AML and myelodysplastic syndrome.28 Thus, it is likely that SB1 mutations affect FLT3 signaling. SHKBP1 overexpression in cell lines has antiapoptotic effects.29
The fourth and fifth AML-specific mutations were missense mutations in XPO7 (a member of the importin beta superfamily) and RRP8 (a methyltransferase, possibly involved in ribosomal RNA processing).
Although recurring mutations in RUNX1 are known to occur in AML, mutations in TLE4 or SHKBP1 have not been described before. We therefore screened the complete coding sequence of TLE4 and SHKBP1, as well as of RUNX1 in 95 cytogenetically normal-AML patients by capillary sequencing of genomic DNA (Table 2). As expected, we found several patients with RUNX1 mutation (9/95; 9.5%): nine missense mutations (two patients with two mutations each), one nonsense mutation and a 5 bp insertion. We also discovered two missense mutations in TLE4 and two missense mutations in SHKBP1 (Table 2), strongly suggesting that both TLE4 and SHKBP1 are mutational targets in AML at a frequency of about 2%. Mutations in TLE4, SHKBP1 and RUNX1 were mutually exclusive in the cohort of 95 cytogenetically normal-AML patients. TLE4 mutations were found in patients with mutations in NPM1 and C/EBPA, whereas SHKBP1 mutations were found in combination with mutations in NPM1 and FLT3 (Table 2).
Our results demonstrate that whole transcriptome sequencing is an efficient method to discover point mutations in AML. Using stringent filtering criteria, we were able to identify just 11 candidate mutations from a total of almost 10 Gbp of primary transcriptome sequence. Five of these mutations were confirmed by sequencing of genomic DNA. Three of these mutations affect genes in pathways involved in AML pathogenesis (Figure 4). Although RUNX1 mutations are known to occur at a frequency of about 8% in AML patients, we describe for the first time TLE4 and SHKBP1 as recurring mutational targets in AML. In summary, our approach proved to be extremely efficient in identifying recurring mutations with a high likelihood of contributing to the pathogenesis of AML.
Overexpression of TLE4 in the RUNX1/RUNX1T1 (AML1/ETO) fusion-positive Kasumi cell line was reported to cause apoptosis and cell death, suggesting that TLE4 may act as a tumor suppressor gene.26 The missense mutations we identified may diminish the function of TLE4 or even act in a dominant negative fashion. The point mutations in SHKBP1, on the other hand, may result in a gain of function, because the antiapoptotic effects of its overexpression classify SHKBP1 as a putative proto-oncogene.29 In AML, mutations in SHKBP1 may disturb the degradation of the FLT3 tyrosine kinase through the interaction of SHKBP1 with SETA and indirectly with the ubiquitin-ligase CBL.27 Although little is known about the protein structure of both TLE4 and SHKBP1, all point mutations found in the present study affect evolutionarily highly conserved domains encoded by neighboring exons (Table 2). Although biochemical assays are required to test whether these missense mutations influence the protein interactions between TLE4 and RUNX1 or SHKBP1 and CBL, in vivo transformation assays are required to elucidate the potential role of these mutations during the onset and progression of AML. Considering the increasing number of recurring mutations that have been identified in AML, it will be very challenging to understand their complex interplay. Apparently, many subtle genetic changes may contribute to the disease through multiple interactions.
Although analysis of the two AML genomes required sequencing of over 120 Gbp for each patient and resulted in the detection of 10 to 12 tumor-specific mutations in the gene coding regions in each case,10, 11 our analysis of an AML transcriptome required only the sequencing of 10 Gbp and resulted in the identification of five tumor-specific mutations in the gene coding regions. Thus, our findings demonstrate that whole transcriptome sequencing might be an order of magnitude, faster and more cost effective than whole genome sequencing for the detection of point mutations in coding regions of expressed genes. The main limitation of transcriptome sequencing is the representational bias of transcripts. Considering recent reports of alternative cleavage and polyadenylation of oncogenic transcripts,30 sequencing of reversely transcribed poly-A selected transcripts may not always correctly reflect the original expression levels in the leukemia cells. Moreover, mutations that lead to increased-mRNA decay might be missed in the present study. As only expressed mRNAs are sequenced, non-expressed and extremely rare transcripts are not sequenced at all or are not sequenced to sufficient coverage levels for reliable mutation detection. However, this limitation might not greatly affect the ability of this method to detect activating mutations in oncogenes, as these genes would have to be transcribed and translated to mediate their oncogenic effect. Recently, exon-capture techniques became available providing a more even read depth across protein coding regions, thus allowing an exhaustive mutation analysis. In contrast to whole exome or genome sequencing, transcriptome sequencing provides valuable additional information on gene expression levels and exon-composition of transcripts. Apart from mutation detection, transcriptome sequencing could also be used to detect tumor-specific fusion genes and splice variants.
Mrozek K, Marcucci G, Paschka P, Whitman SP, Bloomfield CD . Clinical relevance of mutations and gene-expression changes in adult acute myeloid leukemia with normal cytogenetics: are we ready for a prognostically prioritized molecular classification? Blood 2007; 109: 431–448.
Yamamoto Y, Kiyoi H, Nakano Y, Suzuki R, Kodera Y, Miyawaki S et al. Activating mutation of D835 within the activation loop of FLT3 in human hematologic malignancies. Blood 2001; 97: 2434–2439.
Nakao M, Yokota S, Iwai T, Kaneko H, Horiike S, Kashima K et al. Internal tandem duplication of the flt3 gene found in acute myeloid leukemia. Leukemia 1996; 10: 1911–1918.
Reindl C, Bagrintseva K, Vempati S, Schnittger S, Ellwart JW, Wenig K et al. Point mutations in the juxtamembrane domain of FLT3 define a new class of activating mutations in AML. Blood 2006; 107: 3700–3707.
Pabst T, Mueller BU, Zhang P, Radomska HS, Narravula S, Schnittger S et al. Dominant-negative mutations of CEBPA, encoding CCAAT/enhancer binding protein-alpha (C/EBP alpha), in acute myeloid leukemia. Nat Genet 2001; 27: 263–270.
Yu M, Honoki K, Andersen J, Paietta E, Nam DK, Yunis JJ . MLL tandem duplication and multiple splicing in adult acute myeloid leukemia with normal karyotype. Leukemia 1996; 10: 774–780.
Falini B, Mecucci C, Tiacci E, Alcalay M, Rosati R, Pasqualucci L et al. Cytoplasmic nucleophosmin in acute myelogenous leukemia with a normal karyotype. N Engl J Med 2005; 352: 254–266.
Zhang DE, Zhang P, Wang ND, Hetherington CJ, Darlington GJ, Tenen DG . Absence of granulocyte colony-stimulating factor signaling and neutrophil development in CCAAT enhancer binding protein alpha-deficient mice. Proc Natl Acad Sci USA 1997; 94: 569–574.
Caligiuri MA, Schichman SA, Strout MP, Mrozek K, Baer MR, Frankel SR et al. Molecular rearrangement of the ALL-1 gene in acute myeloid leukemia without cytogenetic evidence of 11q23 chromosomal translocations. Cancer Res 1994; 54: 370–373.
Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 2008; 456: 66–72.
Mardis ER, Ding L, Dooling DJ, Larson DE, McLellan MD, Chen K et al. Recurring mutations found by sequencing an acute myeloid leukemia genome. N Engl J Med 2009; 361: 1058–1066.
Pleasance ED, Cheetham RK, Stephens PJ, McBride DJ, Humphray SJ, Greenman CD et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 2010; 463: 191–196.
Pleasance ED, Stephens PJ, O’Meara S, McBride DJ, Meynert A, Jones D et al. A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature 2010; 463: 184–190.
Stephens PJ, McBride DJ, Lin ML, Varela I, Pleasance ED, Simpson JT et al. Complex landscapes of somatic rearrangement in human breast cancer genomes. Nature 2009; 462: 1005–1010.
Stratton MR, Campbell PJ, Futreal PA . The cancer genome. Nature 2009; 458: 719–724.
Gross S, Cairns RA, Minden MD, Driggers EM, Bittinger MA, Jang HG et al. Cancer-associated metabolite 2-hydroxyglutarate accumulates in acute myelogenous leukemia with isocitrate dehydrogenase 1 and 2 mutations. J Exp Med 2010; 207: 339–344.
Hurowitz EH, Drori I, Stodden VC, Donoho DL, Brown PO . Virtual Northern analysis of the human genome. PLoS ONE 2007; 2: e460.
Velculescu VE, Madden SL, Zhang L, Lash AE, Yu J, Rago C et al. Analysis of human transcriptomes. Nat Genet 1999; 23: 387–388.
Li H, Durbin R . Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009; 25: 1754–1760.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009; 25: 2078–2079.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B . Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008; 5: 621–628.
Li H, Ruan J, Durbin R . Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008; 18: 1851–1858.
Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM et al. Exome sequencing identifies the cause of a Mendelian disorder. Nat Genet 2010; 42: 30–35.
Peterson LF, Zhang DE . The 8;21 translocation in leukemogenesis. Oncogene 2004; 23: 4255–4262.
Osato M . Point mutations in the RUNX1/AML1 gene: another actor in RUNX leukemia. Oncogene 2004; 23: 4284–4296.
Dayyani F, Wang J, Yeh JR, Ahn EY, Tobey E, Zhang DE et al. Loss of TLE1 and TLE4 from the del(9q) commonly deleted region in AML cooperates with AML1-ETO to affect myeloid cell proliferation and survival. Blood 2008; 111: 4338–4347.
Borinstein SC, Hyatt MA, Sykes VW, Straub RE, Lipkowitz S, Boulter J et al. SETA is a multifunctional adapter protein with three SH3 domains that binds Grb2, Cbl, and the novel SB1 proteins. Cell Signal 2000; 12: 769–779.
Reindl C, Quentmeier H, Petropoulos K, Greif PA, Benthaus T, Argiropoulos B et al. CBL exon 8/9 mutants activate the FLT3 pathway and cluster in core binding factor/11q deletion acute myeloid leukemia/myelodysplastic syndrome subtypes. Clin Cancer Res 2009; 15: 2238–2247.
Liu JP, Liu NS, Yuan HY, Guo Q, Lu H, Li YY . Human homologue of SETA binding protein 1 interacts with cathepsin B and participates in TNF-Induced apoptosis in ovarian cancer cells. Mol Cell Biochem 2006; 292: 189–195.
Mayr C, Bartel DP . Widespread shortening of 3′UTRs by alternative cleavage and polyadenylation activates oncogenes in cancer cells. Cell 2009; 138: 673–684.
This work was funded by a Deutsche Krebshilfe grant 109031 to PA Greif and SK Bohlander, and by grants from the German Ministry of Research and Education (BMBF; 01GS0876) and the Deutsche Forschungsgemeinschaft (SFB 684-A6) to SK Bohlander.
The authors declare no conflict of interest.
Supplementary Information accompanies the paper on the Leukemia website
About this article
Cite this article
Greif, P., Eck, S., Konstandin, N. et al. Identification of recurring tumor-specific somatic mutations in acute myeloid leukemia by transcriptome sequencing. Leukemia 25, 821–827 (2011) doi:10.1038/leu.2011.19
- acute myeloid leukemia
- point mutations
- transcriptome sequencing
CNS Neuroscience & Therapeutics (2019)
Genome-wide identification and analysis of the eQTL lncRNAs in multiple sclerosis based on RNA-seq data
Briefings in Bioinformatics (2019)
Leukemia-propagating cells demonstrate distinctive gene expression profiles compared with other cell fractions from patients with de novo Philadelphia chromosome-positive ALL
Annals of Hematology (2018)
Biochemical Journal (2017)
Molecular Cancer (2017)