Background

Cancer is caused by DNA mutations resulting in single nucleotide variations (SNVs), structural rearrangements and copy number variations (CNVs)1,2. A SNV frequently occurs on a single allele; the impact of a heterozygous SNV will depend on whether the SNV-containing allele is transcribed to RNA. Indeed, a non-transcribed SNV, non-synonymous or silent, could be phenotypically invisible. Clinical therapy-selection biomarkers often assay mutations using DNA as an analyte, such as KRAS assays designed to identify responders to anti-EGFR monoclonal antibody therapy3. However, if the wild-type allele is selectively transcribed, the mutation may not have therapeutic impact and the merit of using a DNA-based assay for clinical decision-making may be problematic.

Given a heterozygous non-synonymous SNV, the phenotypical impact will depend on whether the SNV-containing allele is transcribed to RNA. A heterozygous SNV in a polyploidy region could be functionally invisible due to multiple mechanisms. The mutation-containing allele could be effectively silenced by epigenetic and chromatin modifying mechanisms. Chromosome X in females, for example, contains silenced and non-silenced alleles. Further, a SNV-containing transcript could activate RNA surveillance mechanisms and cause rapid degradation of the mutation-containing transcript. Nonsense-mediated decay (NMD) surveillance, for example, scans transcripts for the presence of a premature termination codons (PTCs) before the last exon and, when found, initiates degradation of such transcripts4. NMD surveillance has been extensively studied in RNA splicing5, where it removes aberrantly spliced transcripts and in microsatellite instability colorectal cells, where NMD blockage up-regulates genes containing somatic mononucleotide repeat mutations causing frameshift open reading frames6. If and to which extent this happens has not been systematically analyzed.

High throughput profiling technologies, such as next generation sequencing (NGS), have enabled the determination of DNA and RNA mutation allele frequency and copy number. Increased DNA copy number results, in general, in increased gene expression7. The allowable DNA allele frequencies of a mutation are discrete values based on the underlying genomic copy number. In haploid regions, mutations have 100% allele frequency. In diploid regions, mutations have allele frequencies of 50 or 100%, in triploid regions it is 33, 66, or 100% and in tetraploid regions 25, 50, 75 or 100%. Experimentally, genomic copy number can be effectively determined from SNP-monitoring oligonucleotide microarrays and NGS genome and exome profiling8,9. Integration of the allele fraction of heterozygous germline SNPs both enables identification of regions containing imbalanced alleles, including loss of heterozygosity (LOH) and mutant allele-specific DNA amplification10 and determination of tumor cell purity and absolute tumor ploidy11,12. Integration of the DNA mutation allele frequency further improves mutation detection and delineation of tumor evolution13,14,15,16.

Studies have examined imbalanced allele RNA expression17. Imprinted alleles can be identified with NGS RNA-Seq reads18. In tumors, heterozygous SNPs can be transcribed at different abundances than in normal cells19 and the genes with imbalanced SNP expression are associated with higher DNA copy number. This raises the possibility that allele-imbalanced DNA amplification leads to a corresponding imbalance in the RNA levels of SNPs. Indeed, a study examining the DNA and RNA mutation allele frequencies in four genes found high correlation between DNA and RNA frequencies13.

Surprisingly, given the importance to biology and clinical oncology, a genome-wide study of the relationship between DNA and RNA mutation allele frequency in tumor samples has not been done. The question of whether mutated alleles are generally transcribed and, if so, how they are represented at the transcript level – imbalanced, compensated to a diploid setting or proportional to the DNA dosage – has not been systematically analyzed with genome-wide approaches.

To answer these questions, we analyzed the presence of mutations in DNA and their corresponding RNA expression on an allele-specific level with a genome-wide approach. We sequenced the broadly used B16F10, 4T1 and CT26 mouse tumor cell lines. These cell lines have the advantages that they contain large copy number variations and thus varying levels of DNA mutation allele frequency and are homogeneous cells without normal cell contamination. We used NGS to profile DNA and RNA and developed data processing algorithms to identify mutations, define absolute copy number and determine both DNA and RNA mutation allele frequencies. Our findings provide the first systematic genome-wide study of DNA and RNA mutation allele frequency in tumor samples and therewith demonstrate that mutations are both transcribed and they are transcribed in proportion to their DNA allele frequency.

Results and discussion

Using the replicate cancer and germline exome resequencing from three different mouse tumor cell lines, we identified high confidence single nucleotide point mutations (SNVs) and DNA copy number variations (CNVs). There are 3023 SNVs in mouse colorectal carcinoma CT26 cells compared to BALB/cJ, 908 in mouse melanoma B16F10 cells compared to C57BL/6 and 293 in mouse breast tumor 4T1 cells compared to BALB/cJ. These results agree with reports demonstrating that human melanomas and colorectal tumors have among the highest mutation rates of primary tumors while breast tumors average an order of magnitude fewer mutations26.

We calculated the DNA mutation allele frequency using the sequence content of the reads overlapping each SNV. As an example, the gene Eif4g2 (eukaryotic translation initiation factor 4 gamma 2) is highly expressed in CT26 cells and contains a T > G heterozygous somatic mutation. The normalized read counts associated with the Eif4g2 mutation for the two replicates are 141 and 133 in the germline sample and 248 and 247 in the CT26 sample (Figure 1A and B). The G mutation is not found in the germline reads (Figure 1A) whereas it is found in 35% and 36% of the DNA reads from the tumor (Figure 1B).

Figure 1
figure 1

DNA and RNA mutation allele frequency and copy number in the CT26 mouse colon cancer cell line.

(A and B): the sequence content and normalized read counts from the (A) BALB/cJ germline and (B) CT26 tumor DNA reads at the exemplary T > G Eif4g2 mutation locus (chr7, 118,222,833). Four nucleotides are shown on either side of the mutation. The exome of each sample (germline and tumor) was sequenced in duplicate. The normalized total number of reads overlapping the mutation locus is shown for each sample and replicate. The replicates show similar total counts and mutation frequencies. (C) The DNA gene copy number and DNA mutation allele frequency of the 3023 SNVs in CT26. Symbols and colors represent different chromosomes. Black crosses mark allowable mutation allele frequency and copy number combinations. The mutations with 100% allele frequency are homozygous mutations. (D) The sequence content and normalized read counts from the CT26 tumor RNA reads at the T > G Eif4g2 mutated locus. The CT26 transcriptome was sequenced in duplicate; the replicates show similar total counts and mutation frequencies. (E) The DNA and RNA mutation allele frequency for the 697 SNVs covered by at least 10 reads. SNVs are colored according to identified gene DNA copy number (CN), as determined in (C).

Using the read counts and frequencies for all genes and mutations, respectively, we simultaneously determined the allele frequency of each mutation, the absolute DNA copy number of each gene and the mean ploidy for the sample. We found that the CT26 genome has many triploid, tetraploid and pentaploid regions and many regions with homozygous mutations, suggesting multiple LOH events from the onco-transformation or inbreeding (Figure 1C)27. Further, we observed that mutation allele frequencies correctly occur at the distinct frequencies allowed by the DNA copy number (Figure 1C). For example, the mutations in the copy number 4 regions occur at 25, 50, 75 or 100%. With the computational platform working, we determined the DNA mutation allele frequency for each SNV. We found, for example, that the Eif4g2 locus falls in a region with copy number 3. Based on the observed 35% and 36% mutation allele frequencies, the G mutation occurs on one allele and the wild-type T occurs on two alleles.

We determined the RNA mutation allele frequency using the NGS RNA-Seq reads overlapping the identified SNVs. We selected SNVs for which there were at least 10 overlapping RNA reads, which comprised 697 SNVs in CT26. The determination of RNA mutation allele frequencies is sensitive to the read alignment algorithm28,29. We compared multiple methods and selected the STAR algorithm due to its ability to effectively align reads containing mismatches21. Examining the reads overlapping the exemplar Eif4g2 T > G mutation, 34% and 28% contain the mutation (Figure 1D).

Genome-wide analysis using this procedure revealed that of the 697 mutations in CT26 expressed genes, 688 mutations are present in the RNA reads and only 9 are absent. Second, all homozygous DNA mutations are also correctly homozygous at the RNA level. Third, the DNA and RNA mutation allele frequencies correlate remarkably well (Figure 1E, r2 = 0.82). These results show that a) CT26 mutations in expressed genes are transcribed with over 99% likelihood and b) the mutations are transcribed in equal proportion to the underlying DNA mutation allele frequency.

We repeated this analysis for the B16F10 and 4T1 samples (Figures 2 and 3). There were 182 mutations with at least 10 RNA reads in B16F10. 179 of the 182 mutations are expressed (98%) and the RNA and DNA mutation allele frequencies correlate well (r2 = 0.75). In 4T1, there were 101 transcribed mutation loci with at least 10 RNA reads. 100 of the 101 mutations are expressed (99%) and the RNA and DNA mutation allele frequencies correlate very well (r2 = 0.94). Together, 967 of 980 mutations in expressed genes are expressed (99%) and the RNA and DNA mutation allele frequencies correlate highly.

Figure 2
figure 2

The DNA and RNA mutation allele frequency in CT26, B16F10 and 4T1 cells.

The marker color and size are determined by the total number of RNA reads (mutation containing plus wild-type containing reads) that overlap the SNV coordinate.

Figure 3
figure 3

The impact of PTC-containing mutations on the CT26, B16F10 and 4T1 transcriptomes.

Red circles mark PTC-causing mutations in non-last-exons; green squares represent PTC-causing mutations in last exons.

Next, we examined the outliers for evidence of biological processes. We developed a metric for measuring RNA versus DNA mutation allele imbalance:

For most mutations, the RNA and DNA mutation allele fractions are similar and the imbalance is near zero. Highly expressed mutation with more reads, correlate more strongly and have a smaller imbalance variance, as can be seen when comparing the large and small points in Figure 2. Indeed, the mean r2 values are 0.88 and 0.78 and the standard deviations are 8.2 and 14.0 for the subsets of mutations with more than 65 or less than 15 reads coverage, respectively (Figure 4A). This is likely because the RNA allele frequency is more accurate with higher coverage30. 39 of 980 mutations (4%) have an absolute imbalance score greater than 25 and all 39 have relative low expression. We compared the mutation allele frequency imbalance in silent and non-synonymous mutations and found no difference (p = 0.86) (Figure 4B). We expected heterozygous mutations on the X chromosome, having 50% DNA allele frequencies, to have either 0% or 100% RNA allele frequencies due to inactivated X alleles. However, the RNA frequencies of heterozygous X chromosome mutations were centered near 50%. This is likely because 14 of the 15 identified heterozygous X chromosome SNVs were found in the CT26 cells and CT26 cells transcribe from both alleles, having lost the inactivated X allele27. Imprinting causes selective allele transcription18. Across the three cell lines, we found only two SNVs in known imprinted genes31, in genes Atp10a and Plagl1 and neither SNV allele frequency is imbalanced. This suggests that these genes are not imprinted in these tumor samples or that the mutations occurred on amplified, non-imprinted alleles.

Figure 4
figure 4

Mutation frequency imbalance (RNA-DNA).

(A) A comparison of high and low expressed mutations based on mutations with at least 65 read coverage (blue, n = 179) or less than 15 read coverage (green, n = 180). The cutoffs were selected such that the sets have similar number of members. The means are similar but the variance in the low expression group is much higher. (B) The mutation frequency imbalance of non-synonymous (magenta, n = 596) and silent (red, n = 286) mutations. (C) The non-PTC (black, n = 955), PTC not-last-exon (red, n = 19) and PTC last-exon (green, n = 6) mutation sets and p-values.

We found that SNVs causing PTCs in non-last exons were expressed at significantly lower frequencies than predicted from the DNA allele frequencies (Figures 3 and 4C). SNVs causing PTCs in last exons, however, were expressed at the DNA-predicted allele frequencies. The p-value between the “no PTC” (black) and “PTC, not last exon” (red) imbalances is 8e-14, whereas the difference between the “no PTC” and “PTC in last exon” (green) is insignificant (p = 0.9). Both observations are in agreement with the established mechanism of NMD surveillance, in which NMD scans transcripts for PTCs occurring before the last exon and initiates degradation of such transcripts.

Conclusions

Cancer cells contain DNA mutations and the RNA expression of heterozygous non-synonymous mutations has impacts on biological, epigenetic and medical questions. While several studies have examined DNA allele frequency, there has not been a genome-wide examination of the translation of DNA mutations. Thus, the objective of this study was to determine if mutations are, in general, transcribed and, if so, whether they are selectively transcribed.

Unlike somatic mutation detection which requires an inter-sample comparison (tumor versus normal), the determination of mutation allele frequencies is a self-normalizing intra-sample comparison and is thus relatively robust to sample handling and laboratory workflows. However, as the mutation-containing reads contain a mismatch to the reference sequence, they are more difficult to computationally align than wild-type reads. Indeed, we found that different alignment algorithms introduced significant systematic biases in the determination of allele frequencies. We preferred the STAR algorithm for RNA read alignment.

One of our key findings is that 99% of the point mutations in expressed genes are transcribed. Thus, the likelihood that a mutation in a transcribed gene will be transcribed to RNA is high. Second, we found that mutations are transcribed in equal proportion to their DNA allele frequency (up to r2 = 0.94) and thus, RNA and DNA dosages are matched. Third, we identified nonsense-mediated decay of mutations resulting in premature stop codons prior to the last exon as a primary mechanism to introduce imbalances of mutated versus wildtype alleles. Along this line, we did not observe an influence of X-inactivation or imprinting, although this was likely due to the small number of mutations in imprinted genes. We did not find a difference between silent and non-synonymous mutations. Nor did we find surveillance, epigenetic or other compensatory feedback mechanism that selectively transcribes or silences mutation alleles. As investigation of mutations in regulatory regions outside of transcripts, such as in transcription factor binding sites (TFBSs), requires genome instead of exome resequencing, the impact of the potential cis-acting mutations on allele-specific transcription remains unclear.

The results here are consistent with transcription occurring equally from all alleles, mutated and wild-type. Outside of NMD, we did not find evidence for a general surveillance or epigenetic silencing mechanism that acts to degrade or prevent transcription of entire classes of mutation-containing transcripts, suggesting that the default tumor state is to equally transcribe from mutated and wild-type alleles. For cancer patients, this provides support for DNA-based mutation-detection assays for patient stratification, which will gain increasing relevance with mutation-targeting immunotherapies showing pre-clinical proof-of-concept32 and entering clinical trials, such as trial NCT02035956.

Methods

Samples: C57BL/6 and BALB/cJ mice (Charles River) were kept in accordance with legal and ethical policies on animal research. All animal protocols were approved by the government of Rheinland-Palatinate's Animal Care Committee, Koblenz, Germany. Germline BALB/cJ and C57BL/6 DNA samples were extracted from mouse tail. B16F10 melanoma and CT26 colon carcinoma cells were purchased from ATCC (Manassas, USA); 4T1 breast tumor cells were purchased from Caliper Life Sciences (4T1-luc2-tdTomato, product 125669), who derived them from ATCC CRL-2539. B16F10 was originally derived from a C57BL/6 mouse. CT26 was derived from BALB/cJ mouse and 4T1 from a BALB/cfC3H mouse.

NGS sequencing and data processing: replicate exome capture for DNA resequencing was performed using the Agilent Sure-Select mouse whole-exome solution-based capture assay. Oligo (dT) RNA was isolated from each tumor cell line (B16F10, CT26 and 4T1) in replicate and was prepared for gene expression profiling. Libraries were sequenced on an Illumina HiSeq2000. DNA reads were aligned with bwa20 (version 0.5.8c, default options). Ambiguous reads mapping to multiple locations of the genome were removed. RNA reads aligned with STAR21 (version 2.1.4a, default options). Individual exome replicates contained an average of 113 million reads per sample and individual RNA-Seq replicates contained an average of 26 million reads per sample.

Mutation identification: single nucleotide mutations (SNVs) that were identified by all algorithms samtools22, Mutect23 and SomaticSniper24 (all with default options) and found in both replicates were further filtered using binomial filters that eliminated erroneous tumor-only coverage artifacts and thus decreased the likelihood that a mutation was classified as somatic due to lack of coverage in the germline sample.

DNA copy number: absolute allele copy number, tumor purity and mutation allele fraction were simultaneously determined using a novel algorithm that assumes a) that mutation allele fraction can take only discrete values in tumor cells based on allele copy number and b) that the relative tumor to germline number of exome-seq reads mapping to a gene locus is proportional to locus copy number25.

Mutation source: the identified SNVs represent variations between a tumor genome in cells derived from an historic mouse and a modern inbred mouse of the same strain. For example, the CT26 cells were derived from a BALB/c mouse in 1975 and the normal cells sequenced here were from a BALB/cJ mouse in 2011. As such, the SNVs include both somatic mutations associated with the onco-transformation and genetic drift in the inbred mice.

DNA and RNA mutation allele fraction were determined by examining the nucleotide sequence of the reads overlapping each SNVs. The mutation allele fraction was calculated as the number of mutation-containing reads divided by all reads overlapping the SNV. Mutations were considered if at least ten DNA and ten RNA reads overlapped the genomic position. A minimum cutoff of ten was selected to increase the accuracy when determining mutation allele percentages: a higher cutoff would lead to higher accuracy but would decrease the total number of data points. The results obtained when using higher or lower cutoffs are very similar. Frequencies from replicates were combined using error weighted averaging based on uncertainties from the binomial distribution and the number of reads for the frequency measurement. Essentially, frequencies from samples with more reads were given more weight.

Statistics were calculated using Matlab software packages. Correlation coefficients represent Pearson's linear correlation coefficient. P-values were calculated with a two-sample T-test testing whether the hypothesis that the two sample sets come from the same normal distribution be rejected at the 5% significance level.

An additional table contains annotated CT26, B16 and 4T1 mutation lists. NGS fastq files for B16, CT26, 4T1 (4T1-luc2-tdTomato), C57BL/6 and BALB/cJ are available from the European Nucleotide Archive (ENA) as PRJEB5797, PRJEB5791, PRJEB5299, PRJEB5320, PRJEB5312 and PRJEB5321.