Lobular breast cancer is an oestrogen-receptor-positive (ER+, also known as ESR1+) subtype of breast cancer (approximately 15% of all breast cancers). It is usually of low-intermediate histological grade and can recur many years after initial diagnosis. To interrogate the genomic landscape of this class of tumour, we re-sequenced1,2,3,4 the DNA from a metastatic lobular breast cancer specimen (89% tumour cellularity; Supplementary Fig. 1) at approximately 43.1-fold aligned, haploid reference genome coverage (120.7 gigabases (Gb) aligned paired-end sequence; Supplementary Fig. 2, Table 1 and Supplementary Methods). Deep high-throughput transcriptome sequencing (RNA-seq)5 performed on the same sample generated 160.9-million reads that could be aligned (Supplementary Table 1, see also Supplementary Fig. 2 and Supplementary Methods). The saturation of the genome (Table 1) and RNA-seq (Supplementary Table 1) libraries for single nucleotide variant (SNV) detection is discussed in Supplementary Information. The aligned (hg18) reads were used to identify (Supplementary Fig. 2) the presence of genomic aberrations, including SNVs (Supplementary Table 2), insertions/deletions (indels), gene fusions, translocations, inversions and copy number alterations (Supplementary Methods). We examined predicted coding indels and predicted inversions (coding or non-coding; Supplementary Methods); however, all of the events that were validated by Sanger re-sequencing were also present in the germ line (Supplementary Tables 3 and 4). None of the 12 predicted gene fusions revalidated. We also computed the segmental copy number (Supplementary Methods and Supplementary Table 5a) from aligned reads, and revalidated high level amplicons by fluorescence in situ hybridization (FISH) (Supplementary Table 5b), revealing the presence of a new low-level amplicon in the INSR locus (Supplementary Fig. 3).

Table 1 Summary of sequence library coverage

We identified coding SNVs from aligned reads, using a Binomial mixture model, SNVMix (Supplementary Table 2, Methods and Supplementary Appendix 1). From the RNA-seq (WTSS-PE) and genome (WGSS-PE) libraries we predicted 1,456 new coding non-synonymous SNVMix variants (Supplementary Table 2). After the removal of pseudogene and HLA sequences (1,178 positions remaining) and after primer design, we re-sequenced (Sanger amplicons) 1,120 non-synonymous coding SNV positions in the tumour DNA and normal lymphocyte DNA. Some 437 positions (268 unique to WGSS-PE, 15 unique to WTSS-PE, and 154 in common) were confirmed as non-synonymous coding variants. Of these, 405 were new germline alleles and 32 were revealed as non-synonymous coding somatic point mutations (Table 2). Of the 32 somatic mutations, 30 were present in WGSS-PE and/or WTSS-PE, whereas two were detected from the WTSS library sequence alone (Table 2). None of the 32 genes were found in common with the CAN breast genes6, which were discovered from ER- cell lines. Eleven genes appear in the current release of COSMIC7 (CHD3, SP1, PALB2, ERBB2, USP28, KLHL4, CDC6, KIAA1468, RNF220, COL1A1 and SNX4) but with mutations at different positions. We examined the population frequency of the somatic mutation positions for PALB2, ERBB2, USP28, CDC6, CHD3, HAUS3 (previously known as C4orf15), SP1, KIAA1468 and DLG4 in a further 192 breast cancers (Supplementary Methods; 112 lobular, 80 ductal). None of these 192 breast cancers showed identical mutations to those described here; however, 3 out of 192 cases (2 lobular, 1 ductal) contained neighbouring non-synonymous variants/deletions affecting the ERBB2 kinase domain (Supplementary Fig. 4). Interestingly, 2 out of 192 cases (both lobular) contained two different heterozygous truncating variants in HAUS3: chr4:2203685 G>T on minus strand, GAG>TAG (Glu>stop), and chr4:2203483 C>G on minus strand, TCA>TGA (Ser>stop) (Supplementary Fig. 5). Notably, HAUS3 is a member of the recently described8,9,10 multiprotein augmin complex, the function of which is required for genome stability mediated by appropriate kinetochore attachment and centrosome morphogenesis.

Table 2 Somatic coding sequence SNVs validated by Sanger sequencing

To determine how many of the somatic non-synonymous coding sequence mutations were already present at diagnosis 9 years earlier, we next examined genomic DNA from the primary tumour directly, by a single molecule frequency counting experiment (Supplementary Methods)4. Twenty-eight of the 32 mutations yielded amplicons compatible with Illumina sequencing (Supplementary Methods), and two extra mutations were sampled by Sanger sequencing (Supplementary Fig. 5). As controls we selected 36 heterozygous germline SNVs at random. The PCR amplicons for known germline and somatic mutations were sequenced on an Illumina device. After alignment, the observed counts of reference and non-reference bases at the target position were compared using the Binomial exact test. To calibrate the expected mean of the Binomial distribution, we used the non-reference allele frequency from positions -5 to +5 surrounding (but not including) the target position (Supplementary Table 6a, b), where only reference bases should be called. Unequal segmental amplification/deletion in the genome may contribute to a departure from the theoretical ratio of 0.5 for a heterozygous allele. As a result, amplicons from heterozygous germline alleles showed occasional measured frequencies of between 0.2 and 0.8 in both the primary and metastatic tumour DNA (Table 3 and Supplementary Table 7), but with a modal frequency around 0.5, as expected. In the metastatic genomic DNA the somatic mutations showed frequencies of between 0.2 and 0.79 (Table 3). Notably, the somatic coding mutation positions examined in the primary tumour showed three patterns of abundance: prevalent, rare and undetectable (Table 3). Mutations in ABCB11, PALB2 and SLC24A4 were detected at prevalent frequencies for heterozygous mutations (≥0.2, the lowest value seen for known germline alleles) given a 73% tumour content. The frequency of the mutation in HAUS3 was 0.79, consistent with it being a prevalent homozygous mutation, also confirmed by Sanger sequencing (Supplementary Fig. 5). Sanger amplicon sequencing showed that the SNX4 somatic mutation was also present in the primary tumour, whereas the KIAA1772 (also known as GREB1L) mutation was not. Six mutations (KIF1C, USP28, MORC1, MYH8, KIAA1468 and RNASEH2A) showed statistically significant (P < 0.01, Binomial exact test) intermediate frequencies of between 1% and 13% (Table 3), suggesting that these mutations were restricted to minor subclones of tumour cells. The remaining 19 out of 30 of the somatic coding mutations were not detected in the primary tumour DNA. Thus, significant heterogeneity in tumour somatic mutation content existed in the primary tumour at diagnosis. In contrast with the recently reported sequence of cytogenetically normal acute myeloid leukaemia (AML) tumour4, significant evolution of coding mutational content occurred between primary and metastasis. It is unknown whether the 19 mutations present in the metastasis, but not detected in the primary, were a consequence of radiation therapy or innate tumour progression.

Table 3 Frequency of germline and somatic alleles in the metastatic and primary genomes

We also examined how the transfer of information from the nuclear genome to proteins was modified by alternative splicing (Supplementary Table 8 and Supplementary Fig. 6), biased allelic expression (Supplementary Table 9) and RNA editing. At the single nucleotide level, RNA-editing enzymes (which can be regulated by oestrogens11) may also recode transcripts resulting in a proteome divergent from the genome12,13,14,15. Interestingly, the ADAR enzyme—one of the principal RNA-editing enzymes that mediates A→I(G) edits—was one of the top 5% of genes expressed (145.6 reads per base, Supplementary Table 10), and the only editing enzyme expressed at a high level. We searched for potential editing events (Methods) and found 3,122 candidate edits in 1,637 gene loci (Supplementary Table 11). Some 526 out of 3,122 candidate edits are non-synonymous changes and 232 are synonymous changes (with the remainder affecting untranslated regions). We revalidated independently (Supplementary Methods) by Sanger sequencing 75 editing events in 12 gene loci from the lobular metastasis (Supplementary Table 12 and see trace data at Two genes, COG3 and SRP9 (Fig. 1), showed confirmed high frequency non-synonymous transcript editing, resulting in variant protein sequences. These observations emphasize the importance of integrating RNA-seq data with tumour genomes in assessing protein variation.

Figure 1: RNA editing in COG3 and SRP9.
figure 1

Sanger sequence traces from the non-synonymous editing positions in COG3 and SRP9. The editing position is arrowed. Top trace is tumour RNA, bottom trace tumour DNA. The editing positions were confirmed with reverse strand reads (not shown).

PowerPoint slide

The coding mutation landscape of breast cancers has, so far, been mostly determined from ER- metastatic cell lines/samples6,16, and has suggested the presence of large numbers of passenger events as well as drivers. Our results show the importance of sequencing samples of tumour cell populations early as well as late in the evolution of tumours, and of estimating allele frequency in tumour genomes. Our observations suggest that the sequencing of primary breast cancers and pre-invasive malignancy may reveal significantly fewer candidates for tumour initiating mutations.

Methods Summary

Paired-end reads were assigned quality scores and aligned to the reference genome (hg18) using Maq17 (Supplementary Methods and Supplementary Fig. 2). For identification of SNVs we used a simple Binomial mixture model, SNVMix (Supplementary Appendix 1), which assigns a probability to each base position as homozygous reference (aa), heterozygous non-reference (ab) and homozygous non-reference (bb), based on the occurrence of reference (hg18) and non-reference bases at each aligned position. This model was calibrated initially, using high confidence allele calls from Affymetrix SNP6.0 hybridization of tumour and normal DNA. We estimated the receiver operating characteristic (ROC) performance (Supplementary Fig. 8) and determined that an SNVMix threshold of P = 0.77 for (ab) or (bb) for a non-reference call would yield a false discovery rate (FDR) of 1%. For the RNA-seq library, a threshold of P = 0.53 was used (Supplementary Fig. 8; FDR = 0.01) to call non-reference positions. Non-reference positions were then filtered for known variants against the sources of germline variation, the single nucleotide polymorphism database (dbSNP) and the completed individual genomes18,19 (Supplementary Table 2). Saturation of the libraries for SNV discovery was determined by random re-sampling (Supplementary Fig. 9 and Supplementary Methods). Segmental copy number was inferred with a hidden Markov model (HMM) method (Supplementary Table 4a, b and Supplementary Methods).

We searched for RNA-editing events by examining all very high confidence (P(ab) + P(bb) > 0.9) SNVMix predictions from the RNA-seq library of the metastatic tumour, that were not found with extreme confidence (P(aa) > 0.99, derived from the SNVMix receiver operating curve at FDR = 0.01) at the same positions in the metastatic tumour genome library.