Basal-like breast cancer is characterized by the absence of oestrogen receptor (ER) expression, the lack of ERBB2 gene amplification, and a high mitotic index. The consequent absence of approved targeted therapy options and frequently poor response to standard chemotherapy often result in a rapidly fatal clinical course. The disease also accounts for an elevated percentage of breast cancers in patients with African ancestry1. Clinical progress has been limited by a poor understanding of the genetic events responsible for this tumour subtype and by limited preclinical models to study the disease. Because basal-like breast cancer has a highly unstable genome, a key question is whether the fatal metastatic process is driven by mutations that occur after the tumour cells arrive at the distant site, or whether the primary tumour generates cells with a complete repertoire of somatic mutations required for metastatic growth. The rapid advancement of next-generation sequencing technologies allows comprehensive characterization of genomic changes, facilitating the comparison of multiple samples taken from the same patient to address the genetic basis for tumour progression and metastasis.

Case presentation and previous characterization of samples

A 44-year-old African-American woman was diagnosed with an ERBB2-negative and ER-negative inflammatory breast cancer. She was treated with neoadjuvant dose-dense chemotherapy2, but significant residual tumour was present in the breast and axillary lymph nodes at mastectomy. This indicated chemotherapy resistance and she subsequently underwent radiation therapy. Eight months later she developed a cerebellar metastasis and, despite resection, rapidly succumbed to widely disseminated disease. A transplantable human-in-mouse (HIM) xenograft tumour line was generated from a sample of her primary tumour biopsied before treatment3. The xenograft in the mammary fat pad was locally invasive and produced metastatic deposits in lymph nodes and ovaries. Informed consent for full genome sequencing was obtained and DNA samples were prepared from her peripheral blood, primary tumour, brain metastasis and an early passage xenograft (harvested 101 days after initial engrafting into the mouse host). Application of the PAM50 intrinsic subtype algorithm identified the primary tumour, brain metastasis and xenograft line as basal-like subtype, with high risk of relapse (ROR) scores4.

Sequence coverage and mutation analysis

Using a paired-end sequencing strategy, we generated 130.7, 124.9, 111.8 and 149.2 billion base pairs of sequence data from genomic DNA derived from blood, primary tumour, brain metastasis and xenograft samples, respectively, with corresponding haploid coverages of 38.8×, 29.0×, 32.0× and 23.8× (Supplementary Table 1). These genome-wide coverages were assessed by comparing single nucleotide variants (SNVs) detected by MAQ5 with single nucleotide polymorphisms (SNPs) genotyped using Illumina 1M duo arrays for all tissues excluding the xenograft. Array data from the metastasis were used as a surrogate for monitoring the xenograft SNP coverage and confirmed bi-allelic detection of 98.27%, 96.79%, 96.17% and 88.77% of the heterozygous array SNPs in the normal, primary tumour, metastasis and xenograft sequence data sets, respectively (Supplementary Table 1).

The process for selecting somatic mutations is shown in Supplementary Table 2 and is detailed in Supplementary Information. Putative somatic SNVs and indels that overlap with coding sequences, splice sites and RNA genes were included as ‘tier 1’. We combined tier 1 sites identified in all three tumour samples and obtained deep read count data for all four samples from Illumina and/or 454 platforms (Supplementary Information). On the basis of pathology review, the tumour cellularity estimates were 70% for the primary tumour and 90% for both the brain metastasis and xenograft. Using these estimates, we calculated the tumour read counts by proportionally removing the counts derived from the normal tissue reads from the counts obtained from primary tumour and metastasis reads (Supplementary Table 3a). Using the Illumina platform, we also generated 15.6 Gb (4.4× haploid coverage) of sequence data for the NOD/SCID mouse genome used as the host for the xenograft line. The mapping rates of NOD/SCID data to human and mouse C57BL/6 reference sequences were 3.17% and 95.85%, respectively. As the non-malignant contamination in xenograft is largely from murine cells (which do not significantly affect read mapping), no correction was applied for the xenograft data. Adjusted tumour read counts were used to calculate mutant allele frequencies. Somatic changes were validated by comparing mutant allele frequencies in the three tumour genomes against the germline DNA sample, combined with a manual review of ABI 3730 data from PCR products (Supplementary Information).

A total of 50 somatic sites, including 28 missense, 11 silent, 2 splice site, 1 RNA, 1 nonsense, 4 insertions and 3 deletions, were validated in at least one of the three tumour genomes. Of coding point mutations, the observed nonsynonymous/synonymous ratio of 2.64:1 (29:11) is not significantly different from that expected by chance6 (P = 0.51), indicating that the majority of coding mutations do not confer a selective advantage to the basal tumour. This is similar to the nonsynonymous/synonymous ratio reported in the small-cell lung cancer cell line NCI-H2097, but higher than the ratio reported in the melanoma cell line COLO-8298.

Mutation spectrum in basal breast tumour

We investigated the spectrum of DNA sequence changes in this basal tumour and found that 55% (22 out of 40) of coding point mutations represent C·G→T·A transitions. A similar frequency of C·G→T·A transitions (56% (18 out of 32)) was observed in a lobular breast tumour recently reported9 (Fig. 1a). In addition, 15% (6 out of 40) of coding point mutations representing C·G→A·T transversions were detected in the basal tumour, but none was found in the lobular tumour. The statistical significance of these observations should be explored with the comparative analysis of a larger number of basal and lobular breast tumours. Moreover, the observed C·G→T·A transition frequency is notably higher than those observed in a previous breast cancer study10 (P = 0.027; Fig. 1b). A set of extremely high-confidence tier 1–4 mutations (somatic score >55 and average mapping quality >79) was used to explore the genome-wide mutation spectrum. We found that mutations at A·T bases are significantly expanded in the genome-wide set compared to the coding mutations, especially for A·T→G·C transitions (P = 0.0065). This is consistent with the higher A·T content in non-coding sequences than in coding sequences. Comparison to the whole-genome mutation spectrum reported for the melanoma cell line (COLO-829)8 and a small-cell lung cancer cell line (NCI-H209)7 indicates that the tumour genome under study shows no sign of tobacco or ultraviolet influence. We then compared the fraction of the three classes of guanine mutations occurring at CpG dinucleotides in primary tumour, brain metastasis and xenograft and found that the frequencies of G→A mutations are 27.54%, 27.60% and 28.05% in each respective tumour, significantly higher than both the genome average of 4.45% (P < 10-10) and the frequency reported in NCI-H209 (P < 10-10; Fig. 1c).

Figure 1: Mutational signatures in the basal breast tumour.
figure 1

a, Fraction of mutations in each of the transition and transversion categories in the metastasis of a lobular breast tumour9, the metastasis of the basal breast tumour under study, and the 11 breast tumours reported previously29 from which 1,104 coding mutations identified in the discovery set were used in the analysis. b, Fraction of mutations in each of the transition and transversion categories in 43 tier 1 mutations and 3,204 tier 1–4 mutations in the metastasis under study. c, Fraction of guanine mutations at CpGs in primary tumour, metastasis, xenograft and NCI-H209 as reported previously7.

PowerPoint slide

Distribution of mutations among tumours

Common mutations detected in three tumour genomes

Of the 50 validated point mutations and small indels, 48 are detectable in all three tumours. We performed a statistical enrichment test that takes the variations of different platforms, experiments and primer pairs into consideration (Supplementary Information). These 48 sites consist of 20 sites with relatively comparable frequencies across tumours, 26 sites significantly enriched (false discovery rate (FDR) ≤ 0.05) in the metastasis and/or xenograft, and two sites with significant enrichment (FDR ≤ 0.05) in the primary tumour (Fig. 2 and Table 1). The affected genes and the likely consequences of these mutations are summarized in Table 1 and Supplementary Table 3b.

Figure 2: Mutant allele frequency from deep read count data.
figure 2

The mutant allele frequency of each somatic mutation is shown. Mutations were validated using both 454 and Illumina sequencing. Each bar represents the average of the frequency yielded by the two technologies for a single primer pair and the error bars represent the standard deviation. Data were considered only if there were at least 200 reads from Illumina sequencing and at least 20 reads from 454 sequencing. If no error bar exists, then data were only available from a single sequencing platform.

PowerPoint slide

Table 1 Summary of point mutations and small indels

Mutations with comparable frequencies in three tumours

We detected a JAK2 mutation (I166T), residing in the FERM domain, which is different from the previously reported activating mutations in myeloproliferative diseases, often found in the pseudokinase domain11. Screening of an additional 116 breast tumours identified another mutation (R1122P) in the kinase domain of JAK2 from a luminal B-type breast cancer. A splice site mutation (e8-1) was found in IRAK2. We performed a polymerase chain reaction with reverse transcription (RT–PCR) experiment using RNAs from the brain metastasis and xenograft and found that the first 30 nucleotides of exon 8 (IRAK2, NM_001570) were skipped and an internal exonic AG site was used as a splice acceptor, resulting in an in-frame deletion. A missense mutation (A401S) in CSMD1 was found in all three tumours. Loss of CSMD1 expression is associated with poor survival in invasive ductal breast carcinoma12 and it is frequently deleted in colorectal adenocarcinoma and head/neck carcinomas13. We also identified three missense (E608K, T1456R, and Q2204R) and one nonsense (Q3005*) mutations in CSMD1 in four breast cancers out of 116 screened. A binomial test shows that CSMD1 is significantly mutated in breast cancer (P = 0.022 and FDR = 0.197; Supplementary Table 4).

Mutations highly enriched in metastasis and/or xenograft

A missense mutation (A681E) in NRK, a protein kinase involved in activating JNK, was found to be present in all three tumours, but at 8- and 13-fold increased allele frequencies in the metastasis and xenograft, respectively (Fig. 2 and Table 1). Two somatic mutations (S424C and Q521*) in NRK have been previously reported in breast cancer14. The missense mutation (P461L) identified in the carboxy terminus of MAP3K8 was present at a roughly sixfold increase in the xenograft compared to the primary tumour. C-terminal truncation of MAP3K8 has been shown to activate this oncogenic kinase15,16, raising the possibility that this C-terminal substitution (P461L) is an activating mutation.

Another missense mutation (K1017N) in PTPRJ, a protein tyrosine phosphatase, had a mutant allele frequency of 32% in the metastasis and 57% in the xenograft compared with just 1.3% in the primary tumour. This K1017N mutation in PTPRJ is among the most highly enriched mutations in both the metastasis (FDR = 0.00035) and xenograft (FDR = 0.00022). The mutation site is in the juxtamembrane domain (a basic residue motif) and is in close proximity to the tyrosine-protein phosphatase domain (amino acids 1041–1298). Reference 17 reported that the PTPRJ charged peptide (amino acids 1013–1024) is responsible for interaction with its substrates, such as ERK1/2. The K1017N mutation found in the basal tumour and the K1016A mutation described in ref. 17 both change a basic residue to a neutral residue, indicating that these two mutations may be functionally similar. A missense mutation (F299V) in WWTR1, assigned as deleterious by SIFT18, was detected at 28% mutant allele frequency in metastasis, but only at 7% and 10% in primary tumour and xenograft, respectively (Fig. 2 and Table 1). WWTR1, a 14-3-3 binding protein with a PDZ binding motif, has been shown to modulate mesenchymal stem cell differentiation19. Overexpression of WWTR1 has also been implicated in promoting the migration, invasion and tumorigenesis of breast cancer cells20.

Another point mutation (R258Q) was identified in CHGB (chromogranin B) encoding a tyrosine-sulphated secretory protein. A SNP at the same position was reported to dbSNP in January 2009 for a Yoruba sample. It was also assigned as a germline site in another African-American with breast cancer when we genotyped this mutation in 112 additional primary tumours and 73 metastatic tumours of various expression classes (Supplementary Information). To investigate this variant further, 84 cancer-free African-American women with an average age of 71.2 years (low risk for developing breast cancer) and 38 early-onset African-American breast cancer patients with an average age of 35.6 years were genotyped. The results indicated that 8 out of 84 controls and 3 out of 38 cases carried the variant allele, indicating that this variant is unlikely to be a breast cancer susceptibility allele.

Three validated indels were enriched in the metastasis and/or xenograft. One was the 1-bp insertion in exon 4 of the TP53 gene, which creates a frameshift mutation (Q167fs) in the DNA binding domain and results in a truncated protein. We found the TP53 mutation to be significantly enriched in the xenograft, whereas it was present at a relatively constant frequency in primary tumour and metastasis (Fig. 2 and Table 1).

Mutations enriched in the primary tumour

A nonsense mutation (Q2222*) in MYCBP2 and a missense mutation (E576K) in TGFBI, both found in all three tumours, had higher mutant allele frequencies in the primary tumour (88% for MYCBP2 and 89% for TGFBI) than in the metastasis (44% for MYCBP2 and 38% for TGFBI) or the xenograft (37% for MYCBP2 and 18% for TGFBI) (Fig. 2 and Table 1).

De novo mutations identified in the metastasis

Two de novo mutations were discovered in the metastatic tumour, neither of which was detected in the primary or xenograft tumour genomes. One was a missense mutation (T708I) in SNED1, with a mutant allele frequency of 37%; the other was a silent mutation (N2483) in FLNC with a mutant allele frequency of 18% (Fig. 2 and Table 1). Because the xenograft line, without these two mutations, exhibits metastatic lesions in ovarian, lymphoid and subcutaneous tissue (data not shown), it is unlikely that these mutated genes are essential to the metastatic process.

Elevated copy number alterations in metastasis and xenograft

The cnvHMM algorithm (K.C., X.S., E.R.M., L.D. and R.K.W., unpublished) was applied to the aligned sequence reads to detect regions of copy number alterations in all three tumours. Using pathology-based purity estimates for the primary tumour and brain metastasis, we calculated the read depth contributed from the tumour cells alone and then computed the copy number for all genomic positions. Read depth correction was not applied to the xenograft, as stated earlier. We subsequently compared the copy number data from all three tumours with those from peripheral blood, to identify genomic segments with significant copy number alterations (CNAs) (Supplementary Information). A total of 516.5 Mb, 640.4 Mb and 754.5 Mb were amplified, whereas 342.5 Mb, 383.1 Mb and 562.5 Mb were deleted, in primary tumour, metastasis and xenograft, respectively (Supplementary Table 5–7). Moreover, 96.11% and 93.98% of CNA sequences in the primary tumour were also found in CNA segments in the metastasis and xenograft, respectively, indicating that most primary tumour CNAs are preserved during disease progression and engraftment. On the other hand, only 80.65% of metastasis and 61.29% of xenograft CNA sequences overlap with primary tumour CNAs. Furthermore, 155 regions with focal copy number segments (≤2 Mb) were detected in the primary tumour, but only 101 and 97 regions in the metastasis and xenograft (Supplementary Tables 8–10). Our result also shows that 111 (average span = 745,183 bp) and 99 (average span = 799,395 bp) focal copy number segments (≤2Mbp) in the primary tumour overlap with broader copy number segments in the metastasis (average span = 2,245,546 bp) and xenograft (average span = 3,565,456 bp), indicating possible expansion of primary focal regions or selection of new adjacent events during disease progression and in the mouse host. Sequence depth-based copy number analysis shows overall the highest concordance with other platforms, including the array CGH and Illumina SNP array, and also provided the highest concordance of copy number (correlation coefficients: 0.89–0.97) between primary tumour, metastasis and xenograft (Supplementary Table 11).

Common and unique structural variations in three tumours

We used BreakDancer21 to detect structural variants in sequencing data from paired end libraries (Supplementary Table 12) and applied a set of thresholds to identify putative somatic structural events.

Deletions, insertions and inversions

Breakpoint-containing contigs from the three tumour samples that were not present in the matched normal genome were successfully assembled for 137 deletions, 15 insertions and 38 inversions using the TIGRA assembler (L.C., K.C., J.W.W., E.R.M., R.K.W., L.D. and G.M.W., unpublished), suggesting that they were putative somatic events. We then re-mapped individual reads to these assembled contigs to screen out germline structural variants and to confirm somatic structural variants (Supplementary Information), resulting in the detection of 59 deletions and 18 inversions. PCR primers were designed successfully to validate 73 out of 77 putative structural variant events and the resulting amplicons were sequenced by either the Roche 454 or ABI 3730 platform. Subsequently, 28 deletions and 6 inversions were validated as somatic events (Table 2). Among them, a 46,462-bp heterozygous deletion in FBXW7 removes the last 10 exons and a portion of the first exon of NM_018315, probably inactivating FBXW7. FBXW7 targets cyclin E and mTOR for ubiquitin-mediated degradation22,23. Numerous cancer-associated mutations in FBXW7 have been previously reported, and loss of FBXW7 function causes chromosomal instability and tumorigenesis24. Two overlapping deletions (538,467 bp and 515,465 bp in length) on chromosome 5, affecting CTNNA1 along with LRRTM2, MATR3, SNORA74A and SIL1, were also validated. This result is consistent with the detection of a focal copy number deletion encompassing this region in both metastasis (copy number = 0.65) and xenograft (copy number = 0.03) (Fig. 3 and Supplementary Tables 9 and 10). Careful examination of this region in the aligned sequence reads for the primary tumour confirms the existence of copy number deletion. Loss of CTNNA1 was shown to result in global loss of cell adhesion in human breast cancer cells25 and increased in vitro tumorigenic characteristics26, indicating that this bi-allelic deletion has functional importance. A 109,563-bp heterozygous deletion on chromosome 8 was assembled and validated in all three tumours. This event removed three exons of NRG1, which encodes a peptide growth factor that binds to ERBB3 and ERBB4. Notably, a 26,919-bp deletion in MECR was only identified, assembled and validated in the metastasis, suggesting its de novo nature in this sample.

Table 2 Validated structural variations
Figure 3: Two overlapping CTNNA1 deletions on chromosome 5 in three tumours.
figure 3

A graph of sequence depths, read pairs and genes in a 638,468-bp region containing two overlapping deletions. The top four panels display the read depths at each base, and the reads within the region whose mates mapped at an abnormal distance are displayed as blue bars, with matched pairs connected by arcs. Two different shades of blue indicate the two separate allelic deletion events (538,467 bp and 515,465 bp in length). The bottom panel displays genes annotated in this genomic region.

PowerPoint slide


Of the 112 assembled putative translocations, 34 passed manual review using Pairoscope graphs (D.E.L., C.C.H., E.R.M., L.D. and R.K.W., unpublished), and 19 with an assembly score greater than our experimentally supported cutoff of 10 were included in Supplementary Table 13. Seven translocations were experimentally validated (Table 2). One validated translocation t(4;9)(188855443;139022258), assembled in all three tumours, involved a long terminal repeat (LTR) from the ERVL-MaLR family on chromosome 4 and ABCA2 on chromosome 9. The translocation removes the final exon of the ABCA2 gene (NM_001606). Two other validated translocations, identified in all three tumours, are t(1;2)(245548338;64855172) and t(2;6)(64855607;144243116) (Supplementary Fig. 1). Noticeably, the breakpoints on chromosome 2 for these two translocations are only separated by 393 bp in a TcMar-Tigger repeat. The chromosome 1 breakpoint of t(1;2)(245548338;64855172) is in intron 5 of NM_032752 in ZNF496. We expect the translation of ZNF496 to continue through exon 5 into intron 5 due to lack of a splice acceptor site. On the other hand, t(2;6)(64855565;144243116) involves FAM164B on chromosome 6 and the translocation contig retains three exons of XM_928657. We have also validated t(1:6)(245548342;144243110) (not detected by BreakDancer), the breakpoints of which are only 4 bp and 6 bp away from the breakpoints identified on chromosomes 1 and 6 for t(1;2)(245548338;64855172) and t(2;6)(64855607;144243116), respectively (Supplementary Fig. 1). This translocation is found in both the primary tumour and the metastasis, but apparently is lost in the xenograft (Supplementary Fig. 1 and Fig. 4). Sequencing of two PCR products generated using two primer pairs from chromosomes 1 and 6 demonstrated the presence of two forms of genomic fusion: one includes chromosomes 1 and 6 and the other includes chromosomes 1, 2 and 6. The former is only present in the primary tumour and the metastasis.

Figure 4: Circos plots for the primary tumour, metastasis and xenograft genomes.
figure 4

ac, Circos30 plots display the validated tier 1 somatic mutations, DNA copy number and validated structural rearrangements in the primary tumour (a), metastasis (b) and xenograft (c). Mutations enriched in the primary tumour are labelled in red in panel a; mutations enriched in the metastasis or xenograft are in red in panels b and c. Mutations and the large deletion unique to the metastasis are in blue (b). Translocations only present in primary tumour and metastasis are in green. All shared events are in black. The copy number difference between the tumour and normal is shown (scale: -4 to 4). No purity-based copy number corrections were used for plotting.

PowerPoint slide


Our comprehensive analysis of this sample set identified 50 novel somatic point mutations and small indels in coding sequences, RNA genes and splice sites as well as 28 large deletions, 6 inversions and 7 translocations. In terms of functional annotation, a hierarchy can be suggested. The first level includes somatic changes likely to be functional, such as the small indel in TP53, the large heterozygous deletion in FBXW7 and the bi-allelic deletion in CTNNA1. The second level consists of nonsynonymous mutations in genes previously noted to be targeted for somatic mutation in cancer or found to be recurrently mutated in this study, although the exact mutations are novel and their functional importance requires further investigation (JAK2, PTCH2, CSMD1 and NRK). The third level contains mutations known to be related to signal transduction in the malignant cells and/or found to be enriched during disease progression (MAP3K8, PTPRJ and WWTR1). The final level, by far the largest group, awaits the acquisition of new data. Analysis of germline variants for over 500 classic tumour suppressor genes and oncogenes27 identified a large number of SNPs, none of which was an unequivocal hereditary breast cancer susceptibility allele (data not shown).

The wide range of mutant allele frequencies suggests considerable genetic heterogeneity in the cellular population at the primary site. The mutation frequency range narrowed in brain metastasis and xenograft, indicating that the metastatic and transplantation processes selected for cells carrying a distinct subset of the primary tumour mutation repertoire. The overlap between the mutation frequency changes seen in the metastatic and xenograft samples argues that cellular selection during xenograft formation is similar to that during metastasis. Moreover, it suggests that the changes were not therapy-related, as the xenograft was established before any treatment. GO annotation of enriched mutations suggests that transcription factor activity is possibly selected for in the xenograft (Supplementary Table 14). In contrast to our observation of only two new tier 1 mutations at the metastatic site, sequencing of an indolent metastatic lobular breast tumour showed that the great majority of the mutations detected were completely novel when compared to the primary tumour9. However, in this instance, the metastatic process evolved over 9 years, as opposed to less than 1 year in the case we describe here. Another difference relative to the lobular cancer genome, where no structural variants were validated, was that paired-end sequencing detected 41 structural variations within this basal-like tumour genome. Our study of a primary tumour–metastasis–xenograft trio therefore demonstrates that, although additional somatic mutations, copy number alterations and structural variations do occur during the clinical course of the disease, most of the original mutations and structural variants present in the primary tumour are propagated. The preservation of all primary mutations in the xenograft suggests that early passage xenograft lines are valid for functional and therapeutic studies. However, the altered mutation frequency and elevated degree of copy number alterations suggest caution when interpreting the results of such experiments.

The first completed basal-like breast cancer genome is highly complex, as would be anticipated for a tumour type associated with chromosomal instability and DNA repair defects. Indeed, this cancer genome, in comparison with the two AML (acute myeloid leukemia) cases published recently27,28, revealed a 3–4-fold increase in high-confidence SNVs genome-wide, suggesting a much greater background mutation rate. Future studies should extend our analysis approach of primary, metastatic and normal tissue trios and include affected individuals with diverse geographic origins to produce a complete catalogue of recurrent somatic and inherited variants associated with the development of this common malignancy.

Methods Summary

Illumina reads from peripheral blood, primary tumour, metastasis and xenograft were aligned to NCBI build36 using MAQ5 and coverage levels were defined by comparison of SNPs identified by Illumina 1M duo arrays to SNVs called by MAQ. Somatic mutations were identified using our in-house programs glfSomatic and a modified version of the Samtools indel caller ( Putative variants were manually reviewed and then validated by Illumina, 3730 or 454 sequencing. Structural variations were identified using BreakDancer21, manually reviewed and validated by a combination of localized Illumina read assembly, PCR and either 3730 or 454 sequencing. A complete description of the materials and methods used to generate this data set and results is provided in the Supplementary Information.