A whole-genome sequence and transcriptome perspective on HER2-positive breast cancers

HER2-positive breast cancer has long proven to be a clinically distinct class of breast cancers for which several targeted therapies are now available. However, resistance to the treatment associated with specific gene expressions or mutations has been observed, revealing the underlying diversity of these cancers. Therefore, understanding the full extent of the HER2-positive disease heterogeneity still remains challenging. Here we carry out an in-depth genomic characterization of 64 HER2-positive breast tumour genomes that exhibit four subgroups, based on the expression data, with distinctive genomic features in terms of somatic mutations, copy-number changes or structural variations. The results suggest that, despite being clinically defined by a specific gene amplification, HER2-positive tumours melt into the whole luminal–basal breast cancer spectrum rather than standing apart. The results also lead to a refined ERBB2 amplicon of 106 kb and show that several cases of amplifications are compatible with a breakage–fusion–bridge mechanism.

T wo main classes of breast cancer (BC) are distinguished by the expression of hormone receptors (HR); namely oestrogen receptor (ER) and progesterone receptor (PR). HR-positive breast cancers have a better prognosis than HRnegative breast cancers. This classification is helpful in clinic but the classes are still greatly heterogeneous. Transcriptomic analyses have identified intrinsic molecular subtypes differing by their expression programs, including luminal A and B, basal and HER2-enriched subtypes 1 . Schematically, luminal breast cancers express HR while basal breast cancers do not. Because of this heterogeneity, due to various cell-of-origins and molecular alterations, the response of breast cancer patients to therapy is variable and difficult to predict 2 . Conversely, molecular alterations represent potential therapeutic targets. Amplification of the human epidermal growth factor receptor ERBB2/HER2 gene, located in chromosomal region 17q12 occurs in around 15% of breast cancers and defines the category of clinical HER2positive breast cancers. Overexpression of the ERBB2/HER2 protein kinase receptor has enabled patients with HER2-positive tumour to benefit from antibody-based (for example, trastuzumab) and anti-kinase-based (for example, lapatinib) therapies that target this receptor [3][4][5][6][7] . These therapies, in use these last 15 years, have completely changed the prognosis of HER2-positive tumours. However, HER2-positive breast cancers are heterogeneous. They may be included in the HER2-enriched or luminal molecular subtypes, depending on whether they express ER, and this has a consequence on their response to targeted therapies. Indeed, several clinical trials have pointed out the variability in efficacy of trastuzumab-containing regimens depending on ER and HER2 status of the tumours, suggesting heterogeneous biological characteristics [8][9][10][11] . Furthermore, the presence of additional molecular alterations, such as mutations of PI3 kinase or PTEN phosphatase, also has an impact on this response [12][13][14][15][16] . Thus, HER2-positive breast cancers vary in their genome alterations, gene expression programs, and cell-of-origin and this impacts on their microenvironment 17,18 , prognosis and response to treatment. A comprehensive analysis of HER2positive tumour genomes should provide a definite basis for understanding this heterogeneity and the natural history of HER2-positive breast cancer and should help progress in the management of patients with HER2-positive tumours.
Here as part of the ICGC Breast Cancer Working Group effort, we have established gene expression profiles of 99 HER2-positive breast tumours and the complete genomes of a subset of 64 tumours were sequenced. A thorough analysis of this data set identifies four expression groups each of which exhibits distinctive genomic features such as mutations, copy-number variations (CNVs) or structural variations. Moreover, our results lead to a refined ERBB2 amplicon of 106 kb and show that some amplifications are compatible with a breakage-fusion-bridge (BFB) mechanism.

Results
Delineating four expression groups. A total of 289 HER2positive (HER2 þ ) BCs with frozen tumour samples identified from the French PHARE/SIGNAL programs 19,20 were analysed. HER2, ER and PR statuses were defined according to ASCO/CAP guidelines 21 (Methods). All cases were reviewed by breast pathologists and corresponding DNA and RNA samples were subjected to extensive quality controls (Methods). A total of 99 selected tumours were analysed for genome-wide expression profiles, out of which 64 tumours and matched normal DNA were subjected to whole-genome sequencing (WGS) (Supplementary Data 1).
An unsupervised hierarchical cluster analysis of the 99 HER2 þ samples delineated four main groups of gene expression, herein referred to as A, B, C and D ( Supplementary Fig. 1), validated on two external data sets 22,23 (Methods). In addition, these 99 samples were assigned to PAM50 intrinsic subtypes 24 , using an external data set 25 of 537 expression profiles that includes the whole spectrum of the BC disease and was hybridized on the same experimental platform to minimize technical biases (Methods).
Groups A and B were mostly composed of ER þ and luminal B tumours while ER-and HER2-enriched tumours were mostly observed in groups C and D (Table 1). Group D encompassed all (n ¼ 6) basal tumours. As expected, ESR1, PGR and ERBB2 genes expression levels were tightly correlated with ER, PR and HER2 statuses, respectively, and decreased from groups A to D. Gene sets specific to the mammary stem cells (MaSCs), luminal progenitor (pLum) and luminal mature (mLum) populations 26 were analysed, showing an increasing expression gradient from groups A to D for pLum and a corresponding decreasing gradient for mLum ( Supplementary Fig. 2a-c).
A total of 24,203 somatic structural variations (SVs) were detected (Methods), with a median of 327 per sample (range 132-952) in agreement with the reported values 36 . The  majority (75%) of these variants were intra-chromosomal ( Supplementary Fig. 7) and composed of 7,438 (41%) inversions, 5,889 (33%) deletions and 4,731 (26%) duplications. The numbers of intra-and inter-chromosomal SVs were correlated (Kendall tau ¼ 0.46, P ¼ 7 Â 10 À 8 , Kendall's test). Intra-chromosomal SVs were more frequent on chromosome 17 with a median of 30.5 SVs than on all other chromosomes (median of 6.0; Po1 Â 10 À 10 , Mann-Whitney U-test). Although no association was observed between the number of SVs on chromosome 17 and ER status, RNA groups A and C displayed more intra-chromosomal SVs on chromosome 17 than groups B and D (P ¼ 4 Â 10 À 3 , Mann-Whitney U-test for AC versus BD; Supplementary Fig. 8).
In several tumours, the CNV patterns as well as the clipped read orientations were consistent with a BFB mechanism. BFB is a DNA amplification mechanism that has been discovered in maize in the late 30s (ref. 38) but later evidenced in tumours [39][40][41] . It occurs when a chromosome undergoes a double-strand break ( Supplementary Fig. 10a) followed by the erroneous fusion of the two loose ends of the sister chromatids during replication. This results in the formation of a bridged dicentric chromosome that is torn apart during the next anaphase ( Supplementary Fig. 10b), inducing a further breakage that will repeat the process. At each cycle, stretches of DNA close to the breakpoints are duplicated head to head leading to the exponential accumulation of palindromic sequences in the region containing breakpoints ( Supplementary Fig. 10c). When mapped to the reference chromosome, these sequences will adopt a typical fold-back structure that is the hallmark of BFB. The process stops when the loose telomere is capped or fused to another chromosome to produce a translocation. In terms of sequence data there are two main hallmarks of BFB process. First the copy-number pattern follows some specific sequences 42,43 (Supplementary Fig. 10d), that is, not all possible discrete values are allowed. Second, when using paired-end sequencing, read pairs with discordant orientations mark the vicinity of breakpoints. In addition, reads spanning breakpoints may be clipped to the right or to the left depending on the direction of the fold. Therefore, the orientation of discordant pairs and clipped reads provide additional clues to characterize a BFB fold. Figure 2a-c gives three examples of such patterns, consistent with a BFB mechanism for ERBB2 amplification 39 . In some cases (Fig. 2c) several breakpoints were associated with inter-chromosomal events, suggesting that the amplification may have taken place on other chromosomes. We also observed patterns that are very unlikely to have occurred by a BFB process, such as the focal amplification depicted in Fig. 2d, suggesting that other mechanisms may also be involved in ERBB2 gene amplification such as the formation of double-minutes chromosomes 44 . Finally, some intricate fold patterns may also suggest that several mechanisms may sometimes combine at the same locus.

Discussion
We used gene expression as an operational basis to classify HER2 þ tumours into four groups that were further characterized in terms of interdependent genomic variables. A synoptic outline of the dependencies between these variables, provided by multiple correspondence analysis (MCA) (Methods) is shown on Fig. 3. Groups A and B were ER þ tumours, harboured genomic alterations observed in luminal BCs and were close to the luminal B intrinsic subtype. Main features of group A (left part of Fig. 3) were the absence of TP53 mutations, low FGA, higher number of SVs on chromosome 17 and specific amplifications in CCND1 (11q13) and RPS6KB1 (17q23) regions. In contrast, tumours in groups C and D were mostly ER À and close to the HER2enriched intrinsic subtype. TP53 mutations were frequent in these groups and analysis of genome breakages showed higher FGA and some pattern of BRCAness. Moreover, analysis of the mLum and pLum signatures suggested that A and B tumours derive from differentiated mature luminal cells, while C and D derive from luminal progenitors, like basal tumours. Altogether, these results suggest that HER2 þ BCs do not per se represent an actual intrinsic subtype but, instead, are distributed along the whole BC spectrum, from ER þ luminal to ER À basal phenotype, with genome alterations in accordance to these phenotypes and are incidentally characterized by a specific gene amplification.
Although patients with HER2 þ BC benefit from HER2targeted therapies, the response is highly variable and a significant number of patients display primary or secondary resistance. Heterogeneity of these cancers may explain the extent of this variability. Thus, how important the molecular and cell-of-origin definition of HER2 tumours may be to understand BC oncogenesis, the prime interest of the medical community lies  Sample extraction. Samples had full face sectioning performed in with Tissue-Tek optimal cutting temperature (O.C.T) compound to estimate the percentage of malignant epithelial nuclei in the sample relative to stromal nuclei. Macrodissection was performed if required to excise areas of non-malignant tissue. DNA and RNA were then extracted from the same sample. Total genomic DNA was extracted with phenol-chloroform after proteinase K digestion, followed by the precipitation of nucleic acids in ethanol. DNA was quantified using Nanodrop spectrophotometer ND-1000 (ThermoScientific, Wilmington, USA) and Qubit BR DNA assay (Invitrogen). RNA was also extracted using the miRNeasy miniKit (Qiagen) in accordance to the manufacturer's protocol. RNA was quantified using Nanodrop spectrophotometer ND-1000 and the purity and integrity were assessed by the Agilent 2100 Bioanalyzer and RNA 6000 Nano Labchip Kit (Agilent Biotechnologies, Palo Alto, CA, USA). All matched peripheral bloods have been centralized and then extracted using the salting-out method with a Qiagen Autopure LS (Courtaboeuf, France) in the Fondation Jean Dausset CEPH laboratory. To confirm the matching between tumour and blood DNA issued from a same patient, the AmpFLSTR Identifiler PCR Amplification Kit (Life Technologies) has been used.
Selection of high-quality HER2-positive samples. A total of 289 tumour samples (and matched blood samples) were processed as described in the previous section. A special attention was then paid to the selection of a high-quality subset of these samples for further analysis. Proportion of tumour cells were estimated on frozen tumour sections by pathologists and only those estimated with at least 50% tumour cells were kept. All DNA and RNA samples were subjected to quality controls (RNA integrity number Z7; DNA integrity checked on agarose gel) leaving a subset of 131 samples. The tumour DNAs of these 131 patients were hybridized on Illumina OmniExpress arrays to establish the genomic profile of each tumour. These genomic profiles were used to control the presence of the ERBB2 gene amplification and to obtain another estimation of the tumour purity (see SNP array processing section below for details). A missing ERBB2 amplification/gain or a very low estimated purity caused the sample to be discarded. At the end of the process, 99 samples (hereafter called the INCa-HER2 þ data set) met the required quality criteria, out of which 64 were subjected to WGS.
Gene expression array processing and quality control. Tumour RNA samples were analysed for expression profiling on Affymetrix U133 Plus 2.0 GeneChip. Quality control was asserted by using R package affyPLM 45 . Raw feature data were normalized using robust multi-array average 46 method in R package affy 47 . Probe sets corresponding to control genes were filtered out.
PAM50 subtypes classification. Each tumour gene expression profile from the INCa-HER2 þ data set was assigned to a PAM50 breast molecular subtype 24 . The PAM50 classifier was built using a training cohort, which aims at capturing the major breast cancer types in the general population. As the INCa-HER2 þ data set was exclusively composed of HER2 þ samples, the underlying distribution of expression profiles was likely to be not representative of the whole spectrum of breast cancers expression profiles 48 . To overcome this difficulty, we collected 537 Affymetrix expression profiling of all types of breast cancers from the Carte d'Identité des Tumeurs (CIT) project 25 from the French Ligue Nationale Contre le Cancer. This cohort was hybridized on the same microarray and on the same experimental platform (at IGBMC Strasbourg) than the INCa-HER2 þ data set, thus minimizing technical biases. The CIT data set is available on ArrayExpress under accession number E-MTAB-365. First, all expression profiles from CIT and INCa-HER2 þ data sets were normalized together using the robust multi-array average method 46 as above. Then, to maintain the relative proportions of the PAM50 subtypes, each of the 99 INCa-HER2 þ samples was included one by one into the CIT data set and assigned to a PAM50 subtype using the R package genefu 49 v1.12.0 with robust scaling of the gene expressions centroids.
Unsupervised clustering of transcriptomic array profiles. The clustering of transcriptomic array profiles was performed in two steps. The first step aims at selecting probesets carrying the most differential expression across the data set. This was done by using two criteria: (a) for each probeset, we tested whether its variance across samples was different from median of the variances of all the probesets. The statistics and criterion used were the same as in the filtering tool of BRB ArrayTools software 50 , where the variance to median ratio is compared with a percentile of the chi-square distribution. Only probesets satisfying a P value of this variance test o10 À 3 were kept. (b) In addition, probesets were ordered by relative s.d. and the top 5% percentile was retained. After this step, we were left with 274 probesets corresponding to 196 known genes. The second step was an agglomerative hierarchical clustering using Pearson correlation as a similarity measure and the Ward's minimum variance linkage method.
Validation of RNA groups on TCGA and Metabric HER2 þ data sets. Two publicly available expression profiling data sets were collected for validation purpose. The first one was the Breast Invasive Carcinoma (BRCA) collection of 1,098 tumours from TCGA 22 . We selected the subset of 114 samples defined as HER2 þ in the original paper, out of which 75 had associated Agilent G4502A microarray expression data. The second validation set came from the Metabric breast cancer collection 23 . We selected a subset of 122 HER2-amplified tumours with IHC status equal to 2 þ or 3 þ and copy number of ERBB2 locus gained according to SNP6 array (no IHC-FISH status was available in this data set). Raw Illumina HT-12-v3 expression profiling data were obtained after data access authorization and normalized using R package beadarray 51 . To allow the comparison between expression data from different array technologies, we reduced the RNA group signature defined with probesets to genes by using the best genes according to JetSet 52 . This resulted in 196 genes out of which 162 and 180 were defined in TCGA and Metabric data respectively. We used two different and complementary methods for validation on both sets. The first method is a single sample predictor (SSP) method. First the centroid of each RNA group (labelled A, B, C, D) was computed on the centred-reduced INCa data set. Then, for each single external sample (that is, from TCGA or Metabric), we computed the Spearman rho correlation coefficient with each of these centroids and the sample was assigned to the RNA group with the largest correlation coefficient or to no group (O) if the correlation coefficient was o0.1. Independently, the whole external data set was clustered into four clusters (labelled 1, 2, 3, 4) using the same agglomerative hierarchical clustering method as before. The stability of the RNA groups in the external data set was then evaluated by examination of the (A, B, C, D, O) Â (1, 2, 3, 4) contingency table and practically measured by the fraction of the most abundant RNA label in each cluster (that is, 100% if all clusters are composed of a single RNA group and, about 25% if RNA groups are spread randomly across clusters). The second method (joined) consisted in merging the independently centred-reduced INCa and external data sets. This joined set was then clustered into four clusters (labelled 1, 2, 3, 4) using the same agglomerative hierarchical clustering method as before and the stability of the RNA groups was then evaluated by examination of the (A, B, C, D) Â (1, 2, 3, 4) contingency table of the data reduced to the INCa-HER2 subset (that is, for which the actual RNA group labels are known). A RNA group label can further be assigned to each of the four clusters (by considering the majority label in the INCa-HER2 subset) and, eventually, to each external sample. Finally, for both methods, clinical and biological metadata (namely ER status, PAM50 subtypes and P53 mutation status) per RNA group were also compared.
Results are shown in Supplementary Figs 11 À 15. Supplementary Fig. 11 displays the RNA groups obtained on the original INCa-HER2 data set using all of the 196 probesets as the reference. Supplementary Figs 12 and 13 (resp. 14 and 15) displays the results obtained on the TCGA (resp. Metabric) data set for both SSP and joined methods. The two methods gave similar results, with 76% (TCGA) and 81% (Metabric) of samples classified in the same RNA group by both methods. The overall stability of RNA groups is good (80% (SSP), 75% (joined) for TCGA and 66% (SSP), 73% (joined) for Metabric). Groups A and D appeared more stable than B and C. Finally, in terms of metadata, the same general features were observed for the three data sets: (a) over-representation of ER þ (resp. ER À ) in groups A and B (resp. C and D); over-representation of luminal B (resp. HER2-enriched and basal) in groups A and B (resp. C and D). As for TP53 mutations, the situation is less clear-cut: the under-representation of TP53 mutations in group A is observed both for INCa and TCGA data sets but is not significant for the Metabric data set. However it should be pointed out that (a) this data set exhibits a lot of TP53 undetermined cases (51%) and (b) for determined cases the fraction of TP53 mutated cases is unusually low (28% versus 44% and 47%, respectively in INCa and TCGA data sets).
Finally, all cases in the present study were included in the PHARE/SIGNAL cohorts (NCT00381901-RECF1098, www.e-cancer.fr) with a median follow-up at 58 months (interquartile range 46.5-62.5). In this smaller subset and on this period, recurrence events were very limited (4 out of all 99 cases) and the relationship between RNA groups and outcome could not be analysed.
Gene expression signature projection. Gene signatures for mammary stem cell, progenitor and mature luminal 26 were projected by using the single sample extension of GSEA 53 where gene expression values for each single sample were rank-normalized, and an enrichment score was produced using the empirical cumulative distribution functions of the genes in the signature and in the remaining genes.
SNP array processing. Illumina Omni1 Quad and OmniExpress SNP arrays quality control and normalization was performed using GenomeStudio Genotyping Module. A supplementary normalization step was applied using the tQN algorithm 54 . Allelic CNVs were analysed using GAP 55 and ASCAT 2.0 (ref. 56). Tumour purity was estimated using both approaches, with a good correlation, although GAP estimates were found to be always above ASCAT estimates ( Supplementary Fig. 16a). This allowed correcting these estimates by using their geometric mean (Supplementary Fig. 16b). Estimates provided by pathologists were systematically above all others and were not used for the final estimation. Ploidy was estimated by using DNA indexes provided by the two methods and few discrepancies were resolved by nearest neighbours clustering ( Supplementary  Fig. 17). It should be noted that although DNA index of diploid tumours was centred around 1.1, DNA index of aneuploid tumours was centred on 1.7, therefore these tumours were probably resulting from a tetraploidisation event followed by chromosomal losses.
Sequencing and genome alignment. WGS was performed on 64 tumours and matched normal DNA from the same individuals. Tumour and normal DNAs were sequenced to 445-fold and 430-fold coverage, respectively. Illumina HiSeq2000/ HiSeq2500 genome analysers and Illumina paired-end sequencing protocols were used for all samples, read lengths ranging from 100 to 126 base pairs. Paired-end reads were aligned to the human genome (GRCh37) using the BWA aligner 57 . Alignments were refined using GATK 58 and Picard (http://broadinstitute.github.io/ picard/) software suites. Duplicates were removed from the sample BAM files for further analysis. Raw and mapped sequences from all produced HiSeq lanes were checked using in-house pipelines that collect a set of important metrics reflecting the overall quality of the sequencing data. Lanes showing poor quality were manually discarded.
SNV variant calling. Somatic SNVs were called using MuTect 59 v1.1.5 part of the GATK suite. To improve performance, data from dbSNP Build 132 and COSMIC v65 (http://www.sanger.ac.uk/genetics/CGP/cosmic/) were supplied as parameters to MuTect. On top of this, we used in-house post-processing filters to improve the specificity of mutation calls. These filters include adjustments on strand bias, local coverage, position of alternate allele within the read, mapping quality, repeated regions. Moreover, a panel of normal genomes, generated on the same sequencing technology, was used to dismiss systematic sequencing errors and/or low frequency polymorphisms. SNV that passed all these filters were then annotated using the variant effect predictor 60 tool v75.
Mutations in cis-regulatory regions. Annotation of somatic variants that are located in cis-regulatory regions was performed using OncoCis 32 , based on human mammary epithelial cell epigenomic data sets. Annotated somatic variants were further processed to classify those in potential promoter or enhancer regions. Somatic SNVs in potential promoter regions were extracted if they localize around H3K4me3 histone marks. Somatic variants in potential enhancer regions were extracted if they localize around H3K4me1 or H3K27ac histone marks. In addition, mutations were further annotated using FANTOM5 predictions 61 , which uses cap analysis of gene expression to detect potentially active transcription from promotor or enhancer regions. In both cases (promotors and enhancers regions) we computed, per patient, the fraction of somatic SNVs located in these regions (that is, number of SNVs in regions divided by the total number of SNVs in patient).
Copy-number analysis. The analysis of CNVs from whole-genome sequence data was performed in three steps. First, after reads alignment using the Burrows-Wheeler alignment tool (BWA), raw read counts at each genomic position were corrected by GC content using the approach described by Benjamini and Speed 62 . We slightly improved the construction of the empirical dependency model by using only raw counts in mappable 63 regions, by sampling positions (10 7 ) over the whole genome and ignoring positions with extreme read counts. In a second step, the GC-corrected relative read counts (rRC, defined as raw counts divided by predicted counts) were computed within 1 kb windows in mappable regions and smoothed along the chromosomes (Kalman filter) to get a well-resolved distribution of the observed levels. This distribution was used to evaluate the tumour contamination by normal DNA by fitting a sum of evenly spaced Gaussian peaks (the separation between two consecutive peaks being equal to 1/(Q þ 2a(1 À a)); with aA[0,1] the contamination and Q the tumour mean ploidy). Finally, a univariate Gaussian Hidden Markov Model (HMM, R package RHmm) was build, with parameters (levels and variance) inferred from the previous step. This HMM was then used to segment the GC-corrected relative read counts signal along the chromosomes. The resulting segment levels are expressed as a relative tumoral CN (rCN) corresponding to the states of the HMM, that is, rCN ¼ 1, the reference CN state, corresponds to the tumour mean ploidy (for example, 2 for a diploid tumour), 0.5 corresponds to the loss of half of the copies of the reference CN state (that is, hemizygous state for a diploid tumour). These relative tumoral CN segments were further used to define the gain/loss status of regions, using the following scale: rCN43: amplification; 3ZrCN41: gain; rCNo1: loss; rCN ¼ 0: homozygous deletion.
Firestorms and large-scale transitions. A chromosome arm of high genomic complexity and which harbours multiple closely spaced amplicons is said to be in firestorm 35 . Using the CNV profiles computed from WGS data (see above) on each tumour, a chromosome arm was claimed in state of firestorm if it had at least 20 genomic segments reaching at least 10 different levels of copy number and if, among those segments, at least 5 were amplifications (rCN43).
Large-scale state transitions (LST) were defined 37 , as chromosomal break between adjacent regions of at least 10 Mb each. As suggested 37 , the number of LSTs in the tumour genome was estimated for each chromosome arm independently (not accounting for centromeric or unmappable regions breaks) and after filtering and smoothing of all variations o3 Mb.
Structural variants calling. Somatic SVs were identified by using two complementary signals from read alignments: (a) discordant pair mapping (wrong read orientations or insert-size larger than expected) and (b) soft-clipping (first or last bases of read unmapped) that allows to resolve SV breakpoints at the base pair. Each SV candidate was defined by a cluster of discordant pairs and one or two clusters of soft-clipped reads. The discordant pairs cluster defined two associated regions (possibly on different chromosomes) and the soft-clipped reads cluster(s), located in these regions, specified the potential SV breakpoint positions. We further checked that the soft-clipped bases at each SV breakpoint were correctly aligned in the neighbourhood of the associated region. Events were then classified as germline or somatic depending on their presence in the matched normal set of events. Somatic SVs were further filtered according to several criteria: at least 2 discordant read pairs per cluster; at least 2 soft-clipped reads per cluster; at least 15 aligned clipped bases and at least 1 breakpoint should be located in a mappable region 63 . Structural variants were then classified as intra-chromosomal event (deletion, duplication, inversion) or inter-chromosomal (breakpoints on different chromosomes) according to discordant read pairs type.
BFB amplification analysis. We used two independent sources of information for testing the BFB amplification mechanism for the ERBB2 amplicon: the CNV patterns and discordant read pairs and clipped reads orientation at breakpoints. The main hallmark of BFB is that the CNV pattern follows some specific sequences 43,64 . A technical difficulty is that this requires the determination of tumoral absolute integer copy number (aCN), whereas NGS data primarily provides relative read counts (rRC, see section 'Copy-number analysis'); aCN linearly relates to rRC as a function of the contamination by normal DNA and the mean ploidy of the tumour. So we first determined precisely these two quantities by plotting and fitting rRC versus allelic frequencies of SNPs over all the genome. We then segmented the aCN profile by using the same HMM procedure as before (section 'Copy-number analysis'). Finally, we used the algorithm developed by Zakov and Bafna 42 to check if the observed aCN sequences were compatible with a BFB sequence and/or to look for the longest compatible sequence. This is basically the same approach used by Greenman et al. 65 . Another possible approach is to estimate the rearrangement process and copy numbers simultaneously 43 . However, the later approach has a risk of overfitting the data, especially for short sequences. Therefore, except for one case (PI034), we did not readjust the predicted aCN levels to fit with a BFB sequence. In the case of PI034, we had to shift the read count signal by a constant value (2) to get a valid BFB sequence. This may be explained by the presence of two copies of the homologous chromosome 17 in this tumour. Beyond checking for the validity of a putative BFB sequence, the Zakov and Bafna algorithm 42 also provides the corresponding folding pattern (visually represented by the purple curved line on illustrations provided in Fig. 2a À c). This allows for an additional and independent confirmation of the BFB fold. To this purpose we superimposed the location of somatic SV breakpoints (see section 'Structural variant calling') to the CN plot. Of course the curve should ideally fold at identified breakpoints, but, more importantly, the folding pattern should fit with the clipped reads orientation. More precisely, when the fold occurs to the left side, the reads should be clipped on their right (3') side (that is, they align on their left part) and, NATURE COMMUNICATIONS | DOI: 10.1038/ncomms12222 ARTICLE conversely, when the fold occurs to the right side, the reads should be clipped on their left side. Therefore by simply colouring the breakpoint locations, we could visually check if the clipped reads orientation was compatible with the proposed BFB folding pattern. In all of the cases we examined, this test was always successful therefore providing a strong support in favour of the hypothesis that the folding pattern was indeed generated by a BFB mechanism.
Multiple correspondence analysis. MCA is a multivariate statistical approach suited to the exploratory analysis of nominal categorical data 66 . It is an extension of correspondence analysis when more than two variables are involved. It can also be considered as an adaptation of principal component analysis (PCA) to nominal (instead of quantitative) data, using the chi-square (instead of euclidean) metric. Briefly, categorical variables are first encoded into boolean complete disjonctive form. For instance the RNA group variable has four categories (A, B, C, D), a patient in category A is therefore encoded as RNA.A ¼ 1, RNA.B ¼ 0, RNA.C ¼ 0, RNA.D ¼ 0. Then this complete disjunctive table is submitted to standard correspondence analysis using the ADE4 R package 67 with Benzécri correction 68 of the eigenvalues. Quantitative variables (for example, FGA) should be first encoded to categorical. To this purpose, we use the upper, lower and interquartile categories. The purpose of MCA is to construct a joint map of patients and variable categories in such a way that a patient is close to a category it is in, and far from the categories it is not in (conversely a category is close to the patients that have it and far from patients that do not have it). This map has a simple geometrical interpretation 69 , thanks to the centroid principle: MCA plots a category point at the centre of gravity of the patient points for those patients that choose that category (conversely, at a scaling factor, patient points are located at the centre of gravity of categories they choose). Finally, it should be pointed out that beside variables categories used for the analysis, it is also possible to project additional categories (or patients) onto the map. The projected category is simply located at the centre of gravity of those patients that choose this category.
Data availability. Raw data have been uploaded to the European Genome-phenome Archive (EGA; http://www.ebi.ac.uk/ega) under the overarching study accession number EGAS00001001431. This study includes all data from wholegenome sequencing, genotyping arrays and gene expression arrays used in this work. Access to whole-genome sequences and genotyping arrays is subjected to the ICGC data access authorization (DAC: EGAC00001000010 and Policy: EGAP00001000037). The CIT data set is available on ArrayExpress (http:// www.ebi.ac.uk/arrayexpress) under accession number E-MTAB-365. METABRIC expression data sets are available at EGA under study accession number EGAS00000000083. TCGA expression data set is available on the TCGA data portal (https://tcga-data.nci.nih.gov/docs/publications/brca_2012). The remaining data are contained within the Article or Supplementary Information files, or available from the authors upon request.