DNA copy number motifs are strong and independent predictors of survival in breast cancer

Somatic copy number alterations are a frequent sign of genome instability in cancer. A precise characterization of the genome architecture would reveal underlying instability mechanisms and provide an instrument for outcome prediction and treatment guidance. Here we show that the local spatial behavior of copy number profiles conveys important information about this architecture. Six filters were defined to characterize regional traits in copy number profiles, and the resulting Copy Aberration Regional Mapping Analysis (CARMA) algorithm was applied to tumors in four breast cancer cohorts (n = 2919). The derived motifs represent a layer of information that complements established molecular classifications of breast cancer. A score reflecting presence or absence of motifs provided a highly significant independent prognostic predictor. Results were consistent between cohorts. The nonsite-specific occurrence of the detected patterns suggests that CARMA captures underlying replication and repair defects and could have a future potential in treatment stratification.

T he allele-specific DNA copy number profile of a tumor is a window into its past history and its future evolutionary potential [1][2][3] . In general, we may consider a copy number profile as the accumulated result of a series of genomic events [4][5][6][7] . Specific DNA replication and repair errors may leave particular traces throughout the genome in the form of recurring local patterns, or motifs [8][9][10][11] . We hypothesized that such motifs represent a substantial proportion of the copy number variation in a tumor, and that they partly explain the high intertumor copy number heterogeneity frequently observed in cancer. We further hypothesized that the presence or absence of specific motifs is informative of a tumor's past and future evolutionary trajectory. Detailed characterization of such features would thus allow prediction of disease behavior and could potentially direct choice of treatment.
Here, we present an analysis of regional nonsite-specific motifs from allele-specific DNA copy number profiles in breast cancer. The core of this framework is the Copy Aberration Regional Mapping Analysis (CARMA) algorithm, which creates a compact representation of the aberration architecture. Conceptually, the algorithm represents copy number profiles as real-valued functions over the genomic domain and derives a small set of scores representing distinct regional features. The proposed method takes into account copy number amplitude, spatial distribution of copy number break points and allelic imbalance, and captures regional fluctuations in copy number, a signature feature of chromothripsis and chromoplexy. By generating a lowdimensional representation of the copy number data, the proposed algorithm also avoids the curse of dimensionality.
CARMA is related to multiple algorithms designed to detect specific copy number aberration patterns in tumors. The chromosomal instability index (CINdex) 12 and the genomic instability index (GII) 13 both quantify the total amount of genomic aberrations. Other algorithms have been proposed for detection of simplex and complex copy number events 9 and structural rearrangement patterns 14 , for example the complex arm-wise aberration index (CAAI). An algorithm identifying the presence of multiple aberration patterns with application to ovarian cancer was recently proposed 11 . In addition, several methods have been proposed to identify copy number features recurring across tumors, such as GISTIC 15,16 .
We applied CARMA to four breast cancer patient cohorts (METABRIC, Oslo2, Oslo-Val, and ICGC; see "Methods" for details). An integrated score was derived and shown to have superior prediction performance for breast cancer specific survival compared with other available clinical and molecular stratifications. The relation between copy number motifs and established driver gene based classifications of breast cancer was investigated. The analysis described in the paper is applicable to allele-specific copy number data from all types of cancer and any type of platform, including SNP arrays and high-throughput sequencing.

Results
Brief outline of the analysis approach. CARMA is applicable to allele-specific copy number profiles from one or several tumors, obtained from SNP array analysis or DNA high-throughput sequencing. The algorithm extracts multiple local features which are accumulated across genomic regions by numerical integration to form six regional scores. These scores reflect the degree of amplification (AMP), deletion (DEL), complexity (STP and CRV), such as chromothripsis and chromoplexy, loss of heterozygosity (LOH) and allelic imbalance or asymmetry (ASM). More details and precise mathematical definitions are deferred to "Methods." The analysis pipeline is depicted in Fig. 1a-d. An application of the algorithm to three breast tumor samples in the Oslo2 cohort and with chromosome arms as regions is shown in Fig. 1e. Specific regional features are discernible, illustrating how CARMA can be used to perform between-sample comparison of copy number features that are not locus specific.
Relation to other methods. CARMA was compared with two methods for detection of nonsite specific copy number aberrations in single samples: CAAI 9 and CINdex 12 . The CAAI algorithm identifies chromosome arms with complex rearrangements, while CINdex detects regional gains and losses. We also compared CARMA with GISTIC, a well-established method for detection of regions with significant copy number change across multiple samples 15,16 . Figure 2a shows circos plots of CARMA profiles for two selected samples in the METABRIC cohort, together with the results from GISTIC, CINdex, and CAAI.
As expected, CAAI correlates with the two CARMA complexity scores STP and CRV, but the relative sizes of STP and CRV provide additional detail (e.g., on chromosome 16 in the sample MB-0010). CINdex captures both gains and losses, but in the two selected samples it correlates stronger with DEL than with AMP. This is not unexpected, since the CINdex algorithm includes a relative weighting of gains and losses, while CARMA does not. The use of six distinct measure of copy number distortion in CARMA generally provides more detail than CINdex. For example, in a region with loss of one allele and gain of the other (i.e. a uniparental disomy), such as chromosome 22 in MB-0010, CARMA reports LOH and ASM, while CINdex reports no alteration (Fig. 2a). Observe also that the complex aberration on chromosome 11p in MB-0028 which is reported by CINdex is positive for all six CARMA scores including STP and CRV.
For GISTIC, regions of significant gain or loss were identified based on all METABRIC samples; a binary score is subsequently assigned to each sample in each such region based on the presence or absence of a loss or gain. Regions with significant loss or gain according to GISTIC partially overlap with DEL and AMP, respectively. Next, we investigated the distribution of CARMA scores within each region identified by GISTIC (Fig. 2b). A strong overlap is observed between GISTIC gain and high AMP score, and between GISTIC loss and high DEL score. In addition, there is considerable diversity in the CARMA spectrum within regions called as gains or losses according to GISTIC. For example, the relative contribution of LOH is highly variable across GISTIC loss regions. Similarly, the relative contribution of complex aberrations captured by STP and CRV varies across GISTIC gain regions.
Molecular subgroups have distinct CARMA signatures. We next considered the distribution of CARMA scores within established molecular stratifications of breast carcinomas (PAM50 and IntClust). PAM50 17,18 is an expression based classification system defining five distinct subgroups of breast tumors based on the correlation to a set of 50 genes. IntClust 1,19 identifies ten different subtypes based on the pattern of copy number aberrations exerting an effect on gene expression in cis. The distribution of CARMA scores within these classification systems were explored in four different breast cancer data sets of varying sample size (n = 1943, n = 276, n = 165, and n = 553). The percentage of tumors with scores exceeding a median threshold was plotted for all arm scores and for each PAM50 and IntClust subtype separately ( Fig. 3a and Supplementary Figs. 1-4). The CARMA scores consistently reflect differences in the landscapes of genomic architecture in the different biological and clinical patient groups. This visual overview of aberration patterns highlights subtype specific features such as frequent allelic loss on 17p and frequent gain and high complexity on 17q in IntClust1; gain on 1q, frequent asymmetric gain and complex aberrations on 11q and allelic loss on 16q in IntClust2; etc. The signatures of regional CARMA scores within the PAM50 subtypes highlight known features, including whole arm 1q gain/16q loss in luminal A tumors, the more complex copy number aberrations in luminal B tumors, the 17q alterations dominating Her2-enriched tumors, and the global instability of basal-like tumors. Three-dimensional scatter plots of CARMA scores were plotted for all tumors in the Oslo2 cohort (n = 276) and METABRIC cohort (n = 1943) (see Fig. 3b). Trend curves and subtype centroids both demonstrate high degree of consistency between the two cohorts.  Fig. 1 Outline of the CARMA algorithm. a Complete analysis pipeline. b Steps included in the CARMA analysis. The input is one or more allele-specific copy number profiles. The algorithm extracts local features and accumulates these across genomic regions to form six regional scores. c Calculation of CARMA scores within a specified region. d Prototype patterns captured by each of the six CARMA scores. e An application of the algorithm to three breast tumor samples in the Oslo2 cohort. Lower panel: total copy number and allele fraction as a function of genomic locus. Upper panel: circos plots of regional (arm-wise) CARMA scores.
Predicting survival from regional scores. To assess the association between disease-specific survival (DSS) and genome-wide CARMA scores, a univariate Cox proportional hazards regression model was fitted with each score as a covariate (see Supplementary  Table 1). For this purpose, we used the largest cohort (METAB-RIC set). All scores were associated with survival (P < 10 −6 ; Score test) and the strongest associations were found for the scores STP and CRV (P < 10 −18 ; Score test). We next split the METABRIC cohort into a discovery cohort (n = 1295) and a test cohort (n = 648). We fitted a multivariate Cox regression model to DSS and progression-free survival (PFS) data in the discovery cohort based on the six predictors. The predictors were defined by taking an unweighted mean across all the regional (arm-wise) CARMA scores (Fig. 3c). The fitted model was next applied to the test set, producing a single unweighted prognostic value per patient. Thresholds corresponding to the 1/3 and 2/3 percentile were applied to classify samples into groups of low, intermediate, and high risk, with numerical values ranging from 1 to 3. This final score was termed the CARMA Prognostic Index (CPI). An alternative prognostic index was defined using the 252 arm-wise CARMA scores directly as predictors and fitting a Cox regression model with Lasso penalty to the training set. Coefficients derived from the analysis (Supplementary Fig. 5) were used as weights to calculate a weighted prognostic index termed CPI weighted .
To compare the efficacy of CPI and CPI weighted to established clinically and biologically relevant parameters, we fitted a univariate Cox regression model in the METABRIC test set using the prognostic indices and the clinical parameters as covariates (Table 1 and Supplementary Tables 2-3). The P value for CPI from the analysis was lower than for any of the other clinical parameters when looking at both DSS (P = 1.9 × 10 −13 ; Score test) and PFS (P = 5.7 × 10 −13 ; Score test), and also performed better than CPI weighted . However, CPI weighted did remain strongly significant in the analysis for both DSS (P = 5.2 × 10 −10 ; Score test) and PFS (P = 3.7 × 10 −7 ; Score test) presenting P values lower than many of the other established parameters. Hazard ratios for CPI and other clinical variables from univariate Cox regression analysis are shown in Fig. 3d.
Cox regression modeling was also performed to assess the effect of the prognostic indices with adjustments for other variables (see Table 1 and Supplementary Tables 2-3). CPI consistently showed smaller P values than all other clinical variables. Also CPI weighted remained significant when adjusting for other variables (Supplementary Tables 2-3). Hazard ratios from multivariate Cox regression models where the effect of CPI is adjusted for the effect of clinical variables are shown in Fig. 3f.
CPI was next used to stratify patients into low, intermediate, and high-risk groups as described above in the three validation cohorts with survival data available (METABRIC test set, OsloVal, and ICGC). A logrank test was performed for the three groups in each data set (Fig. 3e). P values were significant when considering both DSS (P < 10 −12 in METABRIC test, P < 10 −4 in OsloVal, and P = 0.003 in ICGC) and PFS (P < 10 −12 in METABRIC test; PFS data were not available for OsloVal or ICGC).
Finally, the unweighted continuous prognostic score that was used to obtain the CPI, was utilized to calculate a Harrell's C score in the METABRIC test set. The C scores obtained from the analysis were 0.65 (95% CI: 0.62-0.69) and 0.64 (95% CI: 0.61-0.68) based on DSS and PFS, respectively.

IntClust 10
AM P D EL   The height of each bar represents the proportion of samples in the subgroup with arm score above the median, calculated across all arms within each CARMA score and ignoring zeros. b Three-dimensional scatter plots of tumors using three of the CARMA scores designed to detect three major categories of copy number aberration patterns in tumors (amplifications AMP, allelic loss LOH, complex rearrangements CRV alterations in tumors is the identification of recurrently deleted and amplified genes which may define key driver events in carcinogenesis or potential targets for treatment. We and others have previously shown that in addition to this gene centered or locus centered approach, the structural changes provide important information for classification and survival prediction 8,9,20 . The methodology presented in this study complements gene specific analyses by providing a systematic framework to characterize the information embedded in the copy number profile of a tumor. CARMA determines the presence and relative contributions of six distinct copy number features in genomic regions and in the genome as a whole. By focusing on pervasive patterns or motifs in the genome rather than locus specific events, the algorithm captures footprints of past and ongoing segmental DNA alterations.
Known drivers of such alterations are DNA replication and repair errors [8][9][10][11] . In this study, we used CARMA to assign scores to individual chromosome arms and to the whole genome. The CARMA algorithm is not bound to any particular genomic resolution though, and the tool supports assignment of individual scores to whole genomes, chromosomes, chromosome arms, or genomic bins of any desired width. For a given genomic resolution, scores for individual genes can also be obtained by inheritance of the respective regional score. Irrespective of the selection of regions on which to assign scores, the fact that regions are identical across tumors allows CARMA scores to be used directly as features in clustering, regression, and classification. Normally, the number of features will also be quite small, thus substantially reducing statistical problems related to high dimensionality.
CARMA reveals a rich spectrum of different copy number motifs across samples and also between regions within an individual sample. By combining six different measures of copy number aberration, it provides a more detailed picture of genomic architecture than GII, CAAI, and CINdex. CARMA and GISTIC represent complementary tools with different aims. Combining CARMA with GISTIC offers the possibility of providing a detailed picture of the aberration spectrum restricted to regions that are significantly altered across many samples.
Molecular taxonomy of breast cancer based on gene expression has proved important for the biological understanding of the disease 17 . IntClust 1 is a more recent driver-based classification of breast cancer and has been shown to also reflect degree of chemosensitivity 21 . The CARMA scores revealed distinct aberration signatures for the ten IntClust groups, suggesting that the copy number motifs reflect a driver-based classification of tumors. As seen from the Manhattan plots, the expression signatures defining the IntClust subtypes are to a large degree correlated to focal copy number aberrations, representing driver alterations in these subtypes. The copy number aberrations in these driver regions also exhibit differences in their pattern. This is for instance illustrated by the different types of copy number gains found on the 1q arm in the IntClust 8 subtype, as compared with the gains found on the 11q arm in the IntClust2 group. The first type of gain represents noncomplex low-amplicon whole arm translocations captured by the AMP and ASM scores, while the latter represents more complex rearrangements with high-amplicon gains 22 captured by all of the CARMA scores. Even though both of the observed patterns represent copy number gains, the underlying mechanisms causing these patterns are fundamentally different. The CARMA scores manage to capture these nuances, illustrating the potential of the method to discriminate between a richer set of aberrational patterns. The plot also gives an indication of the global background variation from copy number aberrations, maybe most apparent in the IntClust ten subtypes. Interestingly, the degree to which the different subtypes are affected by this background variation seems to correlate well with the fraction of TP53 mutations observed within each subtype 23 . This again supports the notion that copy number motifs reflect underlying biological traits. In order to assess the ability of the method to predict breast cancer specific survival, a univariate Cox regression model was fitted to genome-wide CARMA scores in the METABRIC cohort. All genome-wide scores showed a strong and significant association to survival. As a first step this supports the assumption that each of the selected scores are informative and thus qualifies for use in further survival analyses. The scores were combined to produce the unweighted and weighted prognostic indices CPI and CPI weighted . When CPI and CPI weighted were compared with established clinical parameters through Cox regression analyses, CPI consistently outperformed all other variables in terms of the level of significance. The multivariate Cox analyses established that CPI is a strong independent predictor of survival in breast cancer. The results might point towards a role of specific aberration motifs, proceeding from specific types of genomic instability, as determinants of malignancy potential in a tumor. The fact that CPI outperformed GII in the above analyses supports the idea that additional information is added through multifaceted measurements of copy number aberrations.
The observation that CPI produced better prognostic predictions than CPI weighted mightstem from the somewhat strict variable selection exerted by the Lasso regression model. The Lasso model excludes arm-specific scores that individually do not contribute strongly to the survival prediction. Aggregated, however, these arm-specific scores might confer additional prognostic information. CPI, which is based on combining all arm scores in an unweighted manner, is not subject to the same kind of selection bias. The fact that this more inclusive approach performed better in our analyses suggests that all parts of the genome copy number aberration profile contribute to the real signal when assessing survival. This supports the notion that our method captures omnipresent background variation caused by underlying DNA disruptions.
In the future it would be of high interest to apply the methodology to different cancer types to compare aberration patterns across tumors at different sites, for example using The Cancer Genome Atlas Pan-Cancer data set 24 . Translocation of genomic material is not captured by any array-based DNA analysis, and data from high-throughput sequencing would be required to fully characterize genomic architecture. The complex patterns described in this manuscript are likely to reflect specific mutational processes that could be further elucidated in future studies, linking CARMA with sequencing data. Finally, ASCAT has recently been implemented for whole genome sequencing data 25 , and it would be interesting to apply our methodology directly to the allele-specific copy number profiles extracted from such data.
Several extensions of the current analyses are possible. One could for example in-crease the genomic resolution by partitioning the genome into a fairly large number of equal-sized regions (say 1000), and then assign separate scores to each of these. At some point, however, the regions may become too small to meaningfully assign scores, most notably for the indices reflecting complex rearrangements (STP and CRV). Another possible extension would be to consider regions harboring genes involved in specific processes or pathways, thus directly linking CARMA scores to biological function.

Methods
Deriving allele-specific copy number profiles. Affymetrix CEL files were preprocessed using the PennCNV libraries for Affymetrix data 26 that includes quantile normalization, signal extraction, and summarization. All samples were normalized to a collection of around 5000 normal samples from the HapMap project 27 , the 1000 genome project 28 , and the Wellcome Trust Case Control Consortium 29 . The resulting LogR and BAF (B allele frequency) values were segmented with the piecewise constant fitting algorithm 30 and processed with the ASCAT algorithm (version 2.3) 31 after adjusting LogR for GC binding artifacts 32 . ASCAT infers an allele-specific copy number profile of a tumor after correction for tumor ploidy and tumor cell fraction, and is based on allele-specific segmentation of normalized raw data 30 with penalty parameter (γ) set to 50. The profile reflects the copy number state at m genomic loci for which two alleles are present in the germline in the general population, and can be represented as a sequence of pairs (n Ai , n Bi ) (i = 1, …, m), where n Ai and n Bi denote the number of copies of each of two alleles (here called A and B) being present in the tumor genome at the ith locus. Pairs are ordered according to location, and since the labels A and B are arbitrary, we may assume that n Ai ≥ n Bi .
Calculating regional instability scores. We characterize the allele-specific copy number in a small genomic neighborhood on a chromosome arm by six features: degree of alteration in negative direction, degree of alteration in positive direction, degree of change, degree of oscillation, extent of LOH, and extent of allelic imbalance (see Fig. 1c). Sliding the genomic region along the chromosome arm from one end to the other, we may regard each feature as a function of genomic position. Specifically, suppose we have measured allele-specific copy numbers (n Ai , n Bi ) at genomic loci L i , i = 1,…, m. We can represent this as a pair of piecewise constant functions (f A , f B ) defined on the unit interval R = [0, 1]. The interpretation of this is that each position L i is mapped to a value t i in the unit interval R = [0, 1], and such that L 1 <· · · < L m will be represented by points t 1 <· · · < t m in R. We thus have a one-to-one correspondence between t ∈ [0, 1] and genomic loci L(t), and if L k is the measurement locus closest to L(t), then f A (t) = n Ak and f B (t) = n Bk .
We assume that f B (t) ≤ f A (t) for all t ∈ R, i.e., B is the minor allele when allelic imbalance is present. The median centered total copy number in locus t is where m is the least number in Range(f) that satisfies µ(f −1 ((−∞, m])) ≥ 1/2, where µ is the Lebesgue measure. Informally, this means that m is chosen as the observed copy number with the property that half the genome has a total copy number less than or equal to m. We define the change in total copy number as the derivative Df (t) of the first order spline interpolation to the center points of segments in f, i.e. Df (t) is the slope of the line segment connecting the pair of segment centers immediately to the left and right of position t. Note that Df is also a piecewise constant function. We define the oscillation in total copy number as D 2 f (t) = D(Df (t)), which is also a piecewise constant function. This process can in principle be repeated to define higher order properties of f such as D 3 f (t) = D(D 2 f (t)); however, in practice further levels add little additional information.
Regional instability scores are next defined by integrating the above local scores over the desired region (e.g., over a chromosome arm). To assess the degree of positive or negative deviation within a region, we define two scores: where z + = z if z > 0 and z + = 0 otherwise, and z − = z if z < 0 and z − = 0 otherwise. For example, in a region with total copy number equal to the median, we have J 1 = J 2 = 0, while in a region with some gains and no losses relative to the median, we have J 1 > 0 and J 2 = 0. The regional degree of change and oscillation in copy number are captured by the following two scores: In a region with constant total copy number, we have J 3 = J 4 = 0. In a region with gradually increasing (or decreasing) copy number, J 3 > 0 while J 4 is close to zero, and in a region with fluctuations between smaller and larger copy numbers we have J 3 > 0 and J 4 > 0. LOH and allelic asymmetry are captured by the last two scores: where 1 0 (z) = 1 if z = 0 and 1 0 (z) = 0 otherwise. In a region with only one allele present we have J 5 > 0 and the magnitude of the score reflects the proportion of the region with LOH. In a region with allelic imbalance, we have J 6 > 0. Further computational details can be found in Supplementary Materials.
Calculating CARMA scores in sex chromosomes. The top level function in the accompanying software does not currently support calculation of CARMA scores for the Y chromosome. It is still possible to obtain such scores by use of the included low level function for calculating scores on a single chromosome. Calculation of CARMA scores for the X chromosome is supported, but it requires information about the gender for correct calculation of AMP and DEL.
Statistics and reproducibility. Three-dimensional scatter plots: Subtype centroids were calculated by averaging over all the three-dimensional vectors representing samples from a particular PAM50 subtype. Trend curves are principal curves 33 and were calculated with the R package princurve using default parameter values. Survival analysis: To assess the association between survival (DSS or PFS) and CPI risk groups, a longrank test was applied, and survival estimates were found using the Kaplan-Meier estimator. The functions survdiff and survfit in the R package survival were used for this purpose. All other associations between survival and covariates were assessed using univariate or multivariate Cox regression, as appropriate. A score test was applied to test the significance of individual covariates in the Cox models. Models were fitted by maximization of the Cox partial likelihood, with the exception of the model containing all the 252 armwise CARMA scores as covariates. In the latter case a Cox partial likelihood with an L 1 (lasso) penalty 34 was applied. The lasso is a regularization method that shrinks regression coefficients towards zero by enforcing an upper bound on the L 1 -norm of the coefficients ði:e: P p j¼1 jβ j j ≤ λÞ in the maximization of the partial log likelihood.
The amount of shrinkage is determined by a tuning parameter λ. Leave-one-out cross-validation was used to determine the value of λ. Cox regression with a Lasso penalty was performed using the functions cv.glmnet and glmnet in the R package glmnet 35,36 . All other Cox regressions were performed using the function coxph in the R package survival.
Assessment of risk-score model: The goodness of fit of the continuous CPI risk score was determined using Harrell's C score. For every pair of observations it is determined if the pair is concordant (lowest risk pairs with longest survival), discordant (lowest risk pairs with shortest survival) or cannot be determined due to censoring. Harrell's C score is then the ratio between the number of concordant pairs and the number of concordant/discordant pairs.
The weighted prognostic index (CPI weighted ) was calculated as CPI weighted = x T i βˆ, where x i represents the CARMA arm scores for patient i in the validation data set and βˆare the estimated coefficients in the survival prediction models found for the discovery set.
Materials. The data material in this study was obtained from four patient cohorts: METABRIC (n = 1943), Oslo2 (n = 276), OsloVal (n = 165), and ICGC (n = 553). Only female patients were included. The distribution of clinical parameters within each of the data sets can be found in Supplementary Tables 4-5. The METABRIC cohort was randomly split into a 2:1 ratio into a discovery set (n = 1295) and a test set (n = 648) for the purpose of model validation. For detailed information regarding which samples belong to the train and test cohort, please contact the authors. For more details about the four cohorts, see Supplementary Material and Methods. Survival data were not available for the Oslo2 cohort.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
Genomic copy number and gene expression information as well as clinical data for the OsloVal cohort have been described previously 37 and are available at the Synapse platform, https://doi.org/10.7303/syn1688370. Gene expression information for the Oslo2 cohort has been described previously 38,39 and is available at Gene Expression Omnibus, DOI: GSE81002. The SNP 6.0 copy number data from the Oslo2 cohort are available upon request. Molecular-subtype information and segmented copy number profiles for the OsloVal and Oslo2 cohort are available from the corresponding author on reasonable request. Genomic copy number, gene expression and molecular-subtype information for the METABRIC cohort have been described previously 1 and are available at the European Genome Phenome Archive, DOI: EGAS00000000083, while clinical data are available from 19 . Gene expression data, segmented copy number profiles and clinical information for the ICGC breast cancer cohort have been described previously 40