Evaluating cancer cell lines as models for metastatic breast cancer

Metastasis is the most common cause of cancer-related death and, as such, there is an urgent need to discover new therapies to treat metastasized cancers. Cancer cell lines are widely-used models to study cancer biology and evaluate drug candidates. However, it is still unknown whether they adequately recapitulate the disease in patients. The recent accumulation of large-scale genomic data in cell lines, mouse models, and patient tissue samples provides an unprecedented opportunity to evaluate the suitability of cancer cell lines as models for metastatic cancer research. Through comprehensively comparing the genomic profiles of 57 breast cancer cell lines with those of metastatic breast cancer samples, we found their substantial genomic differences. We also identified cell lines that more closely resemble different subtypes of metastatic breast cancer. However, we found none of the currently established Basal-like cell lines sufficiently resemble the samples of Basal-like metastatic breast cancer, a subtype of high interest in therapeutic discovery. Further analysis of mutation, copy number variation and gene expression data suggested that MDAMB231, the most commonly used triple negative cell line, had little genomic similarity with Basal-like metastatic breast cancer samples. Our work demonstrates an urgent need of cell lines that more closely resemble Basal-like metastatic breast cancer samples, and could guide cell line selection in other metastasis-related translational research.


Introduction
Cancer cell lines were initially derived from tumors and cultured in a 2D environment. Because of the merit of cell culture, they have been widely used as models to study cancer biology and test drug candidates [1]. However, the fact that many drugs have great preclinical profiles but fail in the clinic urges the reinvestigation of cell lines as tumor models [2]. The differences between cell lines and tumors have raised the critical question to what extent do cell lines recapitulate the biology of tumor samples [3,4].
The emergence of large-scale genomic data provides an unprecedented opportunity to quantify their biological differences. The Cancer Genome Atlas (TCGA) project characterized both genetic and transcriptomic profiles of more than 10,000 human tissue samples across 32 tumor types [5]. The Cancer Cell Line Encyclopedia (CCLE) characterized both genetic and transcriptomic profiles for more than 1,000 cell lines [6]. Silvia et al. performed comprehensive comparison of molecular profiles between 47 ovarian cancer cell lines and ovarian tumor samples, and they showed that popular cell line models did not closely resemble high-grade serous ovarian cancers [7]. In addition, they identified several rarely used cell lines that closely resembled the profile of ovarian cancer. We examined the transcriptome similarity between hepatocellular carcinoma (HCC) cell lines and HCC tumor samples and demonstrated that nearly half of the HCC cell lines did not resemble HCC tumors [8]. Jian et al. conducted a comprehensive comparison of molecular portraits between breast cancer cell lines and primary breast cancer samples and found both similar and dissimilar molecular features [10].
Cancer metastasis is the most common cause of cancer-related death, thus there is an urgent need of new drugs for treating cancer metastasis [11,12]. Previous cell line evaluation analysis were mainly performed in reference to primary tumors. It remains unknown whether cell lines closely resemble metastatic cancer and thus are appropriately used in translational research. Robinson et al. performed whole-exome and transcriptome sequencing on about 500 metastatic cancer samples and recently released their dataset (refer to MET500) [13]. This large-scale genomic profiling combined with existing genomic data allows the evaluation of the suitability of cell lines as models for metastatic cancer. As a case study, in this work we focus on breast cancer, where cell lines are frequently used to study metastasis.
We found breast cancer cell lines poorly recapitulated somatic mutation spectrum while the CNV (copy-number-variation) profiles were highly consistent. In addition, we showed cell lines could resemble the transcriptome of metastatic breast cancer and identified suitable cell lines for LuminalA/LuminalB/Her2-enriched subtypes. Our analysis indicated that none of the cell lines closely resembled Basal-like metastatic breast cancer samples. Specifically, the heavily-used triple negative cell line MDAMB231 shows low genomic similarity with metastatic breast cancer samples, which could be confirmed by using independent in vivo and in vitro data. Our work reveals the similarity and difference between metastatic breast cancer samples and cell lines and provides guidance for choosing cell lines in metastasis-related translational research.
Identification of differentially mutated genes between MET500 and TCGA Given a gene, to test whether it has significantly higher mutation frequency in metastatic breast cancer samples, we computed the right-tailed p-value as follows: Where Pr(i;N,q) is the probability mass function of binomial distribution, N is the number of genotyped MET500 breast cancer cohorts, n is the number of MET500 breast cancer cohorts in which the gene is mutated andq is the mutation frequency of the gene in TCGA dataset. Similarly, we computed the left-tailed p-value to test whether a gene has significantly lower mutation frequency in metastatic breast cancer samples: To control false discovery rate, we applied the Benjamini-Hochberg procedure on the left-tailed and right-tailed p-values respectively [23].

Gene expression correlation analysis
To perform gene expression correlation analysis, we first rank-transformed the gene RPKM values (or probeset intensity values) of each CCLE cell line and then ranked all the genes (or probesets) according to their rank variation across CCLE cell lines. The 1000 most-varied genes (or probesets) were then selected and used to compute the spearman rank correlation between cell lines and metastatic breast cancer samples.

PAM50 sub-typing and t-SNE visualization
The genefu package was used to determine breast cancer subtype [24,25]. To visualize tumor samples with t-SNE, we first computed the pair-wise distance between every two samples as 1 minus the spearman rank correlation (across PAM50 genes) and then applied the function Rtsne to perform 2D dimensional reduction [26].

Pubmed search
The number of PubMed abstracts or full texts mentioning a CCLE breast cancer cell line was determined using the PubMed Search feature on May 10, 2018 (https://www.ncbi.nlm.nih.gov/pubmed/). For each cell line, we searched with a keyword "[cell line name] metastasis". We repeated this step for the terms "metastatic", "breast cancer", and "metastatic breast". These searches returned highly correlated results, so we used the search terms which returned the most results: "[cell line name] metastasis".

Identification of differentially expressed genes between cell lines and metastatic breast cancer samples
DESeq2 was used to identify differentially expressed genes and DAVID bioinformatics sever was used to perform Gene Ontology (GO) enrichment analysis [27,28]. The 50 hallmark gene sets were downloaded from MSigDB (http://software.broadinstitute.org/gsea/msigdb/) and the R package GSVA was used to perform ssGSEA analysis [29][30][31][32]. To identify gene sets which have different activity between cell lines and metastatic breast cancer samples, we used the Wilcoxon rank test to compute a p-value for each of the 50 gene sets and then applied Benjamini-Hochberg procedure to select significant gene sets (FDR ≤ 0.01).

Software tools and statistical methods
All of the analysis was conducted with R and the code is freely available at https://github.com/Bin-Chen-Lab/MetaBreaCellLine. The ggplot2 and ComplexHeatmap packages were used for data visualization [33,34]. The tumor purity was estimated using ESTIMATE [35]. CNTools was used to map the segmented CNV data to genes [36]. If not specified, the Wilcoxon rank test was used to compute p-value in hypothesis testing.

Comparison of genetic profiles between metastatic breast cancer and cell lines
We first compared the gene somatic mutation profile between MET500 breast cancer samples and breast cancer cell lines. Whole-exome sequencing was performed for MET500 samples, while hybrid capture sequencing was performed for cell lines. We thus only focused on the 1630 genes genotyped in both studies. We are particularly interested in two types of genes that may play important roles in breast cancer metastasis: genes that are highly mutated in metastatic breast cancer, and genes that are differentially mutated between metastatic and primary breast cancers.
Consistent with previous research, we identified a long-tailed mutation spectrum of the 1630 genes in MET500 breast cancer samples (Fig S1a). There were 69 highly mutated genes whose mutation frequency is higher than 0.05 and the five most-altered genes were TP53 (0.67), PIK3CA (0.35), TTN (0.29), OBSCN (0. 19), and ESR1 (0.14). We applied a statistical method to identify genes which have significantly different mutation frequency between MET500 and TCGA and 19 genes passed the criteria F DR ≤ 0.001. The top five most significant genes were ESR1, TNK2, OBSN, CAMKK2, and CLK1 ( Fig S1b and Table 1). Interestingly, all of these 19 differentially-mutated genes had higher mutation frequency in MET500 breast cancer samples, which is consistent with previous study showing that metastatic cancer has increased mutation burden compared to primary cancer [13]. 68% of them were also among the 69 highly mutated genes mentioned above. After merging the two gene lists, 75 unique genes remained (Fig1 and Table S1). The median mutation frequency of the 75 genes across breast cancer cell lines is 0.07 and only 9% of them (PRKDC, MAP3K1, TTN, ADGRG4, TP53, FN1, and AKAP9) are mutated in at least 50% of cell lines, suggesting that the majority of these gene mutations could be recapitulated by only a few cell lines. In accordance with this finding, the median number of mutated genes of the 57 cell lines is 10, with CAL51, MDAMB453, UACC812, CAL148, and HCC1569 being the five most-mutated cell lines. Additionally, nine out of the 75 genes (ESR1, GNAS, PIKFYVE, FFAR2, RNF213, MYBL2, KAT6A, MAP4K4, and FMO4) are not mutated in any cell lines. Notably, ESR1 has been identified as a driver gene of cancer metastasis and associated mutations could cause endocrine resistance of metastatic cancer cells [37,38], but none of the cell lines could be used to accurately model it .
We next asked whether there were any genes which were specifically hypermutated in breast cancer cell lines. To address this question, we examined the mutation spectrum of the 32 genes that are mutated in at least 50% of the breast cancer cell lines. Surprisingly, 25 of them (78.1%) have low mutation frequency (< 0.05) in MET500 breast cancer samples. Further analysis of somatic mutation profiles of the 25 genes in TCGA breast cancer samples confirmed their hypermutation was specific to breast cancer cell lines ( Fig S1c).
Besides the somatic mutation spectrum, we also compared CNV profiles between MET500 breast cancer samples and breast cancer cell lines. We observed a very strong correlation of median CNV values across the 1630 commonly genotyped genes (spearman rank correlation = 0.81, Fig 1b). Surprisingly, we noticed that the gain-of-copy-number events in cell lines appeared to resemble metastatic breast cancer while loss-of-copy-number events did not. As shown in Fig S1d, for genes that show copy-number-loss in breast cancer cell lines, their median CNV values across breast cancer cell lines are significantly lower than those from MET500; however, no significant difference was detected in genes with copy-number-gain.
Out of the 57 breast cancer cell lines, 24 were derived from metastatic sites (Table  S2). We further divided the cell lines into two groups (according to whether derived from metastatic sites or not). Then, we compared the CNV profiles of each group with MET500 breast cancer samples. We found cell lines derived from metastatic sites more closely resembled the CNV status of genes with high copy-number-gain (CN V ≥ 0.4) in MET500 breast cancer samples (Fig 1c, 1d, and Fig S1e), which is expected and consistent with the results of additional expression analysis (see Section 3 for more details). Comparison of the genetic profile between metastatic breast cancer samples and breast cancer cell lines. (a) Somatic mutation profile of the 75 genes across MET500 breast cancer samples and breast cancer cell lines. The top-side color-bar indicates data source (MET500 or CCLE) and the right-side color-bar indicates mutation frequency of genes. (b) Comparison of CNV profiles between MET500 breast cancer samples and breast cancer cell lines with 1630 commonly genotyped genes. The x-axis represents the median CNV value of one gene across 57 breast cancer cell lines and y-axis represents the median CNV value of one gene across 53 MET500 breast cancer samples. (c) Comparison of CNV profile between MET500 breast cancer samples and CCLE breast cancer cell lines derived from primary site. The x-axis represents the median CNV value across 33 breast cancer cell lines derived from primary site and y-axis represents the median CNV value across 53 MET500 breast cancer samples. Genes with high CNV value in MET500 breast cancer samples are red. (d) Comparison of CNV profile between MET500 breast cancer samples and breast cancer cell lines derived from metastatic sites. The x-axis represents the median CNV value across 24 breast cancer cell lines derived from metastatic sites and y-axis represents the median CNV value across 53 MET500 breast cancer samples. Genes with high CNV value in MET500 breast cancer samples are red. Correlating breast cancer cell lines with metastatic breast cancer samples using transcriptomic data Gene expression correlation analysis is proven to be an effective approach to evaluate the suitability of cell lines for research purpose [7][8][9]. Therefore, we ranked all 1019 CCLE cell lines according to their median spearman rank correlation with MET500 breast cancer samples. The 20 most-correlated cell lines were all breast cancer cell lines (Fig 2a), illustrating the potential of breast cancer cell lines to resemble the transcriptomic profile of metastatic breast cancer. MDAMB415 and HMC18 are the two cell lines that have highest and lowest correlation respectively.
Since MET500 breast cancer samples were derived from multiple biopsy sites, we asked whether the cell lines resembling the transcriptome of metastatic breast cancer from different biopsy sites were identical. We were only able to consider liver and lymph node due to the paucity of enough samples from other biopsy sites in the MET500 dataset. We performed biopsy-site-specific gene expression correlation analysis (i.e., correlating breast cancer lines with samples derived from a specific biopsy site) and found that the cell line rankings obtained from liver and lymph node were highly correlated (Fig 2b, spearman rank correlation = 0.97), with MDAMB415 being the most correlated cell line for both biopsy sites. In addition, we detected no significant difference of the correlations with MDAMB415 cell line between different biopsy sites (Fig S2a).
Given the genomic heterogeneity of breast cancer, we further asked whether the cell lines resembling the transcriptome of metastatic cancer of different subtypes were identical. To address this question, we first determined the PAM50 subtype of MET500 breast cancer samples with R package genefu. Since genefu was initially developed with primary breast cancer data, we further applied the machine learning method t-SNE on expression data of PAM50 genes and confirmed the PAM50 genes could be used to classify metastatic breast cancer samples. As shown in Fig 2c, Basal-like samples were clustered together and separated with other subtypes, which is in accordance with previous research [39,42]. Additionally, the majority of LuminalA/LuminalB/Her2-enriched/Normal-like samples were mixed together except two skin-derived samples (HER2-enriched samples seemed to be separated with LuminalA/LuminalB samples but the boundary was not clear). We confirmed the finding by performing the same analysis on a combined dataset which contains both MET500 and TCGA breast cancer samples (Fig S2b). We next performed subtype-specific gene expression correlation analysis (i.e., correlating breast cancer cell lines with samples of a specific subtype) and found the ranking of breast cancer cell lines obtained from LuminalA/LuminalB/Her2-enriched subtypes were highly correlated with each other (spearman rank correlation values were 0.96, 0.97, and 0.96 respectively), but they all showed relatively lower correlation with the Basal-like subtype (Fig 2d).
To confirm the robustness of the results, we searched the GEO database and assembled a microarray dataset containing the expression value of another 117 metastatic breast cancer samples, and repeated the above analysis. As expected, the results obtained from two different platforms were highly consistent with each other. First, there was a large overlap of the top-ranked cell lines. Out of the 10 cell lines that were most correlated with the 117 metastatic breast cancer samples, six of them were within top 10 cell lines that were most correlated with MET500 breast cancer samples. Second, cell line ranking results between liver and lymph node were highly correlated ( Fig S3, spearman rank correlation = 0.95). Third, cell line ranking results obtained from Basal-like samples still showed relatively lower correlations with other subtypes (Fig S4).
We also noticed that the expression correlation analysis results derived from bone showed lower correlation with other tissues. To exclude the possibility that this was caused by tumor purity issues, we applied ESTIMATE on the microarray data and found the tumor purity of bone-derived metastatic breast cancer samples was not significantly lower than that of liver, lymph node and lung (Fig S5). Our results may not be too surprising given the fact that bone provides a very unique microenvironment including enriched expression of osteolytic genes [40]; however, this result needs to be confirmed in the future as more data becomes available.

Suitable cell lines for metastatic breast cancer research
We attempted to identify suitable cell line models for metastatic breast cancer based on the results of subtype-specific gene expression correlation analysis. Given a subtype, we noticed that for a random cell line, the median expression correlation (with MET500 breast cancer samples of that subtype) was normally distributed (Fig S6). Based on that, we fit a normal distribution using the median expression correlation values of all non-breast-cancer cell lines and then assigned each of the 57 breast cancer cells lines a right-tailed p-value. We identified 20, 28, and 19 significant cell lines as suitable models for LuminalA, LuminalB, and Her2-enriched subtypes, respectively (F DR ≤ 0.01, see Table S3). Notably, most of these suitable cell lines were derived from metastatic sites and 18 of them were shared by the three subtypes. Surprisingly, no cell line passed the criteria FDR ≤ 0.01 for the Basal-like subtype. We further examined whether this was due to the limited number of Basal-like samples. However, the number of LuminalA samples was even less than that of Basal-like samples.
We next examined the popularity of the 57 breast cancer cell lines. MCF7 is most commonly used in metastatic breast cancer research (n=2299 Pubmed citations). Although we found it was a suitable cell line for LuminalB subtype, its correlation with MET500 LuminalB samples was lower than that of BT483, the most significant cell line for LuminalB subtype (Fig S7a). Following MCF7 in mentions is MDAMB231 (n=2118 Pubmed citations); however, we found that it was not an suitable cell line to use for every subtype based on our results. The third most popular cell line T47D (n=204 Pubmed citations) was a suitable cell line for both LuminalA and Her2-enriched subtype. T47D did not show significantly lower correlation with LuminalA samples than EFM192A, the most correlated cell line for LuminalA subtype (Fig S7b); however, it was significantly less correlated with Her2-enriched subtype than EFM192A, the most correlated cell line for Her2-enriched subtype (Fig S7c). Additional subtype-specific gene expression correlation analysis in the microarray dataset further confirmed our results (Fig S8).
While the triple negative cell line MDAMB231 is one of the most frequently used cell lines in metastatic breast cancer research, it might not be the most suitable cell line to model metastasis biology. We ranked all of the 1019 CCLE cell lines according to their median expression correlation with MET500 Basal-like breast cancer samples and the rank of MDAMB231 was 583 (Fig 3a). It showed significantly lower correlation with MET500 Basal-like breast cancer samples than HCC70, the most correlated cell line. Similar patterns were observed with CNV data (Fig 3b). We also examined how MDAMB231 recapitulated the somatic mutation spectrum of Basal-like breast cancer samples and found only three of the 25 highly mutated genes (mutation frequency ≥0.1 in Basal-like MET500 breast cancer samples) were mutated in MDAMB231. Since CCLE data for MDAMB231 was generated in vitro, we obtained another independent dataset which profiled the gene expression of MDAMB231 cell lines derived from lung metastasis in vivo [41] in order to confirm our finding. We found, however, that even these in vivo MDAMB231 cell lines did not most closely resemble the transcriptome of lung metastasis breast cancer samples. The cell line which showed highest correlation (with lung metastasis breast cancer samples) is the CCLE cell line EFM192A (Fig 3d). Our analysis indicates that although MDAMB231 presents many favorable properties for metastatic breast cancer research, its genomic profile is substantially different from metastatic tissue samples.

Differential gene expression analysis between metastatic breast cancer and cell lines
The gene expression correlation analysis has shown that many cell lines could resemble metastatic breast cancer; however, they are still different in many aspects [3,4]. To characterize such differences, we compared the gene expression profile of MET500 breast cancer samples with breast cancer cell lines and identified 3044 differentially expressed genes (FDR ≤ 0.001, abs(log2FC) ≥ 1). We further performed GO enrichment analysis for the up-regulated and down-regulated genes respectively and listed the results in Table S4. The top five most significant enriched GO terms for up-regulated genes are extracellular matrix organization, cell adhesion, type I interferon signaling pathway, interferon-gamma-mediated signaling pathway and immune response; the top five most significant GO terms for down-regulated genes are all related to cell cycle: cell division, mitotic nuclear division, sister chromatid cohesion, DNA replication, and chromosome segregation.
We also compared the ssGSEA score of the 50 MSigDB hallmark gene sets between MET500 breast cancer samples and breast cancer cell lines (Fig 4). In total,37 gene sets were identified as showing differential activity (F DR ≤ 0.01, Table S6). Out of them, 27 showed significantly higher activity in MET500 breast cancer samples and the remaining 10 showed significantly lower activity. we noticed that some MET500 breast cancer samples derived from liver (in dashed box of Fig 4) had enriched metabolism-related gene sets (such as XENOBIOTIC METABOLISM and BILE ACID METABOLISM). This suggests that liver-metastasis cancer cells may have their unique metabolic mechanism comparing to primary tumors.
It is worth noting that the ssGSEA results are highly consistent with the gene differential expression analysis. The up-regulated genes (and over-activated gene sets in MET500) reflect the large difference of microenvironment between metastatic breast cancer and cell lines; also, the down-regulated genes (and less-activated gene sets in MET500) suggest that cell lines have more active cell cycles. All of these differences should be kept in mind when using cell lines in translational research.

Discussion
In cancer research, cell lines have been traditionally used to test drug candidates and study disease mechanism. Our comprehensive analysis has both raised doubt and shed light on the suitability of breast cancer cell lines as models for metastatic breast cancer research.
Somatic mutation profile analysis indicated that breast cancer cell lines poorly recaptured the mutation patterns of metastatic breast cancer samples. Most of the highly-mutated genes (or differentially-mutated genes between metastatic and primary lesions) were only mutated in a limited number of cell lines. In addition, there were 25 genes showing cell-line-specific hypermutation, which may be due to culture effects. Remarkably, the CNV profiles between breast cancer cell lines and metastatic breast cancer samples were much more consistent. We also performed a gene expression correlation analysis to explore whether breast cancer cell lines could resemble the transcriptome of metastatic breast cancer samples. The results of biopsy-site-specific analysis suggested that for liver and lymph node derived metastatic breast cancer samples, the biospy site did not play a role in determining the cell lines which closely resembled their transcriptome and such conclusion was validated in analysis of two independent datasets. It has been shown that breast cancer is a heterogeneous disease with multiple subtypes. We found that the PAM50 subtype were maintained in metastatic breast cancer samples regardless of the tissues it metastasize to and this corroborates with the results from a recent study [42]. Through a subtype-specific analysis, we found that the cell lines that most closely resembled the transcriptome of LuminalA/LuminalB/Her2-enriched subtypes were highly overlapped. Surprisingly, none of the currently established cell lines adequately resemble Basal-like metastatic breast cancer samples. Moreover, we found that the two most commonly used cell lines, MCF7 and MDAMB231 (together accounting for more than 80% of total PubMed publications mentioning metastatic breast cancer), were not the best choice for metastatic breast cancer research in terms of transcriptomic similarity. Specifically, there is dramatic difference between Basal-like metastatic breast cancer samples and MDAMB231 (the most commonly used triple negative cell line), which was demonstrated by both in vitro and in vivo data. Note that although some cell lines closely resemble tissue samples of LuminalA/LuminalB/Her2-enriched subtypes, it does not mean they could be directly employed to study cancer metastasis as many other criteria are needed for the assessment. Nevertheless, this analysis does suggest that we are in urgent need of new Basal-like cell lines which more closely resemble the biology of Basal-like metastatic breast cancer samples.
The results of our gene expression correlation analysis also raises a new question: when picking out cell lines to test drugs targeting breast cancer metastasis, which factors should be taken into consideration? According to our analysis, it appears that for lymph node and liver metastasis, the subtype information is sufficient since the biopsy-site-specific gene expression correlation analysis results were highly concordant with each other. However, we found that the results computed with bone metastases showed low correlation with other tissues. This implies that even for the same subtype, a cell line that is appropriate to model metastasis of other sites may not be appropriate for bone metastasis study.
Even though many breast cancer cell lines resemble the transcriptome of metastatic breast cancer samples, a large number of genes were identified as differentially expressed between them. Some of these genes relate to immune response, possibly reflecting the large difference between tumor microenvironment and the cell culture. In addition, our ssGSEA analysis on the 50 hallmark gene sets suggested that there is systematical difference of important pathway activities.
In summary, by leveraging publicly available gnomic data and machine learning algorithms, we comprehensively evaluate the suitability of breast cancer cell lines as models for metastatic breast cancer. Our study also describes a blueprint which can be easily extended to other cancer types and more advanced model systems, such as organoids [43]. Although there are concerns about data quality and discrepancies between different studies/platforms, our large-scale analysis and cross-platform validation hopefully addresses these concerns and demonstrates the power of leveraging open data and machine learning algorithms to gain biological insights of cancer metastasis. As more data becomes available, we can start building an ad-hoc mapping algorithm linking metastasis samples, cell lines and other models. Inputs into this algorithm would be the characteristics of metastatic cancer samples (subtype, biopsy site, or even age, race, etc) as well as the specific scientific question of interest and the output would be a list of appropriate models. We hope that the recommendations in this study may facilitate improved precision in selecting relevant and suitable cell lines for modeling in metastatic breast cancer research, which may accelerate the translational research.
Supporting information S1 Fig. (a) Long-tailed gene mutation spectrum in MET500 breast cancer samples. (b) Volcano plot of gene differential mutation frequency analysis. (c) Visualization of log10-transformed mutation frequency of the 25 genes that are specifically hypermutated in CCLE breast cancer cell lines. (d) Boxplot of median CNV of grouped genes (according to whether showing gain or loss of copy number in CCLE breast cancer cell lines) in MET500 breast cancer samples and CCLE breast cancer cell lines. (e) CCLE breast cancer cell lines derived from metastatic sites more closely resemble the CNV status of genes with high copy-number-gain in MET500 dataset. Left: absolute value of median CNV difference between MET500 breast cancer samples and CCLE breast cancer cell lines derived from primary sites; right: absolute value of median CNV difference between MET500 breast cancer samples and CCLE breast cancer cell lines derived from metastatic sites.  Table. Mutation frequency of the 75 highly (or differentially) mutated genes in CCLE, TCGA, and MET500 dataset.
S2 Table. Characteristic of the 57 CCLE breast cancer cell lines.
S4 Table. GO enrichment results of differentially expressed genes between CCLE breast cancer cell lines and MET500 breast cancer samples.
S5 Table. Results of differential activity analysis between MET500 breast cancer samples and CCLE breast cancer cell lines (for the 50 MSigDB hallmark gene sets).