Abstract

Human cancer cell lines are the workhorse of cancer research. Although cell lines are known to evolve in culture, the extent of the resultant genetic and transcriptional heterogeneity and its functional consequences remain understudied. Here we use genomic analyses of 106 human cell lines grown in two laboratories to show extensive clonal diversity. Further comprehensive genomic characterization of 27 strains of the common breast cancer cell line MCF7 uncovered rapid genetic diversification. Similar results were obtained with multiple strains of 13 additional cell lines. Notably, genetic changes were associated with differential activation of gene expression programs and marked differences in cell morphology and proliferation. Barcoding experiments showed that cell line evolution occurs as a result of positive clonal selection that is highly sensitive to culture conditions. Analyses of single-cell-derived clones demonstrated that continuous instability quickly translates into heterogeneity of the cell line. When the 27 MCF7 strains were tested against 321 anti-cancer compounds, we uncovered considerably different drug responses: at least 75% of compounds that strongly inhibited some strains were completely inactive in others. This study documents the extent, origins and consequences of genetic variation within cell lines, and provides a framework for researchers to measure such variation in efforts to support maximally reproducible cancer research.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Sharma, S. V., Haber, D. A. & Settleman, J. Cell line-based platforms to evaluate the therapeutic efficacy of candidate anticancer agents. Nat. Rev. Cancer 10, 241–253 (2010).

  2. 2.

    Freedman, L. P., Cockburn, I. M. & Simcoe, T. S. The economics of reproducibility in preclinical research. PLoS Biol. 13, e1002165 (2015).

  3. 3.

    Prinz, F., Schlange, T. & Asadullah, K. Believe it or not: how much can we rely on published data on potential drug targets? Nat. Rev. Drug Discov. 10, 712 (2011).

  4. 4.

    Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012).

  5. 5.

    Garnett, M. J. et al. Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature 483, 570–575 (2012).

  6. 6.

    Basu, A. et al. An interactive resource to identify cancer genetic and lineage dependencies targeted by small molecules. Cell 154, 1151–1161 (2013).

  7. 7.

    Yang, W. et al. Genomics of drug sensitivity in cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Res. 41, D955–D961 (2013).

  8. 8.

    Seashore-Ludlow, B. et al. Harnessing connectivity in a large-scale small-molecule sensitivity dataset. Cancer Discov. 5, 1210–1223 (2015).

  9. 9.

    Haibe-Kains, B. et al. Inconsistency in large pharmacogenomic studies. Nature 504, 389–393 (2013).

  10. 10.

    The Cancer Cell Line Encyclopedia & Genomics of Drug Sensitivity in Cancer Investigators. Pharmacogenomic agreement between two cancer cell line data sets. Nature 528, 84–87 (2015).

  11. 11.

    Haverty, P. M. et al. Reproducible pharmacogenomic profiling of cancer cell line panels. Nature 533, 333–337 (2016).

  12. 12.

    Soule, H. D., Vazguez, J., Long, A., Albert, S. & Brennan, M. A human cell line from a pleural effusion derived from a breast carcinoma. J. Natl Cancer Inst. 51, 1409–1416 (1973).

  13. 13.

    Brooks, S. C., Locke, E. R. & Soule, H. D. Estrogen receptor in a human cell line (MCF-7) from breast carcinoma. J. Biol. Chem. 248, 6251–6253 (1973).

  14. 14.

    Lee, A. V., Oesterreich, S. & Davidson, N. E. MCF-7 cells–changing the course of breast cancer research and care for 45 years. J. Natl Cancer Inst. 107, djv073 (2015).

  15. 15.

    Bamford, S. et al. The COSMIC (catalogue of somatic mutations in cancer) database and website. Br. J. Cancer 91, 355–358 (2004).

  16. 16.

    Subramanian, A. et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell 171, 1437–1452 (2017).

  17. 17.

    Roth, A. et al. PyClone: statistical inference of clonal population structure in cancer. Nat. Methods 11, 396–398 (2014).

  18. 18.

    Eirew, P. et al. Dynamics of genomic clones in breast cancer patient xenografts at single-cell resolution. Nature 518, 422–426 (2015).

  19. 19.

    Bhang, H. E. et al. Studying clonal dynamics in response to cancer therapy using high-complexity barcoding. Nat. Med. 21, 440–448 (2015).

  20. 20.

    Berger, A. H. et al. High-throughput phenotyping of lung cancer somatic mutations. Cancer Cell 30, 214–228 (2016).

  21. 21.

    Peck, D. et al. A method for high-throughput gene expression signature analysis. Genome Biol. 7, R61 (2006).

  22. 22.

    Gupta, P. B. et al. Stochastic state transitions give rise to phenotypic equilibrium in populations of cancer cells. Cell 146, 633–644 (2011).

  23. 23.

    Lieber, M., Smith, B., Szakal, A., Nelson-Rees, W. & Todaro, G. A continuous tumor-cell line from a human lung carcinoma with properties of type II alveolar epithelial cells. Int. J. Cancer 17, 62–70 (1976).

  24. 24.

    The Cancer Genome Atlas Research Network. Comprehensive molecular profiling of lung adenocarcinoma. Nature 511, 543–550 (2014).

  25. 25.

    Soule, H. D. et al. Isolation and characterization of a spontaneously immortalized human breast epithelial cell line, MCF-10. Cancer Res. 50, 6075–6086 (1990).

  26. 26.

    Bray, M. A. et al. Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nat. Protoc. 11, 1757–1774 (2016).

  27. 27.

    Liberzon, A. et al. The molecular signatures database (MSigDB) hallmark gene set collection. Cell Syst. 1, 417–425 (2015).

  28. 28.

    Janiszewska, M. et al. In situ single-cell analysis identifies heterogeneity for PIK3CA mutation and HER2 amplification in HER2-positive breast cancer. Nat. Genet. 47, 1212–1219 (2015).

  29. 29.

    Venteicher, A. S. et al. Decoupling genetics, lineages, and microenvironment in IDH-mutant gliomas by single-cell RNA-seq. Science 355, eaai8478 (2017).

  30. 30.

    Bray, M. A., Fraser, A. N., Hasaka, T. P. & Carpenter, A. E. Workflow and metrics for image quality control in large-scale high-content screens. J. Biomol. Screen. 17, 266–274 (2012).

  31. 31.

    Dao, D. et al. CellProfiler Analyst: interactive data exploration, analysis and classification of large biological image sets. Bioinformatics 32, 3210–3212 (2016).

  32. 32.

    Adalsteinsson, V. A. et al. Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nat. Commun. 8, 1324 (2017).

  33. 33.

    Ha, G. et al. Integrative analysis of genome-wide loss of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways in triple-negative breast cancer. Genome Res. 22, 1995–2007 (2012).

  34. 34.

    Sholl, L. M. et al. Institutional implementation of clinical tumor profiling on an unselected cancer population. JCI Insight 1, e87062 (2016).

  35. 35.

    Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

  36. 36.

    McKenna, A. et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

  37. 37.

    DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

  38. 38.

    Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).

  39. 39.

    McLaren, W. et al. Deriving the consequences of genomic variants with the Ensembl API and SNP effect predictor. Bioinformatics 26, 2069–2070 (2010).

  40. 40.

    Olshen, A. B., Venkatraman, E. S., Lucito, R. & Wigler, M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557–572 (2004).

  41. 41.

    Abo, R. P. et al. BreaKmer: detection of structural variation in targeted massively parallel sequencing data using kmers. Nucleic Acids Res. 43, e19 (2015).

  42. 42.

    Sanjana, N. E., Shalem, O. & Zhang, F. Improved vectors and genome-wide libraries for CRISPR screening. Nat. Methods 11, 783–784 (2014).

  43. 43.

    Joung, J. et al. Genome-scale CRISPR–Cas9 knockout and transcriptional activation screening. Nat. Protoc. 12, 828–863 (2017).

  44. 44.

    Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

  45. 45.

    Rees, M. G. et al. Correlating chemical sensitivity and basal gene expression reveals mechanism of action. Nat. Chem. Biol. 12, 109–116 (2016).

  46. 46.

    Golub, T. R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999).

  47. 47.

    Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).

  48. 48.

    Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352, 189–196 (2016).

  49. 49.

    Meyers, R. M. et al. Computational correction of copy number effect improves specificity of CRISPR–Cas9 essentiality screens in cancer cells. Nat. Genet. 49, 1779–1784 (2017).

  50. 50.

    Hu, Y. Efficient, high-quality force-directed graph drawing. Math. J. 10, 37–71 (2006).

  51. 51.

    Barbie, D. A. et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature 462, 108–112 (2009).

  52. 52.

    Fowlkes, E. B. & Mallows, C. L. A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78, 553–569 (1983).

  53. 53.

    Ben-David, U. et al. Patient-derived xenografts undergo mouse-specific tumor evolution. Nat. Genet. 49, 1567–1575 (2017).

  54. 54.

    Carter, S. L. et al. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30, 413–421 (2012).

  55. 55.

    Zhang, S., Yuan, Y. & Hao, D. A genomic instability score in discriminating nonequivalent outcomes of BRCA1/2 mutations and in predicting outcomes of ovarian cancer treated with platinum-based chemotherapy. PLoS ONE 9, e113169 (2014).

  56. 56.

    Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).

  57. 57.

    Carter, S. L., Eklund, A. C., Kohane, I. S., Harris, L. N. & Szallasi, Z. A signature of chromosomal instability inferred from gene expression profiles predicts clinical outcome in multiple human cancers. Nat. Genet. 38, 1043–1048 (2006).

  58. 58.

    Pujar, S. et al. Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation. Nucleic Acids Res. 46, D221–D228 (2018).

  59. 59.

    Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

  60. 60.

    Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A. & Dewey, C. N. RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493–500 (2010).

  61. 61.

    Ben-David, U., Mayshar, Y. & Benvenisty, N. Virtual karyotyping of pluripotent stem cells on the basis of their global gene expression profiles. Nat. Protoc. 8, 989–997 (2013).

  62. 62.

    Krijthe, J. H. Rtsne: T-Distributed Stochastic Neighbor Embedding using a Barnes-Hut Implementation https://github.com/jkrijthe/Rtsne (2015).

Download references

Acknowledgements

We thank G. Wei, W. Zhang, C. Mader, J. Roth, D. Thomas, A. Kung, D. Davison, C. Chouinard, K. Walsh, A. Navarro, A. Berger, D. Wheeler, X. Jin, A. Hong, M. Trakala, P. Cho, J. Kuiken, R. Boidot, X. Lu, F. Huang, C. Johannessen, X. Wu, S. Santaguida, N. Sethi, A. Amon, K. Polyak, J. Brugge, D. Yu, J. Klefstrom, W. Hahn, I. Dunn and Y. Mei for contributing cell lines for this study; M. Ducar and S. Drinan for assistance with the OncoPanel assay; Z. Herbert for assistance with the whole-genome sequencing; J. Davis, S. Johnson, D. Lahr, J. Gould, M. Macaluso, X. Lu and T. Natoli for assistance with the L1000 assay; D. Feldman for assistance with the barcoding experiment; J. McFarland, J. Shih, C. Oh, A. Cherniack and P. Clemons for computational assistance; V. Jones and J. Gale for assistance with drug screening; K. Hartland for assistance with Cell Painting staining and imaging; A. Regev, O. Rozenblatt-Rosen, A. Neumann, D. Dionne and L. Nguyen for assistance with 10X single-cell RNA sequencing; E. Gonzalez for assistance with cytogenetic analyses; L. Lichtenstein, D. Benjamin, S. K. Lee, V. Ruano-Rubio and A. Chevalier for the GATK CNA pipeline; Z. Tothova, J. Boehm, O. Cohen, C. Johannessen, A. Subramanian, A. Carpenter, I. Tirosh, Y. Brody, Y. Zeira and R. Pistofidis for discussions. U.B.-D. is supported by a HFSP postdoctoral fellowship. This work was supported by HHMI (T.R.G.), NIH (R01 CA188228; R.Be.), Gray Matters Brain Cancer Foundation (R.Be.), Bridge Project (P.B. and R.Be.), Broad Institute SPARC award (P.B. and R.Be.) and BroadNext10 trainee grant (U.B.-D.).

Reviewer information

Nature thanks J. Yang and the other anonymous reviewer(s) for their contribution to the peer review of this work.

Author information

Author notes

  1. These authors jointly supervised this work: Rameen Beroukhim, Todd R. Golub

Affiliations

  1. Broad Institute of Harvard and MIT, Cambridge, MA, USA

    • Uri Ben-David
    • , Benjamin Siranosian
    • , Gavin Ha
    • , Helen Tang
    • , Yaara Oren
    • , Kunihiko Hinohara
    • , Craig A. Strathdee
    • , Joshua Dempster
    • , Nicholas J. Lyons
    • , Guillaume Kugener
    • , Beth Cimini
    • , Peter Tsvetkov
    • , Yosef E. Maruvka
    • , Ryan O’Rourke
    • , Anthony Garrity
    • , Andrew A. Tubelli
    • , Pratiti Bandopadhayay
    • , Aviad Tsherniak
    • , Francisca Vazquez
    • , Bang Wong
    • , Chet Birger
    • , Mahmoud Ghandi
    • , Joshua A. Bittker
    • , Matthew Meyerson
    • , Gad Getz
    • , Rameen Beroukhim
    •  & Todd R. Golub
  2. Dana-Farber Cancer Institute, Boston, MA, USA

    • Gavin Ha
    • , Kunihiko Hinohara
    • , Robert Burns
    • , Anwesha Nag
    • , Ryan O’Rourke
    • , Pratiti Bandopadhayay
    • , Aaron R. Thorner
    • , Matthew Meyerson
    • , Rameen Beroukhim
    •  & Todd R. Golub
  3. Harvard Medical School, Boston, MA, USA

    • Yaara Oren
    • , Pratiti Bandopadhayay
    • , Matthew Meyerson
    • , Rameen Beroukhim
    •  & Todd R. Golub
  4. Massachusetts General Hospital, Boston, MA, USA

    • Gad Getz
  5. Brigham and Women’s Hospital, Boston, MA, USA

    • Rameen Beroukhim
  6. Howard Hughes Medical Institute, Chevy Chase, MD, USA

    • Todd R. Golub

Authors

  1. Search for Uri Ben-David in:

  2. Search for Benjamin Siranosian in:

  3. Search for Gavin Ha in:

  4. Search for Helen Tang in:

  5. Search for Yaara Oren in:

  6. Search for Kunihiko Hinohara in:

  7. Search for Craig A. Strathdee in:

  8. Search for Joshua Dempster in:

  9. Search for Nicholas J. Lyons in:

  10. Search for Robert Burns in:

  11. Search for Anwesha Nag in:

  12. Search for Guillaume Kugener in:

  13. Search for Beth Cimini in:

  14. Search for Peter Tsvetkov in:

  15. Search for Yosef E. Maruvka in:

  16. Search for Ryan O’Rourke in:

  17. Search for Anthony Garrity in:

  18. Search for Andrew A. Tubelli in:

  19. Search for Pratiti Bandopadhayay in:

  20. Search for Aviad Tsherniak in:

  21. Search for Francisca Vazquez in:

  22. Search for Bang Wong in:

  23. Search for Chet Birger in:

  24. Search for Mahmoud Ghandi in:

  25. Search for Aaron R. Thorner in:

  26. Search for Joshua A. Bittker in:

  27. Search for Matthew Meyerson in:

  28. Search for Gad Getz in:

  29. Search for Rameen Beroukhim in:

  30. Search for Todd R. Golub in:

Contributions

U.B.-D. conceived the project, collected the data, performed the experiments and carried out the analyses; B.S. assisted with computational analyses and figure preparation; G.H. assisted with ichorCNA, PyClone analyses and building the Cell STRAINER portal; H.T. assisted with cell culture and experiments; N.J.L. assisted with the L1000 assay; R.Bu., A.N. and A.R.T. assisted with the OncoPanel assay and analysis; B.C. assisted with the Cell Painting analysis; J.A.B. assisted with the chemical screens and their analysis; P.T. assisted with the proteasome activity assay and with the western blots; P.B., R.O. and A.G. assisted with the DNA barcoding experiment; C.A.S. and K.H. derived MCF7 single-cell clones; Y.O. assisted with single-cell RNA-sequencing data analysis; J.D. and F.V. assisted with the analysis of CRISPR screens; M.G., G.K. and A.T. assisted with the comparison of the Broad and Sanger whole-exome sequencing data; Y.E.M. assisted with compiling the RPE1 RNA-sequencing dataset; C.B. and G.G. assisted with building a web portal; B.W. and A.A.T. assisted with figure design and preparation; R.Be. and T.R.G. directed the project. U.B.-D., G.H., M.M., R.Be. and T.R.G. wrote the manuscript.

Competing interests

The authors declare no competing interests.

Corresponding authors

Correspondence to Rameen Beroukhim or Todd R. Golub.

Extended data figures and tables

  1. Extended Data Fig. 1 Comparison of Broad and Sanger genomic features across 106 cell lines.

    a, Comparison of the Pearson correlations of germline versus somatic SNVs across 106 paired cell lines. b, A histogram of the distribution of mutation discordance fractions across cell lines. Black, the distribution of all non-silent SNVs; grey, the distribution of the 447 genes included in the Oncopanel. c, Comparison of the fraction of discordant gene-level CNAs between the Broad and the Sanger (n = 106 cell lines) datasets, using three different thresholds for CNA calling. Bar, median; box, 25th and 75th percentiles; whiskers, 1.5× IQR of lower and upper quartile; circles, data points. d, A histogram of the distribution of CNA discordance fractions across cell lines. Bars are coloured as in b. e, CNA landscapes of 11 paired cell lines. For each cell line, the CNA landscape of the Broad strain (top) and the Sanger strain (bottom) are shown. Red, copy number gains; blue, copy number losses. CNAs < 10Mb in size are not presented. f, A histogram of the fraction of the genome affected by subclonal events across 916 cell lines from the CCLE. MCF7 and A549 are denoted by arrows. g, All CCLE cell lines ranked by their aneuploidy scores. h, All CCLE cell lines ranked by the number of their gene-level CNAs. i, All CCLE cell lines ranked by the number of their gene-level SNVs. j, All CCLE cell lines ranked by their chromosomal instability (CIN70) signature scores57. k, All CCLE cell lines ranked by their DNA-repair signature scores56. l, All CCLE cell lines ranked by their genomic instability scores55. m, All CCLE cell lines ranked by their subclonal genome fraction54. The vertical black line shows the rank of MCF7 in each comparison. n, Comparison of gene expression variation across multiple strains of nine cell lines, including MCF7. Box plots are the standard deviations of the expression levels for the 978 landmark genes directly measured in L1000. Bar, median; box, 25th and 75th percentiles; whiskers, data within 1.5× IQR of lower or upper quartile; circles, all data points.

  2. Extended Data Fig. 2 Schematic representation of the MCF7 and A549 strains included in the current study.

    a, MCF7 strains included in this study; their origins (columns), years of acquisition (rows), manipulations (colours) and progeny relationships (lines) are shown. b, A table corresponding to a. c, A549 strains included in this study, their origins (columns), years of acquisition (rows), manipulations (colours) and progeny relationships (lines) are shown. d, A table corresponding to c.

  3. Extended Data Fig. 3 Genetic variation across 27 MCF7 strains.

    a, Variation in the copy number status of nine selected genes across 27 MCF7 strains. Red, copy number gains; blue, copy number losses. Thresholds for relative gains and losses were set at 0.1 and −0.1, respectively. b, Western blots of the relative protein expression levels of ERα across strains. The expression of β-actin was used for normalization. For gel source data, see Supplementary Fig. 1. The experiment was repeated twice with similar results. c, Quantification of the relative expression of ERα. Strains Q and M were excluded from the comparison. Bar, median; box, 25th and 75th percentiles; whiskers, data within 1.5× IQR of lower or upper quartile; circles, all data points. One-tailed t-test. d, The allelic fractions of non-silent mutations in seven selected genes across 27 MCF7 strains. e, The number of non-silent point mutations (SNVs) across the 27 MCF7 strains. f, The number of COSMIC non-silent point mutations shared by each number of MCF7 strains. g, Top, unsupervised hierarchical clustering of 27 MCF7 strains, based on all of their SNVs. Groups of strains expected to cluster together based on their evolutionary history are highlighted, as in Fig. 1. Bottom, a corresponding heat map, showing the mutation status of all mutations across the 27 MCF7 strains. Mutations that were identified in only a subset of the strains that were detected in above 5% of the reads (AF > 0.05) are shown. Yellow, presence of a mutation; grey, absence of a mutation. h, The number of large (>15-bp) indels and rearrangements across the 27 MCF7 strains. Grey, indels; black, rearrangements. i, The recurrence of structural variants in each of the 42 (out of 60) genes for which at least one event was detected. j, The number of structural variants shared by each number of MCF7 strains. Note that this analysis is limited to the 60 genes listed in Supplementary Table 2.

  4. Extended Data Fig. 4 Comparison of CNA landscapes between MCF7 strains.

    a, CNA landscapes of a pair of MCF7 strains separated from each other by extensive passaging. b, CNA landscapes of a pair of of MCF7 strains separated from each other by a genetic manipulation (introduction of a GFP reporter). c, CNA landscapes of 10 MCF7 strains separated by multiple freeze–thaw cycles, with little passaging in between. d, CNA landscapes of a pair of MCF7 strains that were either cultured in vitro (top) or cultured in vivo and treated with tamoxifen (bottom). e, CNA landscapes of a pair of MCF7 strains separated from each other by seven passages. f, CNA landscapes of a pair of MCF7 strains before (top) and after (bottom) the introduction of Cas9. g, CNA landscapes of a pair of MCF7 strains obtained from four different sources. h, CNA landscapes of a pair of MCF7 strains separated from each other by extensive passaging. Data points represent 1-Mb bins throughout the genome. Red, gains; blue, losses; black, normal copy numbers; yellow, differential CNAs between the compared strains.

  5. Extended Data Fig. 5 Characterization of the variation in SNV allelic fraction and cellular prevalence across 27 MCF7 strains and their single-cell-derived clones.

    a, Top, unsupervised hierarchical clustering of 27 MCF7 strains, based on the allelic fractions of all their SNVs. Groups of strains expected to cluster together based on their evolutionary history are highlighted, as in Fig. 1. Bottom, a corresponding heat map, showing the allelic fractions of all mutations across the 27 MCF7 strains. Mutations that were identified in only a subset of the strains are shown. The presence of a mutation is shown in colour according to its allelic fraction. b, The allelic fractions of an activating PIK3CA mutation (top) and an inactivating TP53 mutation (bottom) across strains. c, Top, unsupervised hierarchical clustering of 27 MCF7 strains based on their SNV cellular prevalence. Groups of strains expected to cluster together based on their evolutionary history are highlighted, as in Fig. 1. Bottom, a corresponding heat map, showing the cellular prevalence of all mutations across the 27 MCF7 strains. Mutations that were identified in only a subset of the strains are shown. The presence of a mutation is shown in colour according to its cellular prevalence. d, The distribution of the maximal differences in cellular prevalence (CP) of non-silent mutations, across 27 MCF7 strains. The peak at maximum ΔCP = 1 represents SNVs that are clonal in at least one strain but are nearly or completely absent in at least one other strain; the peak at maximum ΔCP = 0 represents SNVs that are detected at similar prevalence across all 27 strains; and the peak at maximum ΔCP ≈ 0.1 represents a group of SNVs present at CP ≈ 0.1 only in strain M. e, Description of the MCF7 single-cell-derived clones included in this study, including their parental cell line, genetic manipulations and relationship to one another. f, A heat map showing the allelic fractions of non-silent mutations in three wild-type single cell-derived MCF7 (scWT3–scWT5) clones and the parental population. The presence of a mutation is shown in colour according to its allelic fraction. g, A heat map showing the allelic fractions of non-silent mutations in five genetically manipulated single-cell-derived MCF7 clones. For two of the clones, samples were passaged for a prolonged time and sequenced at multiple time points. The presence of a mutation is shown in colour according to its allelic fraction. h, Comparison of the karyotypic variation between parental and single-cell-derived cell populations. Histograms show the distribution of chromosome numbers from the parental (light grey) and single-cell-derived (dark grey) populations. P values indicate the significance of the differences between the variations (rather than the means) of the populations using a one-tailed Levene’s test (n = 50 metaphases per group). i, Two representative karyotypes of each sample. Note that all single-cell-derived clones are karyotipically heterogeneous. Marker chromosomes are not shown. Arrows point to partially aberrant chromosomes. Images are representative of 50 metaphases counted per sample. j, Two representative karyotypes from two cell populations of the same single-cell-derived clone, separated by six months of culture propagation. Marker chromosomes are not shown. Arrows point to partially aberrant chromosomes. Images are representative of 50 metaphases counted per sample. k, Comparison of the karyotypic variation between two cell populations of the same single-cell-derived clone, separated by six months of culture propagation. Histograms show the distribution of chromosome numbers from the early (light grey) and late (dark grey) populations. Per sample, 50 metaphases were counted. The P value indicates the significance of the difference between the means of the populations using a two-tailed Wilcoxon rank-sum test.

  6. Extended Data Fig. 6 Transcriptomic variation across 27 MCF7 strains and their single-cell-derived clones.

    a, Comparison of the L1000-based MCF7 expression profiles to microarray-based expression profiles from CCLE. Histograms show the distributions of the Spearman correlations between the 27 MCF7 strains and either MCF7 (light purple), two MCF7 derivatives (dark purple and blue), other breast cancer cell lines (green) or non-breast cancer cell lines (grey). The comparison is based on the 978 landmark genes directly measured in L1000. b, The number of differentially expressed genes identified in all possible pairwise comparisons of MCF7 strains, using a twofold change cutoff. LFC, log fold change; DEGs, differentially expressed genes. c, The 10 top hallmark gene sets identified by GSEA to be significantly enriched among the 100 genes that are most differentially expressed across the MCF7 strains. The two gene sets related to oestrogen response are highlighted in red. d, Comparison of gene expression variation within and between strains. Histograms show the distributions of gene expression variation within replicates of the same strain (grey), between closely related strains (purple) and between all strains (green). The comparison is based on the 978 landmark genes directly measured in L1000. e, Heat map showing the arm-level CNA profiles of 27 MCF7 strains. Red, gains; blue, losses. f, GSEA reveals downregulation of the genes on chromosomes 10q, 17q and 21q in strains that have lost copies of these arms, and upregulation of the genes on chromosomes 5q, 6p, 14q and 16p in strains that have gained copies of these arms. g, GSEA of the upregulation of mTOR signalling (gene set: hallmark_MTORC1_signalling) and of genes that are upregulated when PTEN is knocked down (gene set: PTEN_DN.v2_UP) in strains that have gained PIK3CA; downregulation of the oestrogen response signature (gene set: hallmark_oestrogen_response_late) in strains that have lost ESR1; cell cycle signature (gene set: KEGG_cell_cycle) in strains that have lost CDKN2A; and downregulation of KRAS signalling (gene set: hallmark_KRAS_signalling_DN) in strains that have lost MAP2K4. h, GSEA of the upregulation of mTOR signalling (gene set: hallmark_MTORC1_signalling) in strains with high prevalence of an activating PIK3CA mutation; upregulation of genes that are upregulated when PTEN is knocked down (gene set: PTEN_DN.v1_UP) in strains that have an inactivating PTEN mutation; and downregulation of genes that are downregulated when TP53 is knocked down (gene set: P53_DN.v1_DOWN) in strains with high cellular prevalence of an inactivating TP53 mutation. i, GSEA reveals upregulation of mTOR signalling (gene sets: MTOR_UP.N4.V1_UP and hallmark_MTORC1_signalling) in strains that have both PTEN copy number loss and an inactivating PTEN mutation. j, A t-SNE plot of single-cell RNA-seq data from MCF7-AA cells treated with bortezomib (500 nM) at different time points. Each dot represents a single cell, and cells are coloured by time point. k, Comparison of the proteasome gene expression signature across time points. l, Comparison of the unfolded protein response gene expression signature across time points. m, Comparison of two proliferation gene expression signatures, S (left) and G2M (right), across time points. n, Comparison of the early (left) and late (right) response to oestrogen gene expression signatures across time points. Red lines denote mean values. P values indicate significance from a one-way ANOVA followed by a Games–Howell post hoc test. n = 1,726, 2,743, 1,851 and 1,235 cells for t0, t12, t48 and t96, respectively. o, A t-SNE plot of single-cell RNA-seq data from a parental population and its single-cell-derived clone at two time points. Each dot represents a single cell, and cells are coloured by sample. p, Comparison of the transcriptional heterogeneity between a parental MCF7 population and its single-cell-derived clones. n = 2,904, 2,990, 3,896 and 4,583 cells for parental, scWT3, scWT4 and scWT5, respectively. q, Comparison of the transcriptional heterogeneity between two cultures of the same single-cell clone, separated by six months of continuous passaging. n = 4,295 and 4,116 cells, for clone9-May17 and clone9-Nov17, respectively. Box plots show the Euclidean distance between the cells in each cell population. Bar, median; box, 25th and 75th percentiles; whiskers, data within 1.5× IQR of lower or upper quartile. P values indicate significance from a one-way ANOVA followed by a Games–Howell post hoc test. r, The 10 top hallmark gene sets identified by GSEA to be significantly enriched among the top differentially expressed genes between the two cultures of clone MCF7_GREB1_9 (May 2017 versus November 2017). The gene sets related to oestrogen response are highlighted in red, and those related to proliferation are highlighted in green.

  7. Extended Data Fig. 7 Extensive genetic and transcriptional variation across 23 strains of A549.

    a, Top, unsupervised hierarchical clustering of 23 A549 strains, based on their non-silent SNV profiles derived from deep targeted sequencing. Strains expected to cluster together based on their evolutionary history are highlighted in blue. Bottom, a corresponding heat map, showing the mutation status of non-silent mutations across the 23 A549 strains. Mutations that were identified in only a subset of the strains, which were detected in above 5% of the reads (AF > 0.05) are shown. The presence of a mutation is shown in yellow, and its absence in grey. b, The number of non-silent point mutations shared by each number of A549 strains. c, Top, unsupervised hierarchical clustering of 23 A549 strains, based on the allelic fractions of their non-silent SNVs. Bottom, a corresponding heat map, showing the allelic fractions of non-silent mutations across the 23 A549 strains. Mutations that were identified in only a subset of the strains are shown. The presence of a mutation is shown in colour according to its allelic fraction. d, The allelic fractions of non-silent mutations in six selected genes across 23 A549 strains. Note the inactivating frameshift mutation in SMARCA4, one of the most frequently mutated genes in lung adenocarcinoma24, which was detected at an allelic fraction of ≈1 in 9 of the strains, but was not detected at all in the other 14 strains. e, The number of gene-level CNAs shared by each number of MCF7 strains. Red, copy number gains; blue, copy number losses. f, CNA variation in the copy number of CDKN2A. Red, copy number gains; blue, copy number losses. Thresholds for relative gains and losses were set at 0.1 and −0.1, respectively. g, Unsupervised hierarchical clustering of 23 A549 strains, based on their global gene expression profiles. Strains expected to cluster together based on their evolutionary history are highlighted in blue. h, A t-SNE plot of L1000-based gene expression profiles from multiple samples of nine cancer cell lines. The asterisk denotes the 23 A549 strains profiled in the current study. i, Comparison between the L1000-based A549 expression profiles and the microarray-based expression profiles from CCLE. Histograms show the distributions of the Spearman correlations between the 23 A549 strains and A549 (light blue), other non-small-cell lung cancer cell lines (purple), other lung cancer cell lines (green) or non-lung cancer cell lines (grey). The comparison is based on the 978 landmark genes directly measured in L1000. j, The number of differentially expressed genes identified in all possible pairwise comparisons of A549 strains, using a twofold change cutoff. k, Arm-level gains are associated with significant upregulation and arm-level losses are associated with significant downregulation of genes transcribed from the aberrant arms. GSEA showing upregulation of the genes on chromosome 2q in strains that have gained a copy of that arm (left), and downregulation of the genes on chromosome 9q in strains that have lost a copy of that arm (right). l, Gene-level CNAs are associated with significant dysregulation of the perturbed pathways. GSEA reveals upregulation of the genes that are upregulated, and downregulation of the genes that are downregulated, when TP53 is knocked down in strains with MDM2 high-level copy number gain; and upregulation or downregulation of the G2/M cell cycle checkpoint signature in strains with CDKN2A copy number loss or CCND1 copy number gain. m, Point mutations are associated with significant dysregulation of the perturbed pathways. For example, GSEA reveals downregulation of two PRC2-related expression signatures in strains with an inactivating SMARCA4. n, The 10 top gene sets identified by GSEA to be significantly enriched among the 100 genes that are most differentially expressed across the A549 strains. The six gene sets related to KRAS signalling are highlighted in red.

  8. Extended Data Fig. 8 Genetic variation across multiple strains of additional cancer and non-cancer cell lines.

    a, The fraction of non-silent SNVs that are discordant between pairs of strains of the same cell line. Data are mean ± s.e.m. n, number of strain pairs compared. b, Arm-level CNAs arise in RPE1 samples. Plots show CNAs detected by an e-karyotyping analysis of 26 RPE1 samples. Red, gains; blue, losses. c, Comparison of variability in non-silent SNVs between non-transformed, partially transformed and fully transformed MCF10A samples. Box plots show the fraction of discordant non-silent SNVs between pairs of samples within each category. Bar, median; box, 25th and 75th percentiles; whiskers, data within 1.5× IQR of lower or upper quartile; circles, all data points. One-tailed Wilcoxon rank-sum test, n = 28, 112 and 14 strain pairs, for the non-transformed, partially transformed and the fully transformed groups, respectively. d, Comparison of the Broad–Sanger allelic fraction correlations of cell lines derived from primary tumours and those derived from metastases. Bar, median; coloured rectangle, 25th and 75th percentiles; width of the violin indicates frequency at that value. One-tailed Wilcoxon rank-sum test. e, Top, comparison of the chromosomal instability (CIN70) gene expression signature score between CCLE lines derived from primary tumours and those derived from metastases. Bottom, comparison of the weighted-genomic integrity index (wGII) between CCLE lines derived from primary tumours and those derived from metastases. Bar, median; coloured rectangle, 25th and 75th percentiles; width of the violin indicates frequency at that value. One-tailed Wilcoxon rank-sum test. f, Comparison of the Broad–Sanger allelic fraction correlations of microsatellite-stable cell lines (MSS) and microsatellite-unstable cell lines (MSI). Bar, median; box, 25th and 75th percentiles; whiskers, data within 1.5× IQR of lower or upper quartile; circles, all data points. One-tailed Wilcoxon rank-sum test. g, Heat maps show the allelic fractions of non-silent mutations in multiple strains of cancer cell lines. The presence of a mutation is shown in colour according to its allelic fraction. h, Heat maps show the allelic fractions of non-silent mutations in multiple strains of the non-cancer cell lines HA1E and MCF10A. The presence of a mutation is shown in colour according to its allelic fraction. Also shown is an unsupervised hierarchical clustering of the 15 MCF10A strains, which represent different degrees of cellular transformation, based on their non-silent mutation profiles.

  9. Extended Data Fig. 9 Characterization of cell proliferation and morphology across 27 MCF7 strains.

    a, Growth response curves of 27 MCF7 strains, based on microscopy imaging. b, Doubling time of the 27 MCF7 strains, as measured by automatic microscopy imaging. c, Variation in cellular radius across the 27 MCF7 strains. d, Variation in form factor, a measure of circularity, across the 27 MCF7 strains. e, Variation in nuclear radius across the 27 MCF7 strains. ae, Data are mean ± s.d., circles show individual values; n = 3 replicate wells per data point. f, Microscopy imaging of the 27 MCF7 strains, showing the morphological differences between them. Scale bar, 300 μm. Images are representative of five replicate wells per strain. g, Unsupervised hierarchical clustering of 27 MCF7 strains, based on 1,784 morphological features. h, The correlation between proliferation rate (shown as doubling time) and the number of non-silent protein-coding mutations, across 18 naturally occurring MCF7 strains (that is, strains that have not undergone drug selection or genetic manipulation). Spearman’s ρ and P values indicate the strength and significance of the correlation, respectively. i, The correlation between proliferation rate (shown as doubling time) and the fraction of subclonal mutations, across 18 naturally occurring MCF7 strains. Spearman’s ρ and P values indicate the strength and significance of the correlation, respectively.

  10. Extended Data Fig. 10 Characterization of drug-response variation across 27 MCF7 strains.

    a, Unsupervised hierarchical clustering of 27 MCF7 strains, based on their response to all 321 compounds in the primary screen. Groups of strains expected to cluster together based on their evolutionary history are highlighted, as in Fig. 1. b, Pie chart of the classification of the screened compounds based on their differential activity. The response to each active compound was defined as ‘consistent’ if viability change was <−50% for all strains, ‘variable’ if viability change was <−50% for some strains and >−20% for other strains, or ‘intermediate’ if viability change was in between these values. Classification was performed using a two-strain threshold. c, Pie charts as in b excluding strains Q and M that were generally more drug resistant. Classification was performed using a one-strain or a two-strain threshold (left and right charts, respectively). d, Pie charts as in b using an activity threshold of viability change <−80%. Classification was performed using a one-strain threshold, either including all strains (left) or excluding strains Q and M (right). e, The number of gene-level CNAs shared by each number of MCF7 strains. Red, copy number gains; blue copy number losses. f, The number of non-silent point mutations shared by each number of MCF7 strains. The 10 naturally occurring connectivity map strains were averaged and considered as a single sample. g, The correlation between proliferation rate (shown as doubling time) and the number of non-silent protein-coding mutations, across naturally occurring MCF7 strains (n = 10). Spearman’s ρ and P values indicate the strength and significance of the correlation, respectively. The 10 naturally occurring CMap strains were averaged and considered as a single sample. h, The correlation between proliferation rate (shown as doubling time) and the fraction of subclonal mutations, across naturally occurring MCF7 strains (n = 10). Spearman’s ρ and P values indicate the strength and significance of the correlation, respectively. The 10 naturally occurring CMap strains were averaged and considered as a single sample. i, The number of differentially expressed genes identified in all possible pairwise comparisons of MCF7 strains, using a twofold change cutoff. The 10 naturally occurring CMap strains were averaged and considered as a single sample. j, Pie charts of the classification of the screened compounds based on their differential activity. The response to each active compound was defined as consistent if viability change was <−50% for all strains, variable if viability change was <−50% for some strains and >−20% for other strains, or intermediate if viability change was in between these values. Classification was performed using a one-strain or a two-strain resistance threshold (top and bottom charts, respectively). The 10 naturally occurring CMap strains were averaged and considered as a single sample. k, The dose–response curves for ten compounds are shown. For each compound, eight concentrations were tested in each strain. Two sensitive strains and two insensitive strains are plotted. Each data point represents the mean of two replicates. Nutlin-3, a compound that had no toxicity against any of the strains in the primary screen, was included as negative control. Romidepsin, a compound that killed all strains very efficiently in the primary screen was included as positive control and turned out to be differentially active at lower concentrations. l, The Pearson’s correlation of the two compound screen replicates across the MCF7 strains. m, Strains more sensitive to proteasome inhibitors exhibit higher proteasome activity. The chymotrypsin-like activity of the proteasome was measured in three sensitive and three insensitive strains. Data are mean ± s.d., one-tailed t-test, n = 4 replicate wells. n, Western blots of the relative protein expression levels of the proteasome 19S complex members PSMC2 and PSMD1 in three sensitive and three insensitive strains. The expression of α-tubulin was used for normalization. The experiment was repeated once, with n = 3 strains per group. For gel source data, see Supplementary Fig. 1. o, Quantification of the relative expression of PSMC2 and PSMD1. Data are mean ± s.d., one-tailed t-test, n = 3 strains per group. p, Upregulation of the KEGG cell cycle signature in strains sensitive to the cell cycle inhibitor CDK/CRK inhibitor (n = 3) compared to insensitive strains (n = 12). q, Upregulation of mTOR signalling in strains sensitive to the PI3K inhibitor PP-121 (n = 11) compared to insensitive strains (n = 5). r, Downregulation of the genes that are downregulated when ALK is knocked down in strains sensitive to the ALK inhibitor TAE-684 (n = 4) compared to insensitive strains (n = 15). s, Upregulation of IL-6–JAK–STAT3 signalling in strains sensitive to the STAT inhibitor nifuroxazide (n = 9) compared to insensitive strains (n = 6). t, Upregulation of the genes that are upregulated when AKT is overexpressed in strains sensitive to the AKT inhibitor triciribine (n = 2) compared to insensitive strains (n = 8). u, Upregulation of hypoxia signalling in strains sensitive to the HSP inhibitor 17-AAG (n = 3) compared to insensitive strains (n = 15). v, Downregulation of xenobiotic metabolism signatures in strains M and Q (n = 2), which exhibited an increased resistance to most compounds compared to the other strains (n = 25). w, Upregulation of the early and late oestrogen response signatures, in strains most sensitive to the ER inhibitor tamoxifen (n = 5) compared to the least sensitive strains (n = 5). x, Sensitivity to oestrogen depletion and to tamoxifen is associated with the copy number status of ESR1. Heat maps represent the relative viability in oestrogen-depleted medium (top) and in response to tamoxifen (at 16.6 μM; bottom).

  11. Extended Data Fig. 11 Comparison of genetic-, transcriptomic- and drug-response-based clustering trees, genomic distances and CRISPR dependencies.

    a, Comparison of clustering trees using the Fowlkes–Mallows approach. The dendrograms were based on SNVs, gene-level CNAs, arm-level CNAs, gene expression profiles and drug response patterns and were all compared to each other. The Fowlkes–Mallows index (Bk) was computed for all potential numbers of clusters (k values) ranging from 5 to 26. The red lines indicate the observed Bk values, whereas the grey lines represent the 95% upper quantile of the randomized distribution. The maximum Bk value represents the degree of similarity between the compared pair of dendrograms. The grey shading represents the difference between the observed Bk values and those of the 95% upper quantile of the randomized distribution. b, Force-directed layout of screened lines using a similarity matrix determined by the probability of cell lines clustering together in dependency space. Cell lines (nodes) are coloured by lineage. c, Left, the overlap of dependencies in KPL1 and MCF7 using corrected CERES scores, with genes showing depletion effects in all cell lines (that is, pan-essential genes) excluded. The threshold for dependency was set as a CERES score <−0.5. Right, overlap in dependency with genes of indeterminate dependency status (CERES scores between −0.4 and −0.6) in either cell line excluded. d, A two-sample GSEA of MCF7 and KPL1 against the oestrogen response gene sets (n = 1 sample per group). Expression of the oestrogen signalling pathway is strongly enriched in MCF7. e, The correlation between ESR1 dependency values and the single-sample GSEA enrichment scores of the oestrogen response hallmark gene sets (n = 27 cell lines). The difference in oestrogen response signalling between MCF7 and KPL1 predicts their differing levels of dependency on ESR1. f, The correlation between GATA3 dependency and GATA3 protein levels (z-scored values for reverse-phase protein arrays; n = 27 cell lines). The difference in GATA3 protein levels between MCF7 and KPL1 predicts their differing levels of dependency on GATA3. Spearman’s ρ and P values indicate the strength and significance of the correlations, respectively. g, Top, comparison of proliferation rates between a parental MCF7 population and its single-cell-derived clones. Bottom, comparison of proliferation rates between two cultures of the same single-cell clone, separated by six months of continuous passaging. Box plots show the population doubling time of each sample. Bar, median; box, 25th and 75th percentiles; whiskers, data within 1.5× IQR of lower or upper quartile; circles, all data points. Two-tailed t-test; n, replicate wells. h, Top, comparison of the sensitivity to oestrogen depletion between a parental MCF7 population and its single-cell-derived clones. Bottom, comparison of the sensitivity to oestrogen depletion between two cultures of the same single-cell clone, separated by six months of continuous passaging. Box plots show the relative growth rate in oestrogen-depleted medium. Bar, median; box, 25th and 75th percentiles; whiskers, data within 1.5× IQR of lower or upper quartile; circles, all data points. Two-tailed t-test; n, replicate wells. i, The correlation between sensitivity to tamoxifen (relative viability at 20 μM) and the sensitivity to oestrogen depletion (relative growth rate), across the parental MCF7 populations and their single-cell clones (n = 7). Spearman’s ρ value and P values indicate the strength and significance of the correlation, respectively. j, Correlation plots between various measures to estimate cell line strains (n = 351 strain pairs). CNA distances (based on ultra-low-pass whole-genome sequencing or targeted sequencing), SNV distances, gene expression distances and drug response distances were compared to each other. CNA distance based on ultra-low-pass whole-genome DNA-sequencing was determined by the fraction of the genome affected by discordant CNA calls. CNA and SNV distances based on targeted sequencing were determined by Jaccard indices. Gene expression and drug-response distances were determined by Euclidean distances. Spearman’s ρ and P values indicate the strength and significance of the correlation, respectively.

  12. Extended Data Table 1 Implications of this study for the use of cell lines in cancer research

Supplementary information

  1. Supplementary Information

    This file contains Supplementary Figure 1, Supplementary Discussion and Supplementary References

  2. Reporting Summary

  3. Supplementary Tables

    This zipped file contains Supplementary Tables 1-32 and a Supplementary Table Guide

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/s41586-018-0409-3

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.