Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Massively parallel phenotyping of coding variants in cancer with Perturb-seq

An Author Correction to this article was published on 18 October 2022

This article has been updated

Abstract

Genome sequencing studies have identified millions of somatic variants in cancer, but it remains challenging to predict the phenotypic impact of most. Experimental approaches to distinguish impactful variants often use phenotypic assays that report on predefined gene-specific functional effects in bulk cell populations. Here, we develop an approach to functionally assess variant impact in single cells by pooled Perturb-seq. We measured the impact of 200 TP53 and KRAS variants on RNA profiles in over 300,000 single lung cancer cells, and used the profiles to categorize variants into phenotypic subsets to distinguish gain-of-function, loss-of-function and dominant negative variants, which we validated by comparison with orthogonal assays. We discovered that KRAS variants did not merely fit into discrete functional categories, but spanned a continuum of gain-of-function phenotypes, and that their functional impact could not have been predicted solely by their frequency in patient cohorts. Our work provides a scalable, gene-agnostic method for coding variant impact phenotyping, with potential applications in multiple disease settings.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Coding variants Perturb-seq for scalable high-content profiling of coding variant impact.
Fig. 2: Coding variants Perturb-seq recapitulates known biology of loss-of-function variants in the tumor suppressor gene TP53.
Fig. 3: sc-eVIP annotates known gain-of-function variants in the KRAS oncogene.
Fig. 4: KRAS variants form a gradual functional gradient within and across variant groups.
Fig. 5: Relationship between sc-eVIP scores and the frequency of variants in patient cohorts.
Fig. 6: Power analysis and outlook for the variant-to-function efforts to phenotype cancer coding variation.

Similar content being viewed by others

Data availability

Raw and processed data associated with this work are publicly accessible on Gene Expression Omnibus with accession numbers GSE161824 (single-cell RNA-seq data, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE161824) and GSE183670 (bulk RNA-seq data, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE183670). In addition, previously published datasets/databases used in this study are: cBioPortal http://www.cbioportal.org/, GenBank74, https://www.ncbi.nlm.nih.gov/genbank/, Foundation Medicine Panel39, https://assets.ctfassets.net/w98cd481qyp0/YqqKHaqQmFeqc5ueQk48w/c35460768c3a76ef738dcf88f8219524/F1CDx_Tech_Specs_072021.pdf, the APPRIS database81, https://apprisws.bioinfo.cnio.es/landing_page/ and GILA data from http://www.targetkras.com/.

Code availability

All analyses can be recapitulated with Jupyter notebooks at http://github.com/klarman-cell-observatory/sc_eVIP, and using the Perturb-seq library at http://github.com/klarman-cell-observatory/perturbseq. In addition, we have used the following software packages: Cellranger 2.1.1 (ref. 73), Bowtie 1.2.2 (ref. 78), R packages DNAbarcodes 0.1 (ref. 42) and PRROC 1.3.1 (ref. 82), and Python packages Scanpy 1.7.2 (ref. 75), Sklearn 0.24.1 (ref. 83) and Mimosca https://github.com/asncd/MIMOSCA.

Change history

References

  1. Rehm, H. L. & Fowler, D. M. Keeping up with the genomes: scaling genomic variant interpretation. Genome Med. 12, 5 (2019).

    PubMed  PubMed Central  Google Scholar 

  2. Lawrence, M. S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495–501 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. Zehir, A. et al. Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. Nat. Med. 23, 703–713 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. Bailey, M. H. et al. Comprehensive characterization of cancer driver genes and mutations. Cell 174, 1034–1035 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Tate, J. G. et al. COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids Res. 47, D941–D947 (2019).

    CAS  PubMed  Google Scholar 

  6. Hess, J. M. et al. Passenger hotspot mutations in cancer. Cancer Cell 36, 288–301.e14 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. Muiños, F. et al. In silico saturation mutagenesis of cancer genes. Nature 596, 428–432 (2021).

    PubMed  Google Scholar 

  8. Chang, M. T. et al. Identifying recurrent mutations in cancer reveals widespread lineage diversity and mutational specificity. Nat. Biotechnol. 34, 155–163 (2016).

    CAS  PubMed  Google Scholar 

  9. Kamburov, A. et al. Comprehensive assessment of cancer missense mutation clustering in protein structures. Proc. Natl Acad. Sci. USA 112, E5486–E5495 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. Figliuzzi, M., Jacquier, H., Schug, A., Tenaillon, O. & Weigt, M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1. Mol. Biol. Evol. 33, 268–280 (2016).

    CAS  PubMed  Google Scholar 

  12. Giacomelli, A. O. et al. Mutational processes shape the landscape of TP53 mutations in human cancer. Nat. Genet. 50, 1381–1387 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. Alexandrov, L. B. et al. Clock-like mutational processes in human somatic cells. Nat. Genet. 47, 1402–1407 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. Brenan, L. et al. Phenotypic characterization of a comprehensive set of MAPK1/ERK2 missense mutants. Cell Rep. 17, 1171–1183 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. Findlay, G. M. et al. Accurate classification of BRCA1 variants with saturation genome editing. Nature 562, 217–222 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. Dogruluk, T. et al. Identification of variant-specific functions of PIK3CA by rapid phenotyping of rare mutations. Cancer Res. 75, 5341–5354 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  19. Yu, K. et al. PIK3CA variants selectively initiate brain hyperactivity during gliomagenesis. Nature 578, 166–171 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. Gao, Y. et al. Allele-specific mechanisms of activation of MEK1 mutants determine their properties. Cancer Discov. 8, 648–661 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. Boettcher, S. et al. A dominant-negative effect drives selection of TP53 missense mutations in myeloid malignancies. Science 365, 599–604 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. Kotler, E. et al. A systematic p53 mutation library links differential functional impact to cancer mutation pattern and evolutionary conservation. Mol. Cell 71, 873 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. Hamza, A. et al. Complementation of yeast genes with human genes as an experimental platform for functional testing of human genetic variants. Genetics 201, 1263–1274 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. Sun, S. et al. An extended set of yeast-based functional assays accurately identifies human disease mutations. Genome Res. 26, 670–680 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. Weile, J. et al. A framework for exhaustively mapping functional missense variants. Mol. Syst. Biol. 13, 957 (2017).

    PubMed  PubMed Central  Google Scholar 

  26. Lee, M. G. & Nurse, P. Complementation used to clone a human homologue of the fission yeast cell cycle control gene cdc2. Nature 327, 31–35 (1987).

    CAS  PubMed  Google Scholar 

  27. Osborn, M. J. & Miller, J. R. Rescuing yeast mutants with human genes. Brief. Funct. Genom. Proteomic. 6, 104–111 (2007).

    CAS  Google Scholar 

  28. Gerasimavicius, L., Liu, X. & Marsh, J. A. Identification of pathogenic missense mutations using protein stability predictors. Sci. Rep. 10, 15387 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  29. Sahni, N. et al. Widespread macromolecular interaction perturbations in human genetic disorders. Cell 161, 647–660 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. Moal, I. H. & Fernández-Recio, J. SKEMPI: a structural kinetic and energetic database of mutant protein interactions and its use in empirical models. Bioinformatics 28, 2600–2607 (2012).

    CAS  PubMed  Google Scholar 

  31. Leung, I., Dekel, A., Shifman, J. M. & Sidhu, S. S. Saturation scanning of ubiquitin variants reveals a common hot spot for binding to USP2 and USP21. Proc. Natl Acad. Sci. USA 113, 8705–8710 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. Heyne, M., Papo, N. & Shifman, J. M. Generating quantitative binding landscapes through fractional binding selections combined with deep sequencing and data normalization. Nat. Commun. 11, 297 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  33. Yang, M., Wu, Z. & Fields, S. Protein-peptide interactions analyzed with the yeast two-hybrid system. Nucleic Acids Res. 23, 1152–1156 (1995).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. Kim, E. et al. Systematic functional interrogation of rare cancer variants identifies oncogenic alleles. Cancer Discov. 6, 714–726 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. Berger, A. H. et al. High-throughput phenotyping of lung cancer somatic mutations. Cancer Cell 30, 214–228 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. Rohban, M. H. et al. Systematic morphological profiling of human gene and allele function via Cell Painting. eLife 6, e24060 (2017).

    PubMed  PubMed Central  Google Scholar 

  37. Dixit, A. et al. Perturb-Seq: dissecting molecular circuits with scalable single-cell RNA Profiling of pooled genetic screens. Cell 167, 1853–1866.e17 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. Adamson, B. et al. A multiplexed single-cell CRISPR screening platform enables systematic dissection of the unfolded protein response. Cell 167, 1867–1882.e21 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. FoundationOne CDx. https://www.foundationmedicine.com/test/foundationone-cdx

  40. AACR Project GENIE Consortium. AACR Project GENIE: powering precision medicine through an international consortium. Cancer Discov. 7, 818–831 (2017).

    Google Scholar 

  41. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. Hotelling, H. The generalization of Student’s ratio. Ann. Math. Stat. 2, 360–378 (1931).

    Google Scholar 

  43. Fischer, M. Census and evaluation of p53 target genes. Oncogene 36, 3943–3956 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. Jeay, S. et al. A distinct p53 target gene set predicts for response to the selective p53–HDM2 inhibitor NVP-CGM097. eLife 4, e06498 (2015).

    PubMed  PubMed Central  Google Scholar 

  45. Hong, D. S. et al. KRASG12C inhibition with sotorasib in advanced solid tumors. N. Engl. J. Med. 383, 1207–1217 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. Singh, A. et al. A gene expression signature associated with ‘K-Ras addiction’ reveals regulators of EMT and tumor cell survival. Cancer Cell 15, 489–500 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  47. Meyers, R. M. et al. Computational correction of copy number effect improves specificity of CRISPR–Cas9 essentiality screens in cancer cells. Nat. Genet. 49, 1779–1784 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. Rotem, A. et al. Alternative to the soft-agar assay that permits high-throughput drug and genetic screens for cellular transformation. Proc. Natl Acad. Sci. USA 112, 5708–5713 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. Ly, S. H. Investigation of KRAS Dependency Bypass and Functional Characterization of All Possible KRAS Missense Variants. PhD thesis, Harvard Univ. http://nrs.harvard.edu/urn-3:HUL.InstRepos:40050098 (2018).

  50. UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).

    Google Scholar 

  51. Lu, J., Bera, A. K., Gondi, S. & Westover, K. D. KRAS switch mutants D33E and A59G crystallize in the state 1 conformation. Biochemistry 57, 324–333 (2018).

    CAS  PubMed  Google Scholar 

  52. Akagi, K. et al. Characterization of a novel oncogenic K-ras mutation in colon cancer. Biochem. Biophys. Res. Commun. 352, 728–732 (2007).

    CAS  PubMed  Google Scholar 

  53. Serrano, M., Lin, A. W., McCurrach, M. E., Beach, D. & Lowe, S. W. Oncogenic ras provokes premature cell senescence associated with accumulation of p53 and p16INK4a. Cell 88, 593–602 (1997).

    CAS  PubMed  Google Scholar 

  54. Bouaoun, L. et al. TP53 variations in human cancers: new lessons from the IARC TP53 database and genomics data. Hum. Mutat. 37, 865–876 (2016).

    CAS  PubMed  Google Scholar 

  55. Datlinger, P. et al. Ultra-high-throughput single-cell RNA sequencing and perturbation screening with combinatorial fluidic indexing. Nat. Methods 18, 635–642 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  56. Ma, S. et al. Chromatin potential identified by shared single-cell profiling of RNA and chromatin. Cell https://doi.org/10.1016/j.cell.2020.09.056 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  57. Sidore, A. M. et al. DropSynth 2.0: high-fidelity multiplexed gene synthesis in emulsions. Nucleic Acids Res. 48, e95 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  58. Kinker, G. S. et al. Pan-cancer single cell RNA-seq uncovers recurring programs of cellular heterogeneity. Preprint at bioRxiv https://doi.org/10.1101/807552 (2019).

  59. McFarland, J. M. et al. Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action. Nat. Commun. 11, 4296 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  60. Gaidukov, L. et al. A multi-landing pad DNA integration platform for mammalian cell engineering. Nucleic Acids Res. 46, 4072–4086 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  61. Gaudelli, N. M. et al. Programmable base editing of A•T to G•C in genomic DNA without DNA cleavage. Nature 551, 464–471 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  62. Komor, A. C., Kim, Y. B., Packer, M. S., Zuris, J. A. & Liu, D. R. Programmable editing of a target base in genomic DNA without double-stranded DNA cleavage. Nature 533, 420–424 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  63. Anzalone, A. V. et al. Search-and-replace genome editing without double-strand breaks or donor DNA. Nature 576, 149–157 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  64. Lebrigand, K., Magnone, V., Barbry, P. & Waldmann, R. High throughput error corrected Nanopore single cell transcriptome sequencing. Nat. Commun. 11, 4025 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  65. Volden, R. & Vollmers, C. Highly multiplexed single-cell full-length cDNA sequencing of human immune cells with 10X Genomics and R2C2. Preprint at bioRxiv https://doi.org/10.1101/2020.01.10.902361 (2020).

  66. Cao, J. et al. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 357, 661–667 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  67. Rosenberg, A. B. et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  68. Cleary, B., Cong, L., Cheung, A., Lander, E. S. & Regev, A. Efficient generation of transcriptomic profiles by random composite measurements. Cell 171, 1424–1436.e18 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  69. Cleary, B. & Regev, A. The necessity and power of random, under-sampled experiments in biology. Preprint at https://arxiv.org/abs/2012.12961 (2020).

  70. Frangieh, C. J. et al. Multimodal pooled Perturb-CITE-seq screens in patient models define mechanisms of cancer immune evasion. Nat. Genet. 53, 332–341 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  71. Buschmann, T. & Bystrykh, L. V. Levenshtein error-correcting barcodes for multiplexed DNA sequencing. BMC Bioinf. 14, 272 (2013).

    Google Scholar 

  72. Jaitin, D. A. et al. Dissecting immune circuits by linking CRISPR-pooled screens with single-cell RNA-seq. Cell 167, 1883–1896.e15 (2016).

    CAS  PubMed  Google Scholar 

  73. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  74. Benson, D. A. et al. GenBank. Nucleic Acids Res. 41, D36–D42 (2013).

    CAS  PubMed  Google Scholar 

  75. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).

    PubMed  PubMed Central  Google Scholar 

  76. Blondel, V. D. et al. Fast unfolding of communities in large networks. J. Stat. Mech. 2008, P10008 (2008).

    Google Scholar 

  77. Levine, J. H. et al. Data-Driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  78. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).

    PubMed  PubMed Central  Google Scholar 

  79. Dixit, A. Correcting chimeric crosstalk in single cell RNA-seq experiments. Preprint at bioRxiv https://doi.org/10.1101/093237 (2016).

  80. Storey, J. D. & Tibshirani, R. Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA 100, 9440–9445 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  81. Rodriguez, J. M. et al. APPRIS 2017: principal isoforms for multiple gene sets. Nucleic Acids Res. 46, D213–D217 (2018).

    CAS  PubMed  Google Scholar 

  82. Grau, J., Grosse, I. & Keilwagen, J. PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R. Bioinformatics 31, 2595–2597 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  83. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    Google Scholar 

Download references

Acknowledgements

We thank L. Gaffney for help with figure graphics preparation. We thank A. Wu for sharing code for cell cycle analyses and A. Rubin, G. Eraslan, B. Cleary, E. Torlai-Triglia, D. Silverbush, K. Geiger-Schuller, H. Chung and K. Gosik for helpful discussions. We acknowledge the American Association for Cancer Research (AACR) and its financial and material support in the development of the AACR Project GENIE registry, as well as members of the consortium for their commitment to data sharing. Interpretations are the responsibility of study authors. L.J.-A. is a Chan Zuckerberg Biohub Investigator and holds a Career Award at the Scientific Interface from Burroughs Wellcome Fund. L.J.-A. was a Cancer Research Institute (CRI) Irvington Fellow supported by the CRI, and a fellow of the Eric and Wendy Schmidt Postdoctoral program. P.I.T. was supported by a National Institutes of Health (NIH) F32 fellowship F32AI138458. Work was supported by NIH grant no. U01 CA176058 (J.S.B., W.C.H.), a Broadnext10 internal award (J.S.B), a Broad Variant-to-Function (V2F) award (J.T.N.), a Mark Foundation ASPIRE award (W.C.H., J.T.N.) and the Klarman Cell Observatory, HHMI and NHGRI CEGS (all to A.R.). A.H.B. was supported in part by the NIH/NCI (grant nos. R00 CA197762 and R37 CA252050).

Author information

Authors and Affiliations

Authors

Contributions

J.S.B., A.H.B., J.T.N. and A.R. worked on the conceptualization. O.U. wrote the software. O.U. and L.J-A. conducted the formal analysis. E.S., P.I.T., L.N., D.D., C.D., J.B., M.M.M., C.F., A.L., M.M., S.H.L., O.R.-R. and J.T.N. carried out the investigation. O.U., J.T.N., A.R. and J.S.B. wrote the original draft. O.U., J.T.N., J.S.B., A.R., W.C.H., A.O.G., A.J.A., J.B., E.S., P.I.T. and A.H.B. reviewed and edited the paper. J.T.N., O.R-R., A.R. and J.S.B. supervised. J.T.N., W.C.H., O.R-R., A.R. and J.S.B. acquired the funding.

Corresponding authors

Correspondence to Aviv Regev or Jesse S. Boehm.

Ethics declarations

Competing interests

A.R. is a founder and equity holder of Celsius Therapeutics, an equity holder in Immunitas Therapeutics and until 31 July 2020 was an SAB member of Syros Pharmaceuticals, Neogene Therapeutics, Asimov and ThermoFisher Scientific. From 1 August 2020, A.R. is an employee of Genentech. W.C.H. is a consultant for ThermoFisher, Solvasta Ventures, MPM Capital, KSQ Therapeutics, iTeos, Tyra Biosciences, Frontier Medicine, Jubilant Therapeutics and Paraxel. A.O.G. is a shareholder of 10X Genomics. From 19 October 2020, O.R.-R. is an employee of Genentech. From 22 March 2021, P.I.T. is an employee of Genentech. From 24 May 2021, O.U. is an employee of Genentech. A.R., P.I.T., L.J.-A., J.T.N, J.B. and O.U. are named coinventors on a patent application related to sc-eVIP (US20200283843A1). The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Quality control, scoring and clustering for TP53 Perturb-Seq experiment.

a. Distribution of lengths (in kb) of coding regions for 309 actionable genes from the Foundation Medicine Panel. The principal transcript as defined by the APPRIS database was used for each gene. b. Variant representation in the library. Number of barcode reads (y axis) for each tested variant (x axis), either after transduction and 2-day puromycin selection (‘no recovery’), or 42.5 hours after puromycin selection (‘42.5h recovery’). We refer to variants that are present in the pooled library as passing quality control in the main text. c-g. Quality control metrics. c. Cumulative distribution function (CDF) of number of cells (x axis) profiled for each variant, considering either all cells (gray) or only cells with a single variant (black). d. Distribution of the number of variants detected per cell. e. Distribution of the number of variant barcode (vbc) UMIs per cell per variant. f. The number of cells detected per variant (y axis) and the variant’s barcode expression (x axis, transcripts per 10,000 UMIs/cell (TP10K)) for cells with a single variant, colored by class (black: WT-like, light blue: Impactful I, dark blue: Impactful II). g. Distribution of mean variant barcode expression (TP10K, x axis). Variants with a fold change higher than 1.5 compared to the WT barcode are colored by variant class. h. sc-eVIP scores are independent of variant expression. Variant expression (y axis, TP10K) for variant (dots) with different sc-eVIP scores (x axis). i. Sensitivity for identifying impactful variants at an FDR of 5% (y axis), as a function of the number of principal components used (colors) and as a function of the number of cells per variant (x axis). Boxplots represent the results of 10 subsampling iterations, and show the median, with their ends representing the 25% and 75% quartiles, and with whiskers extending between (25% quartile - 1.5 interquartile range) and (75% quartile + 1.5 interquartile range) or the most extreme values in the data, if they fall within this range. j. Impact of number of cells on sc-eVIP scores. sc-eVIP scores (y axis) for each variant (x axis) computed with varying numbers of subsampled cells (rows), colored by variant class. Mean scores (center) and 1 standard deviation (error bars) are based on 10 subsampling iterations, and are colored by variant classs. k. Low-dimensional embedding of mean expression profiles of variants (dots), colored by variant class (left), or by Louvain clustering (right).

Extended Data Fig. 2 Gene programs impacted by TP53 variants.

a-d. Gene programs impacted by variant classes. a. UMAP embedding of single-cell profiles (dots), colored by program scores (color bar) and labeled by selected Gene Ontology biological processes enriched in genes from each program (top). b. CDF for program scores (x axis) for each variant, colored by class. c. Average expression (z-score, color bar) in cells of each variant (columns) of genes (rows) most correlated with the mean of the expression program. d. Difference (dot color) in mean expression of each gene program (rows) between the cells in each cluster and all other cells, and the significance of this difference (dot size, -log10(adj. p-value, Benjamini-Hochberg procedure), Kolmogorov-Smirnov two-sided test, Methods). Colored border: BH FDR<10%. e,f. Principal component analysis. e. UMAP embedding of single-cell profiles, colored by principal component (PC) scores (color bar), for each of the first 10 PCs. f. CDFs for the PC scores (x axis) for the cells of each variant, colored by class. g. ROC curve of the true positive (y axis) and false positive (x axis) rate when using each PC (color) to distinguish between single cells with synonymous variants and those with variants in hotspot positions 175, 248, and 273. Color legend: Area Under the ROC curve (AUROC) for each variant.

Extended Data Fig. 3 Analysis with stringent thresholding of variant barcodes per cell for TP53 variants.

a.Distribution of normalized variant expression (x axis). The vertical red line represents a stringent threshold for detecting variant barcodes per cell, whose results are investigated in this figure. b. Cumulative distribution function (CDF) of number of cells (x axis) profiled for each variant after using the threshold in a, considering either all cells (gray) or only cells with a single variant (black). c. Distribution of the number of variants detected per cell. d. Low-dimensional embedding of mean expression profiles of variants (dots), colored by variant class. e. sc-eVIP scores from the full dataset (x axis) versus computed on the thresholded dataset (y axis), colored by variant class. f. Sensitivity of sc-eVIP scores (y axis) for identifying impactful variants, at an FDR of 5%, as a function of the number of subsampled cells per variant (x axis) (black: all variants, light blue: Impactful I variants, dark blue: Impactful II variants). Mean sensitivity (lines) and 95% confidence intervals (error shade) are based on 10 different subsampling iterations. g. Top: Hierarchical clustering of variants by their correlation profiles, for the thresholded dataset. Bottom: average expression profile of all variable genes (rows) in each variant (columns), grouped into 8 gene programs (row colors). Program 1, higher in assigned vs. unassigned cells was enriched for translation, nonsense-mediated decay, and viral transcription, and may reflect the response to lentiviral transduction.

Extended Data Fig. 4 Comparison of sc-eVIP with cellular phenotyping assays and cell cycle effects for TP53 variants.

a-c. sc-eVIP impact scores and gene programs agree with functional growth assays under Nutlin-3 treatment in a TP53 wildtype background (top) or a TP53 null background (middle) and under etoposide in a TP53 null background (bottom). a. Growth assay score (y axis) and sc-eVIP score (x axis), for each variant (dots), colored by variant class. b. Growth assay score (y axis) and normalized variant expression (x axis, transcripts per 10,000 UMIs/cell (TP10K)) for each variant (dots), colored by variant class. c. The Spearman correlation (x axis) between the growth assay score for each variant and mean gene expression across the variants for the genes (y axis) whose expression is most strongly correlated with the growth assays score. d. The Spearman correlation (x axis) between mean gene program scores (y axis) and growth assay scores across variants. e. The Spearman correlation (x axis) between the proportion of cells from each variant in each cluster and the growth assays under Nutlin-3 treatment in a TP53 wildtype background (top) or a TP53 null background (middle) and under etoposide in a TP53 null background (bottom). f. Heatmap of the Spearman correlation between TP53 growth assays and sc-eVIP scores. g. Proportion of cells in each cell cycle phase (y axis) among cells carrying each of 99 variants (dots, n=median 926 cells/variant) across variant classes (x axis). P-values for a two-sided t-test have been adjusted using the Benjamini-Hochberg procedure.

Extended Data Fig. 5 Quality control, scoring and clustering for KRAS Perturb-Seq experiment.

a. Variant representation in the library. Number of barcode reads (y axis) for each tested variant (x axis), either after transduction and 2-day puromycin selection (‘no recovery’), or 42.5 hours after puromycin selection (‘42.5h recovery’). b-f. Quality control metrics. b. Cumulative distribution function (CDF) of number of cells (x axis) profiled for each variant, considering either all cells (gray) or only cells with a single variant (black). c. Distribution of the number of variants detected per cell. d. Distribution of the number of variant barcode (vbc) UMIs per cell per variant. e. The number of cells detected per variant (y axis) and the variant’s barcode expression (x axis, transcripts per 10,000 UMIs/cell (TP10K)) for cells with a single variant, colored by class. f. Distribution of mean variant barcode expression (TP10K, x axis). Variants with a fold change higher than 1.5 compared to the WT barcode are colored by variant class. g. sc-eVIP scores are independent of variant expression. Variant expression (y axis, TP10K) for variants (dots) with different sc-eVIP scores (x axis). h. Sensitivity for identifying impactful variants at an FDR of 5%, as a function of the number of principal components used (colors) and as a function of the number of cells per variant (x axis). Boxplots represent the results of 10 subsampling iterations, and show the median, with their ends representing the 25% and 75% quartiles, with whiskers extending between (25% quartile - 1.5 interquartile range) and (75% quartile + 1.5 interquartile range) or the most extreme values in the data, if they fall within this range. i. Impact of number of cells on sc-eVIP scores. sc-eVIP scores (y axis) for each variant (x axis) computed with varying numbers of subsampled cells (rows), colored by variant class. Mean scores (center) and 1 standard deviation (error bars) are based on 10 subsampling iterations, and are colored by variant classs. j. Low-dimensional embedding of mean expression profiles of variants (dots), colored by variant class (left), or by Louvain clustering (right). k. Sensitivity of sc-eVIP scores (y axis) for identifying impactful variants, at an FDR of 5%, as a function of the number of subsampled cells per variant (x axis) (magenta: all variants, green: Impactful I, purple: Impactful II, gold: Impactful III, red: Impactful IV (gain-of-function)). Mean sensitivity (lines) and a 95% confidence interval (shade) are based on 10 different subsampling iterations.

Extended Data Fig. 6 Analysis with stringent thresholding of variant barcodes per cell for KRAS variants.

a.Distribution of normalized variant expression (x axis, transcripts per 10,000 UMIs/cell (TP10K)). The vertical red line represents a stringent threshold for detecting variant barcodes per cell, whose results are investigated in this figure. b. Cumulative distribution function (CDF) of number of cells (x axis) profiled for each variant after using the threshold in a, considering either all cells (gray) or only cells with a single variant (black). c. Distribution of the number of variants detected per cell. d. Low-dimensional embedding of mean expression profiles of variants (dots), colored by variant class, as determined in Fig. 3a. e. sc-eVIP scores from the full dataset (x axis) versus computed on the thresholded dataset (y axis), colored by variant class. f. Sensitivity of sc-eVIP scores (y axis) for identifying impactful variants, at an FDR of 5%, as a function of the number of subsampled cells per variant (x axis) (black: all variants, green: Impactful I, purple: Impactful II, gold: Impactful III, red: Impactful IV (gain-of-function)). Mean sensitivity (lines) and 95% confidence intervals (error shade) are based on 10 different subsampling iterations. g. Top: Hierarchical clustering of variants by their correlation profiles, for the thresholded dataset. Bottom: average expression profile of all variable genes (rows) in each variant (columns), grouped into 12 gene programs (row colors). Program 8, higher in assigned vs. unassigned cells was enriched for translation, nonsense-mediated decay, and viral transcription, and may reflect the response to lentiviral transduction.

Extended Data Fig. 7 Gene programs, and cell cycle changes impacted by KRAS variants.

a-d. Gene programs impacted by variant classes. a-b. UMAP embedding of single-cell profiles (b), colored by program scores (color bar) and labeled by selected Gene Ontology biological processes enriched in genes from each program (a). c. CDF for program scores (x axis) for each variant, colored by class. d. Average expression (z-score, color bar) in cells of each variant (columns) of genes (rows) most correlated with the mean of the expression program. e. Difference (dot color) in mean expression of each gene program (rows) between the cells in each cluster (columns, as in Fig. 4e) and all other cells, and the significance of this difference (dot size, -log10(adj. p-value, Benjamini-Hochberg procedure), Kolmogorov-Smirnov test, Methods). Colored border: BH FDR<10%. f. ROC curve of the true positive (y axis) and false positive (x axis) rate when using each PC (color) to distinguish between single cells with synonymous variants and those with variants in hotspot positions 12, 13 and 61. Color legend: Area Under the ROC curve (AUROC) for each variant. g,h. Principal component analysis. g. UMAP embedding of single-cell profiles, colored by principal component (PC) scores (color bar), for each of the first 10 PCs. h. CDFs for the PC scores (x axis) for the cells of each variant, colored by class. i. Variant expression (y axis) as a function of PC 3 scores (x axis), for each variant (dots). Variants are colored by variant class. j. Proportion of cells in each cell cycle phase (y axis) among cells carrying each of 98 variants (dots, n=median 1058 cells/variant) across variant classes (x axis). P-values for a two-sided t-test have been adjusted using the Benjamini-Hochberg procedure.

Extended Data Fig. 8 Additional quality controls.

a,b. Spearman correlation between sc-eVIP scores and genes most highly correlated with sc-eVIP scores (x axis) for TP53 (a) and KRAS (b). c. Significance (y axis) of two-sided t-tests comparing the expression of each gene (dots) in bulk RNA-seq samples with WT KRAS and either G12C (left) or G12V (right). The genes are grouped by their gene programs (x axis) as defined in our current single-cell study. Values represent signed -log10 (p-values), adjusted for multiple testing using the Benjamini-Hochberg procedure.

Extended Data Fig. 9 Variant-by-variant detailed representation of all analyses for TP53 variants.

a. Variant features. Number of cells (y axis, top), distribution of normalized variant barcode expression (y axis, middle; red: variants with a fold-change greater than 1.5) and sc-eVIP scores (y axis, bottom; black: significant scores) for each variant (x axis), ordered as in Fig. 2d. Gray font: controls (synonymous and ExAC), blue font: hotspot variants (positions 175, 248, 273). b. Agreement with other data features. Top: difference (dot color) in mean expression or signature score between a variant (columns, ordered as in Fig. 2d) and unassigned cells and the significance of this difference (dot size, -log10(adj. p-value, Benjamini-Hochberg procedure), Kolmogorov-Smirnov two-sided test, Methods) for each of two genes canonically induced by TP53 and two TP53-associated signatures (rows). Colored border: BH FDR<10%. Middle: Growth (z-score, color bar) in three functional assays (rows) of each variant (columns). Bottom: Mutation prevalence (log2(counts+1) of variant occurrences) in two datasets (rows) of each variant, ordered as in Fig. 2d. c. Gene programs association with variants. Top: Difference (dot color) in mean program score (top) or mean PC score (bottom) between a variant (columns) and WT overexpressing cells and the significance of this difference (dot size, -log10(adj.p-value, Benjamini-Hochberg procedure), Kolmogorov-Smirnov two-sided test, Methods) for each gene program (top, rows, Methods), or each of the top 10 PCs (bottom, rows). Colored border: BH FDR<10%. d,e. Relation of variants to different clusters and cell cycle phases. Left: Proportion of cells (bar height) in each cell cluster (d) or cell cycle phase (e) (rows) derived for each variant (columns), annotated at the top with significance from a chi-square test comparing the cell state distribution of each variant with that of WT overexpressing cells (-log10(adj. p-value, Benjamini-Hochberg procedure)). Right: UMAP embedding of single-cell profiles, colored by cell clusters (d) or cell cycle phase (e). f. Relation of variants to the TP53 protein structure. sc-eVIP scores (y axis) of each variant (dot, colored by the variant class) and its position along the TP53 gene (x axis, annotated by domain). g. Variant induced shift in cell distributions. Density map of cell profiles in a UMAP embedding, comparing the density of cells overexpressing variants in each of 3 classes to either cells with variants in the WT-like class (grey, top) or unassigned cells (purple, bottom).

Extended Data Fig. 10 Variant-by-variant detailed representation of all analyses for KRAS variants.

a. Variant features. Number of cells (y axis, top), distribution of normalized variant barcode expression (y axis, middle; red: variants with a fold-change greater than 1.5) and sc-eVIP scores (y axis, bottom; black: significant scores) for each variant (x axis), ordered as in Fig. 3a. Grey font: controls (synonymous and ExAC), red font: hotspot variants (positions 12, 13 and 61). b. Agreement with other data features. Top: Dependence of cell line growth on KRAS (y axis), for cell lines (dots) categorized by their KRAS genotype status (x axis). Gray: wildtype KRAS, red: missense KRAS variants. For KRAS-WT cell lines, the boxplot is based on n=660 cell lines, and shows the median, the 25% and 75% quartiles, additional 1.5 interquartile ranges and the most extreme values in the data. Middle: Growth in low attachment of HA1E cells (z-score, color bar), or GILA score, for each variant (columns, ordered as in Fig. 3a) at 7 and 14 days. Bottom: Mutation prevalence (log2(counts+1) of variant occurrences) in the COSMIC database (top) and a pan-cancer curated set (bottom), for each variant. c. Gene programs association with variants. Top: Difference (dot color) in mean program score (top) or mean PC score (bottom) between a variant (columns) and WT overexpressing cells and the significance of this difference (dot size, -log10(adj. p-value, Benjamini-Hochberg procedure), two-sided Kolmogorov-Smirnov test, Methods) for each gene program (top, rows, by clustering genes, Methods), or each of the top 10 PCs (bottom, rows). Colored border: BH FDR<10%. d,e. Relation of variants to different clusters and cell cycle phases. Left: Proportion of cells (bar height) in each cell cluster (d) or cell cycle phase (e) (rows) derived for each variant (columns), annotated at the top with significance from a chi-square test comparing the cell state distribution of each variant with that of WT overexpressing cells (-log10(adj. p-value, Benjamini-Hochberg procedure)). Right: UMAP embedding of single-cell profiles, colored by cell clusters (d) or cell cycle phase (e). f. Relation of variants to KRAS protein structure. sc-eVIP scores (y axis) of each variant (dot, colored by the variant class) and its position along the KRAS gene (x axis, annotated by domain). g. Variant induced shift in cell distributions. Density map of cell profiles in a UMAP embedding, comparing the density of cells overexpressing variants in each of 3 classes to either cells overexpressing variants in the WT-like group (grey, top) or unassigned cells (purple, bottom).

Supplementary information

Supplementary Information

Supplementary Information and Fig. 1.

Reporting Summary.

Supplementary Table 1

Tab TP53. Properties of TP53 variants. The columns represent the name of the variant (“Variant”), the gene in which the variant was engineered (“Gene”), the position of the variant base in the ORF sequence (“Position: base(s)”, comma-delimited if multiple bases have been changed to produce a given variant), the WT base(s) in the ORF (“From ORF base(s)”), the variant base(s) in the ORF at the position(s) given in “Position: base(s)” (“To ORF base(s)”), the position of the amino acid(s) affected by the variant (“Position: amino acid(s)”, comma-delimited if multiple amino acids are changed), the WT amino acid at the provided position(s) (“From amino acid(s)”), the variant amino acid(s) at the provided position(s) (“To amino acid(s)”), the WT codon(s) at the provided position(s) (“From ORF codon(s)”), and the variant codon(s) at the same position(s) (“To ORF codon(s)”), whether the variant involves a single or multiple base change (“Mutation type”), whether the variant is a synonymous control, ExAC or unknown (“Control status”), the associated variant barcode (“Variant barcode”), whether the variant passed quality control and is in the library (“Library synthesis”), the number of cells per variant (“Cells/variant”), the average expression of the variant barcode in UMIs per 10,000 UMIs (“Normalized variant barcode counts (TP10K)”), Hotelling’s T2 statistic representing the sc-eVIP score (“HotellingT2”), the q value (“HotellingT2.q”), the functional class assigned to the variant (“Variant functional class”), the variant prevalence in the pan-cancer dataset (“Count(pancan)”) and the variant prevalence in ExAC (“Count (ExAC)”), the frequency of the variant in cancer cohorts (“Count (IARC)”), growth z-scores from functional assays (“Nutlin-3, TP53 WT growth (z score)”, “Nutlin-3, TP53 null growth (z score)”, and “Etoposide, TP53 null growth (z score)”) and the sequence used in the experiment (“Insert sequence”).

Tab KRAS. Properties of KRAS variants. The columns represent the name of the variant (“Variant”), the gene in which the variant was engineered (“Gene”), the position of the variant base in the ORF sequence (“Position: base(s)”, comma-delimited if multiple bases have been changed to produce a given variant), the WT base(s) in the ORF (“From ORF base(s)”), the variant base(s) in the ORF at the position(s) given in “Position: base(s)” (“To ORF base(s)”), the position of the amino acid(s) affected by the variant (“Position: amino acid(s)”, comma-delimited if multiple amino acids are changed), the WT amino acid at the provided position(s) (“From amino acid(s)”), the variant amino acid(s) at the provided position(s) (“To amino acid(s)”), the WT codon(s) at the provided position(s) (“From ORF codon(s)”), and the variant codon(s) at the same position(s) (“To ORF codon(s)”), whether the variant involves a single or multiple base change (“Mutation type”), whether the variant is a synonymous control, ExAC or unknown (“Control status”), whether the variant passed quality control and is in the library (“Library synthesis”), the associated variant barcode (“Variant barcode”), the number of cells per variant (“Cells/variant”), the average expression of the variant barcode in UMIs per 10,000 UMIs (“Normalized variant barcode counts (TP10K)”), Hotelling’s T2 statistic representing the sc-eVIP score (“HotellingT2”), the q value (“HotellingT2.q”), the functional class assigned to the variant (“Variant functional class”), the variant prevalence in the pan-cancer dataset (“Count(pancan)”), the variant prevalence in ExAC (“Count (ExAC)”), the variant frequency in cancer cohorts (“Count (COSMIC)”) and the sequence used in the experiment (“Insert sequence”).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ursu, O., Neal, J.T., Shea, E. et al. Massively parallel phenotyping of coding variants in cancer with Perturb-seq. Nat Biotechnol 40, 896–905 (2022). https://doi.org/10.1038/s41587-021-01160-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-021-01160-7

This article is cited by

Search

Quick links

Nature Briefing: Cancer

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

Get what matters in cancer research, free to your inbox weekly. Sign up for Nature Briefing: Cancer