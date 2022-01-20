Abstract
Genome sequencing studies have identified millions of somatic variants in cancer, but it remains challenging to predict the phenotypic impact of most. Experimental approaches to distinguish impactful variants often use phenotypic assays that report on predefined gene-specific functional effects in bulk cell populations. Here, we develop an approach to functionally assess variant impact in single cells by pooled Perturb-seq. We measured the impact of 200 TP53 and KRAS variants on RNA profiles in over 300,000 single lung cancer cells, and used the profiles to categorize variants into phenotypic subsets to distinguish gain-of-function, loss-of-function and dominant negative variants, which we validated by comparison with orthogonal assays. We discovered that KRAS variants did not merely fit into discrete functional categories, but spanned a continuum of gain-of-function phenotypes, and that their functional impact could not have been predicted solely by their frequency in patient cohorts. Our work provides a scalable, gene-agnostic method for coding variant impact phenotyping, with potential applications in multiple disease settings.
Data availability
Raw and processed data associated with this work are publicly accessible on Gene Expression Omnibus with accession numbers GSE161824 (single-cell RNA-seq data, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE161824) and GSE183670 (bulk RNA-seq data, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE183670). In addition, previously published datasets/databases used in this study are: cBioPortal http://www.cbioportal.org/, GenBank74, https://www.ncbi.nlm.nih.gov/genbank/, Foundation Medicine Panel39, https://assets.ctfassets.net/w98cd481qyp0/YqqKHaqQmFeqc5ueQk48w/c35460768c3a76ef738dcf88f8219524/F1CDx_Tech_Specs_072021.pdf, the APPRIS database81, https://apprisws.bioinfo.cnio.es/landing_page/ and GILA data from http://www.targetkras.com/.
Code availability
All analyses can be recapitulated with Jupyter notebooks at http://github.com/klarman-cell-observatory/sc_eVIP, and using the Perturb-seq library at http://github.com/klarman-cell-observatory/perturbseq. In addition, we have used the following software packages: Cellranger 2.1.1 (ref. 73), Bowtie 1.2.2 (ref. 78), R packages DNAbarcodes 0.1 (ref. 42) and PRROC 1.3.1 (ref. 82), and Python packages Scanpy 1.7.2 (ref. 75), Sklearn 0.24.1 (ref. 83) and Mimosca https://github.com/asncd/MIMOSCA.
Acknowledgements
We thank L. Gaffney for help with figure graphics preparation. We thank A. Wu for sharing code for cell cycle analyses and A. Rubin, G. Eraslan, B. Cleary, E. Torlai-Triglia, D. Silverbush, K. Geiger-Schuller, H. Chung and K. Gosik for helpful discussions. We acknowledge the American Association for Cancer Research (AACR) and its financial and material support in the development of the AACR Project GENIE registry, as well as members of the consortium for their commitment to data sharing. Interpretations are the responsibility of study authors. L.J.-A. is a Chan Zuckerberg Biohub Investigator and holds a Career Award at the Scientific Interface from Burroughs Wellcome Fund. L.J.-A. was a Cancer Research Institute (CRI) Irvington Fellow supported by the CRI, and a fellow of the Eric and Wendy Schmidt Postdoctoral program. P.I.T. was supported by a National Institutes of Health (NIH) F32 fellowship F32AI138458. Work was supported by NIH grant no. U01 CA176058 (J.S.B., W.C.H.), a Broadnext10 internal award (J.S.B), a Broad Variant-to-Function (V2F) award (J.T.N.), a Mark Foundation ASPIRE award (W.C.H., J.T.N.) and the Klarman Cell Observatory, HHMI and NHGRI CEGS (all to A.R.). A.H.B. was supported in part by the NIH/NCI (grant nos. R00 CA197762 and R37 CA252050).
Ethics declarations
Competing interests
A.R. is a founder and equity holder of Celsius Therapeutics, an equity holder in Immunitas Therapeutics and until 31 July 2020 was an SAB member of Syros Pharmaceuticals, Neogene Therapeutics, Asimov and ThermoFisher Scientific. From 1 August 2020, A.R. is an employee of Genentech. W.C.H. is a consultant for ThermoFisher, Solvasta Ventures, MPM Capital, KSQ Therapeutics, iTeos, Tyra Biosciences, Frontier Medicine, Jubilant Therapeutics and Paraxel. A.O.G. is a shareholder of 10X Genomics. From 19 October 2020, O.R.-R. is an employee of Genentech. From 22 March 2021, P.I.T. is an employee of Genentech. From 24 May 2021, O.U. is an employee of Genentech. A.R., P.I.T., L.J.-A., J.T.N, J.B. and O.U. are named coinventors on a patent application related to sc-eVIP (US20200283843A1). The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Quality control, scoring and clustering for TP53 Perturb-Seq experiment.
a. Distribution of lengths (in kb) of coding regions for 309 actionable genes from the Foundation Medicine Panel. The principal transcript as defined by the APPRIS database was used for each gene. b. Variant representation in the library. Number of barcode reads (y axis) for each tested variant (x axis), either after transduction and 2-day puromycin selection (‘no recovery’), or 42.5 hours after puromycin selection (‘42.5h recovery’). We refer to variants that are present in the pooled library as passing quality control in the main text. c-g. Quality control metrics. c. Cumulative distribution function (CDF) of number of cells (x axis) profiled for each variant, considering either all cells (gray) or only cells with a single variant (black). d. Distribution of the number of variants detected per cell. e. Distribution of the number of variant barcode (vbc) UMIs per cell per variant. f. The number of cells detected per variant (y axis) and the variant’s barcode expression (x axis, transcripts per 10,000 UMIs/cell (TP10K)) for cells with a single variant, colored by class (black: WT-like, light blue: Impactful I, dark blue: Impactful II). g. Distribution of mean variant barcode expression (TP10K, x axis). Variants with a fold change higher than 1.5 compared to the WT barcode are colored by variant class. h. sc-eVIP scores are independent of variant expression. Variant expression (y axis, TP10K) for variant (dots) with different sc-eVIP scores (x axis). i. Sensitivity for identifying impactful variants at an FDR of 5% (y axis), as a function of the number of principal components used (colors) and as a function of the number of cells per variant (x axis). Boxplots represent the results of 10 subsampling iterations, and show the median, with their ends representing the 25% and 75% quartiles, and with whiskers extending between (25% quartile - 1.5 interquartile range) and (75% quartile + 1.5 interquartile range) or the most extreme values in the data, if they fall within this range. j. Impact of number of cells on sc-eVIP scores. sc-eVIP scores (y axis) for each variant (x axis) computed with varying numbers of subsampled cells (rows), colored by variant class. Mean scores (center) and 1 standard deviation (error bars) are based on 10 subsampling iterations, and are colored by variant classs. k. Low-dimensional embedding of mean expression profiles of variants (dots), colored by variant class (left), or by Louvain clustering (right).
Extended Data Fig. 2 Gene programs impacted by TP53 variants.
a-d. Gene programs impacted by variant classes. a. UMAP embedding of single-cell profiles (dots), colored by program scores (color bar) and labeled by selected Gene Ontology biological processes enriched in genes from each program (top). b. CDF for program scores (x axis) for each variant, colored by class. c. Average expression (z-score, color bar) in cells of each variant (columns) of genes (rows) most correlated with the mean of the expression program. d. Difference (dot color) in mean expression of each gene program (rows) between the cells in each cluster and all other cells, and the significance of this difference (dot size, -log10(adj. p-value, Benjamini-Hochberg procedure), Kolmogorov-Smirnov two-sided test, Methods). Colored border: BH FDR<10%. e,f. Principal component analysis. e. UMAP embedding of single-cell profiles, colored by principal component (PC) scores (color bar), for each of the first 10 PCs. f. CDFs for the PC scores (x axis) for the cells of each variant, colored by class. g. ROC curve of the true positive (y axis) and false positive (x axis) rate when using each PC (color) to distinguish between single cells with synonymous variants and those with variants in hotspot positions 175, 248, and 273. Color legend: Area Under the ROC curve (AUROC) for each variant.
Extended Data Fig. 3 Analysis with stringent thresholding of variant barcodes per cell for TP53 variants.
a.Distribution of normalized variant expression (x axis). The vertical red line represents a stringent threshold for detecting variant barcodes per cell, whose results are investigated in this figure. b. Cumulative distribution function (CDF) of number of cells (x axis) profiled for each variant after using the threshold in a, considering either all cells (gray) or only cells with a single variant (black). c. Distribution of the number of variants detected per cell. d. Low-dimensional embedding of mean expression profiles of variants (dots), colored by variant class. e. sc-eVIP scores from the full dataset (x axis) versus computed on the thresholded dataset (y axis), colored by variant class. f. Sensitivity of sc-eVIP scores (y axis) for identifying impactful variants, at an FDR of 5%, as a function of the number of subsampled cells per variant (x axis) (black: all variants, light blue: Impactful I variants, dark blue: Impactful II variants). Mean sensitivity (lines) and 95% confidence intervals (error shade) are based on 10 different subsampling iterations. g. Top: Hierarchical clustering of variants by their correlation profiles, for the thresholded dataset. Bottom: average expression profile of all variable genes (rows) in each variant (columns), grouped into 8 gene programs (row colors). Program 1, higher in assigned vs. unassigned cells was enriched for translation, nonsense-mediated decay, and viral transcription, and may reflect the response to lentiviral transduction.
Extended Data Fig. 4 Comparison of sc-eVIP with cellular phenotyping assays and cell cycle effects for TP53 variants.
a-c. sc-eVIP impact scores and gene programs agree with functional growth assays under Nutlin-3 treatment in a TP53 wildtype background (top) or a TP53 null background (middle) and under etoposide in a TP53 null background (bottom). a. Growth assay score (y axis) and sc-eVIP score (x axis), for each variant (dots), colored by variant class. b. Growth assay score (y axis) and normalized variant expression (x axis, transcripts per 10,000 UMIs/cell (TP10K)) for each variant (dots), colored by variant class. c. The Spearman correlation (x axis) between the growth assay score for each variant and mean gene expression across the variants for the genes (y axis) whose expression is most strongly correlated with the growth assays score. d. The Spearman correlation (x axis) between mean gene program scores (y axis) and growth assay scores across variants. e. The Spearman correlation (x axis) between the proportion of cells from each variant in each cluster and the growth assays under Nutlin-3 treatment in a TP53 wildtype background (top) or a TP53 null background (middle) and under etoposide in a TP53 null background (bottom). f. Heatmap of the Spearman correlation between TP53 growth assays and sc-eVIP scores. g. Proportion of cells in each cell cycle phase (y axis) among cells carrying each of 99 variants (dots, n=median 926 cells/variant) across variant classes (x axis). P-values for a two-sided t-test have been adjusted using the Benjamini-Hochberg procedure.
Extended Data Fig. 5 Quality control, scoring and clustering for KRAS Perturb-Seq experiment.
a. Variant representation in the library. Number of barcode reads (y axis) for each tested variant (x axis), either after transduction and 2-day puromycin selection (‘no recovery’), or 42.5 hours after puromycin selection (‘42.5h recovery’). b-f. Quality control metrics. b. Cumulative distribution function (CDF) of number of cells (x axis) profiled for each variant, considering either all cells (gray) or only cells with a single variant (black). c. Distribution of the number of variants detected per cell. d. Distribution of the number of variant barcode (vbc) UMIs per cell per variant. e. The number of cells detected per variant (y axis) and the variant’s barcode expression (x axis, transcripts per 10,000 UMIs/cell (TP10K)) for cells with a single variant, colored by class. f. Distribution of mean variant barcode expression (TP10K, x axis). Variants with a fold change higher than 1.5 compared to the WT barcode are colored by variant class. g. sc-eVIP scores are independent of variant expression. Variant expression (y axis, TP10K) for variants (dots) with different sc-eVIP scores (x axis). h. Sensitivity for identifying impactful variants at an FDR of 5%, as a function of the number of principal components used (colors) and as a function of the number of cells per variant (x axis). Boxplots represent the results of 10 subsampling iterations, and show the median, with their ends representing the 25% and 75% quartiles, with whiskers extending between (25% quartile - 1.5 interquartile range) and (75% quartile + 1.5 interquartile range) or the most extreme values in the data, if they fall within this range. i. Impact of number of cells on sc-eVIP scores. sc-eVIP scores (y axis) for each variant (x axis) computed with varying numbers of subsampled cells (rows), colored by variant class. Mean scores (center) and 1 standard deviation (error bars) are based on 10 subsampling iterations, and are colored by variant classs. j. Low-dimensional embedding of mean expression profiles of variants (dots), colored by variant class (left), or by Louvain clustering (right). k. Sensitivity of sc-eVIP scores (y axis) for identifying impactful variants, at an FDR of 5%, as a function of the number of subsampled cells per variant (x axis) (magenta: all variants, green: Impactful I, purple: Impactful II, gold: Impactful III, red: Impactful IV (gain-of-function)). Mean sensitivity (lines) and a 95% confidence interval (shade) are based on 10 different subsampling iterations.
Extended Data Fig. 6 Analysis with stringent thresholding of variant barcodes per cell for KRAS variants.
a.Distribution of normalized variant expression (x axis, transcripts per 10,000 UMIs/cell (TP10K)). The vertical red line represents a stringent threshold for detecting variant barcodes per cell, whose results are investigated in this figure. b. Cumulative distribution function (CDF) of number of cells (x axis) profiled for each variant after using the threshold in a, considering either all cells (gray) or only cells with a single variant (black). c. Distribution of the number of variants detected per cell. d. Low-dimensional embedding of mean expression profiles of variants (dots), colored by variant class, as determined in Fig. 3a. e. sc-eVIP scores from the full dataset (x axis) versus computed on the thresholded dataset (y axis), colored by variant class. f. Sensitivity of sc-eVIP scores (y axis) for identifying impactful variants, at an FDR of 5%, as a function of the number of subsampled cells per variant (x axis) (black: all variants, green: Impactful I, purple: Impactful II, gold: Impactful III, red: Impactful IV (gain-of-function)). Mean sensitivity (lines) and 95% confidence intervals (error shade) are based on 10 different subsampling iterations. g. Top: Hierarchical clustering of variants by their correlation profiles, for the thresholded dataset. Bottom: average expression profile of all variable genes (rows) in each variant (columns), grouped into 12 gene programs (row colors). Program 8, higher in assigned vs. unassigned cells was enriched for translation, nonsense-mediated decay, and viral transcription, and may reflect the response to lentiviral transduction.
Extended Data Fig. 7 Gene programs, and cell cycle changes impacted by KRAS variants.
a-d. Gene programs impacted by variant classes. a-b. UMAP embedding of single-cell profiles (b), colored by program scores (color bar) and labeled by selected Gene Ontology biological processes enriched in genes from each program (a). c. CDF for program scores (x axis) for each variant, colored by class. d. Average expression (z-score, color bar) in cells of each variant (columns) of genes (rows) most correlated with the mean of the expression program. e. Difference (dot color) in mean expression of each gene program (rows) between the cells in each cluster (columns, as in Fig. 4e) and all other cells, and the significance of this difference (dot size, -log10(adj. p-value, Benjamini-Hochberg procedure), Kolmogorov-Smirnov test, Methods). Colored border: BH FDR<10%. f. ROC curve of the true positive (y axis) and false positive (x axis) rate when using each PC (color) to distinguish between single cells with synonymous variants and those with variants in hotspot positions 12, 13 and 61. Color legend: Area Under the ROC curve (AUROC) for each variant. g,h. Principal component analysis. g. UMAP embedding of single-cell profiles, colored by principal component (PC) scores (color bar), for each of the first 10 PCs. h. CDFs for the PC scores (x axis) for the cells of each variant, colored by class. i. Variant expression (y axis) as a function of PC 3 scores (x axis), for each variant (dots). Variants are colored by variant class. j. Proportion of cells in each cell cycle phase (y axis) among cells carrying each of 98 variants (dots, n=median 1058 cells/variant) across variant classes (x axis). P-values for a two-sided t-test have been adjusted using the Benjamini-Hochberg procedure.
Extended Data Fig. 8 Additional quality controls.
a,b. Spearman correlation between sc-eVIP scores and genes most highly correlated with sc-eVIP scores (x axis) for TP53 (a) and KRAS (b). c. Significance (y axis) of two-sided t-tests comparing the expression of each gene (dots) in bulk RNA-seq samples with WT KRAS and either G12C (left) or G12V (right). The genes are grouped by their gene programs (x axis) as defined in our current single-cell study. Values represent signed -log10 (p-values), adjusted for multiple testing using the Benjamini-Hochberg procedure.
Extended Data Fig. 9 Variant-by-variant detailed representation of all analyses for TP53 variants.
a. Variant features. Number of cells (y axis, top), distribution of normalized variant barcode expression (y axis, middle; red: variants with a fold-change greater than 1.5) and sc-eVIP scores (y axis, bottom; black: significant scores) for each variant (x axis), ordered as in Fig. 2d. Gray font: controls (synonymous and ExAC), blue font: hotspot variants (positions 175, 248, 273). b. Agreement with other data features. Top: difference (dot color) in mean expression or signature score between a variant (columns, ordered as in Fig. 2d) and unassigned cells and the significance of this difference (dot size, -log10(adj. p-value, Benjamini-Hochberg procedure), Kolmogorov-Smirnov two-sided test, Methods) for each of two genes canonically induced by TP53 and two TP53-associated signatures (rows). Colored border: BH FDR<10%. Middle: Growth (z-score, color bar) in three functional assays (rows) of each variant (columns). Bottom: Mutation prevalence (log2(counts+1) of variant occurrences) in two datasets (rows) of each variant, ordered as in Fig. 2d. c. Gene programs association with variants. Top: Difference (dot color) in mean program score (top) or mean PC score (bottom) between a variant (columns) and WT overexpressing cells and the significance of this difference (dot size, -log10(adj.p-value, Benjamini-Hochberg procedure), Kolmogorov-Smirnov two-sided test, Methods) for each gene program (top, rows, Methods), or each of the top 10 PCs (bottom, rows). Colored border: BH FDR<10%. d,e. Relation of variants to different clusters and cell cycle phases. Left: Proportion of cells (bar height) in each cell cluster (d) or cell cycle phase (e) (rows) derived for each variant (columns), annotated at the top with significance from a chi-square test comparing the cell state distribution of each variant with that of WT overexpressing cells (-log10(adj. p-value, Benjamini-Hochberg procedure)). Right: UMAP embedding of single-cell profiles, colored by cell clusters (d) or cell cycle phase (e). f. Relation of variants to the TP53 protein structure. sc-eVIP scores (y axis) of each variant (dot, colored by the variant class) and its position along the TP53 gene (x axis, annotated by domain). g. Variant induced shift in cell distributions. Density map of cell profiles in a UMAP embedding, comparing the density of cells overexpressing variants in each of 3 classes to either cells with variants in the WT-like class (grey, top) or unassigned cells (purple, bottom).
Extended Data Fig. 10 Variant-by-variant detailed representation of all analyses for KRAS variants.
a. Variant features. Number of cells (y axis, top), distribution of normalized variant barcode expression (y axis, middle; red: variants with a fold-change greater than 1.5) and sc-eVIP scores (y axis, bottom; black: significant scores) for each variant (x axis), ordered as in Fig. 3a. Grey font: controls (synonymous and ExAC), red font: hotspot variants (positions 12, 13 and 61). b. Agreement with other data features. Top: Dependence of cell line growth on KRAS (y axis), for cell lines (dots) categorized by their KRAS genotype status (x axis). Gray: wildtype KRAS, red: missense KRAS variants. For KRAS-WT cell lines, the boxplot is based on n=660 cell lines, and shows the median, the 25% and 75% quartiles, additional 1.5 interquartile ranges and the most extreme values in the data. Middle: Growth in low attachment of HA1E cells (z-score, color bar), or GILA score, for each variant (columns, ordered as in Fig. 3a) at 7 and 14 days. Bottom: Mutation prevalence (log2(counts+1) of variant occurrences) in the COSMIC database (top) and a pan-cancer curated set (bottom), for each variant. c. Gene programs association with variants. Top: Difference (dot color) in mean program score (top) or mean PC score (bottom) between a variant (columns) and WT overexpressing cells and the significance of this difference (dot size, -log10(adj. p-value, Benjamini-Hochberg procedure), two-sided Kolmogorov-Smirnov test, Methods) for each gene program (top, rows, by clustering genes, Methods), or each of the top 10 PCs (bottom, rows). Colored border: BH FDR<10%. d,e. Relation of variants to different clusters and cell cycle phases. Left: Proportion of cells (bar height) in each cell cluster (d) or cell cycle phase (e) (rows) derived for each variant (columns), annotated at the top with significance from a chi-square test comparing the cell state distribution of each variant with that of WT overexpressing cells (-log10(adj. p-value, Benjamini-Hochberg procedure)). Right: UMAP embedding of single-cell profiles, colored by cell clusters (d) or cell cycle phase (e). f. Relation of variants to KRAS protein structure. sc-eVIP scores (y axis) of each variant (dot, colored by the variant class) and its position along the KRAS gene (x axis, annotated by domain). g. Variant induced shift in cell distributions. Density map of cell profiles in a UMAP embedding, comparing the density of cells overexpressing variants in each of 3 classes to either cells overexpressing variants in the WT-like group (grey, top) or unassigned cells (purple, bottom).
Supplementary information
Supplementary Information
Supplementary Information and Fig. 1.
Supplementary Table 1
Tab TP53. Properties of TP53 variants. The columns represent the name of the variant (Variant), its position in the amino-acid sequence (Position), the original base(s) in the ORF (From), the base(s) the variant produces (To), whether the variant involves a single or multiple base change (Mutation type), whether the variant is a synonymous control, ExAC or unknown (Control status), the associated variant barcode (Variant barcode), whether the variant passed quality control and is in the library (Library synthesis), the number of cells per variant (Cells/variant), the average expression of the variant barcode in UMIs per 10,000 UMIs (Normalized variant barcode counts (TP10K)), Hotelling’s T2 statistic representing the sc-eVIP score (HotellingT2), the q value (HotellingT2.q), the functional class assigned to the variant (Variant functional class), the variant prevalence in the pan-cancer dataset (Count(pancan)) and the variant prevalence in ExAC (Count (ExAC)), the frequency of the variant in cancer cohorts (Count (IARC)), and growth z scores from functional assays (Nutlin-3, TP53 WT growth (z score), Nutlin-3, TP53 null growth (z score), and Etoposide, TP53 null growth (z score)). Tab KRAS. Properties of KRAS variants. The columns represent the name of the variant (Variant), its position in the amino-acid sequence (Position), the original base(s) in the ORF (From), the base(s) the variant produces (To), whether the variant involves a single or multiple base change (Mutation type), whether the variant is a synonymous control, ExAC or unknown (Control status), whether the variant passed quality control and is in the library (Library synthesis), the associated variant barcode (Variant barcode), the number of cells per variant (Cells/variant), the average expression of the variant barcode in UMIs per 10,000 UMIs (Normalized variant barcode counts (TP10K)), Hotelling’s T2 statistic representing the sc-eVIP score (HotellingT2), the q value (HotellingT2.q), the functional class assigned to the variant (variant functional class), the variant prevalence in the pan-cancer dataset (Count(pancan)), the variant prevalence in ExAC (Count (ExAC)) and the variant frequency in cancer cohorts (Count (COSMIC)).
Rights and permissions
About this article
Cite this article
Ursu, O., Neal, J.T., Shea, E. et al. Massively parallel phenotyping of coding variants in cancer with Perturb-seq. Nat Biotechnol (2022). https://doi.org/10.1038/s41587-021-01160-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41587-021-01160-7