A new Bayesian factor analysis method improves detection of genes and biological processes affected by perturbations in single-cell CRISPR screening

Clustered regularly interspaced short palindromic repeats (CRISPR) screening coupled with single-cell RNA sequencing has emerged as a powerful tool to characterize the effects of genetic perturbations on the whole transcriptome at a single-cell level. However, due to its sparsity and complex structure, analysis of single-cell CRISPR screening data is challenging. In particular, standard differential expression analysis methods are often underpowered to detect genes affected by CRISPR perturbations. We developed a statistical method for such data, called guided sparse factor analysis (GSFA). GSFA infers latent factors that represent coregulated genes or gene modules; by borrowing information from these factors, it infers the effects of genetic perturbations on individual genes. We demonstrated through extensive simulation studies that GSFA detects perturbation effects with much higher power than state-of-the-art methods. Using single-cell CRISPR data from human CD8+ T cells and neural progenitor cells, we showed that GSFA identified biologically relevant gene modules and specific genes affected by CRISPR perturbations, many of which were missed by existing methods, providing new insights into the functions of genes involved in T cell activation and neurodevelopment.

Supplementary Fig. S6: Permutation results of DEG detection methods on CD8 + T cell CROP-seq data.Results from 10 randomly permuted datasets are presented together.a) GSFA effect sizes of perturbations on factors, estimated within stimulated cells and unstimulated cells, respectively.b) GSFA PIPs of associations between factors and perturbations, estimated within stimulated cells and unstimulated cells, respectively.c) GSFA LFSRs of genes under all perturbations, estimated within stimulated cells and unstimulated cells.d) Quantile-quantile plot of empirical p-values of differential expression estimated by scMAGeCK-LR within stimulated cells, assuming a uniform(0,1) null distribution; empirical p-values of exact zeros were replaced with 5e-4 for visualization in the Q-Q plot (specific to (d)); the shaded band is formed by upper-and lower-bounds of the 95% confidence interval at each of the 1000-quantiles of the uniform(0,1) distribution.e), f), g), h) Same as in (d) except that the observed quantile values are of differential expression p-values estimated within stimulated cells by DESeq2 (e), MAST (f), SCEPTRE (g), and edgeR-QLF test (h).Supplementary Fig. S7: MUSIC results in real data.The heat-map shows the strength of association between a perturbed gene (row) and a topic (column).To obtain the perturbation effects on inferred topics, we adapted the MUSIC's Diff_topic_distri() function to obtain the t-test statistics.To evaluate the significance of the observed effects, we performed 10,000 permutations of the perturbation conditions, and computed two-sided empirical p-values based on how extreme the t statistics calculated in actual data are relative to the empirical t statistics distributions.Bonferroni

Additional prior specification in GSFA
In line with the Bayesian framework, we specify the following conjugate prior distributions for parameters , ⇡, 2 , c 2 , p, and d 2 in the GSFA model: In practice, these hyperparameters are set to the following values: In the simulation study, s b = 5; in real data applications, s b = 20.
By choosing these hyperparameters, we set the mean values of the prior distributions of our parameters ⇡k = 0.2, pm = 0.2, ¯ j = ¯ 2 k = d2 m = 1, and c2 k = 1/6.

Gibbs sampling steps in GSFA
Here we describe the details of the Gibbs sampling steps in GSFA.
To obtain posterior samples for mk and mk , we first sample mk based on the product of two ratios: the ratio of two marginal likelihoods, and the prior ratio. where With mk sampled, we can obtain posterior samples of mk with For the remaining parameters, we can obtain their posterior samples as follows: where , ..., where Below is the pseudocode of a basic Gibbs sampling procedure in GSFA.
, W (t) , . . . ) 7: end for 8: Estimate posterior means of parameters { , , Z, W, . . ., c} 9: Compute LFSR based on posterior samples of , W, F 10: return posterior samples and means of all parameters, LFSR In practice, Z and W are initialized from a truncated singular value decomposition (SVD) of the normalized gene expression matrix Y, with the number of left (and right) singular vectors being K, the number of factors specified in the model.The last 20% quantile of elements (in terms of absolute value) in W are set to 0, and F is initialized as the binarized version of W. m is initialized as the coe cients of linear regression Z ⇠ G m (1  m  M ).The last 50% quantile of elements (in terms of absolute value) in are set to 0, and is initialized as the binarized version of .
The initialization of additional parameters is as follows: 2 Input pre-processing

Deviance residual transformation and feature selection for count data
To accommodate the application of GSFA on count data, we follow the transformation proposed in [1], where the count data are transformed into continuous quantities in the form of deviance residuals.In a standard data normalization pipeline, raw counts are normalized by sample-specific size factors, and then log-transformed.However, due to the large number of zeros in scRNAseq UMI counts, normalization schemes commonly used for bulk RNA-seq data may result in unstable normalization [2], and the arbitrary pseudocount added during the log transformation of exact zeros may introduce systematic errors and cause spurious di↵erences in expression [3].The deviance residual transformation circumvents these di culties by directly modeling the raw count data under a multinomial null model of constant gene expression across all cells, and quantifying the fit of data in the form of deviance residuals, a quantity analogous to z-scores and approximately follow a normal distribution.Specifically, the deviance residual for gene j in cell i is Here c ij is the raw gene count, n i is the library size of cell i, and μij = n i P i c ij P i n i is the expression of gene j under the null model of constant expression.Following [1], we use an approximate multinomial deviance statistic to evaluate the deviance of a gene from the null model: Genes with constant expression across cells are not informative and will have a deviance of 0, while genes that vary across cells in expression will have a larger deviance.Therefore, one can pick the genes with high deviance during feature selection as an alternative to selecting highly variable genes, with the advantage that the selection is not sensitive to normalization.

GSFA model with alternative prior on gene weights
GSFA also allows one to use the standard spike-and-slab prior for the gene weights, although we find that it does not work as well as the default mixture-of-normal prior (Equation ( 4) in Methods).
The alternative spike-and-slab prior is given by: Similarly, we introduced a latent binary matrix F P ⇥K to indicate whether W jk 's are nonzero.We can obtain the posterior samples of these parameters as follows: where With F jk sampled, we can obtain posterior samples of W jk with

GSFA model with multiple cell groups
In the cases when one is interested in learning about the e↵ects of perturbations under di↵erent cell types or experimental conditions, we extend the current model (Equations ( 1) -( 4) in Methods) so that the factors are inferred using all cells but the associations between factors and perturbations are estimated separately for each cell group.For example, assuming 2 groups of cells, group 0 and group 1, we have the conditional probability of factor matrix Z where 0 and 1 are M ⇥ K matrices holding the e↵ect sizes of perturbations on factors within group 0 cells and group 1 cells, respectively.Each e↵ect size matrix is still subjected to the same "spike-and-slab" prior: The distributions of other model parameters remain the same.
Once we have the posterior samples of parameters and latent variables, we can similarly obtain the posterior samples of the total e↵ects of perturbations on individual genes, ✓ mj 's, within each cell group using Equation ( 5) in Methods and the corresponding mk for that cell group.

Selecting the number of factors in GSFA
We provide some guidance for the selection of total factor number, K, in the GSFA model.We follow a strategy that is widely used in principal component analysis (PCA).Specifically, given a model with a set number of factors, we assess the percentage of variation explained (PVE) by all the inferred factors out of total gene expression variation as: where Var(Y •j ) denotes the sample variance of gene j.Y is the input cell by gene matrix for the model, which contains continuous gene expression levels in normal scenarios, or transformed deviance residuals in count-based and real data scenarios.Since we do not know the true values of Z and W , in practice, we use their posterior estimates as an approximation.
We varied the number of factors in the model, K, from 6 to 14 in simulations, and computed the PVE under each model.In all cases, we observed that PVE initially increases with K, but saturates around the true number of underlying factors, 10.We also varied K from 2 to 30 in real data applications, and observed saturation of PVE around K = 20, which justified our choice of 20 factors in the model.Overall, we have provided a PVE-based procedure that can help select a reasonable number of factors in GSFA application.

GSFA implementation and running time
The computational complexity of the GSFA inference is O(NK + P K) per iteration, with N being the number of cells, K being the number of factors, and P being the number of genes.The average run time of GSFA on a simulated dataset of 4000 cells, 6000 genes, and 10 factors is 1.32 seconds per iteration on a modern Linux workstation with Intel Xeon E5-2680 v4 (2.40 GHz) processors.The running time and memory requirement of GSFA in one real dataset (LUHMES) are shown in Table S5.We also included in the We created an additional simulation scenario to compare GSFA and other methods for detecting target genes of perturbations.Our motivations are two fold.First, we would like to use a real scRNA-seq dataset to test GSFA.Secondly, we would like to test how robust GSFA is when its data generative model is not followed.So we will avoid using latent factors in our simulation procedure.We downloaded a real scRNA-seq dataset of Peripheral Blood Mononuclear Cells (PBMCs) from 10X Genomics: https://github.com/10XGenomics/single-cell-3prime-paper/tree/master/pbmc68k_analysis.This dataset is widely used for benchmarking of scRNA-seq analysis methods.We focused on "CD19+ B cells", which were labeled by the original study.After performing quality control, our expression matrix contained 10,830 genes and 5,866 cells.
We introduced three true gRNAs and one negative control gRNA into the expression data.To achieve this, we randomly perturbed 10% of cells, and chose 300 target genes for each gRNA.All target genes were selected from the set of highly expressed genes, defined as genes expressed in over 10% of cells.We placed the gRNAs in these cells in a way to ensure that each cell contains at most one gRNA.Furthermore, the target gene sets of each gRNA were disjoint.
To simulate the e↵ect of a true gRNA, we perturbed the expression of its target genes in each cell containing that gRNA.First, we log-normalized the read count matrix using the median library size of 1294 as the scale factor.We then sampled the e↵ect sizes of the three gRNAs from Normal distributions with mean zero.The standard deviations of the e↵ect sizes were set at 0.3, 0.4, and 0.5 for the three gRNAs, respectively.After introducing the gRNA perturbations, we transformed the normalized expression matrix back to read counts and rounded all values to the closest integer.This resulting read count matrix was used as input for three methods we compared, namely GSFA, MAST, and Wilcoxon rank sum test.
For the di↵erential expression analysis, we compared cells perturbed by a true gRNA with those perturbed by the negative control gRNA.To select di↵erentially expressed genes (DEGs), we used a local false sign rate (LFSR) cuto↵ of below 0.05 for GSFA, and a false discovery rate (FDR) cuto↵ of below 0.05, using the Benjamini-Hochberg method for multiple testing correction, for MAST and Wilcoxon.

Evaluation of other methods on simulated data
For comparison, we have run other di↵erential gene expression (DGE) methods.We applied Welch's t-test [4] to both the normal data and count-normalized deviance residual data.For count data scenarios, we also applied edgeR-QLF [5] and MAST [6].In all these di↵erential expression tests, cells with each perturbation were compared with all other cells without this perturbation for all genes, FDR was computed following the Benjamini-Hochberg procedure for genes under each perturbation, and significant DEGs were obtained by thresholding FDR < 0.05.
We also compared GSFA with a two-step clustering procedure.In the first step, we clustered the cells based on expression of all genes, and in the next step, we detected clusters significantly associated with each of the perturbations.We assessed the results in two ways.First, we assessed the power of discovering the e↵ects of perturbations.Note that it is not straightforward to map clusters with factors in simulation data.For simplicity, we consider the association of a perturbation with any cluster as a true positive.This metric favors clustering-based analysis.Second, to define the target DEGs of a perturbation, we first obtained genes di↵erentially expressed between each cluster and the rest, and then took the union of all DEGs of the clusters associated with the perturbation of interest.We then compared the resulting DEGs with the true target genes of corrections of the empirical p-values were made to adjust for multiple comparisons.a) T cell results.b) LUHMES results.Supplementary Fig. S8: Permutation results of DEG detection methods on LUHMES CROP-seq dataset.Results from 10 randomly permuted datasets are presented together.a) GSFA effect sizes of perturbations on factors.b) GSFA estimated PIPs of associations between factors and perturbations.c) GSFA estimated LFSRs of genes under all perturbations.d) Quantile-quantile plot of empirical p-values of differential expression estimated by scMAGeCK-LR, assuming a uniform(0,1) null distribution; empirical p-values of exact zeros were replaced with 5e-4 for visualization in the Q-Q plot (specific to (d)); the shaded band is formed by upper-and lower-bounds of the 95% confidence interval at each of the 1000-quantiles of the uniform(0,1) distribution.e), f), g) Differential expression p-values estimated by DESeq2 (e), MAST (f), and SCEPTRE (g).

Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor 6 FactorFactor 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor 6 Factor
size matrix (β) in simulation scenarios where each of the six perturbations affects three out of ten factors. .

Table S2 . Full gene ontology enrichment results in T cell GSFA factors and Table S4. Full gene ontology enrichment results in LUHMES GSFA factors can
be found in separate extended files.

Table S6 . Effect size matrix (β) in a simulation scenario that include negative control cells as an additional perturbation group.
table two related single-cell CRISPR analysis methods, SCEPTRE and MUSIC, for comparison.GSFA was implemented in R and Rcpp (using the R packages Rcpp 1.0 and RcppArmadillo 0.10, and is available at Github: https://github.com/xinhe-lab/GSFA.R packages ggplot2 3.3.3and ComplexHeatmap 2.6.2 were used for visualization.All analyses in this study were performed in R version 4.2.0.