Introduction

Pooled CRISPR knockout screening is a powerful technique for evaluating the biologic function of genes. This technique, however, only recognizes genes with very distinct phenotypes, such as those that affect cellular growth substantially or can be detected with antibodies or fluorescent protein reporters directly, which limited its ability to detect other genes with subtle phenotypes1,2,3. Recently described novel methods, i.e., single-cell-based CRISPR knockout or knockdown screening (independently termed Perturb-Seq4,5, CRISP-seq6, and CROP-seq7,8), combine pooled CRISPR screening with single-cell RNA-seq to investigate functional CRISPR screening in a single-cell level. These screening methods make it possible to implement large-scale gene perturbation study in a more elaborated way.

The key technical innovation for single-cell CRISPR screening including Perturb-Seq4,5, CRISP-seq6, or CROP-seq7,8 lies in modifying the lentiviral vector to allow for identification of the sgRNA in a single cell from deep-sequencing of mRNAs (polyadenylated RNA fraction)3. By taking advantage the innovation in performing mRNA-seq on individual cells, large-scale cells with distinct perturbations within a heterogeneous cell population can be investigated3,9.

Several computational challenges exist in the analysis of such single-cell CRISPR screening data: (1) Data sparsity and noise. Single-cell RNA-seq data is sparse10,11. In addition, both single-cell RNA-seq data and pooled CRISPR screening data are inherently noisy12,13, and this is further exacerbated by their combination. Efficient data filtering and normalizing are needed to meet these challenges. (2) The sgRNA perturbation and off-target effect should be carefully investigated when linking such perturbations with the gene expression readout14,15, particularly for heterogeneous cell-to-cell comparisons. (3) Quantitative and parallel estimating and prioritizing the effect of each perturbation and their relationships on different cells with cellular heterogeneity and technical complexity is required, and (4) Intuitively visualizing the perturbation results at a large-scale heterogeneity cellular level is needed. To this end, we developed MUSIC, which is an integrated tool for model-based understanding of single-cell CRISPR screening. This is an easy-to-use and model-based integrated analytical tool designed specifically for single-cell CRISPR screening data analysis.

Results

General pipeline of MUSIC

MUSIC comprises three steps for single-cell CRISPR screening data analysis (Fig. 1): data preprocessing, model building, and perturbation effect prioritizing.

Fig. 1
figure 1

General workflow of MUSIC. MUSIC comprises three steps for single-cell CRISPR screening data analysis: data preprocessing, model building, perturbation effect prioritizing. In the 1st step, besides the conventional considering of cell quality, several specific factors existed for single-cell CRISPR screening are also considered. These factors are the ratio of nonzero perturbed expression value in all cells, sgRNA efficiency and the minimal perturbed cell number per perturbation. In the 2nd step, MUSIC applies a topic model-based computational framework to derive the functional topics of each cell (including controls) with specific perturbation (PE, perturbation). In the 3rd step, MUSIC quantitatively estimates and prioritizes the individual gene perturbation effect on cell phenotypes from three different perspectives, i.e., prioritizing the gene perturbation effect as an overall perturbation effect, or in a functional topic-specific way, and quantifying the relationships between different perturbations

In the first step (Fig. 1 and see Methods), besides the routine quality control and data normalization processes applied in single-cell RNA-seq analysis, MUSIC also applied a data imputation step (achieved by SAVER16) to improve the data quality. In addition, MUSIC addresses two issues that should be taken into account for such a novel data type: (1) Filtering perturbed cells with invalid edits; (2) Filtering perturbations according to a minimal number of cells per perturbation.

Second, MUSIC builds a computational framework based on Topic Models to handle single-cell CRISPR screening data (Fig. 1 and see Methods). The concept of topic models was initially presented in the machine-learning community17 for discovery of hidden semantic structures in a text body and has been successfully applied to gene expression data analysis18,19,20. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently. The topics generated by topic modeling are represented by class of words with similar sematic meanings. A topic model is a probabilistic framework formulated on the investigation of the giving documents and discovering their topic profiles based on such word frequency representations. By analogy to the single-cell CRISPR screening data, a single cell with perturbation can be taken as a document. The gene expression is analogous to the word frequency in the document. A topic here represents a specific biological function associated with a group of highly differential expressed genes. Therefore, a topic model applied here allows us to examine a set of cells with perturbations and discover, based on the gene expression in each, what the perturbation induced biological functions might be. Two key advantages of the topic model applied here are: (1) it allows each perturbed sample to process a proportion of the membership in each functional topic rather than to categorize the sample into a discrete cluster. Such topic profile, which is derived from large-scale cell-to-cell different perturbed samples, making the following ranking of perturbation impact straightforward and quantitative. As can be clearly illustrated in Fig. 2, compared with traditional clustering, which makes a hard assignment of cells into different subclasses, topic modeling just calculates a topic probability profile for each sample rather than assigning it into subclasses. (2) Topic modeling is sensitive to detect subtle phenotype changes based on the change of topic probability profile with and without perturbation, while traditional clustering generally failed to detect such subtle phenotype changes, which widely exist in single-cell CRISPR screening data (Fig. 2).

Fig. 2
figure 2

Comparisons between traditional clustering based analysis and topic model  based analysis for single-cell CRISPR screening data. a Difference between traditional clustering based analysis and topic model-based analysis for single-cell CRISPR screening data when a perturbation has a significant phenotype on the cells. Both analyses can detect such phenotype change (see the cell sample with red dotted line). b Difference between traditional clustering-based analysis and topic model-based analysis for single-cell CRISPR screening data when a perturbation has a subtle phenotype on the cells. Topic modeling calculates a topic probability profile for each sample while traditional clustering just makes a hard assignment of the sample to each cluster. Therefore, in this way, topic-model-based analysis can detect such phenotype change based on the change of topic probability profile with and without perturbation, while traditional clustering based analysis failed to detect such subtle phenotype change (see the cell sample with red dotted line)

In addition, MUSIC addresses several specific issues when applying the topic model to this specific data type: (1) The distribution of topics between cases and controls is affected by the ratio of their sample numbers, and such a sample imbalance issue is addressed by the bootstrapping strategy when prioritizing the perturbation effect (see Methods). (2) The optimal topic number is automatically selected by MUSIC in a data-driven manner (see Methods).

Finally, with the topic-model-based perturbation analysis, MUSIC can quantitatively estimate and prioritize the individual gene perturbation effect on cell phenotypes from three different perspectives (Fig. 1 and see Methods), i.e., prioritizing the gene perturbation effect as an overall perturbation effect, or in a functional topic-specific way, and quantifying the relationships between different perturbations.

Evaluating the performance of MUSIC

To evaluate the performance of MUSIC, we made the following two aspects of analysis. We started our study by applying MUSIC to all publicly available 14 sets of single-cell CRISPR screening data, including Perturb-Seq4,5, CRISP-seq6, and CROP-seq7,8 to obtain the analysis results (Supplementary Table 1). For illustration purposes, we took the doxorubicin-treated MCF10A cells8 with 29 tumor suppressors perturbed as an example plot (Fig. 3a, b). Detailed analysis results of all the other datasets can be accessed in the supplementary materials (Supplementary Data 114 and Supplementary Fig. 114).

Fig. 3
figure 3

An illustration result of MUSIC for single-cell CRISPR screening data analysis. We take the dataset of MCF10A cells treated with doxorubicin (GSM2911346) by the updated version of CROP-seq8 as an example, as illustrated in (a, b). The overall perturbation effect ranking lists identified by MUSIC were also compared between cells with different treatment, as illustrated in (c). a The functional annotations of each topic derived from topic modeling for dataset GSM2911346. b The overall perturbation effect ranking list and the topic-specific perturbation effect ranking list for dataset GSM2911346. c The differences of perturbation impact between different experimental conditions are demonstrated respectively for Perturb-Seq5 and CROP-seq7,8 data

Then, we compared MUSIC with two other mentioned tools MIMOSCA5 and LRICA4 (Tables 1 and 2). MIMOSCA is a computational framework to handle multiple input multiple output single-cell data analysis. LRICA is proposed to decipher the driver signal/component of the data by low-rank matrix factorization. Although MIMOSCA and LRICA models were presented in the literatures, they were only developed as the prototypes without executable and user-friendly implementations. In addition, the output of MUSIC is different from these tools and they are not straightforward to be compared. Therefore, we provided the preliminary comparison results in Tables 13 for several datasets to indicate the effectiveness of MUSIC.

Table 1 Comparisons of detail analysis results between MUSIC and MIMOSCA
Table 2 Comparison of detail analysis results between MUSIC and LRICA
Table 3 Other representative analysis results of MUSIC

First, the comparisons between the analysis results of MUSIC and MIMOSCA were presented in Table 1. MUSIC recapitulated the similar findings as those of MIMOSCA, like the perturbation impact of Cebpb on immune cell activation21. A novel knockout effect on cell migration22 was also identified by MUSIC which are consistent with previous knowledge. MUSIC further identified the gene–gene perturbation relationships, like the recognized associations between Cebpb knockout and other gene perturbations by the quantitative correlation calculations (Table 1).

Second, similar comparisons between MUSIC and LRICA were presented in Table 2. Again, MUSIC recapitulated similar findings like LRICA. For example, ATF, PERK, and IRE1α are all important proteins related to unfolded protein response (UPR). Original study has indicated that the perturbation of PERK has a greater impact than those of ATF6 and IRE1α. MUSIC recapitulated this finding in a quantitatively way. In addition, a novel perturbation effect for apoptosis function by knockout the three genes simutaneously23 was identified, which indicates that in the absence of the three branches of the UPR, K562 cell enhance the positive regulation of apoptosis signal pathway significantly (Supplementary Data 8 and Supplementary Fig. 8).

Finally, analysis of remain datasets also recapitulated original findings or identified novel results. Representative analysis results by MUSIC on remain datasets are shown in Table 3. MUSIC recapitulated the similar results as the original findings, such as the perturbations of Cebpb has an important influence on immune cell differentiation24. MUSIC further identified several novel findings, such as the high correlation between Cebpb and Rela25 perturbations (Supplementary Data 10). MUSIC identified the special response of TP53 knockout when cells treated with doxorubicin, which is consistent with previous knowledge26,27,28 (Fig. 3c).

Evaluating the impact of the data preprocessing strategies adopted in MUSIC

Due to substantially noise existed in single-cell CRISPR screening data, MUSIC adopted several data preprocessing strategies (see Methods), which can effectively improve its performance. In this part, we further explored their impact on the outputs of MUSIC from the following three aspects.

First, we provided an overview information on how many cells are filtered from the datasets in the data preprocessing. A statistic summary of the proportion of filtered cells by quality control is shown in Fig. 4a, indicating that an average of 6% of cells are filtered. A statistic summary of the proportion of filtered cells by filtering low efficiency sgRNA is shown in Fig. 4b (Supplementary Data 15). It can be seen that this step filtered an average of 41% cells and these ratios are different in different datasets and  techniques. It should be noted that prior study already indicated the single-cell CRISPR screening technique is very noisy, 20–30% of the cells with a detected sgRNA show a wild-type phenotype29,30 and these cells should be filtered.

Fig. 4
figure 4

Evaluating the impact of the data preprocessing strategies adopted in MUSIC. a The proportion of filtered cells by quality control for all datasets. The red dash line represents the mean of the data. b The proportion of filtered cells by filtering low efficiency sgRNA for all datasets. The red dash line represents the mean of the data. c zero_rate plot of all knockouts/knockdowns in all datasets. The red dash line represents the mean value of all the knockouts/knockdowns zero_rates. d Comparisons of overall perturbation effect ranking with or without imputation/filtering for all the available datasets

Second, since the single-cell CRISPR screening data are noisy and zero-inflated, we provided a statistic to show how frequently genes have a zero expression value across all cells. And we demonstrated that our filter strategy will not remove lowly expressed while functional genes like transcription factors. To this end, for all 326 knockouts/knockdowns in all 14 datasets, we calculated their proportion of zero expression values in all cells, which is denoted as the zero_rate of these genes (Fig. 4c and Supplementary Data 16). It is found that that our filtering strategy successfully filters CDKN2A in doxorubin-treated and untreated MCF10A cell8, which is expected since MCF10A breast epithelium cells carry a deletion of the CDKN2A locus. Then only two other genes were filtered. These genes are PTPRD in doxorubin-treated MCF10A cell8 and IER3IP1 in K562 cell4, probably due to the noise existed in these datasets. These genes are not transcription factors, and all the functional transcription factors are kept to be unaffected. To further evaluate the impact of this filtering on the results of MUSIC, we also performed a test to check what occurs if MUSIC removed this filtering step. We rerun MUSIC and compared the overall perturbation effect ranking with or without zero expression filter for the corresponding three affected datasets (doxorubin-treated and untreated MCF10A and K563 cell). More specifically, we normalized the overall ranking score (see the section of Obtaining the overall perturbation effect ranking list in Methods) in the obtained ranking list calculated with or without zero expression filter. Then we calculated the Pearson correlation coefficients of the normalized overall ranking score profiles with or without zero expression filter. The Pearson correlation coefficients calculated above were 0.99 for doxorubin-treated MCF10A, 0.93 for untreated MCF10A, 0.98 for K562 cell, respectively. Taking together, these results showed that the filtering of zero expression will not induce substantial changes on the overall rankings, which means that the filtering of the corresponding knockouts generally keeps other knockouts or knockdowns unaffected.

Third, we evaluated the impact of imputation and filtering strategies in the data preprocessing step on the final perturbation ranking results. To this end, we took a group of genes tested by Perturb-Seq5 as a benchmark, which indicated that Cebpb has the strong reinforcing effect on Rela, Hif1a, Stat3 and Junb, while keeps the strong opposing effect on Nfkb1, Runx1, Irf4 and Spi1. The relationships available for these genes are so evident that it is ideal to be taken as a golden standard. As shown in Supplementary Table 2, a comparison with or without imputation/filtering were performed on this dataset. It can be seen clearly that imputation and filtering as a whole can uncover such strong positive and negative correlations correctly and accurately. We further made a global evaluation to access the overall impact of the data preprocessing on all the datasets (Fig. 4d). In this study, the overall impact is calculated as the overall perturbation effect ranking correlation with or without imputation/filtering for all the 14 datasets (Supplementary Data 17). More specifically, we first normalized the overall ranking score (see the section of Obtaining the overall perturbation effect ranking list in Methods) in the obtained ranking list calculated with or without imputation/filtering. Then we calculated the Pearson correlation coefficients of the normalized overall ranking score profiles with or without imputation/filtering. The bar plots of such similarity comparisons are shown in Fig. 4d, indicating that how the imputation_only, the filtering_only or their combinations affect the final perturbation effect ranking as a whole. It can be seen that all the three strategies changed the ranking list with a similarity of ~0.6 on average. Also the combination strategy changed the ranking list mostly, which is expected.

Discussion

In this study, we developed MUSIC, an integrated model-based pipeline designed specifically for single-cell CRISPR screening. MUSIC takes the raw counts data with the corresponding perturbation information as inputs and it can quantitatively estimate and prioritize the perturbation effect for each knockout or knockdown from three different perspectives, i.e., prioritizing the gene perturbation effect as an overall perturbation effect, in a functional topic-specific way, and quantifying the relationships between different perturbations. Extensive tests on MUSIC demonstrated that it is an effective and applicable pipeline for analyzing single-cell CRISPR screening data.

Single-cell CRISPR screening is a powerful technique, making it feasible to perform large-scale perturbations in a single-cell granularity. However, it is inherently noisy, presenting to be challenging for such data analysis. Currently version of MUSIC contains a series of carefully designed filtering steps to reduce the data noise, while future improvements are expected to refine and update such filtering steps to make it more effective.

Methods

Cell quality control

MUSIC evaluates cell quality based on three factors29, i.e., number of genes detected (default 500), number of unique molecular identifiers induced (default 1000), and percentage of mitochondrial genes detected (default 10% among all the detected genes). Only cells with the first two factors above the thresholds and the third factor below the threshold are retained.

Data imputation

Single-cell RNA-seq data is sparse10,11, only a small fraction of the transcripts presented in each cell are sequenced. To improve the quality of data, MUSIC adopted SAVER16, a R package for single-cell RNA-seq data imputation which is proven to be necessary for MUSIC to discover the real and correct regulation relationships (Supplementary Table 2). It should be noted that SAVER has been proven to recover the true expression level of each gene in each individual cell, avoid to introduce spurious correlation or false positive gene pairs that have no biological correlations.

Evaluation of sgRNA knockout efficiency

The sgRNA knockout efficiency in CRISPR screening should also be carefully evaluated. The sgRNA will target Cas9 to a specific gene locus, but only 70–80% of them will generate true loss-of-function of the targeted gene30,31. This implies that in 20–30% of the cells with a detected sgRNA, the gene can be active or partially active and show a wild-type phenotype (false positive) which will influence the estimation for the impact of perturbation. Thus, a step to filter such cells is needed. Intuitively, the basic idea of our filtering algorithm is based on the assumption that if the differentially expressed gene profile of a perturbed cell is more similar to the control cells than that of other same perturbed cells, this cell will be filtered. Specifically, for each type of perturbation, we performed the following steps:

  • If the corresponding gene expression values of the perturbation are all zero among all the cells, this perturbation will be filtered directly. If not, perform the following steps.

  • Identifying genes that are differentially expressed between control and perturbed cells by the Kolmogorov–Smirnov test at p < 0.05.

  • For each perturbed cell i, the median of cosine similarity of differentially expressed gene profile between i and all the other perturbed cells with the same perturbation is calculated, denoted as M(Pi).

  • For each perturbed cell i, the median of cosine similarity of differentially expressed gene profile between i and all the control cells is calculated, denoted as M(Ci).

  • For each cell i, if M(Ci) is bigger than M(Pi), this cell will be filtered.

  • For a specific perturbation, if the influenced cells filtered are amount to a high proportion (default 90%) among all, such perturbation is filtered.

The minimal perturbed cell number per perturbation

Datlinger et al.7 concluded that at least 30 cells are required to capture each perturbation phenotype. Therefore, the perturbations with perturbed cells lower than 30 (default) are not considered in MUSIC.

Selecting highly dispersion differentially expressed (DDE) genes

MUSIC identified differentially expressed genes in single-cell sequencing data as dispersion differentially expressed (DDE) genes, i.e., genes with a maximum dispersion difference (DD) between the case and control. MUSIC selects DDE genes based on the subsequent statistical test:

$${\mathrm{DD}}_i = \left| {{\mathrm{ZD}}_{{\mathrm{case}}}\left( i \right) - {\mathrm{ZD}}_{{\mathrm{control}}}\left( i \right)} \right|$$
(1)

where DDi is the i-th gene’s dispersion difference, and ZDcase(i) and ZDcontrol(i) are the z-scores of the i-th gene’s dispersion in the case and control cells, respectively. Before calculating the z-score, the genes were binned based on their average expression, and the z-score of the dispersion was calculated within their corresponding bins. The z-score of the i-th gene’s dispersion (ZDi) is calculated as

$${\mathrm{ZD}}_i = \frac{{D_i - \mu _i}}{{\sigma _i}}$$
(2)

where μi and σi are the mean and variance of the i-th gene expression, respectively, within its corresponding bin and Di is the dispersion of the i-th gene expression, which is calculated as

$$D_i = {\mathrm{log}}\frac{{\sigma _i}}{{\mu _i}}$$
(3)

where σi and μi are the variance and mean, respectively, of the i-th gene expression.

Normalizing and rounding the expression value

The expression level of different genes is normalized and rounded to fit the topic model:

$$X_{{\mathrm{normalized}}} = \left[ {\frac{{X - \mu _{{\mathrm{control}}}}}{{\mu _{{\mathrm{control}}}}} \times 10} \right]$$
(4)

We round the final expression value as the ×10 magnification of the original normalized expression values.

Topic models

The topic model was originally presented in the machine-learning and natural language processing community for latent topics discovery in a particular set of documents17. This generative hierarchical model assumes that a word in a document is generated through two steps, i.e., a topic in a document is selected with a certain probability, and then a word in the topic is selected with a certain probability. The generative process of topic model is formulated as follows: θd and \(\O _t\) are, respectively, the distribution over topics of document d and the distribution over words of topic t.

$$\theta _t{\mathrm{\sim Dirichlet}}\left( \alpha \right)$$
(5)
$$ _t{\mathrm{\sim Dirichlet}}\left( \beta \right)$$
(6)

Here, α and β are hyper-parameters following Dirichlet distributions. For generating word i in document d, topic Zd,i is first sampled from document’s distribution over topics, and then word Wd,i is sampled from the topic’s distribution over words based on the following distributions,

$$Z_{d,i} \vee \theta _d{\mathrm{\sim Multinomial}}\left( {\theta _d} \right)$$
(7)
$$W_{d,i} \vee Z_{d,i}, _{Z_{d,i}}{\mathrm{\sim Multinomial}}\left( {Z_{d,i}} \right)$$
(8)

In our study, the topic model is utilized to process our single-cell CRISPR screening data. We made a perfect analogy between text mining and perturbation effect evaluation, where documents can be analogized to the cells conducted by single-cell CRISPR screening and the word frequency in a document can be analogized to the expression value of genes for a given cell. We determined the joint probability of gene expression for each cell by integrating parameter θ into and applied the collapsed Gibbs sampling to assign the gene of each cell to topics. Detailed information can be refereed17.

In summary, topic modeling was performed on the entire screen dataset to compare the impact of different perturbations under the same background. Topic modeling resulted into two outputs, i.e., (1) the probability distribution of each topic, representing as a topic profile, which is used to characterize each perturbation (include control) and (2) the enriched functional profile of each topic, which is intuitively calculated by the enrichment analysis with top 10% differentially expressed genes in each topic. Then, with such two profiles in hand, we are able to quantitatively calculate the overall perturbation effect ranking, topic-specific perturbation ranking as well as the relationship between perturbations.

Annotating each topic’s function

MUSIC obtains the occurrence probabilities of genes available in each topic. For each topic, MUSIC took full advantage of the power of topic profile modeling to perform a weighted biological function annotation. Intuitively, genes with large occurrence probabilities are more representative of the function and they should be selected to annotate the topic function. Specifically, for each topic, MUSIC performed the following steps:

  • MUSIC first selects the top 10% genes of each topic based on their occurrence probabilities.

  • Genes selected by step 1 are used to perform the functional enrichment annotation with clusterProfiler32.

  • In the end, the top-ranked n (default 5) GO terms (rank by q value) are selected to represent the topic functions.

Automatically selecting the optimal topic number

Topic distribution is influenced by the topic number. MUSIC applies an automatic strategy to select the optimal topic number. Intuitively, an optimal topic number should distinguish the cells with different perturbation effects from each other as much as possible. In our study, we defined a matrix Gm×n representing the n topics’ occurrence probability in m cells derived from the topic model with a certain topic number n. Then, an optimal topic number should make Gm×n match the following two criteria: (I) For each topic, its occurrence probability in different perturbation cells should differ as much as possible. Such a measurement is defined as a specificity score (SSn) for all the topics under a certain topic number n, as calculated in Eq. (9). The larger the specificity score, the better the selected topic number. (II) The fewer topic functions dominating each cell, the better. Such a measurement is defined as a purity score (PSn) for all the topics under certain topic number n, as calculated in Eq. (10). The larger the score, the better the selected topic number. Finally, MUSIC defined the combination score(CSn), which is a weighted average of the specificity score and purity score, as shown in Eq. (11). Again, the larger the score, the better the selected topic number.

The specificity score (SSn) is calculated as

$${\mathrm{SS}}_n = {\mathrm{log}}\left( {\frac{1}{n}\mathop {\sum }\limits_{j = 1}^n \frac{{\sigma _j}}{{\mu _j^2}}} \right)$$
(9)

where n is the selected topic number, and σj and μj are the variance and mean, respectively, of the j-th column of Gm×n.

The purity score (PSn) is calculated as

$${\mathrm{PS}}_n = {\mathrm{log}}\frac{1}{m}\mathop {\sum }\limits_{i = 1}^m \sigma _i$$
(10)

where n is the selected topic number, m is the number of rows in matrix Gm×n, and σi is the variance of the i-th row of Gm×n.

The combination score (CSn) is calculated as

$${\mathrm{CS}}_n = \alpha {\mathrm{TSS}}_n + \left( {1 - \alpha } \right){\mathrm{TPS}}_n$$
(11)

where n is the selected topic number and α (default 0.5) is the weight with value of [0, 1]. Considering the time cost and the biological interpretability of the result, we recommended a reasonable scope (now 4 to 6) of topic model number to be tried, by considering the prior information of biologic functional categories.

Considering off-target effects

A sgRNA off-target effect may exist for these novel types of data due to application of the CRISPR knockout/knockdown screening technique. For CRISPRi technique, MUSIC won’t consider this step, since CRISPRi knockdown is highly specific with minimal off-target effects33. In the current version MUSIC only provides the off-target information of the knockout. Basically, MUSIC integrates sgRNA sequence information with its corresponding knockout gene expression to determine whether the sgRNA has induced an off-target effect as following:

  • CRISPRseek34 is performed to predict possible off-targets based on the sgRNA sequence information.

  • Correlations of the transcriptional expression values between the corresponding knockout gene and the possible off-targets are calculated for the case and control, respectively.

  • If a significant increase in the correlations between the case and control is detected, the possible off-target effect for this knockout is reported in MUSIC.

Obtaining the topic-specific ranking list

To analyze the functions of the perturbations impact, MUSIC prioritizes the perturbation effect in a topic-specific way. For a specific topic, MUSIC prioritizes the perturbation effect by calculating the specific topic probability difference (TPD) between the case and control. Intuitively, the ranking list is obtained by evaluating the perturbation effect on this specific topic, where the perturbation should influence this topic as much as possible while keeping other topics as unaffected as possible. Specifically, MUSIC performed the following steps:

  1. (1)

    MUSIC calculates topic probability difference (TPD) based on Student t-test. In order to meet the conditions of the Student t-test, the topic probability of different cells with different perturbation were normalized to the standard normal distribution. Specifically, for the i-th perturbation on the j-th topic, each topic probability was z-normalized with respect to the mean and standard deviation of the corresponding control population as:

    $$P_{{\mathrm{normalized}}}\left( {i,j} \right) = \frac{{P\left( {i,j} \right) - \mu _{{\mathrm{control}}}}}{{\sigma _{{\mathrm{control}}}}}$$
    (12)
  2. (2)

    We also realized that the number of cells with different edits generally varies greatly, i.e., the sample imbalance issue exists, which can affect the analysis of the perturbation effects greatly. To address this issue, MUSIC first identified the minimal cell number (M) among all perturbations. Then, for each perturbation, MUSIC adopted a bootstrapping strategy to randomly samples M cells to perform the subsequent Student t-test for 1000 times, and the median is obtained. The test statistic of the i-th perturbation on the j-th topic is calculated as

    $${\mathrm{TPD}}_{ij} = \frac{{\bar X_{ij} - \bar X_{{\mathrm{control}},j}}}{{\sqrt {\left( {\frac{{\left( {n_i - 1} \right)S_{ij}^2 + \left( {n_{{\mathrm{control}}} - 1} \right)S_{{\mathrm{control}},j}^2}}{{n_i + n_{{\mathrm{control}}} - 2}}\left( {\frac{1}{{n_i}} + \frac{1}{{n_{{\mathrm{control}}}}}} \right)} \right)} }}$$
    (13)

    where \(\bar X_{ij}\) is the mean of normalized topic probabilities calculated in Eq. (12) for the i-th perturbation on the j-th topic, \(\bar X_{{\mathrm{control}},j}\) is the mean of normalized topic probabilities of control cells for the j-th topic, Sij is the standard deviation of normalized topic probabilities of cells for the i-th perturbation on the j-th topic, Scontrol,j is the standard deviation of normalized topic probabilities of control cells for the j-th topic.

    In our study, the test statistic TPD will be taken for consideration for the following two reasons: (a) TPD is a valid metric to estimate the difference of mean between two populations. (b) TPD can be positive or negative, thus used to estimate the direction of a perturbation impact.

  3. (3)

    Then, MUSIC prioritizes such a perturbation by considering the effect of the perturbation on this specific topic as well as its influence on other topics.

    MUSIC applies the ratio of each topic probability difference (TPDR) to evaluate its influence on other topics. The bigger the ratio is, the less the perturbation influence on other topics.

    The TPDR of the i-th perturbation on the j-th topic is calculated as

    $${\mathrm{TPDR}}_{ij} = \frac{{\left| {{\mathrm{TPD}}_{ij}} \right|}}{{\mathop {\sum }\nolimits_{i = 1}^n \left| {{\mathrm{TPD}}_{ij}} \right|}}$$
    (14)

    where TPDij is calculated in Eq. (13).

  4. (4)

    Finally, MUSIC defines an efficient score to evaluate the effect of the i-th perturbation (CSi) on a specific topic considering both TPD and TPDR. The larger the score, the higher the rank.

    $${\mathrm{CS}}_{ij} = 0.5 \ast \left( \frac{\left. \left| {\mathrm{TPD}}_{ij} \right| - \min \left( \left| {\mathrm{TPD}}_{i.} \right| \right) \right)}{\max \left( \left| {\mathrm{TPD}}_{i.} \right| \right) - \min \left( \left| {\mathrm{TPD}}_{i.} \right| \right)} + \frac{{\mathrm{TPDR}}_{ij} - {\mathrm{min}}({\mathrm{TPDR}}_{ij})}{\max \left( {\mathrm{TPDR}}_{ij} \right) - {\mathrm{min}}({\mathrm{TPDR}}_{ij})} \right)$$
    (15)

MUSIC also calculated a threshold to determine if a perturbation had an impact on a specific topic with statistically significance. Intuitively, the impact of a perturbation on a functional topic is significant if it is greater than that generated randomly. MUSIC first obtained TPDrandom,j which can be calculated in Eq. (16) and performs the same process to obtain the score (CS) between selected ones and all. This process is repeated for 1000 times to obtain the median as the threshold. The impact of the i-th perturbation on a specific topic j is considered significant when CSij is bigger than the threshold.

$${\mathrm{TPD}}_{{\mathrm{random}},j} = \frac{{\bar X_{{\mathrm{random}},j} - \bar X_{{\mathrm{control}},j}}}{{\sqrt {\left( {\frac{{\left( {n_{{\mathrm{random}}} - 1} \right)S_{{\mathrm{random}},j}^2 + \left( {n_{{\mathrm{control}}} - 1} \right)S_{{\mathrm{control}},j}^2}}{{n_{{\mathrm{random}}} + n_{{\mathrm{control}}} - 2}}\left( {\frac{1}{{n_{{\mathrm{random}}}}} + \frac{1}{{n_{{\mathrm{control}}}}}} \right)} \right)} }}$$
(16)

where \(\bar X_{{\mathrm{random}},j}\) is the mean of normalized topic probabilities calculated in Eq. (12) for the M selected control cells on the j-th topic.

Obtaining the overall perturbation effect ranking list

For the calculation of the overall perturbation effect ranking list, the sum of each topic’s TPD (TPDS) for each perturbation was calculated:

$${\mathrm{TPDS}}_i = \mathop {\sum }\limits_{j = 1}^n \left| {{\mathrm{TPD}}_{ij}} \right|$$
(17)

It should be noted that in practical the calculation of TPD here is needed to be adjusted by performing the same bootstrapping on control cells. Specifically, the adjust TPD, i.e., TPDA is calculated as

$${\mathrm{TPDA}}_{ij} = {\mathrm{TPD}}_{ij} - {\mathrm{TPD}}_{{\mathrm{random}},j}$$
(18)

Obtaining the relationships between different perturbations

MUSIC quantifies the relationships between two perturbations by calculating the Pearson correlation coefficient of two perturbations’ TPDA profiles. Furthermore, the perturbation correlation networks can be automatically visualized by MUSIC for each testing dataset, respectively.

Prioritizing perturbation effect difference under different treatment conditions

When cells were treated under different experimental conditions, MUSIC can be applied to prioritize the perturbation effect difference under two different conditions, and identify the perturbation with substantial effect change. Intuitively, by comparing the TPDS of one specific perturbation under two different conditions, MUSIC identified those perturbations whose impact changed significantly under two conditions. Specifically, MUSIC first selected the common perturbations under two conditions, then MUSIC defined the score perturbation impact difference (PID) to quantitatively represent the perturbation impact difference between two different experimental conditions. For a perturbation i, PIDi is calculated as

$${\mathrm{PID}}_i = \frac{{{\mathrm{TPDS}}({\mathrm{condition}}\_2)_i}}{{\mathop {\sum }\nolimits_i^n {\mathrm{TPDS}}({\mathrm{condition}}\_2)_i}}/\frac{{{\mathrm{TPDS}}({\mathrm{condition}}\_1)_i}}{{\mathop {\sum }\nolimits_i^n {\mathrm{TPDS}}({\mathrm{condition}}\_1)_i}}$$
(19)

where n is the number of common perturbations under two conditions and TPDS is calculated by Eq. (17).

Comparisons between negative control and blank control

Given that the former steps rely on the comparisons between perturbed and negative control cells, we made a statistical test to compare negative control with blank control to indicate the suitability of applying negative control in the experiments.

First, we believe that it should be slightly different to use the negative control (induced with non-targeting gRNAs) and the blank control (none gRNAs induced) in the single-cell CRISPR screening experiments. While in the previous studies4,5,6,7,8, researchers in this community tend to choose negative control rather than blank control to keep a relative fair comparison scenario, since it is necessary to eliminate the effects of the induction on the cells.

Second, the differences between negative control and blank control should be less significantly than that between knockouts/knockdowns and blank control. To prove this point, we made the following test with stimulated Jurkat cell7 which offered cells without any induction of gRNAs (blank control). The routing imputation and filtering were performed on these cells. Then a bootstrap sampling strategy is applied on the blank control cells to randomly selected 10% among them to compare with negative control and other knockouts cells. Then we calculated the similarity of such comparison for 100 times samplings. The statistical comparison result is shown in Supplementary Fig. 15. It is clearly to see that the negative control cells are significantly similar to blank control (t-test p < 2.2e−16) than any other knockouts.

Robust test

For each datasets, we randomly relabeled 20% control cells as a control test subset to be processed along other knockouts or knockdowns, and calculated the rank of the control test subset in the overall perturbation effect ranking result. We calculated the rate of the knockouts or knockdowns whose rank below the control test subset among the total number of knockouts or knockdowns. The above process was repeated 10 times for each datasets to reduce randomness. The average rate calculated above is about 0.06 among all the available datasets, indicating that the control testing sets in general disturb the final ranking list a little. Besides, for each datasets, the Pearson correlation coefficients were similarly calculated as aforementioned between the overall perturbation effect ranking results obtained from this random test and that from the original studies. The average Pearson correlation coefficient is 0.82, further indicating that the data preprocessing steps in MUSIC is reliable and robust with tolerance to the random noise.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.