A novel approach to remove the batch effect of single-cell data

Analyzing


Methods
Here, we provide more details about the workflow of BEER: The workflow of BEER includes two main parts (Fig. 1a). In the first part, for each BEER preprocesses [M1] the data and conducts t-distributed Stochastic Neighbor Embedding (tSNE) to transfer the data into one-dimension values. BEER groups cells (default number of cells in each group is 10) based on the order of the one-dimension values, and then aggregate the expression profiles of each cell in a group to obtain the representative expression profile for that group. Next

M1 (preprocessing):
For each inputted expression matrix, we use "Seurat" package in R to conduct normalization. At first, we use the function named "NormalizeData" to normalize ("LogNormalize', scale.factor=10000) the data. Then, we use the function called "ScaleDate" to standardize [vars.to.regress = c("nUMI")] the data. Finally, we use "RunPCA" to calculate a number (default is 50) of PCA subspaces and use those PCA subspaces to conduct the following t-distributed Stochastic Neighbor Embedding (tSNE).  (Haghverdi, et al., 2018).

M3 (Combine Data):
We simply combine two expression matrices. Those overlapped genes are used to generate the combined matrix.

M4 (Normalization of Combined Data):
We use the function named "FindVariableGenes" in "Seurat" to identify variable genes (default parameters). For other steps, please refer to M1.

M5 (Correlation Test):
For each subspace, we generate two subspace-value lists. The first and the second list are prepared for Batch1 and Batch2, respectively. For each MN-paired group, we use "quantile" function in R to get five values of each batch, and then append those quantile values of Batch1 and Batch2 to the end of the first and the second subspace-value list, respectively. After going through all MN-paired groups, we use "cor.test(method='kendall')" in R to test the correlation between those two subspace-value lists.

Silhouette Plot
We use "silhouette" function in R to calculate Silhouette Coefficient, which can be used to evaluate the distance between different-type cells. Higher Silhouette Coefficients indicate different-type cells are better separated. In the above figure, "Oligodend Merged" means that we merge the oligodendrocytes of Batch1 and Batch2 into one cluster (oligodendrocyte is the only cell type shared by those two batches) and use all cell types of two batches to draw this plot. "Astro & OPC & Microglia" means that we only use astrocytes, OPC, and microglia to draw this plot. "Oligidend & Interneuron & Pyramida_SS" means that we use oligodendrocytes, interneuron, and pyramidal SS cells to draw this plot. In all these three benchmarks, BEER achieves high Silhouette Coefficients.

Requirement
Please install R (>=3.5), and install two packages: "Seurat" and "pcaPP" Please install Python, and install one package: "umap-learn" Then, all the other batches will be compared with the "MAXBATCH". Here is a demo (two oligodendroglima samples) of inspecting whether a PC removed by BEER has biological meaning.

Detect PCs with both batch effects and biological variances
After obtaining the BEER object (default name is "mybeer") by following the instruction of our website (https://github.com/jumphone/BEER), users can use the following command options to visualize the batch effect of each PC: plot(mybeer$cor, xlab='PCs', ylab="COR", pch=16) The above result shows that PC10 has the strongest "batch effect". Then, users can use the write.table (TOP,file='TOP100.txt',quote=F,row.names=F,col.names=F,sep='\t') Finally, users can use some enrichment method to test the biological meaning of those genes.
Here is the enrichment result (KEGG) of PC10's signature genes by using ClueGO: Those above pathways are related to PC10, and may be related to some biological variance that is co-occurring with batch effect.