Abstract
Effective dimension reduction is essential for single cell RNAseq (scRNAseq) analysis. Principal component analysis (PCA) is widely used, but requires continuous, normallydistributed data; therefore, it is often coupled with logtransformation in scRNAseq applications, which can distort the data and obscure meaningful variation. We describe correspondence analysis (CA), a countbased alternative to PCA. CA is based on decomposition of a chisquared residual matrix, avoiding distortive logtransformation. To address overdispersion and high sparsity in scRNAseq data, we propose five adaptations of CA, which are fast, scalable, and outperform standard CA and glmPCA, to compute cell embeddings with more performant or comparable clustering accuracy in 8 out of 9 datasets. In particular, we find that CA with Freeman–Tukey residuals performs especially well across diverse datasets. Other advantages of the CA framework include visualization of associations between genes and cell populations in a “CA biplot,” and extension to multitable analysis; we introduce corralm for integrative multitable dimension reduction of scRNAseq data. We implement CA for scRNAseq data in corral, an R/Bioconductor package which interfaces directly with single cell classes in Bioconductor. Switching from PCA to CA is achieved through a simple pipeline substitution and improves dimension reduction of scRNAseq datasets.
Similar content being viewed by others
Introduction
Single cell mRNA sequencing (scRNAseq) simultaneously measures the transcript levels of genes in thousands of individual cells, providing a window into the transcriptional and functional diversity of cells in a tissue or experiment. These complex datasets are orders of magnitude larger than those encountered when analyzing “bulk” RNAseq data from tissue samples. While such fine resolution data have the potential to reveal new biological findings, scRNAseq data exhibit sparsity, noisiness, and technical artifacts beyond those seen for bulk RNA samples^{1,2}, necessitating scRNAseq specific preprocessing and normalization^{3,4}. Typically scRNAseq analysis includes the use of dimension reduction to attenuate noise and ensure computational tractability, but the choice of method considerably influences downstream analyses, results, and conclusions^{3,5}.
Selecting an appropriate dimension reduction method is important; an effective method finds a representation of the data that minimizes noise and redundancy, while uncovering meaningful signals that reveal latent structures and patterns within the data^{6,7}. When defined from scRNAseq data, reduced dimension embedding representations are most useful when they preserve meaningful, biologically relevant variation; are robust, meaning that the decomposition of new but similar observations consistently yields a similar embedding space; and generalize and transfer to new data, enabling new observations arising from similar biological processes to be projected into the same latent space.
ScRNAseq counts are generally modeled as multinomially distributed, and are often approximated as negative binomial or Poisson^{2}, reflecting the fact that the data are neither continuous nor approximately Gaussian. As such, use of principal component analysis (PCA) requires that discrete and sparse scRNAseq count data be transformed prior to dimension reduction with this method^{6}. PCA is a linear dimension reduction method that obtains a lowdimensional data representation along orthogonal linear axes such that the proportion of variance accounted on each axis is maximized in Euclidean space^{4,8,9,10,11}. Because PCA is most suitable for continuous data that is approximately normally distributed, it may exhibit artifacts when applied to data with gradients or noncontinuous data (such as counts); one such artifact, called the “arch” or “horseshoe” effect, occurs when PCA is applied to scRNAseq data without logtransformation^{4,6,12}. So, in practice, and despite known issues with applying logtransformation to scRNAseq count data^{2,13,14}, most single cell workflows begin with a log(x + 1) transformation of the counts matrix, and then use PCA to decompose the resulting “logcounts” data^{3}. The use of logcounts has poor theoretical justification, and in some cases may obscure meaningful variation^{2,14}, but the resulting reduced dimension embeddings of the data from PCA are nonetheless used in scRNAseq clustering, trajectory analysis, and cell type classification^{3}. Several dimension reduction approaches tailored for scRNAseq counts have been proposed, including methods like ZINBWaVE, the first method appropriate for use with counts which is based on a zeroinflated negative binomial model for decomposition of counts, and zeroinflated factor analysis (ZIFA)^{2,15,16,17}. Still, PCA remains the most widely used method largely due to its simplicity, speed, and computational efficiency. In a comparison of 18 dimension reduction methods, PCA ranked highly when accuracy and performance in downstream analysis were considered with computational scalability^{18}.
Classical matrix factorization methods, including PCA, are instances of the general duality diagram approach proposed by Benzécri and the French school of multivariate statistics in the 1970’s^{8,19,20,21,22,23}, which pivots focus from the matrix as columns of fixed variables to the matrix as an operator between inner product spaces, unifying classical multivariate methods like PCA with modern kernel methods into the same framework^{8,21}. Another matrix factorization method that emerges in the duality diagram framework is correspondence analysis (CA), a fast dimension reduction method appropriate for nonnegative, countbased data and can identify relationships between categorical data types that is popular among ecologists for analyzing speciesbysite abundance count matrices^{8,24}. In practice, PCA is often computed by singular value decomposition (SVD) of columncentered or Zscore normalized data (Fig. 1A)^{4,25} and CA is computed by SVD of the Pearson residuals to reveal the rowcolumn associations that deviate from expectation^{26}. The principal components in CA partition the codependence between the rows and columns such that a higher weight indicates a stronger dependence or association between row and column; for scRNAseq data, CA principal components can identify codependence between gene expression counts and particular cells. From this perspective, the main difference is the space into which the data are transformed then decomposed. Whereas PCA partitions the variance in Euclidean space, CA partitions the total contingency chisquare table along linear additive components^{27}. CA has a long tradition in diverse settings and disciplines, including linguistics, business and marketing research, and archaeology^{26,28}, where it is applied to and further optimized for large, sparse count data. CA has also been applied in bioinformatics to perform codon usage analysis^{29,30}; to analyze microarray transcriptomics data^{31}; to integrate GO labels with microarray data^{32}; and to analyze metagenomic and microbiome data^{33}. In made4, Culhane et al. implemented CA for microarray and bulk RNAseq data^{34,35,36}. We now propose its application to scRNAseq analysis.
Focusing on the issues of logtransforming scRNAseq counts when applying PCA, Townes et al.^{2}, Hafemeister and Satija^{13}, and Lause et al.^{14} presented approaches to scRNAseq analysis based on Pearson residual normalization as an alternative to distortive logtransformation. Townes et al.^{2} proposed glmPCA, a generalization of PCA that minimizes deviance rather than mean squared error (MSE) and accommodates noncanonical link functions, and that can be approximated with PCA of Pearson or deviance residuals^{2}. Lause et al. proposed analytic Pearson residual normalization^{14}, extending work from Hafemeister and Satija, who used a regressionbased approach to computing Pearson residuals^{13}. Lause et al. cited our opensource Bioconductor workshops which describe CA; the relationships among CA, PCA, and SVD; and their application in scRNAseq data as support that glmPCA from Townes et al.,^{2}, SCTransform from Hafemeister and Satija^{13} and their approach are CA or closely approximate CA^{14,37}. However, CA, which can be computed by SVD on the standardized Pearson residuals, may not be the most appropriate approach when there is overdispersion in the contingency table^{38}.
We propose and evaluate five adaptations of CA to address overdispersion in scRNAseq counts. We benchmark the performance of each of these compared to standard CA and with glmPCA^{2}, a popular method in the field. In particular, we find that CA with Freeman–Tukey residuals, an alternative chisquared statistic, is especially performant across a variety of test cases. Because cell clustering and characterization is a key part of most scRNAseq workflows, we set as the goal of the benchmarking task to find embedding representations that facilitate identifying and annotating complex populations of cells. We show that the CA biplot provides a geometric interpretation of features and objects in the same space, which in turn facilitates efficient exploratory data analysis and cluster interpretation. We implemented standard and adapted CA for scRNAseq in corral, an R/Bioconductor package that interfaces directly with Bioconductor classes (including SingleCellExperiment). Designed for computational scalability, corral is fast and performant compared to PCA and other dimension reduction methods, including glmPCA. Switching from PCA to CA with corral is achieved through a simple pipeline substitution and improves dimension reduction of scRNAseq datasets.
Results
Correspondence analysis: countbased dimension reduction
Standard correspondence analysis (CA) casts scRNAseq read counts in a contingency table analysis framework and in its canonical form can be conceptualized as a twostep procedure (graphically outlined in Fig. 1A; detailed in “Methods”). The count matrix is first transformed to Pearson chisquared residuals, and the resulting residual matrix is then factored with singular value decomposition (SVD).
CA analysis of scRNAseq does not require, but is compatible with, logtransformed read counts (logcounts). PCA, which has been widely used, requires data transformation, and is therefore generally applied to logcounts data, even though logtransformation of scRNAseq counts distorts latent space representation such that the first dimension is driven by individual cell sparsity, or the number of features with zero observed counts (“zero fraction”)^{2}. Since we propose CA as a more suitable alternative to PCA for finding cell embeddings, we compared CA to the widely used correlationbased PCA^{4}.
We applied both CA and PCA to a groundtruth scRNAseq benchmarking data set (on both counts and logcounts) obtained by CELseq2 sequencing of pseudocell mixtures comprising mRNA from eight distinct groups^{39}. Figure 1B shows the first two principal components for both PCA and CA. The first PCA component clearly separated cells from three of eight clusters, but PC2 only captures a gradient within the groups. In contrast, CA clearly clustered and separated all groups within two components. Similarly, results in purified PBMCs (Zhengmix4eq benchmarking dataset) demonstrated that CA can be applied directly to counts or to logcounts and still achieve good clustering and separation, whereas PCA on counts produces an “arch” or “horseshoe” effect, arising from the presence of a latent sequential ordering or gradient^{12,25}. PCA on logcounts performed similarly to CA on either counts or logcounts.
CA is robust when applied to either counts or logcounts data, obviating the need for logtransformation and avoiding its associated issues. We compared the performance of the four pipeline configurations presented in Fig. 1C (CA and PCA on counts and logcounts) on six reference benchmark datasets—three scRNAseq datasets from SCMixology (known cell mixture of three cancer lines sequenced with three technologies)^{39} and three Zhengmix PBMC datasets^{40,41}. (Datasets listed in the Benchmarking section of “Methods”). Cluster recovery based on the annotated cell types in the study was assessed using Adjusted Rand Index (ARI), which assesses similarity between two sets of data partitions (Fig. 1D). In all comparisons, CA outperforms or matches PCA’s performance (orange circle indicates highest ARI per dataset).
Comparison of CA approaches that address overdispersion
CA can be influenced by “rare objects” or outliers^{38}. Due to high underlying heterogeneity of gene expression within and between various cell types, scRNAseq data often include biologically “real” outliers as opposed to artifacts due to noisy data. For example, professional secretory cells have a distinct biological profile often driven by extraordinarily high production of one or two proteins, such as insulin in pancreatic islet cells or immunoglobin in immune cells. Similarly, senescent or quiescent cells differ in gene expression profile compared to rapidly dividing cells or highgrade tumor cells.
We propose and evaluate five unique adaptations of CA to address overdispersion in scRNAseq counts. In total, six CA methods (standard CA and the five adaptations) were applied to nine datasets, including the three Zhengmix human PBMC benchmarking datasets, as well as cells from human pancreas, human brain, and Xenopus tail (Table 1). Cluster recovery performance on cell embedding representations generated from each specific method was compared and benchmarked in reference to glmPCA^{2}, based on the partition similarity of the new clusters with the original annotated cell populations from each dataset (measured with ARI; detailed in “Methods” – Benchmarking).
The five adaptations of CA fall into three general approaches (Fig. 2A). The first class of approaches was to explicitly apply a variancestabilizing transformation to the count matrix prior to computing Pearson residuals. Lause et al.^{14} discussed variancestabilizing transformation as compared to Pearson residual normalization, though in their study did not combine variance stabilization and Pearson residual normalization prior to matrix decomposition. They reported that the degree of correction from variancestabilizing transformation alone was insufficient for scRNAseq data in their pipeline configuration and found that only normalizing with analytic Pearson residuals was more effective than only applying variance stabilization^{14}. Given that scRNAseq counts are often approximated as Poissondistributed, we considered three variancestabilizing transformations that are typically applied to count data. These three squareroot based transformations all originate from R.A. Fisher’s observation that performing an arccosine transformation on the square root of multinomial probabilities yields approximately normally distributed angles on a hypersphere^{42}. The first was square root transformation of count data (Row 3 of Fig. 2A), which has been used to correct overdispersion in Poisson counts^{43}. The second is Anscombe’s variancestabilizing count transformation (Row 4 of Fig. 2A), originally proposed in 1948 for use with Poisson, binomial, and negative binomial data^{44}. Third, we used the Freeman–Tukey variancestabilizing count transformation (Row 5 of Fig. 2A), originally proposed in 1950, also for Poisson and other count data^{45}.
Our results indicate that variance stabilization improves performance of standard (classical) CA. Variance stabilization of counts prior to computing Pearson residuals provided great gains in downstream clustering with ARI increases of 0.4 in two studies (Zhengmix4uneq, Aztekin Xenopus tail); squareroot transformation prior to CA increases ARI in 7 datasets, while transformation to Anscombe counts or Freeman–Tukey counts increased ARI in every dataset when compared to standard CA (with no variance stabilization of counts prior to computing Pearson residuals). Indeed, Anscombe’s variancestabilizing count transformation achieves the highest observed ARI in 1 of 9 test datasets (pancreas: Lawlor) and Freeman–Tukey variancestabilizing count transformation had best overall performance in 3 of 9 datasets (Zhengmix4uneq; pancreas: Muraro, Lawlor). Although the square root count transformation did not outperform the other two transformations in any of the comparisons, its ARI was within 0.05 of other two transformation in 7 of 9 datasets. Furthermore, in the pancreas datasets, variancestabilizing count transformation coupled with standard CA yielded the highest ARI overall, outperforming glmPCA.
The second variation we considered is “power deflation” as a data smoothing method. Power deflation handles extreme outliers in the chisquared residual matrix by raising all transformed residual values to a power, α, prior to performing SVD, while preserving sign (Bottom row of Fig. 2A). Conceptually, this procedure is similar to the Tukey ladder transformation^{46}, and has a smoothing effect on the matrix of chisquared distances, reducing the impact of outlying values while preserving the ordering of values. To achieve a “soft” smoothing effect, we considered \(\mathrm{\alpha }\in \left[0.9, 0.98\right]\) (data not shown) and present results for \(\mathrm{\alpha }=0.9\) in Fig. 2. This approach is also similar to the classic square root variance stabilizing transformation for Poisson counts, with the special case where \(\mathrm{\alpha }= 0.5\), but it differs in that the transformation is applied to the chisquared residual matrix rather than to the count matrix. In all nine datasets, this power deflation smoothing approach performed comparably to, or better than, standard CA, although its impact on CA performance was less than variancestabilizing count transformation.
Third, we considered an alternative chisquared statistic that is bettersuited to count data with high levels of sparsity and overdispersion. CA with Freeman–Tukey residuals (CAFT) has been applied to archaeological site data, where it exhibited a variancestabilizing effect and outperformed standard CA (SVD of the Pearson residuals), in the analysis of sparse, overdispersed artifact data (counts of archeological artifacts by site)^{45,47,48}. Both Pearson residuals and Freeman–Tukey residuals are members of the CressieRead family of power divergence statistics for testing goodnessoffit in multinomiallydistributed count data, and when squared, both residuals are chisquare distributed random variables^{47,49}. We found that CAFT is wellsuited for scRNAseq counts (Row 2 of Fig. 2A), outperforming standard CA in all nine datasets and its performance was comparable to (ARI within 0.02) or superior to glmPCA in 8 out of 9 benchmarking datasets. In most datasets CAFT also had higher or comparable clustering accuracy (ARI) to standard CA with variancestabilizing transformation. CAFT achieved the highest ARI overall in 6 out of 9 datasets. Unlike standard CA, we observed little benefit to combining CAFT with variancestabilizing transformation (square root, Anscombe, or Freeman–Tukey) (Fig. S1); while the performance of standard CA improves dramatically with variancestabilizing transformation, CAFT adjusts for and is appropriate to be used with overdispersed data.
Component selection can greatly influence downstream cell clustering analysis, so we considered clustering performance as a function of the number of components selected (Figs. 2C, S2). The ability to recover “known” clusters (measured with ARI between clustering output and the published cell types) was higher for the simpler mixtures of known, purified cell types (Zhengmix datasets). For the complex tissues examined (Brain; Pancreas; Xenopus tail), the “true” number of cell types are experimentally estimated from the scRNAseq data. There was heterogeneity in the number of cell types described in the same tissue between different studies, possibly because cell annotations can be assigned at low resolution (e.g., Tcells), or at high resolution (e.g., CD4 Tcells, exhausted CD8 Tcells, etc.), depending on the particular study question. For instance, the pancreas datasets Lawlor, Muraro, and Baron described eight, eleven, and fourteen cell types in their respective analyses (Table 1). We observed an association between the number of components and the complexity of the clustering task. More components may capture more total variation in data and thus might increase performance when performing higher resolution annotation. Figure 2C shows that more components generally increased ARI in more complex tissue. However, for datasets where the reference cell type annotations are lower resolution (fewer cell types), including more components could reduce the ARI since their results will be higher resolution (more cell types) and therefore technically less concordant with the original reference. This reveals a limitation of current benchmarking approaches. A new method could find biologically meaningful groups, but perform poorly if scored using ARI on low resolution benchmarking datasets. We observed in our results that the Lawlor and Darmanis datasets, both annotated at lower resolution, showed the steepest decline in ARI clustering performance when more PCs are included.
In contrast, there was little gain and, for some, a reduction in ARI with more components in the Zhengmix datasets, which comprise combinations of distinct PBMC cell types sorted and purified prior to sequencing. In simple datasets, including additional components beyond those that sufficiently capture the biological variance may add stochastic, technical, or systematic noise in the system. Benchmarking each of the methods with ranking by maximum ARI was robust to the number of components; CAFT was consistently most performant, whether the first thirty or fifty (Figs. S2, 2B) components were included in downstream clustering.
CA, CAFT, and other variations generate a nearly deterministic result that is stably reproduced. In contrast, glmPCA is not deterministic, and therefore results may vary substantially when the method is rerun on the same dataset (Figs. 2C and S3). For reproducibility, we tested ten random seed initiations of glmPCA (Fig. 2C), which revealed that glmPCA results are consistent for simpler datasets but in other datasets, such as the Xenopus tail dataset, performance varies dramatically between iterations. In the Lawlor pancreas dataset, one iteration failed, suggesting that results were somewhat dependent on finding a “lucky seed.” In simpler datasets, such Zhengmix, all methods generated high ARI scores and glmPCA results had consistency between individual runs (Fig. 2C). However, there was greater variation in glmPCA performance with increasing data complexity. For each dataset, we present the average of the maximum ARI achieved in each of 10 runs of glmPCA.
CA variations adapted for overdispersion outperform standard CA or glmPCA in downstream clustering (Fig. 2B). Of the approaches we considered, CAFT was most performant, outperforming standard CA with variancestabilizing transformation and the power deflation approach.
Geometric interpretation of cell and feature embeddings
The CA biplot provides a natural framework for cluster interpretation, highlighting biologically meaningful relationships among gene expression patterns and cell populations, and may be extended to guide feature selection. Every transformed count (residual) in a CA matrix has an intuitive interpretation, as it is the chisquared test statistic for strength of association between a particular row (expression of a gene) and column (cell). The CA matrix captures the strongest associations between gene expression and cells, highlighting functional contrasts by individual cells and by subpopulations of cells. Biplots visualize associations between features and objects, or in this case, genes and cells. Rather than examining the feature and object embeddings individually, the biplot places both sets of embeddings on the same axes, revealing both the associations that may exist among either rows or columns separately, and also between particular rows and columns^{6,50}. Distance from the origin indicates the magnitude of association; the angular rotation distance (cosine similarity) reflects similarity of the cells (or genes) to each other, or association between cells and genes.
We performed standard CA on the Zhengmix8 PBMC benchmarking dataset, plotting the first two dimensions of the resulting cell and gene embeddings (Fig. 3). The 20 genes with highest weight by L2 norm in the first two dimensions are colored blue, with a corresponding gene label. Cell populations are colored by cell type. The biplot highlights genes that have strong associations with and may discriminate between particular cell populations. For example, natural killer (NK) cells constitutively express granulysin, encoded by the gene GNLY, and although they are not exclusive producers of granulysin, GNLY expression in other cells, like cytotoxic Tcell populations, is driven by immune activation^{51}. The CA biplot shows that GNLY has a high weight in PC2 (far from origin) and has a similar angular rotation as the NK cell population (high cosine similarity). Correspondingly, the inset ridge plots in Fig. 3 showing histograms of log expression in cell populations confirm it is highly expressed specifically in the NK cell population.
Calciumbinding proteins S100A8 and S100A9 (MRP8 and MRP14 respectively) are constitutively expressed in monocytes and neutrophils^{52,53}. Correspondingly, in the CA biplot in Fig. 3, the expression of both genes is strongly associated with the monocyte population (same direction, large magnitude), consistent with the relative logexpression of S100A8 among cell populations (inset plot). Similarly, LYZ encodes for lysozyme, a molecule highly secreted by monocytes^{54}. Reflecting the elevated differential expression of the gene among the monocyte population shown in the inset, the gene is far from the origin while also close in angle to the cell population.
Biplots also inform about genes highly and differentially expressed in multiple cell populations: TYROBP encodes for a signaling adaptor protein (KARAP/DAP12), which was initially identified as a wiring component in NK antiviral and antitumoral function^{55}. TREM1, a KARAP/DAP12associated surface protein, amplifies monocyte, macrophage, and granulocyte activation by cytokines and chemokines following LPS stimulation^{55}. While other lymphoid and myeloid cells may express TYROBP, it has predominantly been observed in NK, monocytes/macrophages, and dendritic cells, consistent with the enriched expression levels in the expected cell types: NK and monocytes. The gene is projected between these cell populations; expression ridge plots confirm that it exhibits elevated expression specifically in NK and monocyte cell populations.
CD74 is part of the MHC class II complex, consistent with both its biplot positioning and expression plot: angularly, it lies closest to the B cell population, but is also rotated slightly towards the monocyte population^{56}. Correspondingly, expression of CD74 is seen in cells of all types but is most elevated in B cells and in some monocytes. Similarly, HLADRA encodes the alpha chain of the HLADR protein, which is a cellsurface receptor in the MHC class II complex^{57}. Both B cells and monocytes are professional antigen presenting cells that require all the machinery of the MHC class II complex, so these genes are important for function of both cell types, and both genes in the biplot are angled between the most relevant cell types, providing a biologically meaningful summary of associations between genes and cell subpopulations.
The CA biplot facilitates unified analysis of cell and gene embeddings, which can inform cluster interpretation and serve as a basis for integrating with (and extending) other methods, such as gene set enrichment analysis and projection of supplementary data into a shared latent space.
Integrative multidataset dimension reduction with corralm
The need to integrate cells from multiple batches motivates continued refinement and development of CA^{10,35,58}. Our multitable adaptation of CA, implemented as corralm in the corral R/Bioconductor package, operates using indexed or Freeman–Tukey residuals, and finds a joint multitable embedding. It is suited for light to moderate integration tasks (e.g., different sequencing runs of an experiment). For complex integration tasks with substantial batch effects, corralm may not fully integrate the data because it is a multitable extension of CA dimension reduction, and is not optimized for batch integration and contains no explicit integration step. Since CA embeddings can be easily substituted for PCA in a pipeline, we investigated whether inclusion corralm in batch integration improved the performance of popular integration methods that include a PCA step. For example, widely used batch correction methods, FastMNN and Harmony, include a PCA step. We compared corralm’s performance with widely used batch integration methods (Fig. 4), including LIGER^{59}, MNNCorrect, Harmony, and Seurat (suggested pipeline including SCTransform normalization and CCA integration), all of which performed well in recent benchmarking studies^{59,60,61,62,63}. To assess corralm as a PCA pipeline substitute, we included in the comparisons corralm coupled with Harmony and MNN.
First, to compare performance in a clear and simple groundtruth scenario, each method was applied to batch integration of the SCMixology benchmarking dataset comprising scRNAseq profiles from a mixture of three cell lines (H2228; H1975; HCC827), obtained in three batches using different library preparation platforms (Dropseq; Celseq2; 10X)^{39}. Second, to compare performance in a more complex, biologically realistic example, the methods were applied to integration of three human pancreas datasets, obtained on different platforms in separate studies: Baron, Lawlor, and Muraro (detailed in “Methods”—Benchmarking below).
In the SCMixology dataset, the “ground truth” is unambiguous, and we expect the lowdimensional representation to align data across batches and identify distinct cell line clusters. Figure 4A shows the first two components of the reduced dimension representation of results from corralm, Harmony with corralm embeddings, SCTransform with CCA, and MNNCorrect with PCA successfully integrate batches while preserving cell line clusters (Fig. 4A, rows 1,3,4,6). In contrast, Harmony (using PCA embeddings, as published) fails at both data integration and cluster detection on these same data (Fig. 4A, row 2). LIGER succeeds in cluster separation but fails in integration, as visualized in the UMAP (Fig. 4A, row 5). Qualitatively, SCTransform with CCA exhibits the best alignment by batch and tightest clusters by celltype, but its runtime is an order of magnitude slower than corralm and Harmony with corralm. SCTransform with CCA runs in 45 s, while corralm and Harmony with corralm run in 7 s for the equivalent task, allocated one core of a laptop (“Methods”—Benchmarking). LIGER and MNNCorrect are significantly slower, running in approximately 6 min and 1.25 min, respectively. Although the SCMixology dataset is relatively small (1401 cells), at scale, this difference in runtime would significantly impact the overall speed of a pipeline, thus demonstrating an advantage of corralm and Harmony with corralm.
Cluster evaluation measures like ARI assess whether clusters can be reidentified, but do not directly quantify how well datasets are integrated in their low dimensional embedding representations. We propose a new metric, scaled variance (SV), for assessing batch integration of datasets comprising similar cell populations across batches (Fig. 4B; detailed in Methods). For each dimension of each embedding, we compute the variance of the subset of observations from each batch and scale by the overall variance in that dimension as a measure of under or overdispersion of the subset’s embeddings in that dimension. For example, in the SCMixology benchmarking dataset, biologically identical samples were assayed using three library preparation methods (Dropseq; Celseq2; 10X), with each batch expected to have the same distribution of cells. SV values closer to one indicate better integration (more similarity in dispersion) in a given dimension by batch. Consistent with Fig. 4A, the SV plots (Fig. 4B) showed that SCTransform had the best integration, with all SV points very close to one. Similarly, corralm and Harmony with corralm also showed good batch integration, and both outperform Harmony with PCA, which had SV values far from one.
In the more complex and realistic pancreas scRNAseq integration task, the performance of data integration methods were assessed qualitatively by comparing UMAPs (Fig. 4C and S5) and quantitatively with ASW cluster metrics^{64} (Fig. 4D), as in a previous benchmarking study^{62}. Assuming that the given cell type labels from each dataset are ground truth, in an embedding where cell types form compact and perfectly separated clusters, ASW_{cell type} should be close to 1. Batch integration was measured by 1 – ASW_{batch}, where values near 1 (ASW_{batch} near 0) indicate integration and less clustering by batch. Corralm is a simple joint dimension reduction that includes neither optimization for batch nor explicit batch integration steps, and therefore is not expected to outperform methods optimized for batch correction. However, we see corralm outperforms multibatch PCA (Fig. 4D). Moreover, corralm combines well with integration pipelines: pairing Harmony or MNN correction with corralm embeddings improves the embedding as compared to both corralm alone and to the original pipelines with PCA. In Fig. 4D, we report that corralm (with Freeman–Tukey residuals) coupled with Harmony exhibits comparable performance to the Seurat routine in terms of integration and biological cluster separation. Qualitatively, these UMAPs are similar (Fig. 4C). In contrast, other methods shown in Fig. 4C were less successful in integrating the batches, though they did appear to preserve at least some of the biological structure.
Computational performance of corral’s CA implementation
The corral implementation of CA leverages fast, approximate, partial SVD from the irlba R package^{65}; even when allocated one core on a laptop (“Methods”—Benchmarking), corral runs in under a minute for a dataset of 1,500 features and over 20,000 cells (50 components). Figure 5A shows that for the analogous task, glmPCA takes over an hour, and that across a range of dataset sizes (1500 features), glmPCA’s runtime increases rapidly with the number of cells, while CA (corral) scales much more favorably. As SVD implementations improve, runtime and/or memory use may be further reduced by modularly incorporating these into the corral pipeline. Standard CA and the variations we considered are not sparse implementations; computational performance may be further enhanced with adaptations for sparsity. Since CA has similar computational requirements to PCA, replacing PCA with CA is a simple pipeline substitution.
Discussion
Correspondence analysis (CA) is a statistical technique with a rich theoretical foundation that was first proposed and mathematically characterized nearly a century ago^{66} and which has continued to be developed and extended. CA has been periodically “rediscovered” and adapted in a variety of disciplines^{20,28,67,68,69} and most recently in the field of scRNAseq analysis: several groups have suggested Pearson residualbased normalization prior to matrix decomposition with PCA^{2,13,14}, a routine that is conceptually similar to standard CA—apart from differences in how residuals are computed, one additional distinction in this routine is PCA’s additional Zscore normalization step^{4} after computing Pearson residuals, as opposed to directly decomposing the residual matrix with SVD.
Correspondence analysis with Freeman–Tukey chisquared residuals (CAFT) is a simple and effective adaptation of CA for dimension reduction of scRNAseq counts. We compared the performance of CA and five CA variations that address scRNAseq overdispersion, benchmarking these against glmPCA^{2}, a popular method in the field. CAFT was most performant overall in a scRNAseq cluster recovery task. Our analyses also showed that, combined with standard CA (Pearson residuals), incorporating variancestabilizing transformations and “power deflation” smoothing both improve performance in downstream clustering tasks, as compared to standard CA alone. Therefore, for dimension reduction of scRNAseq data, we recommend using CAFT or, when using standard CA, incorporating variancestabilization and/or smoothing.
Data normalization and dimension reduction significantly impact downstream scRNAseq analyses. Performance of dimension reduction approaches depend on variance structure, noise, and other characteristics of a dataset; we find, as has been reported elsewhere^{18}, performance of methods vary depending on the characteristics of individual datasets. Benchmarking studies are limited by lack of robust reference datasets reflecting the depth of complexity and nuance in actual biological research; most highquality, “groundtruth” benchmarking datasets are derived from simple “pseudo”cell mixtures, or from pools of distinct cell types. Neither reflect the true diversity of cell types in tissues, nor properties of realworld research data. Typically, parameters like number of “true” clusters are unknown a priori and depend on the specific research question and context. A complementary approach is to consider benchmarking datasets obtained by sequencing complex tissue samples, although these datasets also have their own disadvantages; cells in such studies are assigned identities based on one analytical method (and for one particular set of study objectives) without a way of independently validating the assignments. Therefore, these singlecontext annotations set an overly narrow standard for future benchmarking studies of other methods, which can never outperform the method used for initial assignment. With advances in systematic benchmarking frameworks for complex datasets in different contexts, we will be better equipped to test the merits of each approach and identify optimal approaches based on data characteristics.
As such, the analyses we present here are somewhat limited by the contextspecific annotations of our benchmarking datasets, since we use as the ground truth labels the original annotations published with these datasets. Except for SCMixology and Zhengmix (both comprising welldefined cell clusters and by design simpler than data from complex tissues), the datasets we analyzed did not have independently validated cell type annotations, so performance is limited by the original cell type assignments. Even if a given method better distinguishes important subpopulations or rare cell types from clustering, these advantages may not be reflected in the ARI, and the method would actually receive a small penalty for differences from “reference.” Given the complexity of and subjectivity inherent in cell cluster annotation, researchers may call different cell populations or clusters from the same dataset, depending upon the research objectives. The diversity of research questions and data challenges in single cell biology necessitate the breadth of statistical and computational approaches. The robust conceptual framework for CA and its empirical performance advantages over PCA argue for its application in scRNAseq analyses.
We implemented CA, CAFT, and other variations that adjust for overdispersion of scRNAseq data in the R/Bioconductor package corral (including documentation, tutorials, vignettes), enabling its integration into commonly used analytical pipelines^{3,37}. We conclude with ideas for future development—CA, especially when situated within the broader duality diagram framework, can serve as both a platform for and rich source of further methods development. By simultaneously visualizing both cell and gene embeddings, the CA biplot emphasizes the rowcolumn duality inherent in these data, facilitating joint analysis of genes and cells. The unified approach to analysis of gene and cell embeddings provides a natural framework to extend and/or integrate with other approaches, including gene set enrichment analysis, supervised decomposition, and projection of supplementary data into shared latent space—for example, with a similar approach as used previously in mogsa and omicade4^{10,34,36}. Embeddings can be used as matrix operators to project supplementary data into shared latent space, enabling multimodal and multibatch integration, as well as fast approximation methods. Matrix projection via multiplication is fast and scalable, even for very large datasets, and in future extensions, can serve as the basis for fast, approximate dimension reduction approaches based on decomposing a representative subset of the data and then projecting into the space the full matrix. As advances in library preparation methods enable sequencing of everlarger numbers of individual cells, computational considerations are critical in selecting analytical methods and designing scRNAseq pipelines.
Methods
Standard correspondence analysis on a single table
Similar to many other matrix factorization methods, correspondence analysis comprises two main steps: a data transformation routine (see also Fig. 1A), and a matrix decomposition operation (such as SVD or eigen analysis). In applying “standard” CA to scRNAseq count data, we use SVD to decompose Pearson residuals of genebycell expression count matrix, where the residual quantifies the difference between the observed and the expected data. In this case, the expected value is the product of the row and column weight from the original count matrix. A positive residual, indicating that the observed value (count) for that feature/gene and cell pair is higher than expected, suggests an association or codependency; correspondingly, a negative residual shows a lower value than expected, suggesting indicating a negative association between the expression of a gene expression and a cell subpopulation. When squared, the residuals are chisquared distributed random variables, and their sum of squares comprises a chisquared goodnessoffit test statistic with (n1)(m1) degrees of freedom^{47,70}.
Correspondence analysis is a dual scaling along the rows and the columns of each count matrix.
CA applied to scRNAseq count data proceeds through the following two discrete steps:

1.
Transformation from counts to standardized residuals. Suppose \(\mathbf{X}\) is an \(m\times n\) matrix with \(n\) cells (indexed on \(j\)) in the columns and \(m\) features (indexed on \(i\)) in the rows, comprising observations \({x}_{ij}\). The abundance \({p}_{ij}\), the weight of the \(i\) th row \({p}_{i.}\), and the weight of the \(j\) th column \({p}_{.j}\) for a given observation \({x}_{ij}\) are:
$${p}_{ij}=\frac{{x}_{ij}}{N}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em};\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}N=\sum_{i=1}^{m}\sum_{j=1}^{n}{x}_{ij}$$$${p}_{i.}=\mathrm{row}\hspace{0.25em}\mathrm{weights}=\sum_{j=1}^{n}{p}_{ij}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}{p}_{.j}=\mathrm{col}\hspace{0.25em}\mathrm{weights}=\sum_{i=1}^{m}{p}_{ij}$$The expected abundance for observation \({x}_{ij}\) is \({p}_{i.}\hspace{0.25em}{p}_{.j}\) and is what we would expect to see in a cell assuming there is no relationship between a row and column. The standardized (Pearson) residuals \({r}_{p;ij}\) are the difference between the observed and expected, and can be computed:
$${r}_{p;ij}=\frac{\mathrm{observed}\mathrm{expected}}{\sqrt{\mathrm{expected}}}=\frac{{p}_{ij}{p}_{i.}\hspace{0.25em}{p}_{.j}}{\sqrt{{p}_{i.}\hspace{0.25em}{p}_{.j}}}$$This transformation is equivalent to the computation applied in contingency table analysis of categorical data measuring the strength of association between elements in a row and a column. It yields a matrix \({\mathbf{M}}_{\mathbf{S}}\) where the sum of the distances of the points to their centroid (“total inertia”) is the chisquared statistic of the matrix^{26,28}. As a result of this transformation \({\mathbf{M}}_{\mathbf{S}}\) is centered and should appear more Gaussian, and therefore is appropriate input for SVD.

2.
Matrix decomposition. \({\mathbf{M}}_{\mathbf{S}}\) is decomposed using singular value decomposition (SVD) to find left singular matrix \(\mathbf{U}\), diagonal matrix of singular values \(\mathbf{D}\), and right singular matrix \(\mathbf{V}\) such that:
$${\mathbf{M}}_{\mathbf{S}}=\mathbf{U}\mathbf{D}{\mathbf{V}}^{\mathbf{T}}$$
and
The resulting \(\mathbf{U}\) matrix can be either used as an embedding directly, with each column representing a dimension in the new latent space, or coordinate scores can be computed. Standard coordinate scores are given by dividing the \(\mathbf{U}\) and \(\mathbf{V}\) matrices by the vectors of row weights and column weights, respectively. Principal coordinate scores are given by multiplying the standard coordinate scores by the vector of diagonal values of the matrix \(\mathbf{D}\). The principal coordinate scores differ from the standard coordinate scores by a scalar on each dimension, and both reflect the ordination scores of the features and cells^{38}. Unlike in PCA, where differences in embeddings approximate Euclidean distances, correspondence analysis decomposes the overall chisquared statistic. The value of the underlying chisquared statistic is high when there is an association between a rowcolumn pair of the table.
Variations of CA
We considered five variations of CA to address overdispersion in scRNAseq counts (also summarized graphically in Fig. 2A).

1.
CA with Freeman–Tukey chisquared residuals Instead of computing the Pearson residuals described above, the residuals are computed:
The matrix of these residual values is then decomposed with SVD as described in Step 2 above.
$${r}_{f;ij}=\sqrt{{p}_{ij}}+\sqrt{{p}_{ij}+\frac{1}{N}}\sqrt{4{p}_{i.}\hspace{0.25em}{p}_{.j}+\frac{1}{N}}$$ 
2.
CA with variancestabilizing transform: Square root The square root of the matrix of counts \(\mathbf{X}\) is computed before performing the residual transformation.

3.
CA with variancestabilizing transform: Anscombe Each element \({x}_{ij}\) of the matrix of counts \(\mathbf{X}\) is transformed to \({x}_{ij}^{*}=2\sqrt{{x}_{ij}+\frac{3}{8}}\). The residual transformation is computed on the variancestabilized counts matrix \({\mathbf{X}}^{*}\).

4.
CA with variancestabilizing transform: Freeman–Tukey Each element \({x}_{ij}\) of the matrix of counts \(\mathbf{X}\) is transformed to \({x}_{ij}^{*}=\sqrt{{x}_{ij}}+\sqrt{{x}_{ij}+1}\). The residual transformation is computed on the variancestabilized counts matrix \({\mathbf{X}}^{*}\).

5.
CA with power deflation After performing the Pearson residual transformation, each value in the matrix of residuals is transformed to a power of \(\alpha \in \left(0,1\right)\), while preserving the sign. Each element \({r}_{ij}\) in the residual matrix is transformed to \({r}_{ij}^{*}={\text{sgn}}\left({\text{r}}_{\text{ij}}\right) {\left{r}_{ij}\right}^{\mathrm{\alpha }}\). We recommend selecting \(\alpha \in [0.9,0.99]\) for a “soft” smoothing effect, presenting results for \(\alpha =0.9\).
corralm: multitable adaptation of correspondence analysis
The adaptation of correspondence analysis for the integration of multiple tables is similar to the method for single tables with additional matrix concatenation operations. When integrating datasets, we employ indexed residuals, by dividing the standardized residuals by the square root of expected proportion to reduce the influence of column with larger masses (library depth), which is a known source of batch effect in scRNAseq studies. Indexed residuals have a straightforward interpretation for example a value of 0.5 indicated that the observed value is 50% higher than the expected value. A value of − 0.5 indicated that the observed value is 50% less likely than expected to have a genecell association than expected.

Match tables and select features. Identify the intersection of features across the \(k\) matrices to be integrated, and subset the tables for only those \({m}^{*}\) features. While in these analyses we focus on batch integration and therefore match on features, the tables can either be matched by features, for integration across batches, or by cells, for multimodal integration across ‘omic types.

Transformation from counts to indexed residuals. Given each table with \(n\) cells and \({m}^{*}\) features, the row weight \({p}_{i.}\), column weight \({p}_{.j}\), and abundance \({p}_{ij}\) for each observation are computed as described above for standard CA. The indexed residuals \({r}_{ij}\) can be computed:
$${r}_{ij}=\frac{\mathrm{observed}\mathrm{expected}}{\mathrm{expected}}=\frac{{p}_{ij}{p}_{i.}{p}_{.j}}{{p}_{i.}{p}_{.j}}$$
Each table is scaled separately, so as to preserve the internal structure of each dataset.

Concatenate matrices. The transformed matrices of indexed residuals are then concatenated along the matching features to form a new matrix \({\mathbf{M}}_{\mathbf{C}}\) which has \({m}^{*}\) features and the total number of cells in the \(k\) matrices (i.e., sum of \(n\) across \(k\)).

Matrix decomposition. Singular value decomposition (SVD) is applied to the concatenated matrix of indexed residuals \({\mathbf{M}}_{\mathbf{C}}\) to find left singular matrix \(\mathbf{U}\), diagonal matrix of singular values \(\mathbf{D}\), and right singular matrix \(\mathbf{V}\) such that:
$${\mathbf{M}}_{\mathbf{C}}=\mathbf{U}\mathbf{D}{\mathbf{V}}^{\mathbf{T}}$$
and
The columns of the \(\mathbf{U}\) matrix then serve as the embeddings generated by this procedure, and the cells correspond to their indices in the concatenated matrix \({\mathbf{M}}_{\mathbf{C}}\).
Depending upon downstream analysis, it may be important to select an appropriate number of PCs. Similar to PCA, the number of components can be selected using the elbow method with the scree plot, e.g., as implemented in the findPC R package (as in Fig. 4C for corralm with Harmony)^{71}.
Scaled variance plot
When integrating embedding representations across batches, measures for cluster evaluation are effective for assessing group compactness and recovery of cell populations via clustering. However, they do not directly assess how well dataset embeddings are integrated across batches. To focus specifically on batch integration, we developed and applied a heuristic scaled variance metric, which captures the relative dispersion of each batch with respect to the entire dataset. The scaled variance of component dimension \({d}^{*}\) for the subset of observations in batch \({b}^{*}\), \(S{V}_{{b}^{*},d}\), is computed with:
where \(\mathbf{E}\) is the matrix of embeddings, and \(b\) indexes the rows (observations by batch) while \(d\) indexes the columns to indicate which component dimension to evaluate. When the datasets are well integrated, SV values for each batch are close to 1, indicating that each batch has similar dispersion as compared to the entire embedding. This metric is appropriate when the types of cells represented in different datasets are expected to be similar but cannot account for situations where the expected distribution of cell types (and therefore, embeddings) is fundamentally different between batches.
Benchmarking
We considered the ten scRNAseq benchmarking datasets shown in Table 1. The reduced dimension embeddings from each method were clustered using walktrap nearest neighbor graph clustering, as implemented in the bluster package’s default NNGraph parameter set^{72,73}. Performance on the clustering task was assessed with Adjusted Rand Index (ARI)^{74}, using as “ground truth” the cell type labels from the original datasets. Walktrap was selected as the main method for clustering based on performance; we observed, similar to others, that the walktrap algorithm better preserves hierarchical structure than Louvain clustering and overall achieves higher ARI^{75}. Results comparing Louvain clustering and with walktrap clustering are included in Fig. S4. We note that whilst some variability in clusters and ARI was observed between runs, CAFT consistently ranked as the most performant method across the range of datasets. Results shown in Fig. 2C are from clustering using different numbers of PCs. Results shown in Fig. 2B are computed by taking the maximum across all the tested PCs from Fig. 2C, and for glmPCA, the value shown is the average of the maxima achieved by each seed (ten seeds tested in total). Datasets (detailed below) were acquired from three R/Bioconductor data packages: CellBench, DuoClustering2018, and scRNAseq. Links to each of these are included below in the Data Availability section.
In the SCMixology integration (Fig. 4A, B), each of the benchmarked methods is run with the default settings as suggested in their respective documentation/vignettes. mnnCorrect from the batchelor R/Bioconductor package is run on the logcounts matrices, then decomposed with PCA^{60}. The LIGER result is shown as UMAP visualization because since it is a NMFbased method, we found that the visualization of the UMAP embeddings directly was challenging since the dimensions of the embedding are not ranked by performance, and are also constrained to only positive values^{59}. Similarly, LIGER is not shown in the scaled variance plot for the same reason, and we would not recommend using the scaled variance plot approach with other methods that do not generate ranked components.
In the pancreas integration (Fig. 4C, S5), all UMAP plots were generated using n_neighbors = 40 or n_neighbors = 50. Methods were similarly implemented as in the SCMixology integration results. PCA (scaled by table) was implemented as described in our minireview^{4}. Multibatch PCA was performed with the batchelor implementation (multibatchPCA), as was the “ + MNN” method (reducedMNN). In the result for corralm + Harmony, the elbow method (implemented in findPC; perpendicular option^{71}) was used for PC selection prior to running Harmony^{61}. Average silhouette width (ASW) was implemented with the cluster R package, using Euclidean distance^{64,76}. To enable joint evaluation, labels were harmonized, such that matching cell types are assigned the same label across datasets. In particular, activated stellate and quiescent stellate were merged to stellate; gamma/pp and pp were merged with gamma; duct and ductal were merged.
Data availability
Code and documentation are available in the corral R/Bioconductor package: https://www.bioconductor.org/packages/corral. R code to reproduce the figures and analysis in this manuscript is available on Github at: https://github.com/laurenhsu1/corral_manuscript. A tutorial describing different implementations of PCA and CA, including corral, is available at: https://aedin.github.io/PCAworkshop. The datasets used in these analyses are detailed Table 1, in the Benchmarking section of Methods, including citations and where the data can be accessed directly through R data packages. For ease of access, links for each Bioconductor data package used in this paper are included below: CellBench: https://bioconductor.org/packages/release/bioc/html/CellBench.htmlDuoClustering2018: https://bioconductor.org/packages/release/data/experiment/html/DuoClustering2018.htmlscRNAseq: https://www.bioconductor.org/packages/release/data/experiment/html/scRNAseq.html.
References
Hicks, S. C., Townes, F. W., Teng, M. & Irizarry, R. A. Missing data and technical variability in singlecell RNAsequencing experiments. Biostatistics 19, 562–578 (2018).
Townes, F. W., Hicks, S. C., Aryee, M. J. & Irizarry, R. A. Feature selection and dimension reduction for singlecell RNASeq based on a multinomial model. Genome Biol. 20, 295 (2019).
Amezquita, R. A. et al. Orchestrating singlecell analysis with bioconductor. Nat. Methods 17, 137–145 (2020).
Hsu, L. L. & Culhane, A. C. Impact of data preprocessing on integrative matrix factorization of single cell data. Front. Oncol. 10, 973 (2020).
Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of singlecell RNAseq data. Nat. Rev. Genet. 20, 273–282 (2019).
Nguyen, L. H. & Holmes, S. Ten quick tips for effective dimensionality reduction. PLOS Comput. Biol. 15, e1006907 (2019).
SteinO’Brien, G. L. et al. Enter the Matrix: Factorization Uncovers Knowledge from Omics. Trends Genet. 34, 790–805 (2018).
Holmes, S. Multivariate data analysis: The French way. In Institute of Mathematical Statistics Collections 219–233 (Institute of Mathematical Statistics, 2008). doi:https://doi.org/10.1214/193940307000000455.
Hotelling, H. Relations between two sets of variates. Biometrika 28, 321 (1936).
Meng, C. et al. Dimension reduction techniques for the integrative analysis of multiomics data. Brief. Bioinform. 17, 628–641 (2016).
Pearson, K. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 2, 559–572 (1901).
Diaconis, P., Goel, S. & Holmes, S. Horseshoes in multidimensional scaling and local kernel methods. Ann. Appl. Stat. 2, 777–807 (2008).
Hafemeister, C. & Satija, R. Normalization and variance stabilization of singlecell RNAseq data using regularized negative binomial regression. Genome Biol. 20, 296 (2019).
Lause, J., Berens, P. & Kobak, D. Analytic pearson residuals for normalization of singlecell RNAseq UMI data. Genome Biol. 22, 258 (2021).
Durif, G., Modolo, L., Mold, J. E., LambertLacroix, S. & Picard, F. Probabilistic count matrix factorization for single cell expression data analysis. Bioinformatics 35, 4011–4019 (2019).
Pierson, E. & Yau, C. ZIFA: Dimensionality reduction for zeroinflated singlecell gene expression analysis. Genome Biol. 16, 241 (2015).
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.P. A general and flexible method for signal extraction from singlecell RNAseq data. Nat. Commun. 9, 284 (2018).
Sun, S. Accuracy, robustness and scalability of dimensionality reduction methods for singlecell RNAseq analysis. Genome Biol. 20, 269 (2019).
Benzécri, J.P. Problèmes statistiques et méthodes géométriques. Cah. Anal. Données 3, 131–146 (1978).
Benzécri, J.P. & others. L’analyse des données. vol. 2 (Dunod Paris, 1973).
De la Cruz, O. & Holmes, S. The duality diagram in data analysis: Examples of modern applications. Ann. Appl. Stat. 5, 2266–2277 (2011).
Escoufier, Y. The duality diagram: A means of better practical applications. In Developments in Numerical Ecology (eds. Legendre, P. & Legendre, L.) (Springer, 1987).
Escoufier, Y. Operator related to a data matrix: a survey. In Compstat 2006  Proceedings in Computational Statistics (eds. Rizzi, A. & Vichi, M.) 285–297 (Physica HD, 2006). doi:https://doi.org/10.1007/9783790817096_22.
Legendre, P. & Legendre, L. Numerical Ecology. (Elsevier, 2012).
Holmes, S. & Huber, W. Modern Statistics for Modern Biology. (Cambridge University Press, 2019).
Greenacre, M. J. Correspondence analysis: Correspondence analysis. Wiley Interdiscip. Rev. Comput. Stat. 2, 613–619 (2010).
Digby, P. G. N. & Kempton, R. A. Multivariate Analysis of Ecological Communities (Springer, 1987).
Greenacre, M. J. Theory and applications of correspondence analysis. (Academic Press, 1984).
Grantham, R., Gautier, C., Gouy, M., Mercier, R. & Pavé, A. Codon catalog usage and the genome hypothesis. Nucleic Acids Res. 8, 197–197 (1980).
Perriere, G. Use and misuse of correspondence analysis in codon usage studies. Nucleic Acids Res. 30, 4548–4555 (2002).
Fellenberg, K. et al. Correspondence analysis applied to microarray data. Proc. Natl. Acad. Sci. 98, 10781–10786 (2001).
Busold, C. H. et al. Integration of GO annotations in correspondence analysis: Facilitating the interpretation of microarray data. Bioinformatics 21, 2424–2429 (2005).
McMurdie, P. J. & Holmes, S. phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE 8, 11 (2013).
Culhane, A. C., Perriere, G., Considine, E. C., Cotter, T. G. & Higgins, D. G. Betweengroup analysis of microarray data. Bioinformatics 18, 1600–1608 (2002).
Culhane, A. C., Perrière, G. & Higgins, D. G. Crossplatform comparison and visualisation of gene expression data using coinertia analysis. BMC Bioinformatics 15 (2003).
Meng, C. et al. MOGSA: Integrative single sample geneset analysis of multiple omics data. Mol. Cell. Proteomics 18, S153–S168 (2019).
Culhane, A. C. & Hsu, L. L. Dimension reduction for beginners: Hitchhiker’s guide to matrix factorization and PCA. (2019) https://github.com/aedin/PCAworkshop.
Greenacre, M. The contributions of rare objects in correspondence analysis. Ecology. 94(1), 241–249 (2013).
Tian, L. et al. Benchmarking single cell RNAsequencing analysis pipelines using mixture control experiments. Nat. Methods 16, 479–487 (2019).
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Duò, A., Robinson, M. D. & Soneson, C. A systematic performance evaluation of clustering methods for singlecell RNAseq data. F1000Research 7, 1141 (2020).
Mosteller, F. & Tukey, J. W. The uses and usefulness of binomial probability paper. J. Am. Stat. Assoc. 44, 174–212 (1949).
Bartlett, M. S. The use of transformations. Biometrics 3, 39 (1947).
Anscombe, F. J. The transformation of poisson, binomial and negativebinomial data. Biometrika 35, 246–254 (1948).
Freeman, M. F. & Tukey, J. W. Transformations related to the angular and the square root. Ann. Math. Stat. 21, 607–611 (1950).
Tukey, J. W. Exploratory data analysis. (AddisonWesley, 1977).
Beh, E. J., Lombardo, R. & Alberti, G. Correspondence analysis and the Freeman–Tukey statistic: A study of archaeological data. Comput. Stat. Data Anal. 128, 73–86 (2018).
Plackett, R. L., Bishop, Y. M. M., Fienberg, S. E. & Holland, P. W. Discrete multivariate analysis: Theory and practice. J. R. Stat. Soc. Ser. Gen. 139, 402 (1976).
Cressie, N. & Read, T. R. C. Multinomial GoodnessOfFit Tests. J. R. Stat. Soc. Ser. B Methodol. 46, 440–464 (1984).
Greenacre, M. Contribution biplots. J. Comput. Graph. Stat. 22, 107–122 (2013).
Krensky, A. M. & Clayberger, C. Biology and clinical relevance of granulysin. Tissue Antigens 73, 193–198 (2009).
Gonzalez, L. L., Garrie, K. & Turner, M. D. Role of S100 proteins in health and disease. Biochim. Biophys. Acta BBA Mol. Cell Res. 1867, 118677 (2020).
Wang, S. et al. S100A8/A9 in Inflammation. Front. Immunol. 9, 1298 (2018).
Gordon, S., Plüddemann, A. & Martinez Estrada, F. Macrophage heterogeneity in tissues: Phenotypic diversity and functions. Immunol. Rev. 262, 36–55 (2014).
Tomasello, E. & Vivier, E. KARAP/DAP12/TYROBP: Three names and a multiplicity of biological functions. Eur. J. Immunol. 35, 1670–1677 (2005).
Su, H., Na, N., Zhang, X. & Zhao, Y. The biological function and significance of CD74 in immune diseases. Inflamm. Res. 66, 209–216 (2017).
Matern, B. M., Olieslagers, T. I., Voorter, C. E. M., Groeneweg, M. & Tilanus, M. G. J. Insights into the polymorphism in HLADRA and its evolutionary relationship with HLA haplotypes. HLA 95, 117–127 (2020).
Doledec, S. & Chessel, D. Coinertia analysis: An alternative method for studying speciesenvironment relationships. Freshw. Biol. 31, 277–294 (1994).
Welch, J. D. et al. Singlecell multiomic integration compares and contrasts features of brain cell identity. Cell 177, 18731887.e17 (2019).
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in singlecell RNAsequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Korsunsky, I. Fast, sensitive and accurate integration of singlecell data with Harmony. Nat. Methods 16, 16 (2019).
Tran, H. T. N. et al. A benchmark of batcheffect correction methods for singlecell RNA sequencing data. Genome Biol. 21, 12 (2020).
Luecken, M. D. et al. Benchmarking atlaslevel data integration in singlecell genomics. Nat. Methods 19, 41–50 (2022).
Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Baglama, J. & Reichel, L. Augmented implicitly restarted Lanczos bidiagonalization methods. SIAM J. Sci. Comput. 27, 19–42 (2005).
Hirschfeld, H. O. A connection between correlation and contingency. Math. Proc. Camb. Philos. Soc. 31, 520–524 (1935).
Abdi, H. & Valentin, D. Multiple Correspondence Analysis. Encycl. Meas. Stat. (2007).
Beh, E. J. & Lombardo, R. A genealogy of correspondence analysis: A genealogy of correspondence analysis. Aust. N. Z. J. Stat. 54, 137–168 (2012).
Hill, M. O. Correspondence analysis: A neglected multivariate method. Appl. Stat. 23, 340 (1974).
Pearson, K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond. Edinb. Dublin Philos. Mag. J. Sci. 50, 157–175 (1900).
Zhuang, H., Wang, H. & Ji, Z. findPC: An R package to automatically select the number of principal components in singlecell analysis. Bioinformatics 38, 2949–2951 (2022).
Lun A. bluster: Clustering Algorithms for Bioconductor. R package version 1.8.0. (2022). https://bioconductor.org/packages/bluster.
Pons, P. & Latapy, M. Computing Communities in Large Networks Using Random Walks. in Computer and Information Sciences  ISCIS 2005 (eds. Yolum, pInar, Güngör, T., Gürgen, F. & Özturan, C.) vol. 3733 284–293 (Springer Berlin Heidelberg, 2005).
Hubert, L. & Arabie, P. Comparing partitions. J. Classif. 2, 193–218 (1985).
Barkas, N. et al. Joint analysis of heterogeneous singlecell RNAseq dataset collections. Nat. Methods 16, 695–698 (2019).
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M. & Hornik, K. cluster: Cluster analysis basics and extensions. R package version 2.1.4 (2022). https://cran.rproject.org/web/packages/cluster
Baron, M. et al. A singlecell transcriptomic map of the human and mouse pancreas reveals inter and intracell population structure. Cell Syst. 3, 346360.e4 (2016).
Muraro, M. J. et al. A singlecell transcriptome Atlas of the human pancreas. Cell Syst. 3, 385394.e3 (2016).
Lawlor, N. et al. Singlecell transcriptomes identify human islet cell signatures and reveal celltype–specific expression changes in type 2 diabetes. Genome Res. 27, 208–222 (2017).
Chen, R., Wu, X., Jiang, L. & Zhang, Y. Singlecell RNAseq reveals hypothalamic cell diversity. Cell Rep. 18, 3227–3241 (2017).
Darmanis, S. et al. A survey of human brain transcriptome diversity at the single cell level. Proc. Natl. Acad. Sci. 112, 7285–7290 (2015).
Aztekin, C. et al. Identification of a regenerationorganizing cell in the Xenopus tail. Science 364, 653–658 (2019).
Acknowledgements
We are grateful for helpful discussions with Prof. John Quackenbush and his lab at Harvard TH Chan School of Public Health, Prof. Aedín Culhane’s lab at University of Limerick, and with Bioconductor colleagues funded by the Chan Zuckerberg Initiative seed network program. We are also grateful for support from Prof. Judith Agudo and her lab at DanaFarber Cancer Institute.
Funding
This project has been made possible in part by grant number CZF2019002443 (Lead PI: Martin Morgan) from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation, of which ACC is a grantee. LH is funded in part by the NIH NIGMS Biostatistics Training Grant Program in Statistical Genetics/Genomics & Computational Biology (Predoctoral training grant T32GM135117).
Author information
Authors and Affiliations
Contributions
L.H. and A.C.C. wrote the manuscript and conceptualized the methods presented. A.C.C. wrote the Bioconductor workshop vignette on C.A. L.H. developed the R/Bioconductor package corral, wrote code to perform analyses, and created figures.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Hsu, L.L., Culhane, A.C. Correspondence analysis for dimension reduction, batch integration, and visualization of singlecell RNAseq data. Sci Rep 13, 1197 (2023). https://doi.org/10.1038/s41598022264341
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598022264341
This article is cited by
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.