Causal identification of single-cell experimental perturbation effects with CINEMA-OT

Recent advancements in single-cell technologies allow characterization of experimental perturbations at single-cell resolution. While methods have been developed to analyze such experiments, the application of a strict causal framework has not yet been explored for the inference of treatment effects at the single-cell level. Here we present a causal-inference-based approach to single-cell perturbation analysis, termed CINEMA-OT (causal independent effect module attribution + optimal transport). CINEMA-OT separates confounding sources of variation from perturbation effects to obtain an optimal transport matching that reflects counterfactual cell pairs. These cell pairs represent causal perturbation responses permitting a number of novel analyses, such as individual treatment-effect analysis, response clustering, attribution analysis, and synergy analysis. We benchmark CINEMA-OT on an array of treatment-effect estimation tasks for several simulated and real datasets and show that it outperforms other single-cell perturbation analysis methods. Finally, we perform CINEMA-OT analysis of two newly generated datasets: (1) rhinovirus and cigarette-smoke-exposed airway organoids, and (2) combinatorial cytokine stimulation of immune cells. In these experiments, CINEMA-OT reveals potential mechanisms by which cigarette-smoke exposure dulls the airway antiviral response, as well as the logic that governs chemokine secretion and peripheral immune cell recruitment.

1 Supplementary notes 1 Here we give an rigorous treatment of the causal framework and underlying assumptions in CINEMA-OT.
Assumption 1 (Formal): Independent sources and noise.Confounding factors and treatment events (s 1 , ..., s l , z) are independent random variables.The treatment event z ∈ {−1, 1} with P (z = 1) = p.Assumption 2 (Formal): Linearity of source signal combinations.The expression of each gene can be linearly decomposed as the mixing of noisy confounding signals and treatment-associated signals.Without loss of generality, we assume the concatenated random vector (s 1 + e 1 , ..., s l + e l , z + e z ) is whitened and at most only one of the (s 1 + e 1 , ..., s l + e l , z + e z ) is Gaussian.The observed data matrix X ∈ R n×d is generated by mixing of i.i.d sampled noisy signals plus i.i.d noise terms ϵ.
In our formulation, to be more realistic, we consider both the biological variation terms e and the measurement noise ϵ.For simplicity, we may understand the signal s as cell types and e as biological variations contributing to the heterogeneity within each cell type.Given assumption 1 and 2, denote the gene count matrix as X ∈ R n×d , and it is generated by a random vector x ∈ R m with the mixing matrix A. Then the data generation mechanism is given by As n → ∞, by the subspace consistency established in [1], we have the first l +1 identified principal components of form: Here B represents a linear transform.Note after PCA preprocessing, different components s i + e i or z + e z are still independent and at most only one of the factors is Gaussian.As a result, the ICA identifiability theorem [2] can be directly applied on x to unmix the independent components, which means the confounding factors are identifiable, up to a permutation: Finally, a statistical test on the difference between untreated and treated distributions for each independent component can be performed to distinguish the confounding factors from the treatment event signal.
Our theoretical justification here reveals that: 1.The (noisy version of) confounder terms are identifiable with the ICA transform up to a permutation; 2. The contributions of noise terms are not distorted as the relative ratio between s, z and e are preserved in the output.These two points supports the validity of using CINEMA-OT for matching according to the identified (noisy) confounders.
Moreover, we are able to show that in the same setting, a non-linear neural network based architecture named conditional (variational) autoencoder (used as the key component in scGen [3], compositional autoencoder [4], and contrastiveVI [5] along with other tools) can fail to identify the confounder signals.
In the model of conditional (variational) autoencoders, the latent space can be represented as two parts s = [s 0 ; z].First is the basal embedding s 0 , where the embedding of cell distributions from different conditions overlap; The other is the treatment signal z, which can be transformed into a treatment associated embedding.The generative model can be written as the following form: Here the setting of Gaussian s corresponds to a variational autoencoder design and a deterministic s conditioning on x corresponds to the vanilla autoencoder design, and θ denotes the decoder network parameters.In our context, we suppose the conditional (variational) autoencoder is optimized by the reconstruction loss with respect to x in eq (2) with / without Gaussian constraint: Even we assume the function above is perfectly optimized, and the learned s 0 is indeed independent of z, s 0 can be still an arbitrary transform of the noisy confounder signals.This is based on the following fundamental result from [6,7]: Theorem.[6,7] Let x be a d-dimensional random vector of any distribution.Then there exists a transformation F : R d → R d such that the components of x ′ := F (x) are independent, and each component has a standardized Gaussian distribution.In particular, x ′ 1 equals a monotonic transformation of x 1 .
The above theorem means, for any arbitrary injective transformation F ([s 1 + e 1 , ..., s l + e l ] T ), its first component can be used as the first independent component in s 0 .This immediately leads to two observations: 1.The confounder terms are no longer identifiable up to a permutation; 2.Even though the information of confounder can be preserved up to a injective transformation, the distance measure on the latent space can be dramatically distorted.To see why the second point holds, suppose s 1 is a binary r.v. with P (s 1 = 1) = P (s 1 = −1) = 0.5, and V ar(s1) V ar(e1) is a sufficiently large constant.We can see in this case the distance based on s 1 + e 1 is almost determined by s 1 ; however, the distance measure after an injective transform can be almost determined by e 1 (Supplementary Note Figure 1).In summary, our theoretical analysis suggests that in the model setting, the classical ICA can give consistent distance measures based on confounder space, while the non-linear conditional autoencoder approaches suffer from non-identifiability and distance distortion, leading to less meaningful latent spaces.

Upon arbitrary injective transformation
Supplementary Note Figure 1: An example of distance distortion with noisy confounder signals.
Empirically, we also observe our method can successfully reveal confounding variations even with non-linear interactions between confounding factors and treatment events.In this case, the causal matching may be performed according to a non-one-to-one transform of the full confounder signal, which is not consistent with the full confounder distribution but still meaningful in preserving a part of confounder information.More specifically, it interpolates between a single-cell level causal matching and a cluster/population-level causal matching.
Finally, given the identified confounding factors, the problem of treatment effect estimation can be solved by standard potential outcome framework.The framework has mainly four assumptions, which are discussed in detail in causal inference textbooks [8]: 1. Stable Unit Treatment Assumption (SUTVA): Samples are independent without interference; 2. Ignorability: there are no unmeasured confounders; 3. Consistency; 4. Positivity: for a given confounder, the probability of perturbation is neither 0 or 1.

Supplementary notes 2
There are a number of existing methods that perform perturbation effect analysis in single-cell omics data.Several pioneering works in the field propose the use of factor models to identify perturbation effects, including LRICA [9], MIMOSCA [10], scMAGeCK [11], MUSIC [12], and WGCNA/STM [13].More recently, Mixscape [14] estimates single-cell level perturbation effects by matching between neighboring cells in the shared k-NN graph across conditions.Additionally, several deep learning-based frameworks learn perturbation responses from scRNA-seq data via autoencoders as deep factor models, including scGen [3], compositional autoencoder (CPA) [4], and ContrastiveVI [5].Finally, CellOT [15] proposes a neural network that learns a non-linear transport map aligning cells from different treatment conditions.
Despite these efforts, none of the existing methods achieve guaranteed confounder identification, which leads to interpretable causal effect estimation by aligning cells with the same confounder states across conditions.Among the alternative single-cell treatment effect analysis methods, Mixscape and CellOT do not model the confounding variation.Moreover, Mixscape considers the nearest neighbor relationship on the entire gene expression space instead of distributional matching, which may lead to vulnerability to cell outliers and unbalanced mixing.While the auto-encoder based methods model confounder variation in general, they can suffer from the fundamental limitation of un-identifiability [7] (See Supplementary notes for a detailed explanation), which can reduce their power in identifying ground truth confounding variations.CINEMA-OT is the first method that achieves confounder identification as well as distributional matching for the task of single-cell treatment effect analysis.