Robust single-cell matching and multimodal analysis using shared and distinct features

The ability to align individual cellular information from multiple experimental sources is fundamental for a systems-level understanding of biological processes. However, currently available tools are mainly designed for single-cell transcriptomics matching and integration, and generally rely on a large number of shared features across datasets for cell matching. This approach underperforms when applied to single-cell proteomic datasets due to the limited number of parameters simultaneously accessed and lack of shared markers across these experiments. Here, we introduce a cell-matching algorithm, matching with partial overlap (MARIO) that accounts for both shared and distinct features, while consisting of vital filtering steps to avoid suboptimal matching. MARIO accurately matches and integrates data from different single-cell proteomic and multimodal methods, including spatial techniques and has cross-species capabilities. MARIO robustly matched tissue macrophages identified from COVID-19 lung autopsies via codetection by indexing imaging to macrophages recovered from COVID-19 bronchoalveolar lavage fluid by cellular indexing of transcriptomes and epitopes by sequencing, revealing unique immune responses within the lung microenvironment of patients with COVID.


Supplementary Fig 15: Performance of matching and integration on cross-species whole blood cells CyTOF data (H1N1/IL-4 group). (A)
Testing algorithm stringency between different methods. Increasing amounts of random spike-in noise was added to the data, and the matching accuracy and proportion of cells matched to X were quantified. MARIO matchability test automatically suspended forced matching of inappropriate data due to poor quality here. (B) Testing algorithm stringency among different methods. Single-cell types in Y were deleted before matching to X. The proportion of cells belonging to the deleted cell type in matched X cells were used to calculate the erroneous avoidance score. (C) t-SNE plots visualizing pre-integation and post-integration results with different methods.

Supplementary Fig 22: Computational complexity (A)
Run time for full MARIO pipeline (Initial and refined matching; Finding the best interpolation; Joint regularized filtering; CCA calculation) across different datasets. (B) Run time for MARIO matching steps (total time for initial and refined matchings) across different datasets. The ratio of X and Y was set as 1:4 (eg. at a total of 20,000 cells, X has 4000 cells and Y has 16,000 cells). Three sparsity levels were shown in the figures, which are 1: 'Minimal' sparsity calculated by MARIO. 2: 'Maximum' sparsity, same as using dense data. 3: 'Medium' sparsity which is the level in the middle between minimal and maximum. (C) Peak memory usage when running the full MARIO pipeline across different datasets. The ratio of X and Y was set as 1:4. (D) Matching accuracy with different levels of sparsity for MARIO. Total of 50,000 cells were used, where the ratio of X and Y was set as 1:4.

Extensions of MARIO Pipeline.
Matching more than two datasets. Suppose we have L datasets X 1 ∈ R n 1 ×(p share +p 1 ) , . . . , X L ∈ R n L ×(p share +p L ) . For 2 ≤ ℓ ≤ L, we run the usual two-dataset procedure to estimate the matching between cells in X 1 and cells in X ℓ byΠ 1↔ℓ . We then run jointly regularized filtering on eachΠ 1↔ℓ separately and keep the cells in X 1 that survive all L − 1 rounds of filtering. This gives us a cell-to-cell matching among the L datasets, from which we can construct row-wise aligned datasets X ⋆ 1 ∈ R n×(p share +p 1 ) , . . . , X ⋆ L ∈ R n×(p share +p L ) , where n is the number cells in X 1 that survived all L − 1 rounds of filtering. To jointly embed all the aligned datasets, we use generalized canonical correlation analysis (gCCA) (1). It is well known that gCCA does not admit a unique formulation (2). We take the following formulation which best suits our goal of obtaining joint embeddings: where ∥ · ∥ F is the Frobenius norm, 1 ≤ r ≤ p share + min ℓ p ℓ is the number of components to keep, and X ℓŴℓ is the embedding for the ℓ-th dataset. To solve the above optimization problem, we take a block coordinate descent approach. This approach again needs preliminary estimators {Ŵ To obtain those preliminary estimators, we first run the classical CCA on the first two datasets and obtain the projection matricesŴ 2 are the sample canonical scores for X ⋆ 1 and X ⋆ 2 , respectively. Then, for each ℓ ≥ 3, we run least squares regression using (X ⋆ 2 )/2 as the response and X ⋆ ℓ as the feature matrix. The resulting regression coefficient is then taken to beŴ Given the preliminary estimators, we are ready to enter the block coordinate descent iteration. We first demonstrate how to solve for the first columns of {Ŵ ℓ }. Suppose at iteration t, we are given preliminary estimators {ŵ We then proceed as follows. For every 1 ≤ ℓ ≤ m, we run a least squares regression with the response being the current average scores (not counting ℓ itself), i.e., , and with the feature matrix being X ⋆ ℓ . Denote the resulting regression coefficient asw We run the above procedure for 500 iterations and let {ŵ (1,T ) ℓ } be the first columns of {Ŵ ℓ }. We now discuss how to solve for the j-th columns of {Ŵ ℓ }, where j ≥ 2. We start by running a least squares regression with X ⋆ ℓ being the response and the first j − 1 scores of X ⋆ ℓ (i.e., X ⋆ ℓ (Ŵ ℓ )• ,1:j−1 , where (Ŵ ℓ )• ,1:j−1 is the first j − 1 columns of W ℓ ) being the feature matrix. The residual of this regression is denoted asX ⋆ ℓ . Now suppose at iteration t, we are given preliminary estimators {ŵ (j,t) ℓ }, whereŵ (j,t) ℓ ∈ R p share +p ℓ . We proceed as follows. For every 1 ≤ ℓ ≤ L, we run a least squares regression with the response being ( ℓ ′ <ℓX ⋆ ℓŵ , and with the feature matrix being X ⋆ ℓ . Denote the resulting regression coefficient asw . We then run a least squares regression with the response being X ⋆ ℓw (j,t+1) ℓ /∥w (j,t+1) ℓ ∥ 2 and the feature matrix being X ⋆ ℓ . The resulting regression coefficient is taken to beŵ j,t+1 ℓ . We run the above procedure for 500 iterations and let {ŵ Speeding up MARIO cell matching via distance sparsification. Standard implementations of the one-to-one matching run in O((n x + n y ) 3 ) time. However, if the distance matrix D is sparse (i.e., a lot of entries are infinity, meaning that such a pair is a priori infeasible), then the time complexity can further be reduced. For example, if one regards the distance matrix as a bipartite graph and let (i, j) denote an edge if D ij < ∞, then it is possible to solve the problem inÕ((n x + n y )|E|) time, where |E| is the number of edges andÕ hides poly-log factors (3). A natural attempt is to manually sparsify D so that for each row, only k ≪ n y smallest entries are finite. Let D (k) be the sparsified matrix. In theory, there exists a critical value of k ⋆ such that: (1) the distance matrix D (k ⋆ ) can give a valid matching; and (2) if one sparsifies it further (i.e., use D (k) for k < k ⋆ ), then there is no valid matching. We give an algorithm for computing this critical value. For any fixed k, we can test if D (k) can give a valid matching by computing the maximumcardinality matching, which can be done in O(kn x √ n x + n y ) time using the Hopcroft-Karp algorithm (4). We can then use binary search to search for the critical value k ⋆ . In the worst case (i.e., when k ⋆ = n y ), the whole procedure runs in O(log(n y )n x n y √ n x + n y ) time, which is already much faster than the O((n x + n y ) 3 ) time needed to compute the matching using the original distance matrix. In practice, since k ⋆ is usually very small compared to n y , the running time of the whole procedure can be even faster. This procedure generalizes the strategy taken by (5), which only works when the distance matrix is computed using a single feature.
Given the knowledge of k ⋆ , we sparsify the distance matrix with some user-specified k ≥ k ⋆ (denoted as sparsity in the MARIO package) and apply the LAPJVsp algorithm (an algorithm specifically designed to tackle sparse inputs) (6) to compute the matching. In practice, we can further speed up the matching process by randomly splitting the data into n (in MARIO package denoted as n_batch) evenly-sized batches, computing the matching for each batch, and stitching the batch-wise matchings together.
Details on data pre-processing and analysis.
Preprocessing and analysis of human bone marrow datasets. CyTOF data measuring 32 proteins in healthy human bone marrow cells from levine et al (7)) was downloaded from GitHub https://github.com/lmweber/ benchmark-data-Levine-32-dim. Cells gated as HSPCs, CD4 T cell, CD8 T cell, B cell, monocyte, NK cell and pDC from the paper were selected and a total of 102,977 cells were used. CITE-seq dataset measuring 25 proteins and RNA expression of healthy human bone marrow cells was acquired using bmcite in the R package SeuratData. Cells annotated as HSPCs, CD4 T cell, CD8 T cell, B cell, monocyte, NK cell, and pDC from the paper, comprising a total of 29,007 cells, were used. For evaluation purpose, the annotations were unified between datasets (eg. names of populations). Marker CD11c, CD123, CD14, CD16, CD19, CD3, CD34, CD38, CD4, CD45RA, CD8, and HLA-DR were shared across datasets. During matching, CITE-seq cells were used to match against CyTOF cells, where the input of CITE-seq cells were pre-normalized counts from bmcite and the input of CyTOF cells were values with arcsine transformation (cofactor = 5). The MARIO parameters used are n_components_ovlp = 10, n_components_all = 20, sparsity = 1000, bad_prop = 0.2, and n_batch = 4. The t-SNE plots were generated using the scaled shared protein features across datasets (pre-integration) or the first 10 components for the CCA scores (MARIO integration), using the Rtsne() function with default settings in R package Rtsne.
The heatmap was produced using heatmap.2() in the R package gplots, with z-scaled CITE-seq and CyTOF protein expression levels. The matched or original values of protein/RNA overlaid with t-SNE plots were generated with the function Featureplot() in R package Seurat. The detailed process of benchmarking MARIO against other methods is further described in the Benchmarking section in the Supplementary Methods section.
Preprocessing and analysis of cross species H1N1/IFN gamma challenged datasets. CyTOF data measuring 42 proteins in blood cells from humans challenged with H1N1 (8) virus was acquired from flow repository FR-FCM-Z2NZ 39. Three donors were used (id = "101", "107", "108"). The dataset was randomly downsampled to 120,000 cells, arcsine transformed with cofactor = 5, and subsequently clustered via the default Seurat clustering pipeline with all available antibody markers. Cell types were then manually annotated based on their expression profile. A total of 102,147 annotated cells were used. CyTOF data measuring 39 proteins of whole blood cells from human, rhesus macaque and cynomolgus monkey challenged with Interferon gamma (9) were acquired from flow repository FRFCM-Z2ZY 35. Three donors of each species (human: "7826", "7718", "2810"; rhesus macaque: "D00522", "D06022", "D06122"; cynomolgus monkey: "D07282", "D07292", "D07322") were used. Cells gated as Erythrocytes, Platelets and CD4+CD8+ cells in the paper were excluded from downstream analysis. Each individual dataset was randomly downsampled to 120,000 cells, arcsine transformed with cofactor = 5, then clustered with Seurat using all the markers, followed by manually annotation and then removal of cells with ambiguous annotations. Preprocessing and analysis of murine spleen datasets. Tiff files of CODEX multiplexed imaging data for BALBc mouse spleen, with 28 antibodies, were acquired (10) (sample ID: 'balbc-1'). Segmentation was performed with a local implementation of Mesmer (11) , with weights downloaded from: https://deepcell-data.s3-us-west-1.amazonaws.com/ model-weights/Multiplex_Segmentation_20200908_2_head.h5. Inputs of segmentation were DRAQ5 (nuclear) and CD45 (membrane). Signals from the images were capped at 99.7th percentile, with prediction parameter model_mpp = 0.8. Lateral spillover signals were cleaned using REDSEA (12) with the whole cell compensation flag as previously described. To clean out aggregated B220 signals in the dataset, B220 signal inside the cytoplasm (defined by 7 pixels towards the inside of the cell boundary), was removed. Afterwards, cells with DRAQ5 signal value less than 80 were removed and signals were scaled to 0-1, with percentile cutoffs of 0.5% (floor) and 99.5% (ceiling). Cells were subsequently clustered via Seurat, using CODEX markers: CD45 , Ly6C, TCR, Ly6G, CD19, CD169, CD3, CD8a, F480, CD11c, CD27, CD31, CD4,  IgM, B220, ERTR7, MHCII, CD35, CD2135, NKp46, CD1632, CD90, CD5 COVID-19 human tissue specimen collection. Lung tissues from patients who succumbed to COVID-19 were obtained during autopsy at the University Hospital Basel, Switzerland. Tissues were processed as previously described (13) and collection was approved by the ethics commission of Northern Switzerland (EKNZ; study ID #2020-00969). All patients or their relatives consented to the use of tissue for research purposes. Tissue microarrays were generated from these tissue samples in-house at the University Hospital Basel, Switzerland.
Preprocessing and analysis of COVID patient macrophage datasets. CODEX on COVID-19 samples from University Hospital Basel: CODEX acquisition of the COVID-19 tissue microarrays were performed, and post-processing and cell type annotation executed as previously described (14,15). Data from 23 COVID-19 patients (76 tissue cores; manuscript in preparation) were acquired, and a total of 62,852 macrophages that were annotated were used for MARIO matching. Processed counts of CITE-seq data acquired with a panel of 250 antibodies from bronchoalveolar lavage fluid washes from COVID-19 patients (VIB/Ghent University Hospital) was acquired from COVID-19 Cell Atlas (16) . Cells from 7 COVID-19 patients (COV002; COV013; COV015; COV024; COV034; COV036; COV037) were selected, clustered, and manually annotated on a per patient level based on their protein features, using Seurat as previously described. A total of 16,090 macrophages were annotated and used for subsequent MARIO matching. During MARIO matching, CODEX macrophages were matched against CITE-seq macrophages, with the MARIO running parameters: n_components_ovlp = 25, n_components_all = 25, sparsity = 1000, bad_prop = 0.1, and n_batch = 20. CODEX macrophages were clustered based on their matched C1Q mRNA expression levels (C1QA, C1QB and C1QC) using the function hcut() with k = 2 and stand = TRUE in the R package factoextra. Heatmaps were produced with the scaled values from CITE-seq or CODEX, via function heatmap.2() in R package gplots. Cell-cell interaction and binned anchor analysis were performed as previously described (17). In brief, for each individual C1Q High or Low macrophage, the Delaunay triangulation for neighboring cells (within 100µm) was calculated based on the XY position with the deldir R package. To establish a baseline distribution of the distances, cells were randomly assigned to existing XY positions, for 1000 permutations. The baseline distribution of the distance was then compared to the observed distances using a Wilcoxon test (two-sided). The log2 fold enrichment of observed mean over expected mean for each interaction type was plotted for interactions with a p-value < 0.05. The test results also includes the interactions in both directions (eg. Myeloid => T and T => Myeloid). For the binned anchor analysis of C1Q High or Low macrophages, all cells within a 100µm range were extracted and the average percentage of specific cell types in each radius bin (in 16.66um increments) were calculated and plotted. Differential expression gene analysis was performed using the function FindMarkers() in the R package Seurat. The violin plot of DE genes were created with ggplot2, where mRNA expression values were normalized between 0-1 for visualization purposes. GO term analysis was conducted via the Gene Ontology tool (18,19) (with the biological process option activated), with the input as lists of genes that were either significantly upregulated in C1Q High or Low macrophages. Heatmaps of the expression pattern of differentially expressed ISG genes (identified via FindMarkers()), filtered using a list of 628 ISGs (20,21) with functional annotations 67 in macrophages, was plotted with the function heatmap.2() from the R package gplots. Correlations between C1QA macrophage percentages and neutrophil percentages were calculated with the R function cor() with method spearman.
PANINI Validation with COVID-19 Lung Tissue Samples. Protease-free combined ISH + antibody validation experiments using PANINI as previously described (17). In brief, TMA cores cut onto glass coverslips were baked at 70°C for 1hr and then transferred to 2 × 5 min xylene washes, followed by deparaffinization steps 2 × 100% EtOH, 2 × 95% EtOH, 1 × 80% EtOH, 1 × 70% EtOH, 3× ddH2O; 3 min each. Heat induced epitope retrieval was then performed at 97°C for 10 min using the pH-9 Dako Target Retrieval Solution (Agilent, S236784-2) in a Lab Vision PT Module (Thermo Fisher Scientific). Slides were cooled to 65°C in the PT Module and then removed for equilibration to room temperature. A hydrophobic barrier was drawn around the tissue using the ImmEdge Hydrophobic Barrier pen (Vector Labs, 310018  Images were collected using a Keyence BZ-X710 inverted fluorescent microscope (Keyence, Inc) configured with 4 fluorescent filters (Hoechst, Cy3, Cy5 and Cy7), and a CFI Plan Apo l 20x/0.75 objective (Nikon). The Imaging setting was: 3 × 5 tile per tissue core, 5 Z-stacks acquired each FOV (best focused plane used), with High Resolution setting. The exposures were: 1/50s (Hoechst), 1/250s (Cy3), 1/8s (Cy5), and 6s (Cy7). Segmentation was performed with a local implementation of Mesmer (11), with weights downloaded from: https://deepcell-data.s3-us-west-1.amazonaws.com/ model-weights/Multiplex_Segmentation_20200908_2_head.h5. Inputs of segmentation were Hoechst (nuclear) and C1QA + CD68 + CD15 (membrane). Signals from the images were capped at the 99.7th percentile, with prediction parameter model_mpp = 0.8. Features from single cells in segmented Keyence images were extracted based on the segmentation generated above, scaled by cell size, and written out as FCS files. Cells were filtered out if too large (CellSize > 500 pixels), too small (CellSize < 45 pixels) or limited in nuclear signal (Hoechst < 3500). The signal threshold of CD15, CD68 and C1QA positive cells were selected for each individual tissue core, and visually assessed to minimize false negative and false positive cells. Cells positive for CD68 and C1QA were annotated as C1Q High macrophages. The correlation of C1Q High macrophages between PANINI and CODEX experiments were calculated with the R function cor() with method spearman.
For spatial correlation analysis of C1QA expression in macrophages, the tissue core was divided into 100 sub-regions (a 10×10 grid), and the number of cells or C1QA signal level were summed in each individual region and plotted. Correlation was calculated with function cor() with method spearman.
Simulated data benchmarking. To evaluate MARIO performance on matching high granularity ground truth cells, we simulated proteomic data with SymSim (22). We generated a dataset consisting of twenty cell types with a total of 60 features simulated, using the function SimulateTrueCounts. We selected parameters that were previously used to simulate epitome type datasets (23): (ncells_total = 40000, min_popsize = 1000, i_minpop = 2, ngenes = 60, nevf = 10, evf_type= "discrete", n_de_evf = 5, vary = "s", Sigma = 0.2, prop_hge = 1, mean_hge = 2) In order to create simulated data coming from two different modalities, we then subsequently added batch effect to the simulated values via function DivideBatches: (nbatch = 2, batch_effect_size = 5) Then cells from each batch were separated, then randomly sampled 5000 cells from batch 1 as then modality 1 dataset, 20,000 cells from batch 2 as modality 2 dataset. To mimic the antibody panel design difference, simulated features 1-20 were shared across two batches, features 21-40 only available in modality 1, and features 41-60 only available in modality 2. Cells were then matched by different methods as described in previous sections, detailed related code can be found on our GitHub repository.
Preprocessing and analysis of human PBMC datasets. CyTOF data measuring 33 proteins of PBMC from healthy human donors in Hartmann et al (24) was downloaded from flow-repository ('FR-FCM-Z249, HD06_run1'). Cells were downsampled to 50,000, clustered using Seurat and manually annotated, and then a total of 38,866 annotated cells were used. CITE-seq data measuring 29 proteins of health human PBMC was retrieved from 10x genomics https://support.10xgenomics. com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3?. Counts were normalized via CLR normalization with Seurat function Normalizedata(), then cells were clustered based on their protein features in Seurat. A total of 5,241 cells were annotated and used for matching. Markers CD11b, CD127, CD14, CD16, CD19, CD25, CD27, CD3, CD4, CD45RA, CD45RO, CD56, CD8a, HLA-DR and PD-1 were shared across datasets. During matching, CITE-seq cells were used to match against CyTOF cells, where the input of CITE-seq cells were raw counts and the input of CyTOF cells were arcsine transformed with cofactor = 5. The MARIO parameters used were: n_components_ovlp = 10, n_components_all = 15, sparsity = 1000, bad_prop = 0.2, and n_batch = 1. Analysis was performed the same as previously described.
Preprocessing and analysis of cross species H1N1/IL-4 challenged datasets. Human H1N1 virus challenged data is the same as described in the previous section and the same set of cells were used as input to MARIO matching. IL-4 stimulation cross-species CyTOF data is the same cross-species dataset as described in the previous section, using the same human or animal donors as described above (human: "7826", "7718", "2810"; Rhesus macaque: "D00522", "D06022", "D06122"; Cynomolgus monkey: "D07282", "D07292", "D07322"), and the whole blood cells stimulated with IL-4. Cells gated as Erythrocytes, Platelets and CD4+CD8+ cells from the paper (9) were excluded from downstream matching and analysis. Each individual dataset was randomly downsampled to 120,000 cells, arcsine transformed with cofactor = 5, and subsequently clustered with Seurat using all the markers, followed by manual annotation and removal of cells with ambiguous annotations. Benchmarking on time and memory usage. Time and memory usage of MARIO on the datasets presented in Figure 2, 3, 4 were evaluated, on a linux server using Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz (1 CPU during the MARIO steps). The full pipeline MARIO time usage (including initial and refined matching; best interpolation finding; joint regularized filtering; CCA calculation) was measured with the default parameters, with increasing amount of cells (50,000 cell max), and ratio of X and Y set to 1:4 (e.g. at total of 20,000 cells , X has 4000 cells and Y has 16,000 cells). The MARIO matching time usage (only including intial and refined matching) was measured with the same settings, but with three different sparsity levels: (1) minimal sparsity calculated by MARIO; (2) maximal sparsity (i.e., fully dense matching without sparsification); (3) "medium" sparsity which is in the middle point between minimal and maximum. The MARIO memory usage was measured with the same settings as the time evaluation, but the maximum number was set to 100,000 cells. The peak memory usage was measured by the function profile in the python package memory_profiler. The influence of sparsity level used on MARIO matching accuracy was evaluated by inputting different levels (between minimal and maximal sparsity detected by MARIO). A total of 50,000 cells were used for each dataset with a ratio between X and Y being 1:4.
Benchmark distance matrix construction methods for MARIO. To evaluate the performance of MARIO with different distance matrices, we benchmarked the initial matching accuracy of MARIO, with distance matrices constructed with linear or non-linear kernels: Pearson correlation, as described in the previous section; non-linear kernels were calculated using the python package sklearn with default setting. For gaussian kernel, function rbf_kernel was used, with default setting; for Polynomial kernel, function polynomial_kernel was used, with default setting; for Laplacian kernel, function laplacian_kernel was used, with default setting; for Sigmoid kernel, function sigmoid_kernel was used, with default setting. Then these distance matrices were used in MARIO initial matching, and matching accuracy was evaluated.
Benchmark against optimal transport. To evaluate optimal transport matching performance on cross modality single cell proteomic datasets, we utilized SpaOTsc (25), which is initially designed to infer spatial and signaling relationships between cells from single cell transcriptomic data, using location information captured from spatial transcriptomic datasets. To adapt the method to our scenario, we switched the originally required spatial location distance matrix to a distance matrix constructed by Pearson correlations of protein features between cells within the same modality. The four input matrices used as input for function spatial_sc and transport_plan in python package spaotsc were: df_sc: data from modality 1, with n row of cells, and features from modality 1. is_dmat: originally required spatial distance matrix, switched to a dissimilarity matrix of cells (m x m), calculated from 1 -Pearson correlation of protein features, where cells are from modality 2. cost_matrix: dissimilarity (n x m) between cells from modality 1 and 2, calculated with shared features between modality 1 and 2, from 1 -Pearson correlation. sc_dmat: dissimilarity matrix within modality 1 (n x n), similarly calculated as is_dmat. The produced gamma mapping matrix (n x m) was used to find the matching information, where for each cell in modality 1, the cell from modality 2 that has the highest gamma value was considered the match for that cell.