Abstract
Part of the flow/mass cytometry data analysis process is aligning (matching) cell subsets between relevant samples. Current methods address this clustermatching problem in ways that are either computationally expensive, affected by the curse of dimensionality, or fail when population patterns significantly vary between samples. Here, we introduce a quadratic form (QF)based cluster matching algorithm (QFMatch) that is computationally efficient and accommodates cases where population locations differ significantly (or even disappear or appear) from sample to sample. We demonstrate the effectiveness of QFMatch by evaluating sample datasets from immunology studies. The algorithm is based on a novel multivariate extension of the quadratic form distance for the comparison of flow cytometry data sets. We show that this QF distance has attractive computational and statistical properties that make it well suited for analysis tasks that involve the comparison of flow/mass cytometry samples.
Similar content being viewed by others
Introduction
Most flow and mass cytometry applications in biomedical studies are based on comparisons between/among control and test samples. Dissimilarities between/among samples may be due to drug treatment regime, progression of disease, response to therapies, etc. To define these dissimilarities across samples, the populations of cells in each sample are usually clustered to reveal phenotypically distinct cell subsets that can then be matched, quantified and compared between samples.
Traditionally, this type of cluster analysis has been done by manually gating the data into arbitrary clusters. These methods have proven effective in a gross sense but are both subjective and extremely laborious, making them difficult to apply with current highdimensional (HiD) data sets. The need to facilitate these analyses, and make them more accurate, has motivated development of automated clustering and cluster matching methods for HiD flow and mass cytometry data.
Both of these tasks (cluster identification and cluster matching) are highly challenging because they are subject to the “curse of dimensionality”, a wellknown statistical problem for HiD data that compromises both statistical validity and computational performance^{1,2}. Here, we discuss the limitations of currently available methods for cluster matching applications, and demonstrate that employing a multivariate extension of the quadratic form distance^{3} overcomes key limitations.
Existing methods address the cluster matching problem in two different ways, both of which have limitations. The first way is clustering one sample at a time and aligning/matching the cell subsets (clusters) present in multiple samples postclustering (e.g., as is done in the FLAME analysis^{4} and flowMatch package^{5}). This conventional approach allows fast computational implementations in low dimensions. However, this approach can fail if the locations of the populations (clusters) significantly vary from sample to sample, or if populations disappear or appear between samples. Further, when clustering is performed in HiD settings, this approach can be compromised by the curse of dimensionality.
The second approach (e.g., Joint Clustering and Matching^{6}, ASPIRE^{7}) alleviates some of these problems by creating a HiD template of metaclusters (distinct biologicallyrelevant cell types) in which all sample data are pooled, simultaneously clustered and then matched.
With these methods, multiple samples are treated as different realizations of a single underlying model reflecting the biological reality. However, apart from being computationally expensive, the majority of methods in this category identify clusters by fitting mathematical models to datasets. The feasibility of fitting in these cases, however, is dramatically affected by the curse of dimensionality, since the number of combinations of possible parameters to be considered increases dramatically as the number of dimensions increases above three or four.
Thus, although the existing methods offer solutions to some aspects of the clustermatching problem, they still do not fully accommodate realworld flow/mass cytometry data. To pave the way toward a more robust solution of this problem, we developed QFMatch  a cluster matching method based on the quadratic form (QF) distance measure. QFMatch matches cell subsets (clusters) present in multiple samples postclustering. However, it accommodates cases when the location of a population varies significantly from sample to sample in twodimensional display, or when populations disappear or appear between samples.
The key to our methodology is a new multivariate version of the quadratic form distance for the comparison of flow cytometry samples. Since such comparisons are a fundamental part of the analysis these data, there has been an active interest in developing suitable distance measures^{3,4,8,9,10}. The quadratic form distance has several properties that make it an attractive candidate for these tasks: it is easy to implement, it can be computed quickly, and, as will be shown below, it possesses certain properties that are relevant for a meaningful comparison of flow cytometry distributions.
Results
The QF Distance
As pointed out in Orlova et al.^{10}, a dissimilarity measure needs to possess certain properties in order to provide a biologically meaningful comparison between flow cytometry samples. In particular, it needs to satisfy the properties of a metric as well as a continuity property: small changes in subset location (e.g., due to instrument drift) or subset frequency should be reflected as only small changes in the dissimilarity measure. This continuity property makes it possible to distinguish biologically relevant differences from small differences due to instrument drift and other irrelevant factors. Orlova et al.^{10} point out that this requirement rules out the use of dissimilarity measures based on pvalues from many standard statistical tests, and they demonstrate that the Earth Mover’s Distance (EMD) is a suitable dissimilarity measure for comparing biomarker expression levels in cell populations. Unfortunately, the EMD is computationally intensive. Bernas et al.^{3} propose to use the quadratic form (QF) distance (Hafner et al.^{11}) to quantify the dissimilarity between two univariate histograms of flow cytometry data:
Here h_{ i } and f_{ i } are the relative frequencies of the two histograms pertaining to the histogram bin indexed by i. That is, the two histograms employ the same bins and
The matrix A = [a_{ ij }] is chosen to reflect the spatial dissimilarity between bins i and j. Note that in order to be a metric, D^{2}(h, f) needs to be nonnegative, which restricts the choices for the similarity matrix A. (Positive definiteness of A is sufficient but not necessary as
see Hafner et al.^{11}). We point out that not all of the choices of A proposed in Bernas et al.^{3} satisfy this condition. Here we show that an appropriate choice of A not only guarantees that the QF is nonnegative, but that it futhermore results in a monotonic behavior that mirrors the continuity condition given above. We also show how the QF distance can be extended in a computationally simple way to a multivariate, even highdimensional, situation. Therefore this QF distance shares the advantagous properties of the EMD, but it is conceptually much simpler to implement and, importantly, it can be computed quickly.
In more detail, we propose to use
where \({{\rm{d}}}_{{{\rm{M}}}_{{\rm{ij}}}}\) is the Euclidean distance between centers of mass (calculated on combined samples) of the ith and jth bins, and d_{max} is the maximum value of all the \({{\rm{d}}}_{{{\rm{M}}}_{{\rm{ij}}}}\). Note that this is a generic definition that applies to the univariate as well as the multivariate setting; we will discuss an appropriate binning scheme below. It then follows from a result in Hafner et al.^{11} that for this choice of A the QF distance D^{2}(h,f) is always nonnegative. Furthermore, employing a matrix A with offdiagonal elements that depend on the spatial distance between the bins in a suitable way, such as a_{ij} given above, has the effect that D^{2}(h,f) not only increases with the magnitude of hf, but also with the spatial distance of the nonzero elements of hf. While this is not exactly mathematically equivalent to the continuity condition stated above (one can mathematically construct counterexamples where the continuity condition fails, but these counterexamples appear not to be practically relevant), it results in the desired behavior of D^{2}(h,f) that allows a biologically meaningful quantification of the difference between the two samples as is demonstrated with experimental results below. This property is not shared by some other common distance measures such as the chisquare distance
which may fail to increase even as the spatial distance between the two populations increases, see Tables 1 and 2 (corresponding code is available at https://github.com/zq00/QFMatchsimulation).
The relative frequencies h_{ i } and f_{ i } for bin i are computed once the kdimensional measurement space is partitioned into bins. We propose to use adaptive binning^{8} on the combined sample, i.e. we merge the two samples for the construction of the bins only. Adaptive binning is a method for dividing kdimensional data into bins such that all bins contain the same number of events. This strategy results in bins of variable size that “adapt” to the structure of the data. The algorithm begins by calculating the median and variance of the data for each of the kdimensions included in the comparison. Next, we select the dimension j with the maximum variance and divide the data in half along the median value of that parameter, such that each bin contains an equal number of data points. The algorithm proceeds recursively until a predefined threshold is met (e.g., minimum number of data points per bin). This results in a collection of kdimensional hyperrectangular bins, with each bin containing an equal number of data points. This recursive binning scheme is quite straightforward to implement and can be computed very fast, with the dimension k of the measurement space affecting the computational complexity only linearly.
Experimental performance of the QF distance and comparison with EMD and the chisquare distance
Using synthetic datasets (Fig. 1a and b), we have verified that the QF score increases smoothly and monotonically with the growing separation between two subsets (see Fig. 1c and d). This property of the QF is very important, since it ensures that small differences between clusters (subsets) either in subset location (e.g., due to instrument drift) or subset frequency will be reflected as small changes in the QF score. In general, this is a critical property for cluster matching approaches designed to analyze flow cytometry and similar datasets, where small changes due to instrument noise, calibration, etc. are very common. This property insures that biologically similar samples with minor data aberrations, e.g., caused by shifts in flow instrument configuration during data collection, will still be well aligned.
We conducted a simulation study to compare the performance of QF with the EMD^{10} and the chisquare distance (equation 5) which is a popular dissimilarity score. We evaluated these three distance measures for comparing simulated data from a multivariate standard normal distribution N(0,I) with simulated data from
which represents the situation where a subset consisting of a fraction p of the data was moved by an amount u. Tables 1 and 2 give the values for these dissimilarity scores for various choices of p and u, and for sample sizes 10,000 and 100,000 in dimensions 2 and 20. It is seen that QF behaves quite similarly to EMD, both in its monotonic behavior as u increases as well as in the threshold that u needs to exceed before one can confidently declare that the two distributions are different. In contrast, the chisquare distance is less sensitive to detect this difference for small u, and it is also less able to discriminate large values of u. For example, the chisquare distance is very close to 2 for both u = 5 and u = 10. This is a drawback that is inherent in the definition of the chisquare distance: it will not reflect the size of the spatial separation if the two populations do not overlap. This makes the chisquare distance illsuited for the comparison of flow and mass cytometry data.
The simulation study supports the conclusion that the QF shares the favorable properties of the EMD for the comparison of flow and mass cytometry data while having the advantage that it is conceptually much simpler to implement and that it can be computed much faster.
The QF Match algorithm for multidimensional cluster alignment
We now apply the multivariate QF distance to align subsets (clusters, subpopulations) between two samples. That is, the algorithm will take as input two flow cytometry samples, each of which has been subset beforehand either by a manual or an automated gating algorithm (Fig. 2a). Our QFMatch algorithm for cluster alignment consists of six steps:
Step 1: We do adaptive binning^{8} on the combined samples as decribed in the previous section (Fig. 2b).
Step 2: We apply the binning pattern derived in Step 1 to each of the two samples (Fig. 2c). For each cluster in each sample, we then construct a histogram using the bins from Step 1. (Thus each histogram pertaining to a cluster has total relative frequency equal to 1.)
Step 3: For each combination of two clusters, where one cluster is from sample 1 and the other is from sample 2, we calculate a dissimilarity score based on the quadratic form distance, see Fig. 3.
Step 4: We treat the cluster pairs with the smallest dissimilarity score (marked in green on Fig. 3a) as matched. The remaining clusters in each sample are automatically treated as merging candidates (cluster id is marked in pink on Fig. 3a) for the clusters in the same sample. During this process, each merging candidate is combined with its nearest cluster in the same sample (i.e., both clusters have the smallest dissimilarity score to one of the clusters from the other sample, marked in blue on Fig. 3a). The dissimilarity score is then recalculated again.
Step 5: A decrease in the initial dissimilarity score as a result of the merging process on Step 4 indicates that the cluster was split (marked in violet on Fig. 3b). The increase of dissimilarity score values that occurs as a result of the merging process indicate missing cluster(s).
Step 6: For each pair of matched clusters, we add information regarding relative frequency and the distance between geometric means of these clusters in each dimension (Fig. 3b). This helps to accommodate cases when for example each sample has only one subset and these subsets belong to different categories (i.e., different cell populations).
In our current implementation, each pair of matched clusters can be further compared to reveal global (dis)similarity by sequentially matching these clusters in each possible combination of the most informative dimension pairs (see Results section).
The algorithm sensitivity and performance
To assess the sensitivity of QF to binning parameters, we randomly picked and analyzed three samples (“0 stdevs”, “4 stdevs”, and “8 stdevs”) from the synthetic dataset (Fig. 1b) for a range of bin sizes. The absolute event count for each sample is 3200 (which becomes 6400 when samples are merged for the binning step, see Fig. 2b). We binned the data using n = 16; 32; 64; 128; and 256 bins and plotted the results for QF score (Fig. 4a) and the corresponding running time (Fig. 4b). Running time increased linearly with the number of bins while QF values remained relatively constant.
Overall, these data indicate that QFMatch is robust in the choice of the number of bins. However, choosing the appropriate number of bins is a tradeoff between the algorithm’s speed and the binning resolution of the sample. Matching samples that contain small populations of cells may require finer binning than matching samples with larger populations only. For samples that contain small cell populations (e.g., cluster #3 in Sample 2, Fig. 2a), we chose a number of bins such that there are 2log_{2}N events per bin, where N is the number of events in the smallest cell subset.
In this study, all running time calculations for the QFMatch algorithm implemented in Python (https://github.com/dyorlova/QFMatch; MATLAB implementation is available at http://cgworkspace.cytogenie.org/GetDown2/domains/FACS/QFMatchStandAlone.pdf) were performed on a 3.1 GHz Intel Core i7 with 16 GB of RAM running Mac OS X 10.11.6. With this implementation, it took a few seconds to cluster data with a twodimensional densitybased merging (DBM) algorithm^{12} and 14 seconds to align BALB/c with RAG^{−/−} (~268 000 cells total, see Results section below for more detail about these samples) with QFMatch using 256 bins. This example provides a sense of QFbased cluster matching algorithm speed in our current implementation.
We also showed that the QF dissimilarity score is invariable with sample size (Fig. 4c). Thus, we increased the size of “0 stdevs”, “4 stdevs”, and “8 stdevs” from the original sample size (3200 events including 200 events corresponding to the small green population, see Fig. 1b) 2–6 times and aligned samples “4 stdevs” and “8 stdevs” with “0 stdevs” sample using 128 bins.
Appying the QFMatch cluster alignment algorithm utility to flow cytometry data
We used three real datasets to demonstrate how the analysis pipeline, which includes the QFbased cluster matching aglorithm, can be used to do automated clustering and alignment of cell populations identified in flow cytometry data. The same data analysis workflow was used in the three examples discussed below (see Materials and Methods).
Matching of cell subsets between patient samples, even when relative cell frequencies differ by one order of magnitude and marker expression levels vary between patients
Here we present two examples demonstrating that QFbased algorithm successfully matches cell subsets that vary significantly between samples.
In the first example, we used fluorescence flow cytometry dataset collected in the frames of basophils activation study^{13}. Basophil marker expression levels commonly differ from sample to sample. In the original study^{13}, the authors use the surface level of CD123 as a phenotypic marker to identify peripheral blood basophils. The expression of this marker commonly varies from patient to patient (e.g., in Fig. 5a, MFI varies from 1033 to 6672). Importantly, QFMatch is capable of aligning such clusters, even when MFI values and the size of the basophil vary from one sample to another. Basophil populations in these patient samples have relative frequencies of the same order of magnitude (Fig. 5a) while the distance between their geometric means is significant ([0.6; 0.7]σ when comparing Patient 1 with Patient 2 and [5.1; 1.6]σ between Patient 1 and Patient 3).
In the second example, we used a mass cytometry dataset collected in an acute myeloid leukemia (AML) pathophysiology study^{14}. In the original study, the authors quantitated CD34+Cd45mid cells in five healthy controls (H) and sixteen AML patients (SJ). Both the relative frequiency of CD34+ Cd45mid cells and the expression of these markers significantly vary between healthy controls and patients, and from patient to patient (in the representative example shown on Fig. 5b, CD34 MFI varies from 36 to 144; CD45 MFI varies from 74 to 500; relative frequency of CD34+Cd45mid cells varies from 4.7% to 58%). QFMatch is capable of aligning such clusters, even when MFI values and the frequencies of the CD34+Cd45mid cells vary from one sample to another.
Detection of missing lymphocyte populations in the peritoneal cavity of RAG knockout (RAG^{−/−}) mice
We aligned samples of wildtype (BALB/c) and knockout (RAG^{−/−}) mouse peritoneal cavity cells (PerC) based on cell surface expression of CD5 and CD19, which respectively indentify T and B lymphocytes, i.e., CD5^{hi}CD19^{−} and CD19^{hi}CD5^{lo}/^{−} (Fig. 6a). We computed QF scores (Fig. 6b) that compare data for the wildtype reference sample (BALB/c) and for the sample that completely lacks T and B lymphocytes (RAG^{−/−}). The unmatched clusters were automatically considered as merging candidates by the algorithm. However, this process didn’t decrease any of the initial dissimilarity scores, thus confirm that RAG^{−/−} mice completely lack T and B lymphocytes.
Cluster matching of the murine lymphoid, myeloid and granuloid lineages between PerC and spleen
Using a dataset for side scatter (SScA, which correlates with cell granularity) and CD11b surface marker measures, we matched mouse PerC and spleen samples to explore differences in the representation of the lymphoid, myeloid and granuloid subsets. This is a good case to test the cluster matching algorithm because the type of immune cells present in PerC and spleen are quite different from one another, i.e., naïve spleen lacks virtually all mast cells and small and large peritoneal macrophages (SPM and LPM, respectively) whereas naïve PerC have very few monocytes and neutrophils but still share some types of immune cells, including dendritic cells (DC), natural killer cells (NK), and eosinophils. Note that the QFMatch algorithm successfully aligned the immune cell subsets that are shared between spleen and PerC (marked in the same colors in Fig. 7) and detected the missing cell subsets (marked in different colors in Fig. 7).
Discussion
Population matching is one of the most important analytical tools used in the flow/mass cytometry data analysis pipeline in a variety of research/clinical settings. Recently developed cluster matching methods intended for this purpose can be informally divided into two types:

1.
Separate clustering and matching. This type of approach, used for example in FLAME^{4} and PhenoGraph^{14}, identifies cluster locations in each individual sample (e.g., by using mixture models or by constructing a graph and using modularity optimization to cluster it). It then pools these cluster locations for all of the samples in a given class, for example “healthy controls”, and clusters again (e.g., by using partitioning around medoids or by constructing a graph from pooled preclustered data) to construct a HiD template of metaclusters.
As discussed in detail in^{15} and^{7}, this type of approach is likely to fail when the population pattern varies significantly between samples (i.e., population locations differ significantly or even (dis)appear from sample to sample). Thus, a metacluster corresponding to distinct cell subsets can split into multiple subclusters if extraneous clusters appear in one (or more) of the samples. Further, graphbased methods, such as PhenoGraph^{14} can fail to assign phenotypically distinct cell subsets to distinct metaclusters (Fig. 8b). Finally, because the partitioning of individual samples into clusters is performed independently from sample to sample, even quite similar samples may be poorly aligned.

2.
Joint clustering and matching. This type of approach, including Joint Clustering and Matching^{6,15}, FlowSOM^{16} and ASPIRE^{7}, aligns cell populations based on direct modelling of contributions from individual and grouped samples. It can be thus considered as “hierarchical extension of statistical mixture models”, since it applies a hierarchical (multilevel) model that incorporates information from both the individual and group levels when fitted to flow cytometry data. In these settings, an individual sample is considered a noisy realization of a more general biological population mixture. For example, individual samples could be modeled using a Dirichlet process Gaussian mixture model approach^{17} and linked through hierarchical prior. As we show on Fig. 8c, joint clustering and matching approaches, such as FlowSOM^{16}, can fail to assign phenotypically distinct cell subsets to distinct metaclusters.
Both of these types of cluster matching methods rely heavily on fitting mathematical models to identify and match clusters. Thus, they are dramatically hindered by the curse of dimensionality because the number of combinations of parameters increases dramatically as the number of dimensions increases above three or four. Additionally, these methods are quite computationally demanding and often rely on a heuristic to tune a set of input parameters (see Fig. 8).
To address the key problems mentioned above, we improve on a principle shared by most existing cluster matching methods, i.e., the use of (dis)similarity measures between cell populations. Most current methods rely on different types of (dis)similarity measures, including (1) Joint Clustering and Matching^{6}, which is based on a symmetric form of the KullbackLeibler (KL) divergence; (2) a flowMatch package^{5}, which employs Euclidean distance, Mahalanobis distance and KL divergence for computing the dissimilarities between clusters; and, (3) another commonly used package, FLAME^{4}, which relies on a solution of minimum cost bipartite matching (essentially minimum Euclidean distance and corresponding weights difference to solve).
In our previous paper^{10}, we argued that in order to be biologically/biomedically informative, the (dis)similarity measure should satisfy the following criteria: (1) it must possess the properties of a metric (nonnegative symmetric functions that satisfy the triangle inequality and the axiom of coincidence); (2) it should distinguish biologically significant differences from small differences due to instrument drift or other irrelevant factors; (3) it should be nonparametric, to account for the complex structure of the cell populations commonly found in flow cytometry data; and, (4) it should be computationally efficient, so that modern high throughput analyses can be performed quickly. However, constraint #2 (the need to distinguish biologically significant differences) is the most critical for flow cytometry and similar datasets. This constraint basically rules out most of the current approaches.
In^{10}, we demonstrated that distance metrics (e.g., Earth Mover’s Distance (EMD)), which take into account changes in both location and frequency rather than just changes in one or the other, are the most suitable and accurate methods for comparing multivariate nonparametric flow cytometry data distributions. However, EMD is computationally complex. Further, the algorithm can be slow for practical applications of cluster matching tasks in HiD flow cytometry.
To overcome this speed limitation, we developed the computationally efficient QFbased method (QFMatch) defined here, which takes changes in location and frequency into account and is insensitive to small changes caused by instrument noise. QFMatch also satisfies criteria (1)(4) discussed above.
Applying the QF distance measure to flow cytometry data was originally suggested by Rajwa’s group^{3}. However, this group developed the method only for a onedimensional case and didn’t apply it directly to cluster matching tasks. Here, we have further developed the method and made it applicable to HiD flow and mass cytometry cluster matching tasks.
The QF approach described here can be used with any number of dimensions since it is based on the adaptive binning that avoids the curse of dimensionality by recursively splitting sample along the axis with the highest variation. Thus, the dimensionality enters in the computation at most linearly. Furthermore, computing the QF does not depend on the dimensionality at all, as in the QFbased algorithm we just sum over bins. Therefore, computing the QF does not suffer from the computational curse of dimensionality.
The QF approach can be used with any method that enables valid identification and isolation of cellular (or other) subset in which markers are expressed. Typically, this clustering task is subject to the curse of dimensionality. We avoid this curse here by coupling the QFbased cluster matching method with a twodimensional densitybased merging (DBM) clustering algorithm^{12}.
We have now implemented QFMatch in a provisional flow cytometry data analysis package that we make freely available (no charge) at CytoGenie.org to users at nonprofit organizations (e.g.,.edu and.gov). In its current implementation, QFMatch aligns cell clusters across a pair of samples. However, this method can be further extended to work with large collections of samples, e.g., the QF dissimilarity measure can be used to construct templates of metaclusters for samples that belong to one class and can further be used to align these templates of metaclusters between classes.
Materials and Methods
Experiment overview
We use QF to match subsets between relevant samples (same staining panels) within the biological/biomedical datasets described below.
Flow sample description
Human and mouse datasets shown in Figs 5 and 6 were generated in previously published studies (see refs^{13,14,18} for complete materials and methods). Access to the data was provided by the investigators responsible for the studies. Human subject guidelines are described in Gernez et al.^{13} and Levine et al.^{14}. Patient records/information was anonymized and deidentified prior to acquisition for these studies.
Mouse PerC and spleen datasets shown in Fig. 7 were explicitly generated for this study using adult (>8 wks) naïve wildtype C57BL/6 strain. Mouse studies were approved by the Stanford Animal Care and Use Committee and are in compliance with the Administrative Panel on Laboratory Animal Care guidelines. Peritoneal cells were harvested by injecting 6 ml of custom RPMI1640 media into the peritoneal cavity. Spleens were mechanically disrupted to obtain single cell suspension. Cells were filtered over a 70 µm nylon filter and erythrocytes were lysed using ACK buffer. Cells were resuspended at 100 × 10^{6} cells/ml and stained on ice for 30 min with a reagent panel that detects a total of 12 parameters. Stained cells were resuspended in 10 ug/ml propidium iodide (PI) to enable exclusion of dead cells.
Datasets presented on Figs 5a, 6 and 7 were stored immediately after collection into a stable longterm archive maintained by the Stanford Shared FACS Facility. See figure legends for gating strategy.
Instrument details
Information about instruments used to collect human and mouse samples can be found in^{13,14,18}. PerC and Spleen cells were analyzed on Stanford Shared FACS Facility instruments (BD LSRII) equipped with 4 lasers (405 nm, 488 nm, 532 nm, and 640 nm) and 19 PMTs. Data were collected for 0.2 × 10^{6} to 2 × 10^{6} cells.
Data analysis details
The proposed workflow for analyzing all four datasets used in this manuscript consists of two steps:

(1)
Preprocess the data by sequentially using utilities available in AutoGate^{19} (http://CytoGenie.org/) to compensate the data (fluorescence flow cytometry data only), transform it with the Logicle transformation^{20}, and cluster the transformed data with DBM^{12}. See figure legends for gating sequences. The flow cytometry data prepocessing methods used here do not require user input for parameters such as number of clusters, number of grid bins, manual gating for compensation purposes, etc.

(2)
Use QF to match cell populations of interest, for example, populations of eosinophils (see Fig. 7). The QFbased cluster matching algorithm is integrated into AutoGate (http://CytoGenie.org/).
Combining Logicle transformation, DBM for cell population identification, probability binning, and QF provides a complete pipeline for cluster matching of flow cytometry samples. However, we would emphasize that the QF approach for cluster matching of flow cytometric subsets works independently of how the population was defined here. For example, the clusters could be defined by using domain knowledgedriven manual gating, a sequential automated clustering approach, or a simultaneous clustering approach.
Data Availability
The datasets generated during and/or analysed during the current study are available in the FlowRepository and Cytobank:
https://flowrepository.org/id/FRFCMZY3B, https://www.cytobank.org/nolanlab/reports/Levine2015.html (CyTOF AML PhenoGraph manually gated CD34 x CD45 AML blast populations, Data S2E).
References
Orlova, D. Y., Herzenberg, L. A. & Walther, G. Science not art: statistically sound methods for identifying subsets in multidimensional flow and mass cytometry datasets. Nature Reviews Immunology. https://doi.org/10.1038/nri.2017.150.
Hastie, T., Tibshirani, R. & Friedman, J. Local methods in high dimensions in The elements of statistical learning. 22–27 (SpringerVerlag, 2009).
Bernas, T., Asem, E. K., Robinson, J. P. & Rajwa, B. Quadratic form: a robust metric for quantitative comparison of flow cytometric histograms. Cytometry A. 73, 715–726 (2008).
Pyne, S. et al. Automated highdimensional flow cytometric data analysis. Proc. Natl Acad Sci USA 106, 8519–8524 (2009).
Azad, A. & Pothen, A. flowMatch: Cell population matching and metaclustering in flow. https://www.bioconductor.org/packages/devel/bioc/vignettes/flowMatch/inst/doc/flowMatch.pdf (2017).
Lee, S. X., McLachlan, G. J. & Pyne, S. Modeling of intersample variation in flow cytometric data with the joint clustering and matching procedure. Cytometry A. 89(1), 30–43 (2016).
Dundar, M., Akova, F., Yerebakan, H. Z. & Rajwa, B. A nonparametric Bayesian model for joint cell clustering and cluster matching: identification of anomalous sample phenotypes with random effects. BMC Bioinformatics. 15, 314 (2014).
Roederer, M., Moore, W., Treister, A., Hardy, R. R. & Herzenberg, L. A. Probability binning comparison: a metric for quantitating multivariate distribution differences. Cytometry. 45, 37–46 (2001).
Cox, C., Reeder, J. E., Robinson, R. D., Suppes, S. B. & Wheeless, L. L. Comparison of frequency distributions in flow cytometry. Cytometry. 9, 291–298 (1988).
Orlova, D. et al. Earth Mover’s Distance (EMD): a true metric for comparing biomarker expression levels in cell populations. PLoS One. 11(3), e0151859 (2016).
Hafner, J., Sawhney, H. S., Equitz, W., Flickner, M. & Niblack, W. Efficient color histogram indexing for quadratic form distance functions. IEEE Trans. Pattern Anal. Machine Intell. 17, 729–736 (1995).
Walther, G. et al. Automatic clustering of flow cytometry data with densitybased merging. Adv Bioinformatics. 686759; https://doi.org/10.1155/2009/686759 (2009).
Gernez, Y. et al. Blood basophils from cystic fibrosis patients with allergic bronchopulmonary aspergillosis are primed and hyperresponsive to stimulation by Aspergillus allergens. J Cyst Fibros. 11, 502–510 (2012).
Levine, J. H. et al. DataDriven Phenotypic Dissection of AML Reveals Progenitorlike Cells that Correlate with Prognosis. Cell. 162(1), 184–197 (2015).
Cron, A. et al. Hierarchical modeling for rare event detection and cell subset alignment across flow cytometry samples. PLoS Comput Biol. 9(7), e1003130 (2013).
Van Gassen, S. et al. FlowSOM: Using selforganizing maps for visualization and interpretation of cytometry data. Cytometry A. 87(7), 636–645 (2015).
Escobar, M. D. & West, M. Bayesian density estimation and inference using mixtures. JASA. 90(430), 577–588 (1995).
Ghosn, E. E. et al. Two physically, functionally, and developmentally distinct peritoneal macrophage subsets. Proc Natl Acad Sci USA 107(6), 2568–2573 (2010).
Meehan, S. et al. AutoGate: automating analysis of flow cytometry data. Immunol Res. 58(23), 218–223 (2014).
Moore, W. A. & Parks, D. R. Update for the logicle data scale including operational code implementations. Cytometry A. 81(4), 273–277 (2012).
Pe’er, D. PG. Available at https://www.c2b2.columbia.edu/danapeerlab/html/phenograph.html. Date accessed: September 2017.
cytofkit: Quick Start. Available at https://www.bioconductor.org/packages/devel/bioc/vignettes/cytofkit/inst/doc/cytofkit_example.html. Latest Release: Septemer 28, 2017. Date accessed: September 2017.
Acknowledgements
We thank the members of the Herzenberg laboratory for advice and help in the interpretation of flow cytometry data. Further, we thank John Mantovani for excellent administrative help. This work was supported by NIH Training Grant [5 T32 AI00729032].
Author information
Authors and Affiliations
Contributions
D.Y.O., S.M., D.P., W.M., L.A.H., G.W.: Conception and design; Analysis and interpretation of the data; Drafting of the article; Critical revision of the article for important intellectual content. C.M.: Conception and design; Drafting of the article; Critical revision of the article for important intellectual content. E.E.B.G.: Conception and design of experiments; Analysis and interpretation of the data; Drafting of the article; Critical revision of the article for important intellectual content. Q.Z.: Conception and design; Simulation study.
Corresponding author
Ethics declarations
Competing Interests
The authors declare no competing interests.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Orlova, D.Y., Meehan, S., Parks, D. et al. QFMatch: multidimensional flow and mass cytometry samples alignment. Sci Rep 8, 3291 (2018). https://doi.org/10.1038/s41598018214444
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598018214444
This article is cited by

optimalFlow: optimal transport approach to flow cytometry gating and population matching
BMC Bioinformatics (2020)

SwiftReg cluster registration automatically reduces flow cytometry data variability including batch effects
Communications Biology (2020)

Automated subset identification and characterization pipeline for multidimensional flow and mass cytometry data clustering and visualization
Communications Biology (2019)

diffcyt: Differential discovery in highdimensional cytometry via highresolution clustering
Communications Biology (2019)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.