Automated flow cytometric identification of disease-specific cells by the ECLIPSE algorithm

Multicolor Flow Cytometry (MFC)-based gating allows the selection of cellular (pheno)types based on their unique marker expression. Current manual gating practice is highly subjective and may remove relevant information to preclude discovery of cell populations with specific co-expression of multiple markers. Only multivariate approaches can extract such aspects of cell variability from multi-dimensional MFC data. We describe the novel method ECLIPSE (Elimination of Cells Lying in Patterns Similar to Endogeneity) to identify and characterize aberrant cells present in individuals out of homeostasis. ECLIPSE combines dimensionality reduction by Simultaneous Component Analysis with Kernel Density Estimates. A Difference between Densities (DbD) is used to eliminate cells in responder samples that overlap in marker expression with cells of controls. Thereby, subsequent data analyses focus on the immune response-specific cells, leading to more informative and focused models. To prove the power of ECLIPSE, we applied the method to study two distinct datasets: the in vivo neutrophil response induced by systemic endotoxin challenge and in studying the heterogeneous immune-response of asthmatics. ECLIPSE described the well-characterized common response in the LPS challenge insightfully, while identifying slight differences between responders. Also, ECLIPSE enabled characterization of the immune response associated to asthma, where the co-expressions between all markers were used to stratify patients according to disease-specific cell profiles.

The index = 0 for the control and ≥ 1 for the responder groups that might correspond to different diseases or subtypes of the same disease. If all responder individuals are drawn from a population with the same disease, then will assume values 0 and 1, for the control group the same person is followed and analysed before and after an immune response. Figure S1: (1) Single data matrices representing measurement per individual, (2) Multiset data arrangement obtained by linking the matrices column-wise, (3a) Control/Responder differentiation of the multiset structure, with paired data, (3b) Control/Responder differentiation of the multiset structure, with paired data In this case the index , representing the individuals, is not unique (3a). However, the two Flow Cytometry case studies we consider in this paper both consist of unpaired individuals, indicated by different (as in 3b).
Step 0: Data pre-processing Data pre-processing is a very important aspect of chemometric data analysis. 1 It aims to remove variability in the data that is unrelated to the problem under study, while retaining the experimentally relevant information. In Flow Cytometry, such irrelevant variability might derive from instrumental artefacts due to misalignment of the laser source, baseline drift, laser power variability, interference of fluorescent labels, or uninformative noise coming from low intensity signals. Removing such irrelevant variability in a quantitative and reproducible way facilitates the quest for relevant biomedical information and guarantees that the fluorescents intensities are measures of level of protein expression on the cells.
The first step of the pre-processing consists of transforming matrix using log (or arcsinh) function. In MFC, this type of transformation is essential to cope with the broad dynamic range of emissions between fluorophores. The data matrix may contain negative values, due to background subtraction or to the compensation of overlap between emission spectra of different fluorophores 2 . These require a more dedicated transformation, such as the arcsinh scaling described by Finak et al. 3 . Multivariate analysis of the cell variability then requires centering and scaling of the log or otherwise-transformed matrix log = log 10 ( ) Eq. S1. Mean (or median) centering subtracts the column mean (or median) from every element in the column, which removes surface marker expression consistently present across all cells, resulting in the variability in surface marker expressions across the cells. Scaling equalizes the variability of each surface marker across the cells, to allow them to contribute equally to a multivariate model of the data, regardless of the intensity of the used fluorophore or the absolute variability in abundance of every surface marker.
Both centering and scaling need to take into account the multi-set structure of the MFC data ( Fig. 1). Several strategies for centering and scaling may accommodate this structure and the quantitative comparison between case and responder individuals 4 .
One strategy is centering and scaling the variables per set, i.e. per individual.
(a) = where 0 is the mean of the log-transformed surface marker intensities of the 0 -th control individual, calculated according to Eq. S2a, with = 0; 0 is the total number of control individuals. m 0 , of size of size x , represents the multiset matrix centered using the class mean of the log-transformed surface marker intensities of the control class.
Also the cumulative control standard deviation may be calculated by Eq. S5: with m, 0 the mean-centered individual control matrix, estimated according Eq. S2b, with = 0; 0 − the diagonal matrix holding the standard deviation 0 weighted for the number of control individuals 0 ; sc 0 the multiset matrix resulting from the auto-scaling performed with the weighted mean and standard deviation across the control individuals. From sc 0 we can extrapolate the pre-processed matrix of the control and responder sets, expressed in Eq. S6 and Eq. S7, respectively: Eq. S6      Citrus was applied to the LPS dataset using the R GUI. The regression classification model was trained on the data using the default pre-processing of the GUI, which implies arcsinh transform with cofactor 5. The accuracy of the classification models constructed is shown in Figure S5. The model, identified as optimal by the cross-validation procedure (cv.min in Figure S5), incorrectly classifies around 7% of the samples. Two cell clusters were selected as the most discriminating ones between LPS responders and controls by the cross-validated model. The histograms of these clusters, both more abundant in the LPS responder, are shown against the background cluster, which contains all the rest of the cells not included in the specified cluster ( Figure S6). Cluster 14989 has a distinctive phenotype, existing of CD16+CD11b+CD69+CD11c+CD62L− cells. Two CD32 populations are present in this cluster: CD32− and CD32+. In contrast, cells from cluster 14985 express high levels of CD64 and CD62L, while they have a dim expression of CD16. The cells present in the two cluster might be identified as activated mature and partly immature (CD16 dim ) neutrophils, respectively. Using the ECLISPE method we also identified two LPS-specific cell clusters in the ECLIPSE space generated when plotting PC1 and PC2, Figure 4.
To be able to compare these results, we applied Citrus on the pre-processed data as used in the ECLIPSE analysis. Subsequently, we applied Citrus on the LPS response-specific cells after ECLIPSE cell elimination.
First, Citrus was applied on the data pre-processed with the ECLISPE algorithm, as described in Pre-processing section. In this analysis, all the cells were included. The cross-validated error rates of the obtained models are displayed in Figure S7. A perfect classification is achieved with a model which uses three distinctive features. This model is chosen as optimal by the Citrus Analysis. The most discriminant three clusters identified by the model are shown in Figure S8. Cluster 14956 and 14993 represent phenotypes that might be assigned to mature neutrophils, with CD16+CD62L+ and CD16+CD62L+CD32−, respectively. These clusters are found more abundant in the control group, compared to the responder group for which cluster 14990 is more present. This last cluster, with CD16−CD69− and CD62L+ can be assigned as immature neutrophils. These results can give further insight in the mechanism behind the immune response studied.
Due to the different purposes of the models, a direct comparison of the prediction accuracy of the two methods is not possible. However, the last analysis, performed on the data after removal of normal cells, brings a further improvement of the classification model in terms of complexity. This shows the power of the subsampling provided by ECLIPSE, since in principle the ECLIPSE subsampling may be beneficial for discriminant classification methods as Citrus.     Secondly, we performed viSNE on the LPS data pre-processed by the ECLIPSE algorithm; the results can be observed in Figure S13. Finally we applied viSNE to the LPS data pre-processed by ECLIPSE, but after removal of normal cells from the responder individuals, performed by the ECLIPSE algorithm. The results are shown in Figure S14. It is clear how the cells from control individuals (blue, Figure S14b) are placed in the middle of the viSNE map, whereas the cells from the responder individuals (red, Figure S14b) are distributed in the upper and lower region of the map. Very little overlap is present between cells of the two groups and clearly, a lower amount of responding cells is present. The marker expression of these remaining cells is shown in Figure S14c. ECLIPSE removes mostly the CD16dimCD62L+ and CD16−CD62L− cells, leaving two quite distinctive populations. No particular difference in marker expression is noted in the viSNE analysis, performed on the ECLISPE subsampled data.

Figure S14: viSNE analysis performed on the LPS data mean-centered and scaled, after removal of normal cells in the responder done by ECLIPSE. 14a: cells in the viSNE map are coloured per different individual; 14b: cells in the viSNE map are coloured based on control (blue) and responder (red) group. 14c: cells from the responder individuals, in the viSNE map, are coloured based on expression of the single 7 markers.
The viSNE algorithm is used mostly for data dimensionality reduction and visualization purposes, while the main goal of ECLIPSE is the selection of interesting cellular information and/or populations based on multiple marker co-expression. In principle, no quantitative parameter can be used for the comparison with the ECLIPSE results on the LPS data. However, we have shown how the viSNE analysis is susceptible to sample-to-sample variation, which has influence on the resulting map. The multiset pre-processing operated by ECLIPSE systematically reduces this variation.
Both ECLIPSE and viSNE performed on the ECLIPSE subsampled data reveal two distinctive populations, which are mostly differentiated by CD16 and CD62L marker expressions. However, an advantage of ECLIPSE is that the co-expression among the markers are displayed in one single biplot. Secondly, ECLIPSE defines response-specific cell populations, also taking into account normal cells that are overrepresented upon an immune response. These cells are not specifically defined when analyzing data with the viSNE algorithm. Next to this, the hallmark of ECLIPSE is the removal of normal cells, leading to a less crowed representation, which enables to distinguish the LPS responding cell populations better, also in the viSNE map ( Figure S14).

Results of Citrus on the asthma data
We trained Citrus on the asthma data using the R GUI. The model error rates plot of the built models is shown in Figure S15. The highest accuracy achievable by the analysis correspond to 25% of misclassified samples.
Four features were necessary for this minimal error ( Figure S15, cv.min). A model with a crossvalidation error 1 std higher than the minimum, corresponding to 30% of misclassification, requires 1 feature ( Figure S15, cv.1se). The clusters detected by both models are shown below.
The feature identified by the model cv.1se corresponds to the abundance of cluster 41977, found to be more abundant in the control group (group 1) compared to the asthmatic individual (group 2) ( Figure S16). The phenotype indicated that the cluster is mostly characterized by CD4+ T cells, having CD3+CD4+ and CD8−CD16−CD123−CD14−CRTH2− expression levels.    Figure S19 shows the viSNE map with individuals #63 and #67, which were also grouped and analysed in a partial model by ECLIPSE (Figure 11).       Figure   S25 (the small cell population in yellow in the top right corner belongs to Responder 10), and in the histograms shown in Figure S26; contour plots showing the differing relative abundance of cell populations between controls and responders are shown in Figure S27.   Citrus was applied to the dataset with a Minimum Cluster Size Threshold (MCST) corresponding to 10 cells, so that the method could identify the rare subset. The accuracy of the constructed models is shown in Figure S28, which reports the model cross-validation error rate versus the regularization threshold associated to the number of features identified. Seven features were detected by the model with the minimum error (cv.min), while the model with a cross validation error 1 std higher than the minimum (cv.1se) identified two most discriminant features.    Figure S30) clearly shows the difference between the two groups.
Advantageous is the possibility to estimate the difference between densities of a single responder against the control group estimate ( Figure S30B). This will enhance the resolution on a specific individual and it is helpful when one sample is available for the response class/classes. The results of the analyses on the synthetic data showed how ECLIPSE outperforms Citrus in presence of a heterogeneity within the response class. In fact, Citrus is a two-class classifier method which needs numerous samples with similar phenotypically properties within both groups to find their signature features. Contrarily, ECLIPSE is a one class classifier method that requires only a group of control individuals. This is necessary to define a reference, against which a single patient or a group of patients, even highly heterogeneous in their response, can be compared. The removal of normal cells will put more focus on the response specific subpopulations of each individual, including rare cell subsets.
In this example, the dimensional reduction step of ECLIPSE was not needed because the data had only two dimensions. In order to show that the SCA-based transformation will not affect the discovery of cells subpopulations we performed the analyses on a 3D datasets, obtained from the first one after specific transformations.
A third random aspect was added to the first dataset; the matrix obtained was rotated by using orthogonal Procrustes rotation 8 , which allowed to introduce a more realistic correlation across all the three variables. A 3D scatter plot of the new dataset and histograms of the three marker expression levels are shown below ( Figure S32 and S33).  We applied the Citrus method on this new dataset. The cross-validated error rates of the model obtained, in Figure S34, indicate the model with only one feature as optimal. In this case, different relative abundance of cell populations is not identified as relevant feature ( Figure S35). As for the previous Citrus analysis, the rare cell subset is not detected.  ECLIPSE analysis was performed on the three-dimensional dataset, whose dimensionality was reduced with Simultaneous Component Analysis to two components. Figure S36 Figure S38, which does show the rare cell population in down right corner. Moreover, the loadings show the correct co-expression between Marker 2 and 3 for this rare cell population which was masked in Figure S37, due to the high variability in the control cells of all markers.