Machine learning-based approaches for identifying human blood cells harboring CRISPR-mediated fetal chromatin domain ablations

Two common hemoglobinopathies, sickle cell disease (SCD) and β-thalassemia, arise from genetic mutations within the β-globin gene. In this work, we identified a 500-bp motif (Fetal Chromatin Domain, FCD) upstream of human ϒ-globin locus and showed that the removal of this motif using CRISPR technology reactivates the expression of ϒ-globin. Next, we present two different cell morphology-based machine learning approaches that can be used identify human blood cells (KU-812) that harbor CRISPR-mediated FCD genetic modifications. Three candidate models from the first approach, which uses multilayer perceptron algorithm (MLP 20-26, MLP26-18, and MLP 30-26) and flow cytometry-derived cellular data, yielded 0.83 precision, 0.80 recall, 0.82 accuracy, and 0.90 area under the ROC (receiver operating characteristic) curve when predicting the edited cells. In comparison, the candidate model from the second approach, which uses deep learning (T2D5) and DIC microscopy-derived imaging data, performed with less accuracy (0.80) and ROC AUC (0.87). We envision that equivalent machine learning-based models can complement currently available genotyping protocols for specific genetic modifications which result in morphological changes in human cells.

The Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) genome editing technology, which is adapted from an immune system analog found in archaea and prokaryotes, has been applied to exceedingly broad scientific, industrial, and medical domains at an exceptional pace since the first demonstration in cells [1][2][3][4][5][6][7][8] .One particularly exciting domain is CRISPR-based therapeutics, which as of today, has found applications in several areas: blood disorders, diagnostics and therapeutics in cancer, eye diseases, chronic infections, neurodegenerative disorders, and protein-folding disorders 9 .
Two most common hemoglobinopathies, sickle cell disease (SCD) and β-thalassemia, arise from genetic mutations within the β-globin gene.These mutations result in deficient or absent β-globin synthesis, which in turn lead to oxygen being disassociated from the hemoglobin and eventually result to conformational changes in red blood cells 10,11 .There is no cure available for these disorders except bone marrow transplantation (BMT) when a suitable donor is available, and most treatments are mainly aimed at relieving symptoms and preventing complications.Recently, the CRISPR technology has been used to reactivate the expression of fetal hemoglobin, which can take the place of defective adult hemoglobin, and has shown remarkable results in improving the quality of life in patients 12 .
Machine learning, which can yield models for pattern recognition, classification, and prediction from acquired data, has been widely used in biological studies ranging from protein folding prediction 13 to cancer prognosis 14 .There are two main types of machine learning methods: (1) supervised learning (e.g.random forest, support vector machine), which derive the relationship between a set of input variables (features) and a designated

Generation of FCD-HT monoclonal stable cell line
To generate the FCD-HT monoclonal stable cell line, approximately 10 million human KU-812 cells were seeded onto a 10 cm petri dish.16 h later, the cells were transiently transfected with 4.5 μg of PCMV-SpCas9-U6-sgRNA-L, 4.5 μg of PCMV-SpCas9-U6-sgRNA-R, and 1 μg of the donor plasmid using the JetPEI reagent (Polyplus Transfection).48 h later, puromycin was added at the final concentration of 2 μg/mL.The selection lasted 2 weeks, after which the surviving clones were pooled to generate the polyclonal stable cells.Next, to remove the puromycin resistance gene-T2A-mKate cassette, approximately 10 million of the polyclonal stable cells were seeded onto a 10 cm petri dish, and after 16 h were transfected with 10 μg of EF1-Flpase (unpublished data) using the JetPEI reagent.48 h later, single cells were isolated using flow cytometry.The established monoclonal stable cell line was confirmed to be heterozygous by genotyping and further expanded and maintained in the complete growth medium.

Genotyping of FCD-HT monoclonal stable cell line
The genomic DNAs were isolated from FCD-HT monoclonal stable cells using DNeasy Blood&Tissue Kit (Qiagen).The transcripts containing the CRISPR-targeting region was amplified with primers P13 and P14.The PCR products were then subjected to gel electrophoresis and Sanger sequencing using primers P13 and P14.

Flow cytometry
FCD-WT, FCD-HT, HCT116 and PUF3.1 HCT116 cells from a 10-cm petri dish were washed with 5 mL PBS buffer, and subsequently trypsinized with 2 mL 0.25% Trypsin-EDTA at 37 °C for 5 min.Trypsin-EDTA was then neutralized by adding 10 mL of complete medium.The cell suspension was centrifuged at 1000 rpm for 5 min and after removal of supernatants, the cell pellets were re-suspended in 5 mL PBS buffer.The cells were analyzed on a BD Reforest flow analyzer.The voltages (V) for each channel were: 270 for FSC-A, 270 for FSC-H, 270 for FSC-W, 280 for SSC-A, 280 for SSC-H, and 280 for SSC-W.

Machine learning model training and testing
For flow cytometry-derived dataset, a Dell desktop computer (Intel Core i7-10700 CPU @ 2.90 GHz, Windows 10 enterprise 64-bit OS and 32 GB RAM) was used to conduct the machine learning modeling.Scikit-Learn, a free Python machine learning library, was used to conduct all model training and testing procedures.For DIC microscopy-derived dataset, a Lenovo Laptop (Intel Core i7-10510 CPU @ 1.80 GHz, Ubuntu 20.04 OS and 16 GB RAM) was used to conduct the deep learning modeling.The Keras library in TensorFlow was used to conduct all model training and testing procedures.Other Python libraries, including NumPy, Pandas, and Matplotlib, were also included for data analysis and presentation.

Performance metrics
Performance of different models was evaluated using threshold dependent and independent metrics, which include: 1. precision: this parameter measures how accurate a model is when predicting cells being at live state.Precision = TP/(TP + FP), where TP refers to correctly predicted live cells and FP refers to falsely predicted live cells.2. recall: this parameter measures the model's ability to correctly predict live cells from actual live cells.
Recall = TP/(TP + FN), where TP refers to correctly predicted live cells and FN refers to falsely predicted apoptotic cells.3. true positive rate (TPR): this parameter measures the model's ability to correctly predict live cells from actual live cells.TPR = TP/(TP + FN), where TP refers to correctly predicted live cells and FN refers to falsely predicted apoptotic cells.4. false-positive rate (FPR): this parameter measures the model's level of falsely predicting live cells from actual apoptotic cells.FPR = FP/(FP + TN), where FP refers to falsely predicted live cells and TN refers to correctly predicted apoptotic cells.5. accuracy: this parameter determines the success of correctly predict live and apoptotic cells from overall data.Accuracy = (TP + TN)/(TP + FP + TN + FN), where TP refers to correctly predicted live cells, FP refers to falsely predicted live cells, FN refers to falsely predicted apoptotic cells, and TN refers to correctly predicted apoptotic cells.

Generation and characterization of heterozygous FCD-deficient KU-812 cell model (FCD-HT)
The CRISPR/SpCas9 technology was used to generate the FCD-deficient model in KU-812 cells, which were established from the peripheral blood of a patient with chronic myelogenous leukemia 30 .Briefly, the parental cells were transiently transfected with CRISPR/SpCas9 complex which targets both left and right genomic regions flanking the FCD motif, along with a homologous recombination repair template containing a puromycin resistance gene transcript (Fig. 1A).The polyclonal stable cell line was then established after 2 weeks of puromycin selection (2 µg/mL).Subsequently, to avoid potential interference with the transcriptional activities of the globin genes, the puromycin resistance gene transcript (~ 2.2 kb), which was flanked by flippase recognition target (FRT) sites, was removed using flippase.Finally, monoclonal stable cells were established using FACS single cell sorting.
To characterize the stable monoclonal cell line, genomic DNA was isolated using DNeasy Blood&Tissue Kit (Qiagen), and subsequently the transcript containing the FCD motif or FRT site was amplified using primers P13 and P14 (Supplementary Table 1).The PCR products were then subjected to gel electrophoresis and as shown in Fig. 1B and Supplementary Fig. 1A, the monoclonal cell line yielded two distinct bands corresponding to both the wild type (806 bp) and FCD-knockout (341 bp) alleles, confirming its heterozygous status (named as FCD-HT).Both bands were further extracted and subjected to Sanger sequencing, which confirmed that the FCD sequence was successfully removed in the FCD-Knockout allele (Supplementary Fig. 1B and Supplementary Materials/ Sanger_FCD_Knockout_Allele.seq).To determine how the FCD-removal affects the ϒ-globin expression, total RNAs were extracted using the RNeasy Mini Kit from KU-812 and FCD-HT cells and the relative expression of ϒ-globin transcript was determined using quantitative reverse transcription-PCR (qRT-PCR).As shown in Fig. 1C, the mRNA level of ϒ-globin significantly increased in FCD-HT cells (2.87-fold compared to its parental KU-812, named as FCD-WT), which is consistent with our hypothesis that the removal of the FCD motif may reactivate the expression of fetal hemoglobin.

Flow cytometry-based data collection and visualization for FCD-WT and FCD-HT cells
To prepare cell morphology-based predictive models differentiating FCD-WT, in which all ϒ-globin alleles contain the FCD motif, and FCD-HT cells, which contain both wild-type and FCD-knockout ϒ-globin alleles,  2).Next, this initial dataset was randomly split into training and testing datasets at a ratio of 80:20 (size of training dataset: size of testing dataset).Specifically, the training dataset contains 302,652 cells (label 0: 154,180 cells, label 1: 148,472 cells, Supplementary Fig. 2 and Supplementary Table 3), and the testing dataset contains 75,664 cells (label 0: 38,592 cells, label 1: 37,072 cells, Supplementary Fig. 3 and Supplementary Table 4).
We first compared the absolute readings among the 6 features using box plotting.As shown in Fig. 2A, the means of these features varied significantly with the maximal ratio larger than 2.0-fold (mean FSC-A / mean SSC-H = 2.47), indicating that standardization of the original training and testing datasets are required (standardized training and testing datasets in Supplementary Tables 5 and 6, respectively).Subsequently, the standardized training dataset was subjected to two dimensionality reduction algorithms, principal component  www.nature.com/scientificreports/analysis (PCA) and t-distributed stochastic neighbor embedding t-SNE (t-SNE).As shown in Fig. 2B (PCA) and Fig. 2C (t-SNE), the two cell subpopulations (FCD-WT: green, FCD-HT: yellow) overlapped significantly and were not linearly separable, which implies that non-linear modeling approaches such as multilayer perceptron may be required for classification purposes.

Cell morphology-based machine learning models using flow cytometry-derived data
A general workflow as described in our previous study was adopted to build and test various cell morphology-based machine learning models using flow cytometry-derived data 24 .In total, five (5) supervised learning algorithms (logistical regression, random forest, k-nearest neighbor, support-vector machine, and multilayer perceptron) were included (model hyperparameters in Supplementary Table 7).Briefly, logistical regression is a statistical model which uses a logistic function to model a binary dependent variable.The model itself is non-linear but can be transformed into a linear regression via link functions.Random forest algorithm is an ensemble learning method which constructs "an ensemble" of decision trees at the training step.K-nearest neighbor algorithm for classification assumes that data points with same classification labels tend to be in proximity.Support-vector machine (SVM) relies on constructing one or a set of hyperplanes in high-dimensional space and remains one of the most robust supervised classification methods.Multilayer perceptron (MLP) belongs to the family of feedforward artificial neural network (ANN), and typically consist of three layers of nodes: an input layer, a hidden layer, and an output layer.First, using tenfold cross-validation, we screened all models with the standardized training dataset, and adopted the filtering conditions as (1) the mean accuracy > 0.80, and (2) the standard deviation of accuracy < 0.10, which were derived from our previous study using the same 6 flow cytometry-based features to predict mammalian cell states 24 .In total, one (1) logistic regression model (Supplementary Table 8), 94 random forest models (Supplementary Table 9), 96 k-nearest neighbor models (Supplementary Table 10), two (2) SVM models with linear kernel (Supplementary Table 11), 25 SVM models with Gaussian kernel (Supplementary Table 12), and 893 MLP models (Supplementary Table 13) were selected.
Next, all selected models (1111) were trained using the training dataset, and subsequently applied to the standardized testing dataset and subjected to secondary filtering conditions as (1) precision when predicting FCD-HT cells > 0.80, and (2) recall when predicting FCD-HT cells > 0.80.As shown in Supplementary Table 14, only 533 MLP models survived this additional filter.
Finally, we chose three MLP models with largest AUC values (Table 1, MLP 20-26: number of nodes in the first layer: 20/number of nodes in the second layer: 26, MLP 26-18: number of nodes in the first layer: 26/number of nodes in the second layer: 18, and MLP 30-26: number of nodes in the first layer: 30/number of nodes in the second layer: 26), and plotted both the receiver operating characteristics (ROC, Fig. 3A) and precision-recall curves (Fig. 3B).The three models displayed essentially identical performance when predicting FCD-HT cells (precision: 0.83, recall: 0.80, accuracy: 0.82, and AUC: 0.90).

Cell morphology-based machine learning models using microscopy-derived data
In addition to flow cytometry, cell morphology information can also be directly assessed using imaging 28 .Using a Differential Interference Contrast (DIC) microscopy, we prepared 1594 images of individual FCD-WT cells and 1695 images of FCD-HT cells (Supplementary Fig. 4), the ratio of FCD-WT and FCD-HT = 0.94).Next, this starting dataset was randomly split into the training and testing datasets at a ratio of 90:10 (size of training dataset: size of testing dataset).The final training dataset contains 2956 images (FCD-WT: 1433 images, FCD-HT: 1523 images), and the testing dataset contains 333 images (FCD-WT: 161 images, FCD-HT: 172 images).
Next, deep learning-based convolutional neural networks (CNNs) were used to construct genotype-predictive models.Two general CNN architectures were explored: (1) Type 1 (T1): (Conv-Conv-Pool) n , which was based on the VGG design 31 , and (2) Type 2 (T2): (Conv-Pool) n , which contained a single convolution layer for each repeat.For each type, different number of convolution numbers were tested (two, four and six layers for T1, and two, three, four, five layers for T2) until the final feature map reaches a dimension of zero.Since our image inputs have a relatively small size (100 pixels by 100 pixels), we fixed the filter size at 3 and when applicable, the Maxpooling pool size at 2. The detailed architectural designs were included in Supplementary Table 15.
As an example, for Type 2/5 layers (T2D5, Fig. 4), the numbers of layers at the feature extraction step were 32, 64, 92, 100 and 128 for each successive layer, and rectified linear unit (ReLU) was used as the activation function.Additionally, a Maxpooling layer was included after each convolution layer.Next, the outputs from convolutional layers were subjected to global average pooling and converted into a 1-dimensional vector (Flatten) for a fully connected layer (dense, 1028 nodes).Finally, a Softmax classifier, which applies a categorical cross-entropy loss function, was used, together with the adaptive moment estimation (ADAM) optimization algorithm.
Table 1.Predictive performances of the three candidate learning models using flow cytometry-derived data.First, all 7 candidate architectures were subjected to tenfold cross-validation using the training dataset.As shown in Supplementary Table 16, models from Type 2 showed better performance compared to those from Type 1. Specifically, the best-performing model of Type 2 (T2D5) showed a mean of accuracies from 10 crossvalidation tests at 67.3% (Supplementary Fig. 5), while the best-performing model of Type 1 (T1D4) yielded a mean of accuracies at 58.3%.

Number
We further trained models using all candidate architectures and the training dataset, and subsequently applied them to the testing dataset.As shown in Table 2, the architectures T2D5 displayed the best predictive outcome.Specifically, for FCD-HT cells, precision was 0.84, recall was 0.76, accuracy was 0.80 and AUC was 0.87 (Fig. 5).

Discussion
In this study, we have investigated two alternative approaches in predicting cell genotypes: (1) numeric data based on flow cytometry assay, and (2) imaging data based on DIC microscopy.Our analysis showed that the best performing models from approach 1 (MLP 20-26, MLP 26-18, and MLP 30-26) yielded better prediction results compared to the best model from approach 2 (T2D5) (Tables 1 and 2, ROC AUC values for MLP 20-26, MLP 26-18, and MLP 30-26: 0.90, for T2D5: 0.87).Multiple factors maybe impact the classification performance of the two approaches.For example, the resolution of our images was relatively low (100 pixels by 100 pixels).Additionally, compared to the training dataset from flow cytometry (302,652 cells), the size of imaging dataset was vastly smaller (3289 images).To overcome this challenge, we resorted to data augmentation techniques by applying zooming (range: 0.5-1.5x),rotation (range: 90°), width shifting (range: − 10 to 10 pixels), and height shifting (range: − 10 to 10 pixels) to the original training dataset (Supplementary Fig. 8) 32 .Together with original samples, the augmented training dataset contained 26,604 images (FCD-WT: 12,897 images, FCD-HT: 13,707 images), which was subsequently subjected to deep learning modeling using the T2D5 architecture.As shown in Supplementary Table 17, the newly acquired model did not yield better predictive performance.As an example, for FCD-HT cells, precision was 0.77, recall was 0.74, accuracy was 0.75, and ROC AUC was 0.75, which were lower than those of the original model (precision: 0.84, recall: 0.76, accuracy: 0.80, ROC AUC: 0.87).These results showed that synthetic samples do not always enhance the model performance in deep learning.
It should be noted that data collected from our flow cytometry assay were based on fixed voltage values for each channel ("Materials and methods"/"Flow cytometry", 270 for FSC-A, 270 for FSC-H, 270 for FSC-W, 280 for SSC-A, 280 for SSC-H, and 280 for SSC-W).In future studies, we will explore a wide range of voltage settings for all channels and systematically study how these modifications affect the modeling results.To compare the predictive performances between supervised and unsupervised learning algorithms, we additionally subjected our flow cytometry-derived dataset (the standardized training dataset) to two unsupervised clustering algorithms (k-means clustering and Gaussian mixture clustering) in parallel 34 .As shown in Supplementary Fig. 6 (SSC-A vs. FSC-A), the predicted distributive pattern for two subpopulations from k-means algorithm differed drastically  It should be emphasized that since our training dataset contained only a specific cell type (i.e., KU-812) subjected to a specific CRISPR editing (CRISPR-mediated knockout of the FCD motif), the resulting models should not be used to differentiate any two cell types, or any cells with or without any CRISPR treatments.To demonstrate this scenario, we applied the candidate MLP 20-26 model to (1) a different human cancer cell type (HCT116, Supplementary Table 20), and (2) a human cancer cell line which was subjected to CRISPR-mediated knock-in (HCT116mut, Supplementary Table 21) 35 , both of which do not contain any mutations within the FCD motif (i.e., a correct model would classify both as FCD-WT genotype).Our results showed that the model yielded poor predictive performance on both cell lines (52.7% of HCT116 cells predicted as FCD-WT genotype, and 50.2% of HCT116mut as FCD-WT genotype), which confirmed that for each specific cell line and CRISPR editing event, the modeling process needs to be separately executed.
Typically, to establish and confirm a human stable cell line with desirable CRISPR-mediated genetic modifications, polyclonal cells need to be sorted into single cells, commonly on 96-well plates.Next, the single cells are allowed to propagate until sufficient genomic material can be extracted and subjected to PCR-based genotyping assays (Fig. 1).While the protocol is well established, the full pipeline can become time-consuming (up to several weeks for cell propagation step dependent on the cell proliferation rates) and labor-intensive (hundreds of monoclones may be needed for acquiring desirable genotypes).On the other hand, sophisticated imaging flow cytometry techniques, which record extensive physically measurable quantities (features) of the cells, have been used to identify cell subpopulations with specific traits (e.g.cell types, apoptotic states) 26,29,36,37 .
As an example, using next generation cell sorting and time-stretch imaging technologies, Jalali et al. achieved high accuracy (95%) when classifying OT-II white blood cells and SW-480 epithelial cells 38 .However, the required instrumentation (time-stretch imaging and customized microfluidic chip devices) may not be available to many research labs.In comparison, our flow cytometry and DIC microscopy-based machine learning approaches provide a unique balance between efficacy and availability, and theoretically can be applied to any engineered cells with genetic modifications known to introduce cell morphological changes.
In summary, we demonstrated the feasibility of using flow cytometry-based cellular information (FSC-A, FSC-H, FSC-W, SSC-A, SSC-H, and SSC-W) to predict specific cell genotypes using multilayer perceptron algorithms.Specifically, the three candidate MLP models, MLP 20-26, MLP 26-18, and MLP 30-26, achieved good predictive performance for predicting FCD-HT cells (AUC: 0.90).Additionally, we showed that deep learning framework (T2D5), when applied to DIC microscopy images, can also be indicative for certain genotyping purposes.We envision that both assays can be valuable and complementary to currently available genotyping protocols for engineered cell lines.
12:1481 | https://doi.org/10.1038/s41598-022-05575-3www.nature.com/scientificreports/we first used flow cytometry assay to record 6 features (FSC-A, FSC-H, FSC-W, SSC-A, SSC-H, and SSC-W).Specifically, FSC (forward scatter) measures the light scatter along the path of the laser (A: area, H: height, and W: width), and is dependent on the diameter/size of the cell.In contrast, SSC (side scatter) measures the light scatter perpendicular to the path of the laser (A: area, H: height, and W: width), and provides information about the internal complexity (granularity) of a cell.In total, 192,772 FCD-WT cells (labeled as 0) and 185,544 FCD-HT cells (labeled as 1) were included (the ratio of labels 0 and 1 = 1.04,Supplementary Table

Figure 1 .
Figure 1.Generation and characterization of heterozygous FCD-deficient KU-812 cell model (FCD-HT).(A)Schematic illustration of the CRISPR/SpCas9 homologous recombination process to remove the FCD motif within human globin locus.The polyclonal stable cell line was established using 2 weeks of puromycin selection (2 µg/mL).Subsequently, the puromycin resistance gene transcript, which was flanked by flippase recognition target (FRT) sites, was removed using flippase.Finally, monoclonal stable cells were established using FACS single cell sorting.(B) Genotyping of FCD-HT monoclonal stable cell.Genomic DNAs were harvested from FCD-WT and FCD-HT cells and the transcript containing the FCD motif or FRT site was PCR amplified and subsequently subjected to gel electrophoresis.The stable cell line yielded two bands corresponding to both the wild type (806 bp) and FCD-knockout (341 bp) alleles, confirming its heterozygous status.(C) Quantitative reverse transcription-PCR (qRT-PCR) assay showed that compared to FCD-WT cells, the mRNA level of ϒ-globin significantly increased in FCD-HT cells (2.87-fold).*** denotes p-value < 0.001.

Figure 2 .
Figure 2. General statistics and visualization of the training dataset.(A) Box plot of the training dataset.The means of the six features (FSC-A, SSC-A, FSC-H, SSC-H, FSC-W, and SSC-W) varied significantly with the maximal ratio larger than 2.0-fold (mean FSC-A /mean SSC-H = 2.47), indicating that standardization of the original training and testing datasets are required.(B) Visualization of the standardized training dataset using PCA (green: FCD-WT, yellow: FCD-HT).(C) Visualization of the standardized training dataset using t-SNE (green: FCD-WT, yellow: FCD-HT).

Figure 4 .
Figure 4.The T2D5 deep learning architecture.The model contained five convolutional layers (the numbers of each layer: 32, 64, 92, 100 and 128).Additionally, a Maxpooling layer was included after each convolution layer.Next, the outputs from convolutional layers were subjected to global average pooling and flattened for a fully connected layer (1028 nodes).Finally, a Softmax classifier, which applies a categorical cross-entropy loss function, was used.