Realistic in silico generation and augmentation of single cell RNA-seq data using Generative Adversarial Neural Networks

A fundamental problem in biomedical research is the low number of observations available, mostly due to a lack of available biosamples, prohibitive costs, or ethical reasons. Augmenting few real observations with generated in silico samples could lead to more robust analysis results and a higher reproducibility rate. Here we propose the use of conditional single cell Generative Adversarial Neural Networks (cscGANs) for the realistic generation of single cell RNA-seq data. cscGANs learn non-linear gene-gene dependencies from complex, multi cell type samples and use this information to generate realistic cells of defined types. Augmenting sparse cell populations with cscGAN generated cells improves downstream analyses such as the detection of marker genes, the robustness and reliability of classifiers, the assessment of novel analysis algorithms, and might reduce the number of animal experiments and costs in consequence. cscGANs outperform existing methods for single cell RNA-seq data generation in quality and hold great promise for the realistic generation and augmentation of other biomedical data types.

To show that the scGAN generates realistic cells, we trained a Random Forest (RF) classifier 18 to distinguish between real and generated data. The hypothesis is that a classifier should have a (close to) chance-level performance when the generated and real data are highly similar. Indeed the RF classifier only reaches 0.65 Area Under the Curve 9 Conditional generation of specific cell types scRNA-seq in silico data generation reaches its full potential when specific cells of interest could be generated on demand, which is not possible with the scGAN model. This conditional generation of cell types could be used to increase the number of a sparse, specific population of cells that might represent only a small fraction of the total cells sequenced.
To generate specific cell types of interest while learning multi-cell type complex data, we developed and evaluated various conditional scGAN architectures. Common to all these models is that the conditional scGAN learns to generate cells of specific types while being trained on the complete multi cell type dataset. The cell type information is then associated to the genes' expression values of each cell during the training. These tags can then be used to generate scRNA-seq data of a specific type, in our case of specific cluster of PBMCs. The best performing conditional scGAN model (cscGAN) utilized a projection discriminator 19 , along with Conditional Batch Normalization 20 and an LSN function in the generator. Again, model selection and optimization details can be found in the methods.
Model performance was assessed on the PBMC dataset using t-SNE, marker gene correlation, and classification. The cscGAN learns the complete distribution of clusters of the PBMC data and can conditionally represent each of the ten Louvain clusters on demand. The t-SNE results for the conditional generation of cluster 2 and cluster 6 cells are shown in Fig. 2A and Fig. 2B, respectively. Figure Table S2).
The results of this section demonstrate that the cscGAN can generate high-quality scRNA-seq data for specific clusters or cell types of interest, while rivaling the overall representational power of the scGAN.

Improved classification of sparse cells using augmented data
We now investigate how we can use the conditional generation of cells to improve the quality and significance of downstream analyses for rare cell populations. The two underlying hypotheses are (i) that a few cells of a specific cluster might not represent the cell population of that cluster well (sampling bias), potentially degrading the quality and robustness of downstream analyses. (ii) This degradation might be mitigated by augmenting the rare population with cells generated by the cscGAN. The base assumption is that the cscGAN might be able to learn good representations for small clusters by using gene expression and correlation information from the whole dataset.
To test the two parts of our hypothesis, we first artificially reduce the number of cells of the PBMC cluster 2 (downsampling) and observe how it affects the the ability of a RF model to accurately distinguish cells from cluster 2 from cells of other clusters. In addition, we train the cscGANs on the same downsampled datasets, generate cells from cluster 2 to augment the downsampled population, retrain an RF with this augmented dataset, and measure the gain in their ability to correctly classify the different populations.
More specifically, cluster 2 comprises 15,008 cells and constitutes the second largest population in the PBMC dataset. Such a large number of cells makes it possible to obtain statistically sound results in various downstream analysis tasks. By deliberately holding out large portions of this population, we can basically quantify how the results would be affected if that population was arbitrarily small. We produce 8 alternate versions of the PBMC dataset, obtained by downsampling the cluster 2 population (keeping 50%, 25%, 10%, 5%, 3%, 2%, 1% and 0.5% of the initial population) (Fig. S8, Table S3). We then proceed to train RF classifiers (for each of those 8 downsampled datasets) (Fig.   S9A), on 70% of the total amount of cells and kept aside 30% to test the performance of the classifier (Fig. S9B). The red line in Fig. 3A and Fig. S10 very clearly illustrates how the performance of the RF classifier, measured through the F1 score, gradually but 13 significantly decreases from 0.95 to 0.45 while the downsampling rate goes from 50% to 0.5%. Figure 3 Effects of data downsampling and augmentation on classification and clustering. A: F1score of a RF classifier trained to discriminate cluster 2 test from other test cells when trained on training (red), upsampled (orange), or augmented (blue) cells for eight levels of downsampling (50% to 0.5%). B: t-SNE representation of cluster 2 real test (red) and cscGAN generated (blue) cells for three levels of downsampling (10%, 2%, and 0.5%). Other trest cells are shown in gray.
To see if we could mitigate this deterioration, we experimented with two ways of augmenting our alternate datasets. First, we used a naïve method, which we call upsampling, where we simply enlarged the cluster 2 population by duplicating the cells that were left after the downsampling procedure (Fig. 9A). The orange line in Fig. 3 shows that this naive strategy actually mitigates the effect of the downsampling, albeit only to a minor extent (F1 score of 0.6 obtained for a downsampling rate of 0.5%). It is important to note that adding noise (e.g. standard Gaussian) to the upsampled cluster 2 cells usually deteriorated the classification performance (data not shown).
In order to understand if in silico generated cluster 2 cells could improve the RF performance, we next trained the cscGANs on the 8 downsampled datasets (Fig. S9C).
We then proceeded to augment the cluster 2 population with the cells generated by the cscGAN (Fig. 9A). Fig. 3B shows that using as little as 2% (301 cells) of the real cluster 2 data for training the cscGAN suffices to generate cells that overlap with real test cells.
When less cells are used the t-SNE overlap of cluster 2 training cells and generated cells slightly decreases ( Fig. 3B right panel, Fig. S11). These results strongly suggest that the cluster-specific expression and gene dependencies are learnt by the cscGAN, even when very few cells are available. In line with this assumption, the blue curves in Figure 3A and It may be surprising or even suspicious that our cscGAN is able to learn to generate cells coming from very small sub-populations (a few dozens of cells) so well. We speculate that although cells from a specific type may have very specific functions, or exhibit highly singular patterns in the expression of several marker genes, they also share a lot of similarities with the cells from other types, especially with those that share common precursors. In other words, the cscGAN is not only learning the expression patterns of a specific sub-population from the (potentially very few) cells of that population, but also from the (potentially very numerous) cells from other populations. This hypothesis actually aligns with the architecture of cscGAN. In the generator, the only parameters that are cluster specific are those learnt in the Conditional Batch Normalization layers. On the other hand, all the parameters of each of the Fully Connected layers are shared across all the different cell types. While focusing on the task of cell type classification in this manuscript, many other applications will most probably gain from data augmentation, including -but not limited to -clustering itself, cell type detection, and data denoising. Indeed, a recent manuscript used Wasserstein GANs (WGAN) to denoise scRNA-seq data 21 . For this purpose, the (low-dimensional) representation obtained at the output of the single hidden layer of a critic network was used. These lower-dimensional representations keep cell typedetermining factors while they discard 'noisy' information such as batch effects. In general, GAN models allow for the simulation of cell development or differentiation through simple arithmetic operations applied in the latent space representation of the GAN, operations for which our conditional cscGANs are especially suited.
It is tempting to speculate how well the scRAN-seq data generation using cscGANs can be applied to other biomedical domains and data types. It is easy to envision, for example, how cscGAN variants could generate realistic (small) RNA-seq or proteomic data. Moreover, cscGAN variants might successfully generate whole genomes with predefined features such as disease state, ethnicity, and sex, building virtual patient cohorts for rare diseases, for example. In biomedical imaging, in silico image generation could improve object detection, disease classification, and prognosis, leading to increased robustness and better generalization of the experimental results, extending clinical application.
with human data, which is notoriously heterogeneous due to genetic and environmental variation. Data generation and augmentation might be most valuable when working with rare diseases or when samples with a specified ethnicity or sex, for example, are simply lacking.
Lastly we would like to emphasize that the generation of realistic in silico data has far reaching implications beyond enhancing downstream applications. In silico data generation can decrease human and animal experimentation with a concomitant reduction in experimental costs, addressing important ethical and financial questions.

Methods
Datasets and pre-processing PBMC We trained and evaluated all models using a published human dataset of 68,579 peripheral blood mononuclear cells (healthy donor A) 11 . The dataset was chosen as it contains several clearly defined cell populations and is of reasonable size. In other words, it is a relatively large and complex scRNA-seq dataset with very good annotation, ideal for the learning and evaluation of generative models.
The cells were sequenced on Illumina NextSeq 500 High Output with ~20,000 reads per cell. The cell barcodes were filtered as in 11 and the filtered genes matrix is publicly available on the 10x Genomics website.
In all our experiments, we removed genes that are expressed in less than 3 cells in the gene matrix, yielding 17,789 genes. We also discarded cells that have less than 10 genes expressed. This, however, did not change the total number of cells. Finally, the cells were normalized for the library-size by first dividing UMI counts by the total UMI counts in each cell and then multiplied by 20,000. See Table S1 for an outlook of this dataset.

Brain Large
In addition to the PBMC dataset we trained and evaluated our best performing scGAN model on the currently largest available scRNA-seq dataset of ~1.3 million mouse brain cells (10x Genomics). The dataset was chosen to prove that the model performance scales to millions of scRNA-seq cells, even when the organism, tissue, and the sample complexity varies. The sequenced cells are from the cortex, hippocampus and the subventricular zone of two E18 mice.
The barcodes filtered matrix of gene by cell expression values is available on the 10x genomics website. After removing genes that are expressed in less than 3 cells, we obtained a gene matrix of 22,788 genes. We also discarded cells that have less than 10 genes expressed, which did not affect the overall number of cells. The cells were normalized for the library-size by first dividing UMI counts by the total UMI counts in each cell and subsequent multiplication by 20,000. See Table S1 for an outlook of this dataset.

Brain Small
We also examined the performance of the generative models proposed in this manuscript on a subset of the Brain Large dataset provided by 10x Genomics, which consists of 20,000 cells. The pre-processing of the Brain Small dataset was identical to that of the Brain Large dataset, yielding a matrix of 17970 genes by 20.000 cells (Table S1).

Louvain clustering
Throughout this manuscript we use the Cell Ranger workflow for the scRNA-seq secondary analysis 11 . First, the cells were normalized by UMI counts. Then, we took the natural logarithm of the UMI counts. Afterwards, each gene was normalized such that the mean expression value for each gene is 0, and the standard deviation is 1. The top 1000 highly variable genes were selected based on their ranked normalized dispersion. PCA was applied on the selected 1000 genes. In order to identify cell clusters, we used Louvain clustering 22 on the first 50 principal components of the PCA. This replaced the k-means clustering used in Cell Ranger R analysis workflow. We used the Scanpy package 23 to 20 run the louvain clustering on the first 50 PCs. The number of clusters were controlled by the resolution parameter of scanpy.api.tl.louvain. The higher resolution made it possible to find more and smaller clusters.
For the PBMC dataset we used a resolution of 0.3 which produced 10 clusters. On the other hand, we clustered the Brain Large dataset using a resolution of 0.15 which produces 13 clusters. Finally, the Brain Small dataset was clustered using a resolution of 0.1 which gives 8 clusters.

Definition of marker genes
In several experiments, we investigated the expression levels and correlation of genes.
For these purposes, a group of 10 'marker genes' was defined by taking the five most highly upregulated genes for the largest two clusters in the dataset (clusters 1 & 2 for the PBMC dataset). Significant upregulation was estimated using the logarithm of the Louvain-clustered dataset and the scanpy.api.tl.rank_genes_groups function with its default parameters (Scanpy 1.2.2) 23 .

Model description scGAN
In this section, we outline the model used for scGAN by defining the loss function it optimizes, the optimization process, and key elements of the model architecture GANs typically involve two Artificial Neural Networks: a generator, which, given some input random noise, trains to output realistic samples, and a critic that trains to spot the differences between real cells and the ones that the generator produces (Fig. S1). An 'adversarial' training procedure allows for those entities to compete against each other in a mutually beneficial way. Formally, GANs minimize a divergence between the distributions of the real samples and of the generated ones. Different divergences are used giving rise to different GAN variants. While original GANs 4 minimize the so-called Jensen-Shannon divergence, they suffer from known pitfalls making their optimization notoriously difficult to achieve 24 . On the other hand, "Wasserstein GANs" (WGANs) 13,25 use a Wasserstein distance, with compelling theoretical and empirical arguments.
Let us denote by and the distributions of the real and of the generated cells respectively. The Wasserstein distance between them, also known as the "Earth Mover" distance, is defined as follows: where and are random variables and ( , )is the set of all joint distributions ( , )whose marginals are and respectively. Those distributions represent all the ways (called transport plans) you can move "masses" from to in order to transform into . The Wasserstein distance is then the cost of the "optimal transport plan".
However, in this formulation, finding a generator that will generate cells coming from a distribution such that it minimizes the Wasserstein distance with the distribution of the real cells, is intractable.
Fortunately, we can use a more amenable, equivalent formulation for the Wasserstein distance, given by the Kantorovich-Rubinstein duality: where | | ≤ 1is the set of 1-Lipschitz functions with values in ℝ. The solution to this problem is approximated by training a Neural Network that we previously referred to as the "critic" network, and whose function will be denoted by .
The input of the generator are realizations of a multivariate noise whose distribution is denoted by . As it is common in the literature, we use a centered Gaussian distribution with unit diagonal covariance (i.e. a multivariate white noise). The dimension of the used Gaussian distribution defines the size of the "latent space" of the GAN. The dimension of that latent space should reflect the "intrinsic dimension" of the scRNA-seq expression data we are learning from, and is expected to be significantly smaller than their "apparent dimension" (i.e. the total number of genes).
If we denote by the function learnt by our generator network, the optimization problem solved by the scGAN is identical to that of a Wasserstein GAN: The enforcement of the Lipschitz constraint is implemented using the gradient penalty term proposed by 25 .
Hence, training a scGAN model involves solving a so-called "minmax" problem. As no analytical solution to this problem can be found, we recourse to numerical optimization schemes. We essentially follow the same recipe as most of the GAN literature 4,24 , with an alternated scheme between maximizing the critic loss (for 5 iterations) and minimizing the generator loss (for 1 iteration). For both the minimization and the maximization, we use a recent algorithm called AMSGrad 26 , which addresses some shortcomings of the widely used Adam algorithm 27 , leading to a more stable training and convergence to more suitable saddle points.
Regarding the architecture of our critic and generator networks, which is summarized in Finally, as mentioned in the datasets and pre-processing section, each real sample used for training has been normalized for library size. We now introduce a custom Library Size Normalization (LSN) layer that enforces the scGAN to explicitly generate cells with a fixed library size (Fig. S1B).

LSN layer
A prominent property of scRNA-seq is the variable range of the genes expression levels across all cells. Most importantly, scRNA-seq data is highly heterogeneous even for cells within the same cell subpopulation. In the field of Machine Learning, training on such data is made easier with the usage of input normalization. Normalizing input yields similarlyranged feature values that stabilize the gradients. scRNA-seq normalization methods that are used include library-size normalization, where the total number of reads per cell is exactly 20.000 (see also Datasets and pre-processing).
We found that training the scGAN on library-size normalized scRNA-seq data helps the training and enhances the quality of the generated cells in terms of our evaluation criteria cscGAN Our cscGAN leverages conditional information about each cell type, or sub-population, to enable the further generation of type-specific cells. The integration of such side information in a generative process is known as "conditioning". Over the last few years, several extensions to GANs have been proposed to allow for such conditioning 19,29,30 . It is worth mentioning that each of those extensions are available regardless of the type of GAN at hand.
We explore two conditioning techniques, auxiliary classifiers (ACGAN) 30 and projectionbased conditioning (PCGAN) 19 . The former adds a classification loss term in the objective.
The latter implements an inner product of class labels at the critic's output. While we also report results obtained with the ACGAN (see Table S2), the best results were obtained while conditioning through projection.
In practice, the PCGAN deviates from the scGAN previously described by: (i) multiple critic output layers, one per cell type and (ii) the use of Conditional Batch Normalization Layers (BNL) 20 , whereby the learnt singular scaling and shifting factors of the BNL are replaced with one per cell type.
As described in Section 2 and 3 of 19 , the success of the projection strategy relies on the hypothesis that the conditional distributions (with respect to the label) of the data at hand are "simpler", which helps stabilizing the training of the GAN. When it comes to scRNAseq data, it is likely that this hypothesis holds as the distribution of the gene expression levels should be "simpler" within specific cell types or sub-populations.

Model selection & evaluation
Evaluating the performance of generative models is no trivial task 12 . We designed several metrics to assess the quality of our generated cells at different levels of granularity. We will now describe in detail how those metrics were obtained. They can be grouped into two categories: the metrics we used for model selection (in order to tune the hyperparameters of our GANs) and the metrics we introduced in the Results section. Metrics used for model selection For each of our models, before starting the training, we randomly pick 3,000 cells from our training data and use them as a reference to measure how it performs. We therefore refer to those 3,000 cells as "real test cells" To optimize those hyper-parameters, we trained various models and evaluated their performance through a few measures, computed during the training procedure: (a) the distance between the mean expression levels of generated cells and real test cells, (b) the mean sparsity of the generated cells, and (c) the intersection between the most highly variable genes between the generated cells and the real test cells.
First, we compute the mean expression value for each gene in the real test cells. During the training procedure, we also compute the mean expression value for each gene in a set of 3,000 generated cells. The discrepancy is then obtained after computing the Euclidian distance between the mean expression values of the real and the generated cells.
scRNA-seq data typically contains a lot of genes with 0 read counts per cell, which we also use to estimate the similarity of generated and real cells. Naturally, similar sparsity values for real and test cells indicate good model performance whereas big differences indicate bad performance.
Finally, using the Scanpy 23 package, we estimate the 1,000 most highly variable genes from the real data. During the training, we also estimate what are the 1,000 most highly variable genes from a sample of 3,000 generated cells. We use the size of the intersection between those two sets of 1,000 highly variable genes as a measurement of the quality of the generation.

Gene expression and correlation
To highlight the performance of our models, we used violin plots of the expression of several marker genes along with heatmaps displaying the correlation between those same marker genes as expressed among all clusters, or among specific clusters.
To produce those plots, we used the expression levels of cells (either test real, or generated by scGAN, cscGAN), in a logarithmic scale. For the heatmaps, we compute the Pearson product-moment correlation coefficients.

t-SNE plots
To visualize our generated and real cells within same t-SNE plot, they are embedded simultaneously. In brief, we are trying to assess how "realistic" the generated cells are.
Thus our reference point is the real data itself. The delineation of what constitutes noise 28 and what constitutes biologically relevant signal should be driven by the real data only.
Hence we project the generated cells on the first 50 principal components that were computed from the real cells in the Cell Ranger pipeline 11 (see also Datasets and preprocessing). From this low-dimensional representation, we compute the t-SNE embedding.
Classification of real vs. generated cells Building on the 50-dimensional representation of the cells (t-SNE plots section), we trained classifiers to distinguish between real test cells and generated cells. As mentioned in the Results section, we trained Random Forest classifiers with 1000 trees and a Gini impurity quality metric of the decision split using scikit-learn package 31 . The maximum depth of the classifier is set so that, the nodes are expanded until all leaves are pure or until all leaves contain less than 2 samples. The maximum number of features used is the square root of the number of genes.
In order to produce Fig. 1C, which highlights the ability to separate real from generated cells, irrespective of which cluster they are coming from, we used the whole real test set along with generated cells. On the other hand Fig. 2C and 2D are cluster specific (cluster 2 and cluster 5 respectively). We trained the Random Forests using only the cells from those specific clusters. To prevent bias due to class imbalance, each model was trained using an equal number of real test cells and generated cells.
We used a 5-fold cross-validation procedure to estimate how each classifier generalizes.
To assess this generalization performance, we plotted the Receiver Operating Characteristic (ROC) curves obtained for each fold, along with the average of all the ROC curves. We also display the AUC in each of those cases. constitutes an "impossible" task). Each row represents the performance obtained for cluster specific cells (rows denoted 1 to 6), while the last row ("global") shows the results obtained for all clusters together.

Downsampling
To assess the impact of cluster size on the ability of the cscGAN to model the cluster we artificially reduced the number of cells of the relatively large PBMC cluster 2. We call this approach 'downsampling' throughout the manuscript.
Eight different percentages {50%, 25%, 10%, 5%, 3%, 2%, 1%, 0.5%} of cluster 2 cells were sampled using a random seed and a uniform sampling distribution over all the cluster 2 cells (Table S3). We sampled nested subsets (for each seed, the smaller percentage samples are a complete subset of the larger ones). In order to accurately estimate the generalization error across the different experiments and to avoid potential downsampling (sampling bias) artifacts, we conducted all our experiments using five different random seeds. For the classification of cell subpopulations (see next paragraph) we report the average as well as the 5 individual values for the different seeds.

Classification of cell sub-populations
To investigate the use of the proposed cscGAN model for data augmentation, we examined the performance of cell subpopulation classification before and after augmenting the real cells with generated cells. For this purpose, and as described in the 30 previous paragraph and the Results section, we produced alternate datasets with subsampled cluster 2 populations (Fig. S8).
For simplicity, we focus in this section on the experiment where cluster 2 cells were downsampled to 10% using five different random seeds (Fig. S9). We advise to use Figure S9 as an accompanying visual guide to this text description.
Using the previously introduced 50-dimensional PCA representation of the cells, three RF models were trained to distinguish cluster 2 cells from all other cell populations (RF downsampled, RF upsampled, and RF augmented) (Fig. S9A). In the training data for all the three classifiers, 70% of the cells from all the clusters except cluster 2 (i.e. 37,500 cells) were used (light blue boxes in Fig. S9A).
RF downsampled: For the first RF classifier, we used 10% of cluster 2 cells (1,502 cells) and 70% other cells (37,500 cells) to train the RF model. We refer to this dataset as the "RF downsampled" set in Fig. S9A. This dataset was also used to train the cscGAN model, which is used later to generate in silico cluster 2 cells (Fig. S9C). It is important to note that RF classifiers for 'RF downsampled' datasets always use weights that account for the cluster-size imbalance. The reason for this is that RFs are sensitive to unbalanced classes, leading to classifiers that always predict the much larger class, thereby 'optimizing' the classification error 32 .
RF upsampled: For the second RF classifier, we uniformly sampled with replacement 5,000 cells from the 1,502 cluster 2 cells (10%). We added those 5,000 (copied) cells to the original 1,502 cells. We refer to this dataset as "RF upsampled" in Fig. S9A. Again, the rationale for this 'upsampling' is that RF classifiers are sensitive to class imbalances, as outlined in the previous paragraph (RF downsampled). Of note, we have also conducted experiments where we added standard Gaussian noise to the upsampled cells, which always reduced the performance of the RF classifier and are therefore not shown.
RF augmented: Finally, the third classifier training data "RF augmented" consists of 10% cluster 2 cells as well as 5,000 cluster 2 cells generated using the 10% cscGAN model as shown in Fig. S9C. The 10% cscGAN model was trained on 10% cluster 2 cells as well as all other cells (53,571 cells, Fig. S9C).
The RF classifiers were trained using the same parameters as descirbed in the 'Generated vs. real classification' methods section, using 1,000 trees and 'Gini impurity'.
The only difference is that here the class weights during the training are adjusted inversely proportional to the class frequencies. This approach of adjusting class weights is of paramount importance when an imbalanced training data is used 32 . The scikit-learn package 31 was used to conduct all experiments to classify cell sub-populations.
Test cells consisted of 30% of the data from all the clusters. Since we are testing the cscGAN's ability to augment different percentages of real cluster 2 cells, we made sure that the 30% of cluster 2 cells used in the test set were selected from the cells which were not seen by any trained cscGAN model (Fig S8 & S9B).
To prove that the downsampling limits the ability to classify and that augmenting the dataset mitigates this effect, all three RF classifiers were trained to classify cluster 2 cells vs. all other sub-populations. The F1 score of each classifier is calculated and presented in different colors (Fig. 3A).
Furthermore, in order to understand how augmentation helps to separate close clusters, we trained the same three RF classifiers after removing all clusters except cluster 2 and 1 from the corresponding training data. We repeated this procedure for cluster 2 and 5, and cluster 2 and 3. We chose those clusters in particular because their highly differentially expressed genes are also highly expressed in 2 meaning that separating them from cluster 2 is more difficult (Fig. 1A). In a similar way, F1 scores for classification of cluster 2 vs. 1 (Fig. S10A), and cluster 2 vs. 3 (Fig. S10B), and cluster 2 vs. 5 (Fig.   S10C) are calculated and reported.
As mentioned above, we repeated this procedure for different downsampling levels of cluster 2 cells and for 5 different sampling seeds for each level (Table S3).
When training the RF classifier with the augmented dataset, the number of cells used in the augmentation was set to 5,000 cells. This, however, doesn't necessarily mean that 5,000 cells is the optimal number of cells to be added. The increase in the F1 score due to augmenting the data with generated cells depends on two factors: (i) the number of real cells in the original sub-population and (ii) the number of cells used for augmentation.
To highlight the impact of the number of generated cells used for data augmentation, we trained the previously mentioned RF classifiers using different numbers of generated cells (from 100 to 12,000) while keeping the number of other cells constant (see Fig. S11).

Splatter comparison
In addition to what has been previously introduced in the Results section, we also compared the performance of the scGAN to Splatter 16 , using the metrics described in their manuscript. Briefly, Splatter simulation is based on a gamma-Poisson hierarchical model, where the mean expression of each gene is simulated from a gamma distribution and cell counts from a poisson distribution. We noticed that Splatter uses the Shapiro-Wilk test to evaluate the library size distribution, which limits the number of input cells to 5,000. Therefore, we slightly modified the code that allows Splatter to take more than 5,000 cells as input.
While scGAN learns from and generates library-size normalized cells, Splatter is not suited for that task. For the sake of fairness, we used the Splatter package on the nonnormalized PBMC training dataset. We then generated (non-normalized) cells, which we normalized, so that they could be compared to the cells generated by scGAN. Following 16 , we used the following evaluation metrics: distribution of the mean expression, of the variance, of the library sizes and ratio of zero read counts in the gene matrix. The results were computed using the Splatter package and are reported in Fig. S5.
We observe that the results obtained by Splatter are marginally better than or identical to those of the scGAN (Fig. S3). The results from those measures suggest that both Splatter and scGAN constitute almost perfect simulations. However, Splatter simulates "virtual" genes. While those genes share some characteristics with the real genes Splatter infers its parameters from, there is no one-to-one correspondence between any "virtual gene" simulated in Splatter generated cells and the real genes. It is therefore non-sensical to compare Splatter-simulated cells with real cells, as we did to evaluate the quality of (c)scGAN-generated cells. This also prohibits the use of Splatter for data-augmentation purposes.
This being said, we also would like to pinpoint that while the (c)scGAN is able to capture the gene-gene dependencies expressed in the real data ( Fig. 1B and S7), this does not hold for Splatter, for which the "virtual" genes are mostly independent from each other. To prove this point, we extract the 100 most highly variable genes from the real cells, the cells generated by Splatter, and the cells generated by the scGAN. We then proceed to compute the Pearson correlation coefficients between each pair within those 100 genes ( Fig. S4). It reveals that while those most highly variable genes in the real cells or those generated by the scGAN exhibit some strong correlations, highly variable genes are mostly independent from each other in the cells generated by Splatter. As already mentioned in the Results section, the co-regulation of genes is an important aspect of biological gene-regulatory networks and should be taken into account when generating in silico data for algorithmic testing.
Software, packages, and hardware used Our (c)scGAN Tensorflow 33 implementation can be found on https://github.com/imsbuke/scGAN, including documentation for the training of the (c)scGAN models. As mentioned several times, we used Scanpy 23 to conduct most of the data analysis. We also compared our results to those of Splatter 16 , and adapted the code they provided on Github (https://github.com/Oshlack/splatter).
For the sake of reproducibility, here is a list of the version of all the packages we used:      and scGAN generated (right) cells for the Brain Large dataset (1,3 million mouse brain).    A: RF training was conducted on three different datasets for each percentage of downsampling of cluster 2 cells (as an example we use 10% in the figure). The RF downsampled dataset consists of the 10% cluster 2 set (yellow). The RF upsampled dataset contains 6,502 cluster 2 cells sampled with replacement from the 10% cluster 2 set. The RF augmented dataset contains the 10% cluster 2 set, and 5,000 cells generated using the generated 10% cluster 2 set (grey, see also panel C). In addition, all three datasets contain the other training set (light blue). B: RF testing was conducted on the 30% cluster 2 test set (green) and the other test set (dark blue). C: For data augmentation (see RF augmented) we generate cluster 2 cells using a cscGAN. The cscGAN is trained on the 10% cluster 2 set (yellow) and the other clusters set (blue), yielding a generated 10% cluster 2 set (gray).