Abstract
Advances in singlecell technologies enable the routine interrogation of chromatin accessibility for tens of thousands of single cells, elucidating gene regulatory processes at an unprecedented resolution. Meanwhile, size, sparsity and high dimensionality of the resulting data continue to pose challenges for its computational analysis, and specifically the integration of data from different sources. We have developed a dedicated computational approach: a variational autoencoder using a noise model specifically designed for singlecell ATACseq (assay for transposaseaccessible chromatin with highthroughput sequencing) data, which facilitates simultaneous dimensionality reduction and batch correction via an adversarial learning strategy. We showcase its benefits for detailed celltype characterization on individual real and simulated datasets as well as for integrating multiple complex datasets.
Similar content being viewed by others
Main
Rapid advances in singlecell epigenomics technologies, including singlecell assay for transposaseaccessible chromatin with highthroughput sequencing (scATACseq), have enabled the interrogation of gene regulation at an unprecedented resolution. scATAC profiles the accessibility of chromatin across the whole genome and is currently the most widely adapted protocol to identify candidates of regulatory regions of importance to the system under investigation. The resulting datasets require specialized computational tools to cope with their characteristic high dimensionality and sparsity and will rely on scalability for future datasets.
A key step in every scATACseq processing pipeline is dimensionality reduction, which aims to represent the most salient trends in the data at lower dimensionality, such as groups of similar cells. The quality of this step is critical, as it precedes other analysis tasks, including celltype characterization, identifying celltype specific regulatory regions, motif analysis and so on. Several methods have been introduced for dimensionality reduction using scATACseq data, including scABC^{1}, chromVAR^{2}, Scasat^{3}, latent Dirichlet allocation (cisTopic)^{4}, latent semantic indexing (LSI)^{5}, iterative LSI^{6}, SnapATAC^{7}, SCALE^{8}, scDEC^{9} and PeakVI^{10}. A recent benchmark analysis showed that current computational tools work well for celltype characterization for small or moderate dataset sizes, but may not scale to large dataset sizes and/or vary in performance across different datasets^{11}. Apart from performing dimensionality reduction, the growing number of published datasets opens up new avenues for data integration of replicates or data obtained with different protocols, such as combinatorial indexing or dropletbased approaches^{12}.
To address the lack of dedicated scATACseq tool that enable simultaneous data integration (for example, batch correction) and dimensionality reduction, we have developed BAVARIA, a batchadversarial variational autoencoder (VAE)^{13}, which facilitates dimensionality reduction and integration for scATACseq data. To this end, we extended the standard VAE framework in several ways. First, inspired by combining deep learning with specialized and suitable count noise models for processing singlecell RNAseq data (for example, by using a zeroinflated negative binomial distribution^{14,15}), we set out to find a suitable count model for scATACseq. We identified the negative multinomialbased reconstruction loss to outperform alternative reconstruction losses with respect to extracting useful information about the underlying cell types, including the binary crossentropy or the multinomialderived loss (Extended Data Fig. 1 and Methods). The multinomial part of the reconstruction loss describes the accessibility profile across all regions as a whole, rather than considering the regions independently (for example as is assumed for the binary crossentropy loss). This reduces the risk of obtaining a poor local minimum (for example, due to overfitting) and achieves invariance with respect to the read depth. The dispersion parameter, on the other hand, offers robustness against outliers during training. Second, fitting neural networks is commonly based on stochastic optimization, which may lead to variable latent feature quality across multiple training runs. Due to the optimizer getting stuck in a poor local optimum, cell types may occasionally be poorly characterized. Here, an ensemble approach, whereby latent features of several independently trained models are concatenated, not only stabilizes the latent feature quality, but also appears to improve their feature quality compared to latent features from individual models (Extended Data Fig. 2 and Methods). Third, we adopted a domainadversarial training strategy^{16} that encourages the VAE to extract latent features uninformative of batch effects. Specifically, we use batchdiscriminator networks not only at the final layer of the encoder as suggested in Ganin et al.^{16}, but also at intermediate layers of the encoder (Fig. 1). This puts more emphasis on removing irrelevant batchassociated information in the initial layers of the network, which we find to enhance the batch correction capabilities (see comparison below and Methods). We refer to our approach as batchadversarial VAE or BAVARIA (Fig. 1).
BAVARIA improves celltype characterization
We systematically assessed the ability of BAVARIAderived lowdimensional feature representations for celltype characterization on a range of real and synthetic datasets. To this end, we expanded on a recently published benchmarking framework^{11} for comparision with current stateoftheart solutions (cisTopic^{4}, LSI^{5}, SnapATAC^{7}, ArchR^{6}, scDEC^{9}, SCALE^{8} and PeakVI^{10}). Briefly, for each method, a lowdimensional feature representation is extracted and subjected to cell clustering. For BAVARIA, the lowdimensional representation is derived from the latent features of the encoder. Subsequently, a range of clustering evaluation scores are used to determine how well the cell clusters reflect cell identities based on known ground truth cell labels (if available) or by assessing the separation of known marker genes (Methods and ref. ^{11}).
Across three publicly available datasets, we observe variable performance of LSI, cisTopic, SnapATAC, SCALE, ArchR and PeakVI. Compared to these methods, scDEC seems to perform less well on the datasets. No single method consistently outperformed the other methods across the datasets (Fig. 2c,d). On the other hand, we find that BAVARIA robustly achieves best or competitive performance for uncovering cell identities across all real datasets (Fig. 2c,d). While PeakVI achieves a slightly better performance compared to BAVARIA on the Cusanovich 2018 subset when evaluated with the adjusted Rand Index (ARI) score, we find BAVARIA to perform similarly well or slightly better on that dataset for the adjusted mutual information (AMI) and homogeneity score (Hom). The performance of BAVARIA is also exemplified by the separation of known cell types in the uniform manifold approximation and projection (UMAP) embedding (Fig. 2b) and the high similarity of networkimputed and original signal tracks within distinct known cell types (Fig. 2a).
Next, we assessed the performance of the methods at different read depths and noise levels using two synthetic datasets (a bone marrow and an erythropoiesis dataset). We generated larger simulated datasets relative to ref. ^{11}, as we found previous sizes to be insufficiently small to reflect current realistic experiments and for fitting large neural networks (Methods). For the synthetic bone marrow datasets, LSI and BAVARIA both achieve similar high performance across all sparsity levels, while the performance of cisTopic, SnapATAC, SCALE, PeakVI and scDEC decrease sooner with decreasing read depth (see 250 and 500 fragments per cell; Extended Data Fig. 3a). All methods perform similarly well on higher noise levels for the bone marrow dataset, except for scDEC which exhibited a systematically lower performance than the other tools and PeakVI which exhibited a moderate performance decrease for 40% noise (Extended Data Fig. 3b). On a synthetic erythropoiesis dataset, all methods achieve similar results for the high read coverage regime (see ‘5,000 fragments per cell’; Extended Data Fig. 3c). Yet, with decreasing read coverage, BAVARIA outperforms the other methods (see 1,000, 500 and 250 fragments per cell; Extended Data Fig. 3c). In addition, LSI and BAVARIA perform best for 0% and 20% additional noise, and BAVARIA outperforms all other methods for 40% additional noise using synthetic erythropoiesis data (Extended Data Fig. 3d).
Overall, in comparison to other methods we find that BAVARIA is robust and capable of extracting meaningful latent feature representations across a range of datasets.
Batchadversarial training facilitates data integration
Having established the superior performance on benchmark tasks, we turned to BAVARIA’s signature feature of data integration. We analysed two adult mouse brain cell samples from different sources: 10X Genomics^{17} and Cusanovich et al.^{5} (Methods). To quantify the contribution of our new data integration strategy, we first disabled adversarial training with BAVARIA. Here, cells from the two datasets occupy nonoverlapping territories in the cell embedding space, which underlines the severity of the batch effect (Fig. 3a). That is, cells largely cluster by batches. By contrast, with batchadversarial training, BAVARIA achieves a markedly better alignment between cells from different batches, while also largely maintaining a separation between previously annotated cell types (Fig. 3b). Compared to the originally proposed adversarial strategy by Ganin et al.^{16} of using a single batchdiscriminator network at the final layer of the encoder, we observe an improved cellmixing performance when batch discriminators are placed not only on the final encoder layer, but also on the hidden encoder layers (Extended Data Fig. 4). In addition, we observe considerable batch effects when using a conditional VAE variant of BAVARIA in which onehot encoded batch labels are used as additional inputs for the encoder (Extended Data Fig. 4)—similar to that proposed for scVI^{15}.
We compared BAVARIA against several batch integration methods (scVI, trVAE, SAUCIE, Harmony, Seurat v3 CCA, Liger and PeakVI). With the exception of PeakVI, the other tools were not specifically designed for processing scATACseq data. This enabled us to assess whether a dedicated approach to the characteristics of singlecell open chromatin would surpass a naive strategy to use a tool tailored to a different modality. Indeed we find that BAVARIA and PeakVI appear to achieve a reasonable separation between cell clusters, compared to tools that were not specifically designed for processing scATACseq data (Fig. 3b). For instance, with SAUCIE and trVAE different celltypes largely remain connected in the embedding. Harmony appears to have merged some cell types (see the centretop in the Harmony panel of Fig. 3b). The integration with Liger has lead to a substantial number misalignments of 10Xderived cells with Cerebellar granule cells and a considerable amount of batch effects is still visible after the integration with Seurat (Fig. 3a,b). Not only does BAVARIA separate cell types reasonably well, but our model also yields a markedly better mixing of cells between batches compared to all other tools as measured by the kBET score and as is evident from the UMAP embeddings (Fig. 3a). In particular, we find that batchconditional VAE models (for example scVI, PeakVI and the conditional variant of BAVARIA), are prone to leaving residual batch effects after the integration (Fig. 3a and Extended Data Fig. 4).
Next, we clustered the cells based on the joint latent features and inspected pseudobulk accessibility profiles stratified by the batches around several marker genes. For clusters with relatively even representation of cells from both batches, highly concordant accessibility profiles across clusters can be observed (for example, Scl17a7, Caln1, Gad2, Tmem119, Aldh1l1, Mbp and Abca4; Extended Data Fig. 5a,b,d,e). This suggests that the latent features derived from BAVARIA are suitable for cell label transfer, as cells appear to cluster together based on their underlying cell type. On the other hand, as a sign for successful integration of data taken from similar but not identical sources, we also find cell clusters that are primarily present in one of the samples. For instance, cluster 0 and 18 consist of mostly Cusanovich 2018 cells which exhibit accessibility at Mmp24 and Itga11, whereas cluster 16 consists of mostly 10X Genomics cells which exhibit clusterspecific accessibility around Sh3bp4. While label transfer from an annotated reference onto a dataset without celltype annotation is possible in a supervised manner (for example, by classifying cell types based on their accessibility profile), unsupervised data integration (for example, via BAVARIA) offers the possibility to correct and account for imperfections of the original cell annotations (for example, reference dataset). For instance, we observe several subpopulations of cells that have previously been annotated as inhibitory neurons (for example, clusters 4 and 14; Extended Data Fig. 5a). We do not find a separation of these cell clusters when using the other integration methods. Similarly, a cell population previously annotated as unknown (cluster 22; Extended Data Fig. 5c) likely represents doublet events between cerebellar granular cells and oligodendrocytes, as these cells exhibit specific accessibility for both cell types (compare clusters 0, 2 and 22; Extended Data Fig. 5c).
Computational requirements
We compared VAEbased models (SCALE, scVI, PeakVI and BAVARIA) and cisTopic in terms of runtime and memory requirements on a synthetic bone marrow dataset consisting of 12,000 cells and 80,000 peaks and a coverage of 5,000 reads per cell (based on the synthetic data from the benchmarking framework). The comparison was performed on a Linux server using an Intel Xeon Platinum 8168 CPU @ 2.70GHz processor with 3 TB RAM and an NVIDIA Tesla P40 GPU. The VAEbased models utilized a GPU for model fitting. All models were fitted for 100 epochs with otherwise default settings. For BAVARIA, a single VAE was fitted. A cisTopic model was fitted with 10 topics using CPUs.
The runtime for cisTopic is markedly higher than the deep learningbased methods, as the latter set of methods benefit from GPUaccelerated processing (Supplementary Table 1). Fitting a single VAE within BAVARIA requires similar or slightly lower runtime compared to the other VAEbased methods (Supplementary Table 1).
All methods handle data in the form of sparse matrices, which is critical for scATACseq processing and in general leads to similar memory footprints. For BAVARIA, we make use of tensorflow data pipelines^{18} to optimize minibatch processing during training and evaluation. This, however, introduces a higher memory overhead compared with the other methods (Supplementary Table 1).
In general, the memory and runtime requirements of BAVARIA depend linearly on the number of cells and the number of peaks. Additionally, the runtime depends linearly on the number of training epochs and the ensemble size. For large datasets, a tradeoff between speed and accuracy can be achieved by (1) adjusting the ensemble size, (2) by subsetting the cells for model fitting and/or (3) by subsetting the available features (for example, peaks).
Conclusion
In summary, we have developed a VAE for integrating sparse and highdimensional scATACseq data (BAVARIA). We demonstrated that several unique aspects, regarding robust training and noise modelling for scATAC count data, allow the model to match and exceed the performance of current stateoftheart solutions for cell type characterization across (1) different dataset sizes, (2) different read depths and (3) different noise levels. Importantly, its batchadversarial training strategy makes BAVARIA the first tool to facilitate data integration and accurate batch correction across different scATAC protocols.
Methods
Benchmark analysis
Data for the benchmark analysis was obtained by following and adapting a recently published scATACseq benchmarking framework^{11}. We obtained (1) a haematopoietic differentiation dataset (n = 2,034 cells; Buenrostro 2018^{19}), (2) a singlenucleus combinatorial indexing ATACseq mouse tissue dataset (n = 81,173 cells; Cusanovich 2018^{5}) and (3) a peripheral blood mononuclear cell dataset (n = 5,335 cells; 10X peripheral blood mononuclear cells (PBMC) 5k). The former two datasets also include ground truth cell labels from fluorescenceactivated cell sorting (FACS) sorted cells or by using the tissue of origin as label. The benchmark also includes a 15% subset of cells from the mouse tissue dataset (n = 12,178 cells; Cusanovich 2018 subset^{5}).
The haematopoietic dataset was independently processed twice with two different sets of peaks. Namely, peaks determined on the scATACseq by Chen et al.^{11} (Buenrostro 2018) and the original peak set reported by Buenrostro et al.^{19} (Buenrostro 2018 bulkpeaks).
Preprocessing includes feature counting in the predefined peak regions, binarization of the count matrix, filtering for a minimum read coverage per peak (at least 1% of cells need to be covered at the region), and removing sex chromosomes. This led to 98,738, 132,110, 378,894, 141,388 and 67,427 peaks for the Buenrostro 2018, Buenrostro 2018 bulkpeaks, Cusanovich 2018, Cusanovich 2018 subset and 10X datasets, respectively.
We followed the procedure of ref. ^{11} to generate several synthetic datasets based on FACSsorted bulkATACseq samples from bone marrow^{20} and erythropoiesis ^{21}, as described previously^{11}. As the originally published benchmarking assessment consists of too few cells for fitting large neural networks, we increased the numbers of cells. Specifically, for bone marrow, 2,000 cells per population were generated with different fragment sizes per cell (5,000, 2,500, 1,000, 500 and 250 fragments) as well as for different noise levels (0%, 20% and 40% additional noisy reads). Likewise, for erythropoiesis, 1,000 cells per population were generated with different fragment sizes per cell (5,000, 2,500, 1,000, 500 and 250 fragments) as well as for different noise levels (0%, 20% and 40% noise). Note also that the downsampling experiment based on the erythropoiesis data was not part of the original benchmarking assessment^{11}. Finally, for all synthetic datasets, the 80,000 most covered regions were retained for the benchmark analysis.
Mouse brain cell integration
We downloaded fresh adult mouse brain cell data from the 10X Genomics web site (https://support.10xgenomics.com/singlecellatac/datasets/1.2.0/atac_v1_adult_brain_fresh_5k) as well as scATACseq data from several mouse tissues^{5} from https://atlas.gs.washington.edu/mouseatac/data/.
We used the peaks provided by 10X Genomics as reference peaks^{17}. The master peaks from Cusanovich et al.^{5} were lifted over from mm9 to mm10 and mapped onto the reference peaks using bedtools intersect^{22}. Likewise, the original count matrix was mapped onto the new reference peak set and only cells from the WholeBrain and PreFrontalCortex tissues were retained for the analysis^{5}. The count matrices from the 10X Genomics and Cusanovich 2018 datasets were concatenated, binarized and filtered to ensure that each region was covered in at least 1% of the cells. Regions on the sex chromosomes were removed. The final count matrix contained 18,605 cells (3,880 from 10X and 14,725 from Cusanovich et al.^{5}) and a peak set of size 136,528.
Negative multinomial variational autoencoder
We define the accessibility profile of a given single cell across a set of regions as x = [x_{1}, ..., x_{N}] where x_{i} > 0 reflects accessibility of region i. x_{i} = 0 indicates inaccessibility. In the context of this work, we use binarized accessibility profiles (for example, x_{i} = 0 or x_{i} = 1), although the model is in general also applicable to nonnegative integer vectors (for example, sites with multiple reads per cell).
In this section, we first introduce the adaptation of the standard VAE model^{13} without integrated batchcorrection and turn to batchadversarial training strategy in the next section. Briefly, following Kingma and Welling^{13}, the encoder determines the Ldimensional mean μ and variance σ parameters of the approximate Gaussian posterior distribution of the latent feature representation given an accessibility profile x. From the approximate posterior distribution, samples, z, are drawn, which are in turn used as input for the decoder to reconstruct the N dimensional accessibility profile^{13}. We shall use the mean vector, μ, as the lowdimensional feature representation used for the downstream analysis.
We assume that the accessibility profile, x, follows a negative multinomial distribution defined by
where p = [p_{0}, …, p_{N}] represent the nonnegative parameters which sum to one and r denotes the positive realvalued dispersion parameter. p_{i} for i > 0 reflects the accessibility profile, while p_{0} is associated with the dispersion.
We construct a decoder that determines the respective parameters p and r. In particular, for i > 0 the decoder computes
where a_{i} represents the decoder’s output activity at region i for a given latent feature sample z. The remaining probability mass is reserved for the dispersion and is given by
Here we assume that the dispersion parameter is scalar and does not depend on the accessibility profile x, that is, it is adjusted like a bias term in the network.
Consequently, the reconstruction loss is given by
As described previously^{13}, Kullback–Leibler divergence is utilized as a regularization loss, denoted as loss_{KL}, which encourages that the latent feature representation is distributed according to \({{{\mathcal{N}}}}(0,I)\). The total loss is given by summing the reconstruction loss and the regularization loss, which is subject to minimization by adapting the model parameters during the model fitting.
Batchadversarial training
To facilitate data integration and batch correction, we took inspiration from ref. ^{16} and adapted the variational autoencoder framework described above to enable batchadversarial training. The goal of this approach is to establish a latent feature representation that captures biologically relevant information about the accessibility profiles (for example, to describe cell types), while at the same time conveying as little information as possible about experimental batch labels.
To facilitate batchadversarial training, the standard VAE architecture is augmented by batchdiscriminator network modules. These subnetworks are stacked on top of the final layer of the encoder for the purpose of predicting the batch label from the latent feature representation. We use batchdiscriminator networks with softmax output activation to predict categorical batch labels. Multiple independent batch labels can be used simultaneously with separate softmax output units for each batch. In addition to the batchdiscriminator network at the final layer of the encoder (similar to ref. ^{16}), we also add batch discriminators on top of the hidden layers of the encoder (after the first hidden layer and after each residual network block). The batchdiscriminator network modules at the initial and intermediate layer allow to put more emphasis on removing batch related information early on in the network, and is intended to simplify disentangling batchrelated information from the biologically relevant signal throughout the entire encoder network.
Apart from the additional batchdiscriminator networks, we modified the decoder to take as input the latent feature encoding z as well as the onehot encoded batch labels. This enables the decoder to recombine the batch information with the (ideally) batchcorrected latent feature representation to compute the reconstruction of the accessibility profiles. This is important, as batchrelated information is still used in this way to compute the reconstruction loss.
Finally, we adjust the training objective of the standard VAE as follows: We measure how well batchlabels can be predicted from the latent features of the encoder (for example, derived from the hidden or final layer) using the categorical crossentropy loss
where y and \(\hat{y}\) denote the true and predicted batch label. While the parameters of the batchdiscriminator associated parameters are adapted to minimize loss_{batch}, the parameters associated with the encoder module are adapted according to loss_{BAVARIA} = loss_{recon} + loss_{KL} − loss_{batch} That is, the encoder seeks to find a latent feature representation that is uninformative for the batchlabel classification.
Model and training hyperparameters
We use the following model architecture for all experiments: for the encoder, we use a feedforward layer with 512 nodes and rectified linear units (ReLU) activation, followed by 20 consecutive residual neural network blocks with 512 nodes, which feed into two layers representing the means and variances of the latent features (dimensions listed in Supplementary Table 2). Each residual block is composed of a feedforward layer with ReLU and a feedforward layer whose output is added to the block’s input before applying ReLU activation. For the decoder, we use a single feedforward layer consisting of 16 and 25 neurons in the benchmarking analysis and the data integration use case, respectively.
For the batchadversarial training, batch discriminator network modules are stacked on top of the intermediate layers of the encoder (after the first layer and after each residual block) as well as on top of the final layer of the encoder. Each discriminator network consists of two layers with 128 neurons and ReLU activation and an output layer with softmax activation.
The models were fitted using 85% of the cells using AMSgrad^{23}. The remaining 15% cells were used for validation. Additional datasetspecific hyperparameters are listed in Supplementary Table 2.
Ensemble of models and feature extraction
We fitted a BAVARIA model M times, each time starting from random initial weights. Afterwards, we concatenated the mean vectors of the approximate posterior distribution of the latent features μ either across all M individual models or by using a subset of these models, dependent on the use case. In the latter case, we sought to remove potentially poor quality models whose average loss across the dataset exceeded an outlier criterion. Specifically, we removed models if their average loss after training exceeded Q_{75%}(loss) + 1.5 × IQR(loss), where Q_{75%} denotes the 75% quantile of the loss distribution across the M individual models and IQR represents its interquartile range. The latent features of the remaining models were concatenated and considered for the downstream cell clustering analysis.
Benchmark analysis
We adapted a recently published scATACseq benchmarking framework^{11}. For all real datasets, we obtained the results of three topperforming tools: cisTopic^{4}, LSI^{5} and SnapATAC^{7}, as previously reported^{11}. The results of the remaining tools from the original benchmarking assessment are omitted here. In addition, we ran LSI on the full Cusanovich 2018 dataset in the same way it was previously run for the Cusanovich 2018 subset^{11}. We fitted BAVARIA models on the filtered count matrices using the hyperparameters described in Supplementary Table 2. After training each individual model, an ensemble model was created by concatenating individual models by excluding outlier models. Latent features of the ensemble were used for the downstream clustering using the benchmarking framework. We ran SCALE^{8} with default parameters using the input count matrices described above (for example, the same matrices which were used for BAVARIA) and extracted the latent features for the downstream clustering analysis. We created arrow files using ArchR^{6} for each dataset based on the bamfiles (merged across cells) that are part of the benchmarking framework without additional filtering. For each dataset iterative LSI was performed as demonstrated in the online tutorial https://www.archrproject.com/articles/Articles/tutorial.html. Subsequently, the reduced dimensionality was evaluated using the benchmarking framework^{11}. We applied PeakVI^{10} on each dataset (with the same filtering criteria as were used for applying BAVARIA) by following the online tutorial https://docs.scvitools.org/en/stable/user_guide/notebooks/PeakVI.html. Similarly, we applied scDEC^{9} by following the online documentation https://github.com/kimmo1019/scDEC. For the parameter –K we used 8, 10, 10, 13 and 13 clusters for the 10X PBMC, Buenrostro 2018, Buenrostro 2018 (bulkpeaks), Cusanovich 2018 (subset) and Cusanovich 2018 (full), respectively. Moreover, we used 15 latent features for all cases, except for Cusanovich 2018 full where we used 30 latent features.
For the synthetic datasets, we ran cisTopic, LSI and SnapATAC with the same parameters as in the published benchmark analysis, but using larger numbers of cells. We applied SCALE with parameters ’–min_cells 0 –min_peaks 0.0’ to ensure that the simulated count matrix is not further subjected to filtering. We applied PeakVI as described above. We used scDEC with 15 latent features and 8 clusters as described above. BAVARIA ensembles were fitted for each synthetic dataset using hyperparameters listed in Supplementary Table 2.
Following ref. ^{11}, the latent features extracted for each method were subjected to kmeans clustering, hierarchical clustering (with ward linkage), and Louvain clustering. For all datasets with available ground truth labels, the ARI, AMI and Hom were used to score the agreement of the clustering results with the known cell labels^{11}. For the 10X dataset (without ground truth)^{24}, the residual average Gini index (RAGI) was computed based on a previously reported set of marker and housekeeping genes^{11}.
Subsequently, the maximum clustering score across the clustering algorithms (kmeans clustering, hierarchical clustering or Louvain clustering) was reported for each dataset and score (ARI, AMI, Hom).
Comparison of alternative reconstruction losses
We compared several reconstruction loss implementations on Buenrostro 2018 data using the benchmarking framework described above^{11}. Specifically, we used the same variational autoencoder architecture parameters, with exception of the final layer of the decoder. Assuming the accessibility profiles are derived from a Bernoulli distribution, we used a sigmoid output activation for the decoder in conjunction with a binary crossentropy loss, \(\mathrm{loss}_\mathrm{Bin}={\sum }_{i}\left({x}_{i}{{\mathrm{log}}}\,({p}_{i}(x))+(1{x}_{i}){{\mathrm{log}}}\,(1{p}_{i}(x))\right)\). Alternatively, assuming multinomially distributed accessibility profiles, we used a softmax output activation in conjunction with a negative loglikelihood of the multinomial distribution, \(\mathrm{loss}_\mathrm{Mul}={\sum }_{i}{x}_{i}{{\mathrm{log}}}\,{p}_{i}(x)\).
Data integration of mouse brain cells
We used BAVARIA with a 15dimensional latent feature representation and 25 neurons for the hidden layer of the decoder. We trained an ensemble of 10 individual models for 200 epochs using a batch size of 64 (Supplementary Table 2). In addition, each individual model was fitted on a random 50% subset of the original peak set, which enabled training time speedup while maintaining similar clustering qualities (data not shown).
We compared BAVARIA to several tools that facilitate batch correction. We fitted scVI^{15} with default parameters. We trained trVAE^{25} by following the tutorial code (https://nbviewer.jupyter.org/github/theislab/trVAE/blob/master/examples/trVAE_Haber.ipynb) using the unnormalized count matrix (136,528 regions by 18,605 cells). We fitted SAUCIE^{26} with default parameters. We fitted PeakVI^{10} with default parameters. We used read depth normalized (with a target of 10,000 reads per cell) and log(x + 1) transformed signals, determined 50 principal components and integrated the datasets with Harmony^{27} using the scanpy interface^{28}. Integration with Liger^{29} was performed by following the tutorial code https://github.com/welchlab/liger/blob/master/vignettes/Integrating_multi_scRNA_data.html. Integration with Seurat v3 based on canonical correlation analysis (CCA)^{30} was performed by following the tutorial code https://satijalab.org/seurat/articles/integration_introduction.html using all available features.
As baseline, we ran simplified versions of the BAVARIA architecture: (1) we adjusted BAVARIA to a conditional VAE model, which receives the batch labels as input at the encoder (along with the raw read profile), rather than predicting the batch labels from the latent features and hidden layers; (2) we adjusted BAVARIA to use only a single batchdiscriminator network module at the final layer of the encoder (along the lines of Ganin et al. 2016^{16}). These networks were trained with the same network parameters and hyperparameters as BAVARIA.
The UMAP embedding was computed based on the latent features from each method using scanpy^{28}. For the UMAP visualization, cells previously annotated as astrocytes, cerebellar granule cells, encothelial II cells, Ex. neurons CPN, Ex. neurons CThPN, Ex. neurons SCPN, inhibitory neurons, microglia, oligodendrocytes, Purkinje cells, SOM+ interneurons and unknown cells in Cusanvovich et al. 2018 and all 10X Genomics cells^{17} were illustrated.
We determined kBET scores using the parameters k0 = 10 and n_repeat = 500 using the kBET R package^{31} to measure the mixing of cells across batches.
Clustering and differential accessibility analysis of mouse brain data
Using the BAVARIAderived latent features, we performed Louvain clustering using scanpy^{28}. Clusterspecific accessibility was determined by using a generalized linear model and a binomial distribution:
where α_{r} denotes the regionspecific offset for region r, β_{cb} denotes the cluster and batchspecific offset for cluster c and batch b, δ_{rc} denotes the clusterspecific accessibility for region r and cluster c and f_{rb} denotes the batchspecific accessibility for region r and batch b.
After fitting the linear models, we identified the 100 topmost accessible regions per cluster by ranking δ_{rc} and visualized the associated raw accessibility profiles in a heatmap using scanpy^{28}.
For the pseudobulk visualization, we remapped reads from Cusanovich et al.^{5} WholeBrain and PreFrontalCortex tissues to mm10 using bowtie2^{32} using the parameters ’–verysensitive –X 2000 –3 1’. Clusterspecific pseudobulk bam files were constructed by dividing the reads by barcodes associated with the clusters. These bam files were converted to bigwig tracks using ’bamCoverage –normalizeUsing CPM’ from deeptools^{33}. Finally, pyGenomeTracks^{34} was used to visualize the clusterspecific accessibility tracks.
Data availability
All datasets to perform the benchmark analysis were obtained from the computational scATACbenchmarking framework at https://github.com/pinellolab/scATACbenchmarking. This includes the publicly available scATACseq 10X PBMC 5k dataset^{24}, the haematopoiesis dataset^{19} and the adult mouse dataset^{5}, as well as the bone marrow^{20} and erythropoiesis datasets^{21} from which the simulated scATACseq datasets were derived. For the brain data integration, we additionally obtained mouse brain cells from Cusanovich et al.^{5} https://atlas.gs.washington.edu/mouseatac/data/ and the 10X Genomics website https://support.10xgenomics.com/singlecellatac/datasets/1.2.0/atac_v1_adult_brain_fresh_5k.
Code availability
BAVARIA is available via GitHub under a GPLv3 licence at https://github.com/BIMSBbioinfo/bavaria^{35}.
References
Zamanighomi, M. et al. Unsupervised clustering and epigenetic classification of single cells. Nat. Commun. 9, 2410 (2018).
Schep, A. N., Wu, B., Buenrostro, J. D. & Greenleaf, W. J. chromvar: inferring transcriptionfactorassociated accessibility from singlecell epigenomic data. Nat. Methods 14, 975–978 (2017).
Baker, S. M., Rogerson, C., Hayes, A., Sharrocks, A. D. & Rattray, M. Classifying cells with scasat, a singlecell ATACseq analysis tool. Nucleic Acids Res. 47, 10 (2019).
GonzálezBlas, C. B. et al. cistopic: cisregulatory topic modeling on singlecell ATACseq data. Nat. Methods 16, 397–400 (2019).
Cusanovich, D. A. et al. A singlecell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324 (2018).
Granja, J. M. et al. ArchR is a scalable software package for integrative singlecell chromatin accessibility analysis. Nat. Genet. 53, 403–411 (2021).
Fang, R. et al. Comprehensive analysis of single cell ATACseq data with SnapATAC. Nat. Commun. 12, 1337 (2021).
Xiong, L. et al. SCALE method for singlecell ATACseq analysis via latent feature extraction. Nat. Commun. 10, 4576 (2019).
Liu, Q., Chen, S., Jiang, R. & Wong, W. H. Simultaneous deep generative modelling and clustering of singlecell genomic data. Nat. Mach. Intell. 3, 536–544 (2021).
Ashuach, T., Reidenbach, D. A., Gayoso, A. & Yosef, N. PeakVI: a deep generative model for single cell chromatin accessibility analysis. Preprint at bioRxiv https://doi.org/10.1101/2021.04.29.442020 (2021).
Chen, H. et al. Assessment of computational methods for the analysis of singlecell ATACseq data. Genome Biol. 20, 241 (2019).
Luecken, M. D. et al. Benchmarking atlaslevel data integration in singlecell genomics. Nat. Methods 19, 41–50 (2022).
Kingma, D. P. & Welling, M. Autoencoding variational Bayes. In International Conference on Learning Representations (2014).
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Singlecell RNAseq denoising using a deep count autoencoder. Nat. Commun. 10, 390 (2019).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for singlecell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Ganin, Y. et al. Domainadversarial training of neural networks. J. Mach. Learn. Res. 17, 2096–2030 (2016).
Fresh cortex from adult mouse brain (P50). 10X Genomics https://support.10xgenomics.com/singlecellatac/datasets/1.2.0/atac_v1_adult_brain_fresh_5k (2019).
Abadi, M. et al. TensorFlow: largescale machine learning on heterogeneous systems (TensorFlow, 2015); https://www.tensorflow.org/
Buenrostro, J. D. et al. Integrated singlecell analysis maps the continuous regulatory landscape of human hematopoietic differentiation. Cell 173, 1535–1548 (2018).
Ulirsch, J. C. et al. Interrogation of human hematopoiesis at singlecell and singlevariant resolution. Nat. Genet. 51, 683–693 (2019).
Ludwig, L. S. et al. Transcriptional states and chromatin accessibility underlying human erythropoiesis. Cell Rep. 27, 3228–3240 (2019).
Quinlan, A. R. & Hall, I. M. Bedtools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Reddi, S. J., Kale, S. & Kumar, S. On the convergence of Adam and beyond. In International Conference on Learning Representations (2018).
5k peripheral blood mononuclear cells (PBMCs) from a healthy donor. 10X Genomics https://support.10xgenomics.com/singlecellatac/datasets/1.0.1/atac_v1_pbmc_5k (2018).
Lotfollahi, M., Naghipourfar, M., Theis, F. J. & Wolf, F. A. Conditional outofdistribution generation for unpaired data using transfer vae. Bioinformatics 36, 610–617 (2020).
Amodio, M. et al. Exploring singlecell data with deep multitasking neural networks. Nat. Methods 16, 1139–1145 (2019).
Korsunsky, I. et al. Fast, sensitive and accurate integration of singlecell data with harmony. Nat. Methods 16, 1289–1296 (2019).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: largescale singlecell gene expression data analysis. Genome Biol. 19, 1–5 (2018).
Welch, J. D. et al. Singlecell multiomic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019).
Stuart, T. et al. Comprehensive integration of singlecell data. Cell 177, 1888–1902 (2019).
Büttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing singlecell RNAseq batch correction. Nat. Methods 16, 43–49 (2019).
Langmead, B. & Salzberg, S. L. Fast gappedread alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Ramírez, F., Dündar, F., Diehl, S., Grüning, B. A. & Manke, T. deepTools: a flexible platform for exploring deepsequencing data. Nucleic Acids Res. 42, 187–191 (2014).
LopezDelisle, L. et al. pyGenomeTracks: reproducible plots for multivariate genomic datasets. Bioinformatics 37, 422–423 (2020).
Kopp, W., Akalin, A. & Ohler, U. BAVARIA source code v0.1. Zenodo https://doi.org/10.5281/zenodo.5791250 (2021).
Acknowledgements
We thank P. Rautenstrauch, J. Ronen and R. Monti for valuable comments on the manuscript. This work was supported by the German Federal Ministry of Education and Research (de.NBI; FKZ 031L0101B) and by the Helmholtz Association (sparse2big; ZTI0007).
Funding
Open access funding provided by MaxDelbrückCentrum für Molekulare Medizin in der HelmholtzGemeinschaft (MDC).
Author information
Authors and Affiliations
Contributions
W.K. designed and implemented BAVARIA and performed the analysis with input from U.O. A.A. provided resources dedicated to the project. W.K. and U.O. wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Bo Li and Wing Wong for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Comparison of reconstruction loss measures.
The suitability of different reconstruction loss measures was assessed by fitting thirty individual models on the Buenrostro et al. 2018 dataset. The total loss across the dataset was determined for each model and models with poor outlier losses were excluded (for example due to poor local minima; see Methods), leading to 28, 29 and 29 models for binary, multinomial and negative multinomial loss for the visualization, respectively. The xaxis represents different reconstruction losses: binary crossentropy loss (Binary), negative loglikelihood for the multinomial distribution (Multinomial), and negative loglikelihood of the negative multinomial (Neg. multinomial) distribution. Otherwise, the model architecture remained the same. Latent features were subjected to clustering using kmeans, hierarchical clustering and Louvain clustering and clustering performances were computed based on adjusted mutual information (AMI), adjusted Rand index (ARI) and Homogeneity (Hom) against ground truth cell labels. The best score across the clustering algorithms was considered. Boxes represent quartiles Q1 (25% quantile), Q2 (median) and Q3 (75% quantile); whiskers comprise data points that are within 1.5 x IQR (interquartile region) of the boxes.
Extended Data Fig. 2 Combining latent features of separately trained models.
Three ensembles consisting of ten VAE models were fitted on the Buenrostro et al. 2018 dataset. The total loss across the dataset was determined for each model and models with poor outlier losses were excluded from the ensemble (for example, due to poor local minima; see Methods). Individual models (BAVARIA  individual) were combined to ensembles by concatenating the latent features (BAVARIA  ensemble). Latent features were subjected to clustering using several algorithms and clustering performances were computed based on adjusted mutual information (AMI), adjusted Rand index (ARI) and Homogeneity (Hom). The best score across the clustering algorithms are considered.
Extended Data Fig. 3 Cell type characterization assessment using synthetic data.
A) Bone marrow data using 5000, 2500 1000, 500 and 250 fragments per cell. B) Bone marrow data using 0%, 20% and 40% additional noise. C) Erythropoiesis data using 5000, 2500 1000, 500 and 250 fragments per cell. D) Erythropoiesis data using 0%, 20% and 40% additional noise. Lowdimensional feature representations were obtained using cisTopic, LSI, SnapATAC, SCALE and BAVARIA and subjected to clustering using different algorithms (kmeans, hierarchical clustering, Louvain clustering). Clustering performances were evaluated using adjusted mutual information (AMI), adjusted Rand index (ARI) and Homogeneity (Hom) compared against ground truth cell labels (see Methods). The best score across clustering algorithms is shown. cisTopic, LSI, SnapATAC and SCALE were run once per case, while N = 3 ensembles of NMVAE were trained from scratch to assess the variability of the performance. The dot represents the mean performance and the error bars indicate the +/ SEM according to the repetitions.
Extended Data Fig. 4 Batch correction  Comparison of architectures.
A) UMAP embedding illustrating cells from 10X Genomics and Cusanovich et al. 2018 after applying a BAVARIA variant with a single batchdiscriminator network module at the final encoder layer (BAVARIA  final layer only) and a conditional variational autoencoder variant of BAVARIA which receives the batch labels as input for the encoder’s initial layer (batch conditional). The kBET scores indicate the cell mixing across batches. Lower kBET scores correspond to better mixing. B) UMAP embedding illustrating previously characterized cell types (Astrocytes, Cerebellar granule cells, Encothelial II cells, Ex. neurons CPN, Ex. neurons CThPN, Ex. neurons SCPN, Inhibitory neurons, Microglia, Oligodendrocytes, Purkinje cells, SOM+ interneurons and unknown cells; Cusanovich et al. 2018). 10X Genomics cells are labelled ’Unknown’.
Extended Data Fig. 5 Clustering and clusterassociated regions.
A) Clustering of the integrated 10x and Cusanovich et al. 2018 datasets. B) Number of cells per cluster and batch. C) Illustration of clusterassociated accessibility using the 100 top accessible regions per cluster. D) Depth normalized accessibility tracks per cluster for the 10X dataset for several marker regions. E) Depth normalized accessibility tracks per cluster for the Cusanovich et al. 2018 dataset for several marker regions.
Supplementary information
Supplementary Information
Supplementary Tables 1 and 2.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kopp, W., Akalin, A. & Ohler, U. Simultaneous dimensionality reduction and integration for singlecell ATACseq data using deep learning. Nat Mach Intell 4, 162–168 (2022). https://doi.org/10.1038/s42256022004431
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256022004431
This article is cited by

Discrete latent embedding of singlecell chromatin accessibility sequencing data for uncovering cell heterogeneity
Nature Computational Science (2024)

scCASE: accurate and interpretable enhancement for singlecell chromatin accessibility sequencing data
Nature Communications (2024)