EpiScanpy: integrated single-cell epigenomic analysis

Epigenetic single-cell measurements reveal a layer of regulatory information not accessible to single-cell transcriptomics, however single-cell-omics analysis tools mainly focus on gene expression data. To address this issue, we present epiScanpy, a computational framework for the analysis of single-cell DNA methylation and single-cell ATAC-seq data. EpiScanpy makes the many existing RNA-seq workflows from scanpy available to large-scale single-cell data from other -omics modalities. We introduce and compare multiple feature space constructions for epigenetic data and show the feasibility of common clustering, dimension reduction and trajectory learning techniques. We benchmark epiScanpy by interrogating different single-cell brain mouse atlases of DNA methylation, ATAC-seq and transcriptomics. We find that differentially methylated and differentially open markers between cell clusters enrich transcriptome-based cell type labels by orthogonal epigenetic information.

individual cells using single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-35 seq) 3 . Thanks to dropping sequencing costs, well described protocols and advances in microfluidics 36 techniques, current experimental designs afford to interrogate the epigenome of thousands of cells at the 37 time [4][5][6][7] . These data represent a rich layer of regulatory information that stands between the genome and 38 the transcriptome, and new analysis methods are needed to leverage it 8 . 39 While many methods for analyzing single-cell transcriptomics data have been developed recently 8 , this is 40 much more limited for scATAC-seq data 9,10 and single-cell DNA methylation data 11 , or for the joint analysis 41 of multiple -omics data types 8 . With the current speed at which single-cell methylome and open chromatin 42 datasets are being generated, an analysis tool that goes beyond custom-made scripts and that permits 43 dealing with different -omics data types in the same framework is needed. Here we present epiScanpy, a 44 method for the analysis of scATAC-seq and single-cell DNA methylation data, which integrates into the 45 scanpy platform for single-cell transcriptomics data analysis 12 . EpiScanpy enables preprocessing of 46 epigenomics data as well as downstream analyses such as clustering, manifold learning, visualization and 47 lineage estimation. EpiScanpy allows for comparative analyses between -omics layers, and can serve as a 48 framework for future single-cell multi-omics data integration. Since its downstream analyses extend the 49 popular scanpy framework, it inherits properties such as fast and scalable runtime behavior and modular 50 extensibility. 51

RESULTS: 53
EpiScanpy workflow: 54 Workflows based on epiScanpy consist of four stages: Feature space engineering, data pre-processing, 55 assessment of global heterogeneity via embeddings and clusterings, and feature-level analysis to attribute 56 drivers of heterogeneity (Fig. 1a). The input of epiScanpy consists of .bam files for scATAC-seq or 57 methylation count files for single-cell DNA methylation. 58 In the feature space engineering step, epiScanpy generates count matrices based on open chromatin 59 levels or individual cytosine methylation levels, summarized over different sets of genomic regions (Fig.  60 1b). These count matrices serve as feature space that retains as much variation of the data as possible 61 without being too high-dimensional -a feature space at single base-pair resolution can in principle be 62 assembled but would impede downstream analysis through memory and run time issues as well as though 63 data sparsity. These genomic regions can cover the entire genome (i.e. windows) or can be based on 64 genomic features such as known open chromatin peaks, gene promoters, gene bodies or enhancers 65 (suppl. methods). Any other feature space of interest, such as for example cis-regulatory topics 10 , can also 66 be used. For scATAC-seq data, the count matrix is binarized to account for presence or absence of reads 67 at every peak or feature, library size is regressed out and low quality single cells are filtered out (suppl. 68 methods, Fig. SI1-2). For DNA methylation data, the CG methylation level per genomic region is computed, 69 and features with too few covered cytosines are labelled as missing data (suppl. methods, Fig. SI3). 70 Optionally, CH methylation can also be used for computing count matrices, but methylation in this context 71 is only present in a limited number of mammalian tissues. 72 In bisulfite sequencing, it is necessary to differentiate non-methylated cytosines (zero signal) and non-73 observed cytosines (missing signal). Accordingly, we propose the usage of imputation methods for non-74 observed cytosines. Note that this is different to imputing zeros in single-cell RNA-seq, which are not 75 inherently non-observed data points, but may also be zero count observations. EpiScanpy imputes 76 shared peaks are considered (usually ~20,000 peaks) (Fig. SI1-3). The constructed epigenetic data matrix 91 is stored as an instance of the anndata class, a flexible data structure to store large annotated count 92 matrices introduced in scanpy 12 . EpiScanpy allows for joint storage of multiple -omic modalities, allowing 93 easy comparison between conditions and offering integrated and easy to use workflows for different 94 types of single-cell data. 95 Given a processed data matrix, epiScanpy's unsupervised learning algorithms can be used to uncover 96 heterogeneity in the data, such as clusters, trajectories or lineage trees. We implemented a cell-cell 97 distance metric based on epigenetic features to enable common algorithms that rely on a k-nearest 98 neighbor (kNN) graph, such as Louvain clustering 13 , diffusion pseudotime 14 and UMAP 15 . These algorithms 99 and other unsupervised algorithms, such as tSNE 16 and graph abstraction 17 , can directly be called via the 100 interface to scanpy (Fig. 1a and suppl. methods). Note that at this point, epiScanpy has created an abstract 101 representation of the data in the form of a transformed feature space or a kNN graph which can be treated in a similar fashion to single-cell RNA-seq data sets. This representation is independent of the original data 103 form (methylation or chromatin accessibility) so that the workflows presented here truly generalize across 104 data modalities. 105 Lastly, feature-level analysis dissects the drivers of heterogeneity in a data set: epiScanpy includes a 106 differential methylation and differential open chromatin calling strategy (suppl. methods), which enables 107 the ranking of genomic features (such as genes, promoters or other regulatory elements) based on their 108 relevance in the discovered cellular identities (Fig. 1a). This allows for the identification of marker loci that 109 can be used for a fast semi-automated cell-type identification (Fig. 1a). This feature-level analysis allows 110 the user to correlate variation along trajectories or across clusters with marker loci to support cell type 111 annotation and to generate hypotheses on the mechanism that underlie the identified population 112 structure. 113

Applications: 114
To illustrate the potential of epiScanpy and to show how it can effectively deal with different data 115 modalities, we applied it to brain mouse atlases from three different -omic data types: single-cell DNA 116 methylation (snmC-seq, 3,377 prefrontal cortex neurons, 4.7% average genomic coverage 4

), single-cell 117
open chromatin (scATAC-seq, ~13,000 prefrontal cortex and whole brain cells, median coverage range 118 ~8,000 -24,000 reads per cell 6 ) and single-cell gene expression (Drop-seq, ~690,000 cells, 9 regions of the 119 adult mouse brain 18 ). 120 Firstly, we explored the impact of the choice of genomic feature on the global topology ("structure") that 121 can be learned from the data, using clustering as an example method for unsupervised learning. Count 122 matrices were constructed for different types of genomic features for single-cell DNA methylation and 123 scATAC-seq data (respectively: 100kb non-overlapping windows, gene promoters, gene bodies and 124 enhancers (from 19 ); and open chromatin peaks (from 6 ) and enhancers). We performed iterative Louvain 125 clustering (suppl. methods) on each feature space and found that cells are grouped similarly across all feature spaces used, illustrating the fact that different genomic features contain partially redundant 127 information and can be used interchangeably (Fig. 2a-c, Fig. SI4-5). For single-cell DNA methylation data, 128 the enhancer feature space provided the clearest cell-type separation in clustering and low dimensional 129 visualization (average silhouette score of 0.44 (enhancers) versus 0.36 (promoters), Fig.2c and Fig.2a,  130 suppl. methods), highlighting the relevance of DNA methylation at non-genic regulatory elements at 131 determining cell identity. 132 To evaluate epiScanpy's ability to map the discovered population structure in the form of clusterings to 133 known cell types, we ranked differentially methylated and differentially open loci between the identified 134 clusters to map cluster identity to cell types (suppl. methods). Neurod2 was identified as one of the top 135 differentially methylated promoters between inhibitory and excitatory neurons (Fig. 2d), which correlates 136 with its expression levels in the adult brain 20 . 4930567H17Rik and Satb2 could be used to distinguish 137 between the different neuronal layers 20 and between SCPN and CPN neurons 21 , respectively (Fig. SI6). 138 These observations based in CG promoter methylation are consistent with CH gene body methylation at 139 known marker genes (Fig. SI7). Interestingly, we identify several differentially methylated promoters of 140 genes which are not differentially expressed in the adult mouse brain 20 but whose differential expression 141 during embryonic development is necessary for cell fate determination, such as Rab4a, a marker of SST 142 neurons expressed during E12.5 -E14.5 22 (Fig. SI6). These findings reflect the unique ability of DNA 143 methylation data to record past cellular states 1 and therefore add valuable information about 144 differentiation and lineage trees to models based on transcriptomics. This integration of complementary 145 layers of information highlight the potential of multi-omics approaches to build a more complete picture 146 of developmental systems. 147 For scATAC-seq, we identified top differentially open peaks which were used to label cell clusters (Fig. 2b). 148 For example, openness of the Ndrg2 promoter can be used to distinguish astrocytes 23 (Fig. SI8) and 149 microglia and oligodendrocytes are identified by open peaks in the promoters of Runx1, and Efnb3 150 respectively 24,25 (Fig. SI8). As a whole group, neurons show openness of peaks in promoters for neuronal 151 genes like Ptprd, Pik3r1, and Syt1 26-28 (Fig. SI8), while differential openness at the Foxp2 promoter can be 152 used to identify Layer 6 cortical neurons (Fig. 2e), for example. A comprehensive list of differential markers 153 used for single-cell DNA methylation and scATAC-seq cluster identification can be found in SI Table 1. regulatory site accessibility) 6 . Respectively ~89% and ~71% of cells are assigned to the same cell type as 166 in the original publications (Fig. SI9-10). For the scATAC-seq dataset the biggest discrepancy is found 167 between SCPN/CPN assignments, where we identify clusters with SCPN signatures that were labelled as 168 CPN neurons in the original publication, and vice versa (Fig. SI11). 169 We performed a global comparison of multi-omic cellular atlases based on mouse brain tissue from single-170 cell DNA methylation, scATAC-seq and scRNA-seq data (processed using scanpy, suppl. methods). While 171 some markers are differentially expressed, differentially open and differentially methylated between 172 clusters (Fig. SI12), there is also a large number of non-redundant markers, such as that of Fabp7. Fabp7 173 is a brain fatty acid binding protein that has been reported to be important for forebrain physiology and 174 is associated with Schizophrenia 29 , which displays signs of differential regulation in CPN neurons 175 (differentially open and methylated) but is not expressed in neurons (Fig. SI12). These markers provide 176 complementary information between data modalities, underpinning the fact that every -omic layer 177 contributes its individual non-redundant layer of information, and emphasizing the need for a tool that 178 deals with many -omic data types and facilitates integration across modalities. 179 Finally, we also considered open chromatin profiles of hematopoietic cells (bone marrow cell types from 6 ) 180 to evaluate whether epiScanpy can learn developmental trajectories with pseudotime and more complex 181 lineage trees with graph abstraction directly based on epigenomic profiles (Fig. 2f). Such continuous 182 descriptions of developmental systems have been very useful in studies based on single-cell 183 transcriptomics. EpiScanpy discovers 7 cell types (Fig. SI13) and recovers the known hematopoietic 184 differentiation tree (Fig. 2f). 185

DISCUSSION: 188
In summary, epiScanpy is a fast and versatile tool for the analysis of single-cell epigenomic data and its 189 integration with single-cell transcriptomic data. It offers the first unified framework for the analysis of 190 both single-cell DNA methylation, scATAC-seq and single-cell transcriptomic data, and its flexible data 191 structure is ready to handle other new types of single-cell -omic data, such as Hi-C or NOME-seq, as well 192 as multi-omics single-cell data. EpiScanpy addresses the open question of feature space construction on 193 epigenetic data and we show evidence that similar manifolds can be learned based on different feature 194 spaces. EpiScanpy also scales well to the large scATAC-seq data sets generated with the 10x Chromium 195 platform (Fig. SI14) 30 . EpiScanpy performs single-cell graph construction from potentially any type of 196 single-cell -omics data and performs downstream analysis like low-dimensional data visualization, 197 clustering, single-cell graph abstraction or trajectory inference, and differential calling. EpiScanpy is 198 available as a python package through Github (https://github.com/colomemaria/epiScanpy, 199 documentation available on episcanpy.readthedocs.io) and builds upon the scanpy analysis toolbox 12 , 200 opening the toolchain to the commonly measured single-cell epigenomic data. 201 202