A rapid and robust method for single cell chromatin accessibility profiling

The assay for transposase-accessible chromatin using sequencing (ATAC-seq) is widely used to identify regulatory regions throughout the genome. However, very few studies have been performed at the single cell level (scATAC-seq) due to technical challenges. Here we developed a simple and robust plate-based scATAC-seq method, combining upfront bulk Tn5 tagging with single-nuclei sorting. We demonstrate that our method works robustly across various systems, including fresh and cryopreserved cells from primary tissues. By profiling over 3000 splenocytes, we identify distinct immune cell types and reveal cell type-specific regulatory regions and related transcription factors.

Due to its simplicity and sensitivity, ATAC-seq 1 has been widely used to map open chromatin regions across different cell types in bulk. Recent technical developments have allowed chromatin accessibility profiling at the single cell level (scATAC-seq) and revealed distinct regulatory modules across different cell types within heterogeneous samples [2][3][4][5][6] . In these approaches, single cells are first captured within a microfluidic device, followed by independent tagmentation of each cell 3 . Alternatively, a combinatorial indexing strategy has been reported to perform the assay without single cell isolation 2,4 . However, these approaches either require a specially engineered and expensive microfluidic device, such as a Fluidigm C1 3 , or a large quantity of customly modified Tn5 transposase 2,4,5 .
Here, we overcome these limitations by performing upfront tagmentation in the bulk cell population, prior to single nuclei isolation. It has been previously demonstrated that Tn5 transposase-mediated tagmentation does not fragment the DNA until its release using heat or denaturing agents, such as sodium dodecyl sulfate (SDS) [7][8][9] . Therefore, we reasoned that tagmentation itself would not disrupt nuclei in an ATAC-seq experiment. Based on this idea, we developed a simple, robust and flexible plate-based scATAC-seq protocol, performing a tagmentation reaction 6,10 on a pool of cells (5,000 -50,000) followed by sorting individual nuclei into plates containing lysis buffer.
Library indexing and amplification are done by PCR, followed by sample pooling, purification and sequencing. The whole procedure takes place in one single plate, without any intermediate purification or plate transfer steps ( Fig. 1a ). With this easy and quick workflow, it only takes a few hours to prepare sequencing-ready libraries, and the method can be implemented by any laboratory using standard equipment.
We validated our method by generating the chromatin accessibility profiles of 3,648 splenocytes (after red blood cell removal) from two C57BL/6Jax mice. The aggregated scATAC-seq profiles exhibited good coverage and signal and resembled the bulk data generated from 10,000 cells by the Immunological Genome Project (ImmGen) 11 ( Fig. 1b ). The library fragment size distribution before and after sequencing both displayed clear nucleosome banding patterns ( Fig. 1c and Supplementary Fig. 1a ), which is a feature of high quality ATAC-seq libraries 1 . In addition, sequencing reads showed strong enrichment around transcriptional start sites (TSS) ( Fig. 1d ), further demonstrating the success of the method.
Importantly, for the majority of the cells, less than 10% (median 2.1%) of the reads were mapped to the mitochondrial genome ( Supplementary Fig. 2a ). Overall, we obtained a median of 643,734 reads per cell, while negative controls (empty wells) generated only~100-1,000 reads ( Supplementary Fig. 2b ). In most cells, more than 98% of the reads were mapped to the mouse genome ( Supplementary Fig. 2b ), indicating low level of contamination. After deduplication, we obtained a median of 30,727 unique fragments per cell that mapped to the nuclear genome ( Supplementary Fig. 2c ). At the sequencing depth of this experiment, the duplication rate of each single cell library is~95% ( Supplementary Fig. 2d ), indicating that the libraries were sequenced to near saturation. Downsampling analysis suggests that at 20 -30% of our current sequencing depth, the majority of the fragments would have already been captured ( Supplementary Fig. 3a and b ). Therefore, in a typical scATAC-seq experiment,~120,000 reads per cell are sufficient to capture most of the unique fragments, with higher sequencing depth still increasing the number of detected unique fragments ( Supplementary Fig. 2e ).
To compare our data to the published scATAC-seq studies done by Fluidigm C1 3,6 , we downsampled the reads from our study to a similar depth (Online Methods) as the published experiments and quantified the median unique fragments, mapping rate and fraction of reads in peaks (FRiP). Our method performs at a similar or higher level with respect to these measures compared to the published studies ( Fig. 1f and Supplementary Fig. 4a and b ). This validates the technical robustness of our rapid protocol.
Next, we examined the data to analyse signatures of different cell types in the mouse spleen. Reads from all cells were merged, and more than 78,000 open chromatin regions were identified by peak calling 12 (Online Methods). We binarised peaks as "open" or "closed" (Online Methods) and applied a Latent Semantic Indexing (LSI) analysis to the cell-peak matrix for dimensionality reduction 2 (Online Methods). Consistent with previous findings 2 , the first dimension is primarily influenced by sequencing depth ( Supplementary Fig. 2f ). Therefore, we only focused on the second dimension and upwards and visualised the data by t-distributed stochastic neighbour embedding (t-SNE) 13 . We did not observe batch effects from the two profiled spleens, and several distinct populations of cells were clearly identified in the t-SNE plot ( Fig. 1e ). Read counts in peaks near key marker genes ( e.g. Bcl11a and Bcl11b ) suggested that the major populations are B and T lymphocytes, as expected in this tissue ( Fig. 2a ). In addition, we found a small number of antigen-presenting cell populations ( Supplementary Fig. 5 ), consistent with previous analyses of mouse spleen cell composition 14 .
To systematically interrogate various cell populations captured in our experiments, we applied a spectral clustering technique 15 which revealed 12 different cell clusters ( Fig. 2b ). Reads from cells within the same cluster were merged together to form 'pseudo-bulk' samples and compared to the bulk ATAC-seq data sets generated by ImmGen ( Supplementary Fig. 6 and 7 ). Cell clusters were assigned to the most similar ImmGen cell type ( Fig. 2b and Supplementary Fig. 7 ). In this way, we identified most clusters as different subtypes of B, T and Natural Killer (NK) cells, as well as a small population of granulocytes (GN), dendritic cells (DC) and macrophages (MF) ( Fig. 2a and Supplementary Table 1 ). An aggregate of all single cells within the same predicted cell type agrees well with the ImmGen bulk ATAC-seq profiles ( Supplementary Fig. 8). Remarkably, the aggregate of as few as 55 cells ( e.g. the predicted MF cell cluster) already exhibited typical bulk ATAC-seq profiles ( Supplementary Fig. 8 ). This finding opens the door for a novel ATAC-seq experimental design, where tagmentation can be performed upfront on large populations of cells (5,000 -50,000). Subsequently, cells of interest (for example, marked by surface protein antibodies or fluorescent RNA/DNA probes) can be isolated by FACS, and libraries generated for subsets of cells only. This will be a simple and fast way of obtaining scATAC-seq profiles for rare cell populations.
Importantly, the spectral clustering was able to distinguish different cell subtypes, such as naïve and memory CD8 T cells, naïve and regulatory CD4 T cells and CD27+ and CD27-NK cells ( Fig. 2a ). Previous studies have identified many enhancers that are only accessible in certain cell subtypes, and these are robustly identified in our data. Examples are the Ilr2b and Cd44 loci in memory CD8 T cells 16 and Ikzf2 and Foxp3 in regulatory T cells 17 ( Supplementary Fig. 9a and b ). When examining the TCR alpha/delta gene loci, only T cell clusters exhibited many open chromatin peaks around the TCR alpha variable loci ( Fig. 2c ). In addition, all T cell clusters, and not others, share a unique signature of broad ATAC-seq peaks, spreading across the entire TCR alpha constant locus ( Fig. 2c ). In sharp contrast to the T cell clusters, the two NK cell clusters displayed the opposite patterns, where ATAC-seq peaks only appeared around TCR delta variable and constant loci ( Fig. 2c ). This is consistent with the notion that Tcrd gene is expressed in NK cells 18 . The unique pattern around the TCR loci suggests that scATAC-seq might be used for TCR studies at the single cell level.
Interestingly, our clustering approach successfully identified two subtle subtypes of NK cells (CD27-and CD27+ NK cells), as determined by their open chromatin profiles ( Fig. 2b and d ). It has been shown that, upon activation, NK cells can express CD83 19 , a well-known marker for mature dendritic cells 20 . In mouse spleen, Cd83 expression was barely detectable in the two NK subpopulations profiled by the ImmGen consortium ( Supplementary Fig. 9c). However, in our data, the Cd83 locus exhibited different open chromatin states in the two NK clusters ( Fig. 2d ). Multiple ATAC-seq peaks were observed around the Cd83 locus in the CD27+ NK cell cluster but not in the CD27-NK cluster ( Fig. 2d ). This suggests that Cd83 is in a transcriptionally permissive state in the Cd27+ NK cells, and the CD27+ NK cells have a greater potential for rapidly producing CD83 upon activation. This may partly explain the functional differences between CD27+ and CD27-NK cell states 21 .
Finally, we investigated whether we could identify the regulatory regions that define each cell cluster. We trained a logistic regression classifier using the spectral clustering labels and the binarised scATAC-seq count data (Online Methods). From the classifier, we extracted the top 500 open chromatin peaks (marker peaks) that can distinguish each cell cluster from the others ( Fig. 2e and Online Methods). By looking at genes in the vicinity of the top 50 marker peaks, we recapitulated known markers, such as Cd4 for the helper T cell cluster (cluster 3), Cd8a and Cd8b1 for the cytotoxic T cell cluster (cluster 6) and Cd9 for marginal zone B cell cluster (cluster 4) ( Supplementary Figure 10 and Supplementary Table 2 ). These results are consistent with our correlation-based cell cluster annotation ( Fig. 2a ).
While the peaks at TSS are useful for cell type annotation, the majority of the cluster-specific marker peaks are in intronic and distal intergenic regions ( Supplementary Fig. 11 ). This indicates that many open chromatin regions defining cell clusters are putative enhancers, which is consistent with previous findings that enhancers are essential for cell-type specificity 6,22 . Therefore, we analysed these non-coding peaks in more detail by motif enrichment analysis using HOMER 23 . As expected, different ETS motifs and ETS-IRF composite motifs were significantly enriched in marker peaks of many clusters ( Fig. 2f ), consistent with the notion that ETS and IRF transcription factors are important for regulating immune activities 24 . Furthermore, we found motifs that were specifically enriched in certain cell clusters ( Fig. 2f ). Our motif discovery is consistent with previous findings, such as the importance of T-box ( e.g. Tbx21) motifs in NK 25 and CD8T memory cells 26 and POU domain ( e.g. Pou2f2) motifs in marginal zone B cell 27 . This suggests that our scATAC-seq data is able to identify known gene regulation principles in different cell types within a tissue.
In recent years, other methods, such as DNase-seq 28 and NOMe-seq 29,30 , have investigated chromatin status at the single cell level. However, due to its simplicity and reliability, ATAC-seq currently remains the most popular technique for chromatin profiling. Our study demonstrates that our simple plate-based scATAC-seq method can successfully detect different cell populations, including subtle and rare cell subtypes, from a complex tissue. More importantly, it is able to reveal key gene regulatory features, such as cell-type specific open chromatin regions and transcription factor motifs, in an unbiased manner. Future studies can utilise this method to unveil the regulatory characteristics of novel and rare cell populations and the mechanisms behind their transcriptional regulation.  Schematic view of the workflow of the scATAC-seq method. Tagmentation is performed upfront on bulk cell populations, followed by sorting single-nuclei into 96/384 well plates containing lysis buffer. The lysis buffer contains a low concentration of proteinase K and SDS to denature the Tn5 transposase and fragment the genome. Tween-20 is added to quench SDS 31 . Subsequently, library preparation by indexing PCR is performed, and the number of PCR cycles needed to amplify the library is determined by quantitative PCR (qPCR) (Supplementary Fig.  1b) .  Supplementary Figure 2. Scatter plots of different quality control metrics. Single cells from different batches are indicated by different colours, and empty well controls are also indicated. We removed cells that have less than 10,000 reads or less than 90% mapping rate, as indicated by dotted lines in (b). Supplementary Figure 4. Different comparisons of scATAC-seq from this study vers us previously published data sets using Fluidigm C1. The data from this study were downsampled to a similar depth (Online Methods) as the published ones. Comparisons were done by batches.

Supplementary
Supplementary Figure 5. Number of counts from all peaks that assigned to the indicated genes by HOMER. Figure 6. Hierarchical clustering of the Pearson's correlation between aggregated single cell clusters and the bulk ATAC-seq data sets from ImmGen. The full matrix is shown here, and the ImmGen sample labels were taken directly from the ImmGen ATAC-seq data deposited at the European Nucleotide Archive (ENA) ( https://www.ebi.ac.uk/ena/data/view/PRJNA392905 ).

Competing financial interests
None declared.

Cell isolation
The spleen from a C57BL/6Jax mouse was mashed by a 2-ml syringe plunger through a 70 μm cell strainer (Fisher Scientific 10788201) into 30 ml 1X DPBS (ThermoFisher 14190169) supplied with 2 mM EDTA and 0.5% (w/v) BSA (Sigma A9418). Cells were centrifuged down, supernatant was removed, and the cell pellet was briefly vortexed. 5 ml 1X RBC lysis buffer (ThermoFisher 00-4300-54) was used to resuspend the cell pellet, and the cell suspension was vortexed again, and left on bench for 5 minutes to lyse red blood cells. Then 45 ml 1X DPBS was added, and cells were centrifuged down. 30 ml 1X DPBS were used to resuspend the cell pellet. The cell suspension was passed through a Miltenyi 30 μm Pre-Separation Filter (Miltenyi 130-041-407), and the cell number was determined using C-chip counting chamber (VWR DHC-N01). All centrifugations were done at 500 g, 4 °C, 5 minutes.

Fluorescence-activated cell sorting (FACS)
FACS was performed on a BD-INFLUX sorter. DAPI+ single nuclei were sorted into a Armadillo 384-well PCR plate (ThermoFisher). We also sorted a few extra columns in an additional plate for the determination of the PCR cycle numbers required for library generation.

qPCR for library amplification
After assembly of the 20 μl PCR reaction (see Supplementary Protocol ), a pre-amplification step was performed on a PCR machine (Alpha Cycler 4, PCRmax) with 72°C 5 minutes, 98°C 5 minutes, 8 cycles of [98°C 10 seconds, 63°C 30 seconds, 72°C 20 seconds]. Of the product, 19 μl of pre-amplified library were transferred to a 96 well qPCR plate, 1 μl 20X EvaGreen (Biotium #31000) was added, and qPCR was performed on an ABI StepOnePlus system with the following cycle conditions: 98°C 1 minutes, 20 cycles of [98°C 10 seconds, 63°C 30 seconds, 72 °C 20 seconds]. Data was acquired at 72°C. We qualitatively chose the cycle number to where the fluorescence signals just about to start going up ( Supplementary Fig. 1b ). In this study, a total of 18 cycles were used to amplify the libraries.

Sequencing data processing
All sequencing data were processed using a pipeline written in snakemake 32 . The software/packages and the exact flags used in this study can be found in the 'Snakefile' provided in the GitHub repository https://github.com/dbrg77/plate_scATAC-seq . Briefly, reads were trimmed with cutadapt 33 to remove the Nextera sequence at the 3' end of short inserts. The trimmed reads were mapped to the reference mouse genome (UCSC mm10) using hisat2 34 . Reads with mapping quality less than 30 were removed by samtools 35 (-q 30 flag) and deduplicated using the MarkDuplicates function of the Picard tool ( http://broadinstitute.github.io/picard ). All reads from single cells were merged together using samtools, and the merged BAM file was deduplicated again. Peak calling was performed on the merged and deduplicated BAM file by MACS2 12 . For bulk ATAC-seq and single cell aggregate coverage visualisation, bedGraph files generated from MACS2 callpeak were converted to bigWig files and visualised via UCSC genome browser. For individual single cell ATAC-seq visualisation, aligned reads from individual cells were converted to bigBed files. A count matrix over the union of peaks was generated by counting the number of reads from individual cells that overlap the union peaks using coverageBed from the bedTools suite 36 .
Public ATAC-seq data processing FASTQ files were all downloaded from the European Nucleotide Archive (ENA). The ImmGen bulk ATAC-seq data (study accession PRJNA392905) and the scATAC-seq data using Fluidigm C1 (study accessions PRJNA274006 and PRJNA299657) were processed in the same way as described in this study. The 'Snakefile' used to process the data can be found at the the same GitHub repository.

Bioinformatics analysis
Codes used to carry out all the analyses were provided as Jupyter Notebook files, which can be found in the same GitHub repository. Briefly, downsampling was performed by randomly selecting a fraction of reads from the original FASTQ files using seqtk ( https://github.com/lh3/seqtk ), and the same pipeline was run on the sub-sampled FASTQ files. For comparison with Fluidigm C1 data, we downsampled our reads to the similar level (20% of current depth). For binarising the scATAC-seq data, peak calling was performed on reads merged from all cells, and we labelled the peak '1' (open) if there was at least one read overlapping the peak, and '0' (closed) otherwise. Latent semantic indexing analysis was performed by first normalising the binarized count matrix by term frequency inverse document frequency (TF-IDF) and then performing a Singular-Value Decomposition (SVD) on the normalised count matrix. Only the 2nd -50th dimensions after the SVD were passed to t-SNE for visualisation. To compare with ImmGen bulk ATAC-seq data, a reference peak set was created by taking the union of peaks from the peak calling results of aggregated scATAC-seq (this study) and different samples of ImmGen bulk ATAC-seq using mergeBed from the bedTools suite. All comparisons were done based on this reference peak set. The annotatePeaks.pl from HOMER 23 was used to assign genes to peaks. Latent semantic indexing, spectral clustering and logistic regression were carried out using Scikit-learn 37 .