An atlas of gene regulatory elements in adult mouse cerebrum

The mammalian cerebrum performs high-level sensory perception, motor control and cognitive functions through highly specialized cortical and subcortical structures1. Recent surveys of mouse and human brains with single-cell transcriptomics2–6 and high-throughput imaging technologies7,8 have uncovered hundreds of neural cell types distributed in different brain regions, but the transcriptional regulatory programs that are responsible for the unique identity and function of each cell type remain unknown. Here we probe the accessible chromatin in more than 800,000 individual nuclei from 45 regions that span the adult mouse isocortex, olfactory bulb, hippocampus and cerebral nuclei, and use the resulting data to map the state of 491,818 candidate cis-regulatory DNA elements in 160 distinct cell types. We find high specificity of spatial distribution for not only excitatory neurons, but also most classes of inhibitory neurons and a subset of glial cell types. We characterize the gene regulatory sequences associated with the regional specificity within these cell types. We further link a considerable fraction of the cis-regulatory elements to putative target genes expressed in diverse cerebral cell types and predict transcriptional regulators that are involved in a broad spectrum of molecular and cellular pathways in different neuronal and glial cell populations. Our results provide a foundation for comprehensive analysis of gene regulatory programs of the mammalian brain and assist in the interpretation of noncoding risk variants associated with various neurological diseases and traits in humans.

The cerebral cortex and cerebral nuclei or basal ganglia in adult mice are made up of tens of millions of neurons and glial cells 9 . The neurons are classified into many different types of excitatory projection neurons and inhibitory interneurons, defined by their distinct morphology, neurotransmitters, and synaptic connections 10-12 . The glial cells include astrocytes, oligodendrocytes, oligodendrocyte precursor cells, microglia and other less abundant non-neuronal cell types. Understanding how the identity and function of each neural cell type is established during development and modified by experience is one of the fundamental challenges in brain research. Despite recent advances in analysis of gene expression patterns using single-cell transcriptomics and spatial transcriptomics assays 2-8 , we still lack comprehensive maps of the transcriptional regulatory elements in each cell type. Transcriptional regulatory elements recruit DNA-binding transcription factors to exert control of target gene expression in cis in a cell-type-dependent manner 13,14 . Activation of these elements is accompanied by open chromatin, specific histone modifications and DNA hypomethylation 13,14 . To exploit these epigenetic features, candidate cis-regulatory elements (cCREs) have been mapped by techniques such as DNase I hypersensitive sites sequencing (DNase-seq), assay for transposase-accessible chromatin using sequencing (ATAC-seq), chromatin immunoprecipitation followed by sequencing (ChIP-seq) and whole-genome bisulfite sequencing 15,16 . Conventional assays performed using bulk tissue samples are unable to resolve the cCREs in individual neural cell types owing to the extreme cellular heterogeneity of the brain. To overcome this limitation, single-cell genomic technologies have been developed to enable detection of open chromatin in individual cells 17-21 and identify cell-type-specific transcriptional regulatory sequences in several mouse brain regions 19,22-27 .
As part of the BRAIN Initiative Cell Census Network (BICCN), we performed single-nucleus (sn)ATAC-seq assays using single-cell combinatorial indexing (sci)ATAC-seq 17,25 for more than 800,000 cells from Article 45 dissected regions in the adult mouse brain to produce comprehensive maps of cCREs in distinct cerebral cell types. We also integrated the chromatin accessibility data with brain single-cell RNA sequencing (scRNA-seq) data to characterize the gene regulatory programs of different brain cell types, and used the cCREs to interpret genetic risk variants that are associated with neurological diseases and traits.

snATAC-seq analysis of mouse brain cells
We dissected 45 brain regions from the isocortex, olfactory bulb (OLF), hippocampus (HIP) and cerebral nuclei (including the striatum and pallidum) in 8-week-old male mice (Fig. 1a, Extended Data Fig. 1, Supplementary Table 1). Each dissection was made from 600-μm-thick coronal brain slices starting at the frontal pole according to the Allen Brain Reference Atlas 28 (Extended Data Fig. 1). For each region, we performed snATAC-seq with two independent biological replicates (Fig. 1a, Extended Data Fig. 2a-f). A total of 813,799 nuclei with a median number of 4,929 fragments per nucleus passed rigorous quality control measures (Supplementary Table 2 (Fig. 1b-d). Next, the three cell classes were further divided into 43 subclasses (also referred to as major types in the accompanying paper 29 ) (Fig. 1b, d) and annotated these on the basis of chromatin accessibility at promoters and the gene bodies of at least three marker genes of known neural cell types 2,4 (Fig. 1e, Extended Data Fig. 6, Supplementary Table 4). Finally, we performed another round of clustering for each subclass and identified a total of 160 cell types at optimal resolutions (also referred to as subtypes in the accompanying paper 29 ) (Extended Data Figs. 3b-g, 7, Supplementary Table 3, Supplementary Note, Methods). For example, Lamp5 + neurons (LAMGA) and Sst + neurons (SSTGA) 4,6 were further divided into several cell types, including a chandelier-like cell type 4 and the Chodl cell type 6 (Fig. 1b, e). Notably, the detected clusters and cell type proportions from snATAC-seq were comparable between the combinatorial barcoding (sci) and the droplet-based 10x Genomics platform 20 (Extended Data Fig. 8, Supplementary Note).
We constructed a dendrogram to capture the hierarchical organization of chromatin landscapes among the 43 subclasses (Fig. 1d, Extended Data Fig. 9). This dendrogram shows known organizing principles of mammalian brain cells: the non-neuronal class is separated from the neuronal class, which is further separated on the basis of the neurotransmitter types (GABAergic versus glutamatergic) and developmental origins 4 (Fig. 1d). These chromatin-defined cell types matched well with the taxonomy based on transcriptomes 2 and DNA methylomes 29 (Extended Data Fig. 10, Supplementary Note, Supplementary Table 5).

Regional specificity of brain cell types
Taking advantage of our high-resolution brain dissections, we examined the regional specificity of each brain cell type (Extended Data Figs. 11, 12). We calculated a regional specificity score for each subclass and cell type based on the contribution from different brain regions and showed that this score is highly consistent between biological replicates ( Fig. 1f, g, Extended Data Fig. 12d, Methods). Overall, we found good agreement between the regional specificity of most neuronal cell types defined using snATAC-seq datasets and the normalized in situ hybridization (ISH) signals of marker genes in each cell type (Extended Data Fig. 13, Supplementary Table 6, Methods). As expected, most glial cell types were ubiquitously distributed throughout the different brain dissections and showed very low regional specificity (Fig. 1f), except for neuronal intermediate progenitor cells (NIPCs) and radial glia-like cells in the dentate gyrus or subventricular zone (labelled as subclass RGL in Fig. 1f). In contrast to the glial cell types, most GABAergic and glutamatergic neurons showed notable regional specificity (Fig. 1f, g). We found a marked separation on the basis of brain subregions for distinct neuron types such as granule cells in the dentate gyrus and matrix D1 neurons (MXD) in the pallidum. Glutamatergic neurons showed slightly higher regional specificity than GABAergic neurons, consistent with previous single-cell transcriptomic analysis 4 (Fig. 1g, bottom). We also observed distinct types of Pvalb + neuron (PVGA), intra-telencephalic-projection neuron, and hippocampal cornu ammonis (CA1) neuron (CA1GL) that were highly restricted to individual brain regions or dissections (Extended Data Fig. 14, Supplementary Note).

Mapping of cCREs in mouse brain cells
To delineate the gene regulatory programs that underlie the identity and function of each brain cell type, we next identified cCREs in each of the 160 brain cell types from the accessible chromatin landscapes. To account for different sequencing depth and/or the number of nuclei in individual clusters, we identified reproducible peaks based on a corrected integer score calculated by model-based analysis of ChIPseq data (MACS2) 30 Tables 7, 8). Of these cCREs, 96.3% were located at least 1 kb away from annotated promoter regions of protein-coding and long noncoding RNA genes (Fig. 2a, Extended Data Fig. 15e). Several lines of evidence support the authenticity of the identified cCREs. First, they strongly (70.1%) overlapped with the DNase hypersensitive sites (DHSs) that were previously mapped in a broad spectrum of bulk mouse tissues and developmental stages 14 (Fig. 2b, Extended Data Fig. 15f). Second, they generally showed higher levels of sequence conservation than random shuffled genomic regions with similar GC content (Fig. 2c, Extended Data Fig. 15g). Third, they were enriched for active chromatin states or potential insulator protein-binding sites that were previously mapped by bulk analysis of mouse brain tissues 31-33 (Fig. 2d, Extended Data Fig. 15h).
To define the cell type specificity of the cCREs, we first plotted the median levels of chromatin accessibility against the range of variation for each element (Fig. 2e). We found that most cCREs exhibited highly variable chromatin accessibility across the brain cell types identified, except for 8,188 regions that showed accessible chromatin in almost all cell clusters (Fig. 2e). The invariant cCREs were highly enriched for promoters (81%), with the remainder including CCCTC-binding factor (CTCF)-binding sites (9%) and strong enhancers (Fig. 2f). To characterize the cell type specificity of the cCREs more explicitly, we used non-negative matrix factorization to group them into 42 modules, with elements in each module sharing similar cell type specificity profiles. Aside from the first module (M1) that included mostly cell type invariant cCREs, the remaining 41 modules showed high cell-type-restricted accessibility (Fig. 2g, Supplementary Tables 9, 10). These cell-type-restricted modules were enriched for distinct sets of motifs recognized by known transcriptional regulators (Supplementary Table 11), laying a foundation for investigating the gene regulatory programs in different brain cell types and regions.    ASC  OPC  IOL  OGC  MGL  RGL  PER  VEC  VLMC  VPIA  LAMGA  VIPGA  PVGA  SSTGA  LSXGA  MSGA  STRGA  D1MSN  D2MSN  MXD  OBGA  OBDOP  OBNBL  DGNBL  ITL23GL  ITL4GL  ITL5GL  ITL6GL  ITHGL  NPGL  PTGL  CTGL  L6bGL  CLAGL  CA1GL  CA3GL  GRC  CRC  PIRGL  OBGL

Differential chromatin states at cCREs
Because most neuronal types showed highly restricted distribution in the mouse cerebral cortex and basal ganglia, we hypothesized that the regional specificity of different cell types is accompanied by differences in chromatin accessibility at the cCREs, which drive cell-specific gene expression patterns. We performed integrative analysis to delineate these differentially accessible cCREs among different neuronal and glial cell types. We compared the open chromatin landscapes among different cell types using a likelihood ratio test (Methods), and identified a median number of 11,683 cCREs that exhibited differential accessibility (range: 360-31,608) (Extended Data Fig. 16a Table 13). We found a strong motif enrichment of the zinc-finger transcription factor family KLF in cCREs in SSTGA10 cells (also known as Sst-Chodl cells) compared with other SST neurons (Fig. 3c). This observation, coupled with the finding that Klf5, a member of the KLF family, was expressed in Chodl cells, implicates KLF5 in the transcriptional control of Chodl cells (Fig. 3d). We also identified three astrocyte cell types and performed differential chromatin accessibility analysis for cCREs between these (Fig. 3e, Extended Data Fig. 18a). Two cell types were predominantly found in the cortex and hippocampus, whereas the third cell type (ASCN) was detected mostly in the pallidum and lateral septum complex 34 (Fig. 3f). The cortical or hippocampal astrocyte cell types resembled previously defined fibrous astrocytes in white matter (ASCW) and protoplasmic astrocytes in grey matter (ASCG) 5,35 (Extended Data Fig. 18b-d). Consistent with the previous findings that astrocytes were organized into distinct lineage-associated laminae 36 , we detected a spatial gradient in ASCG based on chromatin accessibility at several gene loci in ASCG (Extended Data Fig. 18e). We further performed motif analysis for differentially accessible regions in the ASCN cell type, finding enrichment of the binding motif for the GLI family of zinc-finger transcription factors (Fig. 3g, h, Extended Data Fig. 18f, Supplementary Table 14), which mediate the sonic hedgehog (Shh) signalling pathway that maintains neural stem-cell and astrocyte functions 36 . Notably, we found a cCRE that contained the GLI motif upstream of the Olig2 promoter. This is consistent with a potential role for Shh signalling in regulating Olig2 expression (Fig. 3i, Extended Data Fig. 18g, h) in OLIG2-lineage-derived mature astrocytes in the globus pallidus 34 . We found that a high fraction of genes specific to OLIG2-lineage astrocytes were predominantly expressed in the pallidum (Fig. 3j). For example, Itih3, Slc6a11 34 and Agt were predominantly expressed in the pallidum, and cCREs at the gene locus were specifically accessible in ASCNs (Fig. 3k, l, Extended Data Fig. 18h-j). We also found enrichment of distinct transcription factor motifs from regional-specific cCREs in ASCGs sampled from different brain regions (Extended Data Fig. 18k

Integrative analysis of gene regulation
To investigate the transcriptional regulatory programs that are responsible for cell-type-specific gene expression patterns in the mouse cerebrum, we carried out integrative analysis that combines the snATAC-seq data collected in the current study with previously published scRNA-seq data from matched brain regions 2 . We first connected 261,204 distal cCREs to 12,722 putative target genes by measuring the co-accessibility using Cicero 37,38 (Fig. 4a, Methods). This analysis identified a total of 813,638 gene-cCRE pairs within 500 kb of each other (Supplementary  Table 16). Next, we identified the subset of cCREs that might increase the expression of putative target genes and therefore function as putative enhancers in neuronal or non-neuronal types. To this end, we first identified distal cCREs for which chromatin accessibility was correlated with transcriptional variation of the linked genes in the RNA-ATAC joint cell clusters as defined above (Fig. 4a, b, Extended Data Fig. 10). This analysis revealed a total of 129,404 pairs of positively correlated cCRE (putative enhancers) and genes at an empirically defined significance threshold of FDR < 0.01 (Supplementary Table 16). These included 86,850 putative enhancers and 10,604 genes ( Fig. 4b, Extended Data Fig. 19, Supplementary Note). To investigate how the putative enhancers may direct cell-type-specific gene expression, we further classified them into 38 modules, by applying non-negative matrix factorization to the matrix of normalized chromatin accessibility across the RNA-ATAC joint cell clusters (Supplementary Table 17). The putative enhancers in each module had a similar pattern of chromatin accessibility across cell clusters to the expression of putative target genes (Fig. 4c, Supplementary Table   18). This analysis revealed a large group of 12,740 putative enhancers that were linked to 6,373 genes expressed at a higher level in all neuronal cell clusters than in all non-neuronal cell types (module M1) (Fig. 4c). It also uncovered modules of enhancer-gene pairs that were active in a more restricted manner (modules M2-M38) (Fig. 4c, Extended Data Fig. 19, Supplementary Tables 19, 20, Supplementary Note). Genes associated with module M1 are preferentially expressed in both glutamatergic and GABAergic neurons, but not in glial cell types (Fig. 4c). De novo motif enrichment analysis of the 12,740 cCREs or putative enhancers in this module showed strong enrichment of sequence motifs recognized by the transcription factors CTCF, RFX and MEF2 (Supplementary Table 21, Extended Data Fig. 19d), as well as many known motifs for other transcription factors (Fig. 4d, Supplementary  Table 20). CTCF is a ubiquitously expressed DNA-binding protein with a well-established role in transcriptional insulation and chromatin organization 39 . CTCF has also been shown to promote neurogenesis by binding to promoters and enhancers of the proto-cadherin alpha gene cluster and facilitating enhancer-promoter contacts 40,41 . We found putative enhancers with one or more CTCF-binding motifs linked to 2,601 genes that were broadly expressed in both inhibitory and excitatory neurons (

Neurogenesis in the adult mouse brain
Neurogenesis in the adult mouse brain is spatially restricted to the subgranular zone (SGZ) in the dentate gyrus of the hippocampus where excitatory neurons are generated, and the subventricular zone (SVZ) of the lateral ventricles that give rise to GABAergic neurons 42 . The NIPCs that are involved in adult neurogenesis 42,43 could be identified as the cells lined up in trajectories between radial glia-like cells and neuroblasts in both brain regions (Fig. 4e) and their presence in the respective dissections was supported by ISH data from the Allen Brain Atlas 44 for several marker genes (Extended Data Fig. 21a, b). We predicted potential transcription factors that contribute to NIPCs as well as other cell types by integrating RNA expression and motif enrichment analysis using the Taiji pipeline 45 (Fig. 4f, Supplementary Table 24, Methods). Consistent with previous reports, NR2E1 was predicted to be a master regulator in both NIPC populations 46 , and SOX2 was a regulator of the NIPCs from the SGZ 42 , whereas E2F1 contributed to NIPCs from the SVZ 47 . Although chromatin landscapes in the NIPCs from both regions were very similar (Fig. 4g), we identified 200 differentially accessible regions in the NIPC population between the SGZ and the SVZ (Fig. 4h). Several cCREs at Neurog2, which encodes a protein that is crucial for glutamatergic granule neuron specification in the SGZ, were found to be accessible selectively in the SGZ but not in the SVZ 43 (Fig. 4i). By contrast, several cCREs with chromatin accessibility in SVZ NIPCs were located at the Dlx2 locus-a gene that is important for the specification of GABAergic neurons 43,46 (Extended Data Fig. 21c). An active enhancer previously validated by mouse transgenics 48 was predicted to target the nearby Trappc9 gene, which encodes a protein that is involved in nerve growth factor-induced neuronal differentiation 49 (Fig. 4i, j). These observations suggest that NIPCs in the SGZ and SVZ give rise to distinct neuronal cell types by engaging different cCREs involved in controlling region-specific gene expression of key regulator genes.

Interpreting noncoding risk variants
Genome-wide association studies (GWASs) have identified genetic variants that are associated with many neurological diseases and traits (Supplementary Table 25), but interpreting the results has been challenging because most variants are located in noncoding parts of the genome that often lack functional annotations 50 . To test whether our maps of   Expression values of Sst-Chodl cells were extracted from the Allen Brain Atlas (atlas.brain-map.org) 44 . e, UMAP 58 embedding of astrocyte cell types. f, Contribution of brain regions to each astrocyte cell type. CNU, cerebral nuclei; OLF, olfactory area; HPF, hippocampus. g, Normalized accessibility of astrocyte cell-type-specific cCREs. h, Motif enrichment in astrocyte cell-type-specific cCREs. Hox, homeobox; HTH, helix-turn-helix; Zf, zinc-finger. i, Genome browser view 60 showing ASCN-specific cCREs at the Olig2 locus. j, Heat map showing the fraction of overlap between spatially mapped genes from ISH in different brain structures and genes specific to astrocyte cell types. ***P < 0.0001, Fisher's exact test. k, Bar plot showing log 2 -transformed expression value of Itih3 in each brain region from ISH experiments. CTXsp, cortical subplate. Images downloaded from © 2004 Allen Institute for Brain Science. Allen Mouse Brain Atlas. Available from: atlas. brain-map.org. l, Views of ISH experiment from the Allen Brain Atlas, showing spatial expression of Itih3 in the pallidum.   cCREs in different mouse brain cell types could assist the interpretation of noncoding risk variants of neurological diseases, we identified orthologues of the mouse cCREs in the human genome by performing reciprocal homology searches 51 (Methods). For this analysis, we found that for 69.2% of the mouse brain cCREs, human genome sequences with high similarity could be identified (more than 50% of bases lifted over to the human genome) (Extended Data Fig. 22a). Supporting the function of these human orthologues of the mouse brain cCREs, 83.0% of them overlapped with representative DNase hypersensitivity sites in the human genome 14 .
We performed linkage disequilibrium score regression 52 analysis and found significant associations between 20 neurological traits (Supplementary Table 25) and the open chromatin landscapes in one or more subclasses of the brain cells we identified (Fig. 5, Methods, Supplementary Note). In particular, we observed widespread and strong enrichment of genetic variants linked to psychiatric and cognitive traits such as major depressive disorder, bipolar disorder and schizophrenia (SCZ) within accessible cCREs across various neuronal cell types (Fig. 5). Other neurological traits-such as attention deficit hyperactivity disorder and autism spectrum disorder-were associated with specific neuronal cell types in cerebral nuclei and the hippocampus (Fig. 5). Risk variants for schizophrenia were not only enriched in cCREs in all excitatory neurons, but also enriched in certain inhibitory neuron cell types 53 (Fig. 5). We also found that more than 25% of homologous sequences of SCZ causal variants reside in the mouse cCREs defined in this study (Extended Data Fig. 22b, Supplementary Note). The strongest enrichment of heritability for bipolar disorder was found in cCREs that mapped in the excitatory neurons from the isocortex (Fig. 5). Risk variants of tobacco use disorder showed significant enrichment in the cell types from the striatum-a cerebral nucleus previously implicated in addiction 54 (Fig. 5).
Understanding the cellular and molecular basis of brain circuits is one of the grand challenges of the twenty-first century 55,56 . In-depth knowledge of the transcriptional regulatory program in brain cells would not only improve our understanding of the molecular inner workings of neurons and non-neuronal cells, but could also shed light on the pathogenesis of a spectrum of neuropsychiatric diseases 57 . Here, we report a comprehensive profiling of chromatin accessibility at single-cell resolution in the mouse cerebrum. The chromatin accessibility maps of 491,818 cCREs, probed in 813,799 nuclei and 160 cell types, span several cerebral cortical areas and subcortical structures. Taking advantage of our high-resolution brain dissections, we examined the regional specificity in chromatin accessibility of cell types in the mouse cerebrum and showed that most brain cell types exhibit strong regional specificity. The described cCRE atlas (http://catlas.org/mousebrain) represents a rich resource for the neuroscience community to understand the molecular patterns that underlie diversification of brain cell types in complementation to other molecular and anatomical data.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Tissue preparation and nuclei isolation
All experimental procedures using live animals were approved by the SALK Institute Animal Care and Use Committee under protocol number 18-00006. Adult C57BL/6J male mice were purchased from Jackson Laboratories. Brains were extracted from 56-63-day-old mice and sectioned into 600-μm coronal sections along the anterior-posterior axis in ice-cold dissection media 62,63 , by using a brain slicer from CNS muSlice LLC. Specific brain regions were dissected according to the Allen Brain Reference Atlas 28 (Extended Data Fig. 1) and nuclei isolated as previously described 63 . Regions were pooled from 3-31 of the same sex to obtain enough nuclei for snATAC-seq for each biological replica, and two biological replicas were processed for each dissection region.

Nuclei indexing scheme
To generate snATAC-seq libraries we used initially an indexing scheme as previously described (version 1) 22,25 . Here, 16 p5 and 24 p7 indexes were combined to generate an array of 384 indexes for tagmentation and 16 i5 as well as 48 i7 indexes were combined for an array of 768 PCR indexes. Owing to this library design, it is required to sequence all four indexes to assign a read to a specific nucleus with long reads and a constant base sequence for both indices reads between i and p barcodes. Therefore, the resulting libraries were sequenced with 25% spike-in library on a HiSeq2500 (RRID:SCR_016383) and these read lengths: 50 + 43 + 37 + 50 (ref. 25 ).
To generate libraries compatible with other sequencers and not requiring spike-in libraries or custom sequencing recipes, we modified the library scheme (Version 2). For this, we used 384 individual indices for T7 and combined with one T5 with a universal index sequence for tagmentation (for a total of 384 tagmentation indexes). For PCR, we used 768 different i5 indexes and combined with a universal i7 primer index sequence. Tagmentation indexes were 10 bp and PCR indexes 12-bp long. We made sure, that the hamming distance between every two barcodes was ≥ 4, the GC content between 37.5-62.5%, and the number of repeats ≤ 3. The resulting libraries were sequenced on a HiSeq4000 with custom primers and these read lengths: 50 + 10 + 12 + 50 (Supplementary Table 26).

Processing and alignment of sequencing reads
Paired-end sequencing reads were demultiplexed and the cell index transferred to the read name. Sequencing reads were aligned to mm10 reference genome using bwa 64 . After alignment, we used the R package ATACseqQC (1.10.2) 65 to check for fragment length contribution which is characteristic for ATAC-seq libraries. Next, we combined the sequencing reads to fragments, and for each fragment we performed the following quality control: (1) keep only fragments quality score MAPQ > 30; (2) keep only the properly paired fragments with length < 1,000 bp; (3) PCR duplicates were further removed with SnapTools (https://github.com/r3fang/SnapTools, RRID:SCR_018097) 26 . Reads were sorted on the basis of the cell barcode in the read name.

TSS enrichment calculation
Enrichment of ATAC-seq accessibility at TSSs was used to quantify data quality without the need for a defined peak set. The method for calculating enrichment at TSS was adapted from previously described. TSS positions were obtained from the GENCODE database v16 (RRID:SCR_014966) 38 . In brief, Tn5 corrected insertions (reads aligned to the positive strand were shifted +4 bp and reads aligned to the negative strand were shifted -5 bp) were aggregated ± 2,000 bp relative (TSS strand-corrected) to each unique TSS genome-wide. Then this profile was normalized to the mean accessibility ± 1,900-2,000 bp from the TSS and smoothed every 11 bp. The max of the smoothed profile was taken as the TSS enrichment.

Doublet removal
We used a modified version of Scrublet (RRID:SCR_018098) 66 to remove potential doublets for every dataset independently. Peaks were called using MACS2 scores for aggregate accessibility profiles on each sample. Next, cell-by-peak count matrices were calculated and used as input, with default parameters. Doublet scores were calculated for both observed nuclei {x i } and simulated doublets {y i } using Scrublet (RRID:SCR_018098) 66 . Next, a threshold θ is selected based on the distribution of {y i }, and observed nuclei with doublet score larger than θ are predicted as doublets. To determine θ, we fit a two-component mixture distribution by using function normalmixEM from R package mixtools 67 . The lower component contained most embedded doublet types, and the other component contained majority of neo-typic doublets (collision between nuclei from different clusters. We selected the threshold θ where the p x μ σ p x μ σ ⋅ pdf( , , )= ⋅pdf( , , ) 1 1 1 2 2 2 in which p denotes probability; pdf denotes probability density function. This value suggested that the nuclei have same chance of belonging to both classes.
Cell clustering We used an iterative clustering strategy using the snapATAC 26 package (RRID:SCR_018097) with slight modifications as detailed below. For round 1 clustering, we clustered and finally merged single nuclei to three main cell classes: non-neurons, GABAergic neurons, and glutamatergic neurons. For each main cell class, we performed another round of clustering to identify major cell subclasses. Last, for each subclass, we performed a third round of clustering to find cell types.
Detailed description for every step is as follows.
(1) Nuclei filtering. Nuclei with ≥1,000 uniquely mapped fragments and TSS enrichment > 10 were filtered for individual dataset. Second, potential barcode collisions were also removed for individual datasets. (2) Feature bin selection. First, we calculated a cell-by-bin matrix at 5-kb resolution for every dataset independently and subsequently merged the matrices. Second, we converted the cell-by-bin count matrix to a binary matrix. Third, we filtered out any bins overlapping with the ENCODE blacklist (mm10, http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm10-mouse/mm10.blacklist.bed.gz). Fourth, we focused on bins on chromosomes 1-19, X and Y. Last, we removed the top 5% bins with the highest read coverage from the count matrix. (3) Dimensionality reduction. SnapATAC applies a nonlinear dimensionality reduction method called diffusion maps, which is highly robust to noise and perturbation 26 . However, the computational time of the diffusion maps algorithm scales exponentially with the increase of number of cells. To overcome this limitation, we combined the Nyström method (a sampling technique) 68 and diffusion maps to present Nyström landmark diffusion map to generate the low-dimensional embedding for large-scale dataset.
A Nyström landmark diffusion maps algorithm includes three major steps: (1) Sampling. We sampled a subset of K (K ≪ N) cells from N total cells as 'landmarks'.
(2) Embedding. We computed a diffusion map embedding for K landmarks.
(3) Extension. We projected the remaining (N -K) cells onto the low-dimensional embedding as learned from the landmarks to create a joint embedding space for all cells. Having more than 800,000 single nuclei at the beginning, we decided to apply this strategy on round 1 and 2 clustering. A total of 10,000 cells were sampled as landmarks and the remaining query cells were projected onto the diffusion maps embedding of landmarks. Later for the round 3 clustering, diffusion map embeddings were directly calculated from all nuclei.
(4) Eigenvector selection. To determine the number of eigenvectors of diffusion operator to include for downstream analysis, we generated an 'elbow plot', to rank all eigenvectors on the basis of the percentage of variance explained by each one. For each round of clustering, we selected the top 10-20 eigenvectors that captured most of the variance.
(5) Graph-based clustering. Using the selected significant eigenvectors, we next construct a k-nearest neighbour graph. Each cell is a node and the k-nearest neighbours of each cell were identified according to the Euclidian distance and edges were drawn between neighbours in the graph. Next, we applied the Leiden algorithm on the k-nearest neighbour graph using Python package leidenalg (https://github.com/ vtraag/leidenalg) 69 .
(6) Optimization on cluster resolution. We tested different 'resolu-tion_parameter' parameters (step between 0 and 1 by 0.1) to determine the optimal resolution for different cell populations. For each resolution value, we tested whether there was clear separation between nuclei. To do so, we generated a cell-by-cell consensus matrix in which each element represents the fraction of observations two nuclei are part of the same cluster. A perfectly stable matrix would consist entirely of zeros and ones, meaning that two nuclei either cluster together or not in every iteration. The relative stability of the consensus matrices can be used to infer the optimal resolution. To this end, we generated a consensus matrix based on 300 rounds of Leiden clustering with randomized starting seed s. Let M s denote the N × N connectivity matrix resulting from applying Leiden algorithm to the dataset D s with different seeds. The entries of M s are defined as follows: The entry (i,j) in the consensus matrix is the number of times single nucleus i and j were clustered together divided by the total number of times they were selected together. The matrix is symmetric, and each element is defined within the range [0, 1]. We examined the cumulative distribution function (CDF) curve and calculated proportion of ambiguous clustering (PAC) score to quantify stability at each resolution. The resolution with a local minimum of the PAC scores denotes the parameters for the optimal clusters. In the case these were multiple local minimal PACs, we picked the one with higher resolution. Another measurement is dispersion coefficient, which reflects the dispersion (ranges from 0 to 1) of the consensus matrix M from the value 0.5. The closer to 1 is the dispersion coefficient, the more perfect is consensus matrix, and thus the more stable is the clustering. In a perfect consensus matrix, all entries are 0 or 1, meaning that all connectivity matrices are identical. The dispersion coefficient is defined as: Finally, for every cluster, we tested whether we could identify differential features compared to all other nuclei (background) and the nearest nuclei (local background) using the function 'findDAR'.
(7) Visualization. For visualization, we applied UMAP 58 . Using the cell embedding, we applied both k-nearest neighbor batch effect test (kBET) and local inverse Simpson's index (LISI) analysis to test the robustness of the clutering results to variation of sequencing depth, signal-to-noise ratios, and batches.

Dendrogram construction for mouse brain cell types
First, we calculated for cCRE the median accessibility per cluster and used this value as cluster centroid. Next, we calculated the coefficient of variant for the cluster centroid of each element across major cell types. Finally, we only kept variable elements with coefficient of variants that were larger than 1.5 for dendrogram construction.
We used the set of variable features defined above to calculate a correlation-based distance matrix. Next, we performed linkage hierarchical clustering using the R package pvclust (v.2.0) 70 with parameters method.dist = "cor" and method.hclust = "ward.D2". The confidence for each branch of the tree was estimated by the bootstrap resampling approach with 1,000 rounds.

Regional specificity of cell types
The specificity score is defined as Jensen-Shannon divergence, which measures the similarity between two probability distributions. For each cell type, the contribution of different brain regions was first calculated. Then, we compared this distribution with the contribution of brain regions calculated from all sampled cells. We used the function 'JSD' from R package philentropy for this analysis 71 .

Identification of reproducible peak sets in each cell cluster
We performed peak calling according to the ENCODE ATAC-seq pipeline (https://www.encodeproject.org/atac-seq/). For every cell cluster, we combined all properly paired reads to generate a pseudo-bulk ATAC-seq dataset for individual biological replicates. In addition, we generated two pseudo-replicates that comprise half of the reads from each biological replicate. We called peak for each of the four datasets and a pool of both replicates independently. Peak calling was performed on the Tn5-corrected single-base insertions using the MACS2 score 30 with these parameters:-shift -75-extsize 150-nomodel-call-summits-SPMR -q 0.01. Finally, we extended peak summits by 250 bp on either side to a final width of 501 bp for merging and downstream analysis. To generate a list of reproducible peaks, we kept peaks that (1) were detected in the pooled dataset and overlapped ≥50% of peak length with a peak in both individual replicates; or (2) were detected in the pooled dataset and overlapped ≥50% of peak length with a peak in both pseudo-replicates.
We found that when cell population varied in read depth or the number of nuclei, the MACS2 score varied proportionally owing to the nature of the Poisson distribution test in MACS2 scores 30 . Ideally, we would perform a reads-in-peaks normalization, but in practice, this type of normalization was not possible because we did not know how many peaks we would get. To account for differences in performance of MACS2 scores 30 on the basis of read depth and/or number of nuclei in individual clusters, we converted MACS2 peak scores (−log 10 (q-value)) to 'score per million' 72 . We filtered reproducible peaks by choosing a score-per-million cut-off of 2 to filter reproducible peaks.
Lastly, because snATAC-seq data are very sparse, we selected only elements that were identified as open chromatin in a significant fraction of the cells in each cluster. To this end, we first randomly selected the same number of non-DHS regions (approximately 670,000 elements) from the genome as background and calculated the fraction of nuclei for each cell type that showed a signal at these sites. Next, we fitted a zero-inflated beta model, and empirically identified a significance threshold of FDR < 0.01 to filter potential false positive peaks. Peak regions with FDR < 0.01 in at least one of the clusters were included in downstream analysis.

Computing chromatin accessibility scores
Accessibility of cCREs in individual clusters was quantified by counting the fragments in individual clusters normalized by read depth (CPM). For each gene, we summed up the counts within the gene body +2 kb upstream to calculate the gene activity score. The gene activity score were used for integrative analysis with scRNA-seq. For better visualization, we smoothed the gene activity score to 50 nearest neighbour nuclei using Markov affinity-based graph imputation of cells (MAGIC) 74 .
To directly compare our single-nucleus chromatin accessibility-derived cell clusters with the single-cell transcriptomics defined taxonomy of the mouse brain 2 , we first used the snATAC-seq data to impute RNA expression levels (gene activity scores) according to the chromatin accessibility of gene promoter and gene body as previously described 75 . We performed integrative analysis with scRNA-seq using Seurat 3.0 (RRID:SCR_016341) to compare cell annotation between different modalities 75 . We randomly selected 200 nuclei (and used all nuclei for cell cluster with fewer than 200 nuclei) from each cell cluster for integrative analysis. We first generated a Seurat object in R by using previously calculated gene activity scores, diffusion map embeddings and cell cluster labels from snATAC-seq. Then, variable genes were identified from scRNA-seq and used for identifying anchors between these two modalities. Finally, to visualize all the cells together, we co-embedded the scRNA-seq and snATAC-seq profiles in the same low dimensional space.
To quantify the similarity between cell clusters from two modalities, we calculated an overlapping score as the sum of the minimum proportion of cells or nuclei in each cluster that overlapped within each co-embedding cluster 5 . Cluster overlaps varied from 0 to 1 and were visualized as a heat map with snATAC-seq clusters in rows and scRNA-seq clusters in columns. We found strong correspondence between the two modalities, which was evidenced by co-embedding of both transcriptomic (T-type) and chromatin accessibility (A-type) cells in the same RNA-ATAC joint clusters (Extended Data Fig. 10a, Supplementary Table 5). For this analysis, we examined GABAergic neurons, glutamatergic neurons and non-neuronal cell classes separately (Extended Data Fig. 10, Supplementary Table 5).

Identification of cis-regulatory modules
We used non-negative matrix factorization 76 to group cCREs into cis-regulatory modules on the basis of their relative accessibility across major clusters. We adapted non-negative matrix factorization (Python package: sklearn 77 ) to decompose the cell-by-cCRE matrix V (N × M, N rows: cCRE, M columns: cell clusters) into a coefficient matrix H (R × M, R rows: number of modules) and a basis matrix W (N × R), with a given rank R: The basis matrix defines module related accessible cCREs, and the coefficient matrix defines the cell cluster components and their weights in each module. The key issue to decompose the occupancy profile matrix was to find a reasonable value for the rank R (that is, the number of modules). Several criteria have been proposed to decide whether a given rank R decomposes the occupancy profile matrix into meaningful clusters. Here we applied two measurements 'sparseness' 78 and 'entropy' 79 to evaluate the clustering result. Average values were calculated from 100 times for non-negative matrix factorization runs at each given rank with random seed, which will ensure the measurements are stable.
Next, we used the coefficient matrix to associate modules with distinct cell clusters. In the coefficient matrix, each row represents a module and each column represents a cell cluster. The values in the matrix indicate the weights of clusters in their corresponding module. The coefficient matrix was then scaled by column (cluster) from 0 to 1. Subsequently, we used a coefficient > 0.1 (approximately the 95th percentile of the whole matrix) as threshold to associate a cluster with a module.
In addition, we associated each module with accessible elements using the basis matrix. For each element and each module, we derived a basis coefficient score, which represents the accessible signal contributed by all cluster in the defined module. In addition, we also implemented and calculated a basis-specificity score called 'feature score' for each accessible element using the 'kim' method 79 . The feature score ranges from 0 to 1. A high feature score means that a distinct element is specifically associated with a specific module. Only features that fulfil both of the following criteria were retained as module specific elements: (1) feature score greater than median + 3 standard deviations; (2) the maximum contribution to a basis component is greater than the median of all contributions (that is, of all elements of W).

Identification of differentially accessible regions and definition of specificity score
To identify cCREs that were differentially accessible either in subtypes or in brain regions, we constructed a logistic regression model predicting cluster/region membership based on each cCRE individually and compares this to a null model with a likelihood ratio test. We used two functions 'fit_models' and 'compare_models' in R package Monocle3 (v.0.2.2) 80 to perform the differential test. We designed the full model as in which P ij represents the probability of ith site is accessible in the jth cell, a is the log 10 -transformed total number of sites observed as accessible for the jth cell, m is membership of the jth cell in either cluster or region being tested, r is the replicate label for jth cell and ε is an error term.
For each set of testing, between subtypes or between regions in cell type, we kept only cCREs that overlapped with peaks identified in corresponding cell types. A likelihood ratio test was then used to determine whether the full model (including cell cluster or region membership) provided a significantly better fit of the data than the reduced model. After correcting P values using Benjamini-Hochberg method, we set an FDR cut-off as 0.001 to filter out significant differential cCREs.
The log 2 -transfomed fold change is used for two-group comparison, for multiple groups, we calculated a Jensen-Shannon divergence-based specificity score previously described 22 to better assign differential cCREs to cell type or brain region. The fraction of accessibility of each cluster f was first calculated for each ith site. We normalized these scores by multiplying by corresponding scaling factors, which are considering different overall complexity across groups. To do so, median number of sites accessible c in individual cells for each cluster was calculated and followed with log 10 -transforming. Then, we took the ratio of the average of c (across all clusters) over the median accessibility in each cluster as scaling factor. These corrected fraction of accessibility for each cCRE was then converted to probability by scaling by groups. Then, we calculated Jensen-Shannon divergence between two probability distributions. For example, the probability distribution for the first cCRE as d1, to test whether this cCRE is specific in group 1, we assumed another probability distribution: Function 'JSD' in R package 'philentropy' was used to calculate Jensen-Shannon divergence between these two probability distributions, and Jensen-Shannon-based specificity ( JSS) scores was defined as:

JSS = 1 − Jensen Shannon divergence
For each group, we calculated the JSS score for every cCRE. To find a reasonable cut-off to determine restricted or general cCREs, we consider JSS scores from all cCREs that are not identified as differential accessible (from likelihood ratio test) as a background distribution, and JSS scores from cCREs that passed our likelihood ratio test threshold and had positive values to be true positives. We set an empirical FDR cut-off where the type I error was no more than 5%. Finally, the differential cCREs could be aligned to several cell types or brain regions based on the JSS score, we named the one can be assigned to only one type or region as region-specific or cell-type-specific cCREs.
Comparison between the regional specificity of cell types defined by snATAC-seq data and the spatial ISH signals of cell-type-specific genes To validate the regional specificity of cell types, we took advantage of the spatially mapped quantified ISH expression from ABA 44 in five matched major brain structures, isocortex, olfactory areas, hippocampal formation (HPF), striatum (STR), pallidum. We used the 'differential search' function to identify 10,269 genes with increased expression in these five brain regions compared to all brain regions 'grey matter' (expression level >1 and fold change > 1). We also identified cell-type-specific genes using Seurat 75 with default parameters for each joint cluster from integrative analysis (fold change > 1 and FDR < 0.05) (Extended Data Fig. 10). 505 cell-type-specific genes (range from 1 to 53, 15 on average) overlapped with the list of genes with increased expression in the five brain regions. For each cell type, we calculated the regional specificity score (see previous section: regional specificity of cell types) on the basis of the relative contribution from five brain regions estimated from snATAC-seq datasets, and also a coefficient of variation based on averaged normalized ISH signals of cell-type-specific marker genes. For each cell-type-specific gene, we calculated the PCC between cell composition in five brain structures and spatial expression levels across the five brain structures derived from ISH.
Because the astrocyte subtypes identified in our study were not resolved in scRNA-seq studies, we identified subtype-specific genes for astrocyte subtypes using chromatin accessibility from snATAC-seq using a likelihood ratio test. The cell-type-specific genes were filtered by FDR less than 0.001 from the likelihood ratio test and empirical FDR cut-off of no more than 5% for JSS scores. Then, we calculated the fraction of overlap between spatially mapped ISH genes from different brain structures and genes with astrocyte subtype-specific accessibility.

Predicting enhancer-promoter interactions
First, co-accessible regions were identified for all open regions in each cell cluster (randomly selected 200 nuclei, and using all nuclei for cell cluster with <200 nuclei) separately, using Cicero 37 with following parameters: aggregation k = 10, window size = 500 kb, distance constraint = 250 kb. To find an optimal co-accessibility threshold for each cluster, we generated a random shuffled cCRE-by-cell matrix as background and identified co-accessible regions from this shuffled matrix. We fitted the distribution of co-accessibility scores from random shuffled background into a normal distribution model by using the R package fitdistrplus 81 . Next, we tested every co-accessibility pairs and set the cut-off at co-accessibility score with an empirically defined significance threshold of FDR < 0.01. CCRE outside of ±1 kb of the TSS in GENCODE mm10 (v.16, RRID:SCR_014966) 38 . were considered distal. Next, we assigned co-accessibility pairs to three groups: proximal-to-proximal, distal-to-distal, and distal-to-proximal. In this study, we focus only on distal-to-proximal pairs. We further used RNA expression from matched T-types to filter out pairs that were linked to non-expressed genes (normalized UMI < 5).
We calculated PCC between gene expression and cCRE accessibility across joint RNA-ATAC clusters to examine the relationship between co-accessibility pairs. To do so, we first aggregated all nuclei or cells from scRNA-seq and snATAC-seq for every joint cluster to calculate accessibility scores (log 2 (CPM)) and relative expression levels (log 2 (normalized UMI)). Then, PCC was calculated for every cCRE-gene pair within a 1-Mb window centred on the TSS for every gene. We also generated a set of background pairs by randomly selecting regions from different chromosomes and shuffling of cluster labels. Finally, we fit a normal distribution model and defined a cut-off at PCC score with empirically defined significance threshold of FDR < 0.01, to select significant positively correlated cCRE-gene pairs.

Identification of candidate driver transcription factors
We used the Taiji pipeline 45 to identify candidate driver transcription factors in cell clusters. In brief, for each cell type cluster, we constructed the transcription factor regulatory network by scanning transcription factor motifs at the accessible chromatin regions and linking them to the nearest genes. The network is directed with edges from transcription factors to target genes. The weights of the genes in the network were determined on the basis of the RNA expression level (gene activity score for SGZ NIPCs only, because there is no corresponding T-type) of corresponding T-types. The weights of the edges were calculated by the relative accessibility of the promoters of the source transcription factors. We then used the personalized PageRank algorithm to rank the transcription factors in the network. The output of Taiji pipeline is transcription-factor-by-cell type matrix with PageRank scores. From the output matrix, we calculated coefficient of variation across cell types. To identify driver transcription factors, we used following criteria: (1) transcription factors have FDR less than 0.001; (2) transcription factors have coefficient of variant larger than the mean of coefficient of variant; (3) PageRank score should be ranked in the top 25% of all transcription factors for each cell type; (4) RNA expression level (CPM) is larger than 20, which we consider as an expressed transcription factor in corresponding cell type.

Motif enrichment
We performed both de novo and known motif enrichment analysis using Homer (v.4.11, RRID:SCR_010881) 61 . For cCREs in the consensus list, we scanned a region of ±250 bp around the centre of the element. And for proximal or promoter regions, we scanned a region of ±1,000 bp around the TSS. Randomly selected background regions are used for motif discovery. To identify motif enriched in different cell types or brain regions, we use variable cCREs as input and invariable cCREs as background.

GREAT analysis
Gene Ontology annotation of cCREs was performed using GREAT (version 4.0.4, RRID:SCR_005807) 82 with default parameters. Gene Ontology biological process was used for annotations.

Gene Ontology enrichment
We perform Gene Ontology enrichment analysis using the R package Enrichr (RRID:SCR_001575) 83 . The gene set library 'GO_Biological_Pro-cess_2018' was used with default parameters. The combined score is defined as the P value computed using the Fisher's exact test multiplied with the z-score of the deviation from the expected rank.

GWAS enrichment
To enable comparison to GWASs of human phenotypes, we used lift-Over with settings '-minMatch=0.5' to convert accessible elements from mm10 to hg19 genomic coordinates 51 . Next, we reciprocal lifted the elements back to mm10 and only kept the regions that mapped to original loci. We further removed converted regions with lengths greater than 1 kb.
We obtained GWAS summary statistics for quantitative traits related to neurological disease and control traits (Supplementary  99 , intelligence 100 , amyotrophic lateral sclerosis 101 , anorexia nervosa 102 and height 103 .
We prepared summary statistics to the standard format for linkage disequilibrium score regression. We used homologous sequences for each major cell types as a binary annotation, and the superset of all candidate regulatory peaks as the background control.
For each trait, we used cell-type-specific linkage disequilibrium score regression (https://github.com/bulik/ldsc) to estimate the enrichment coefficient of each annotation jointly with the background control 52 .

Fine mapping
We obtained 99% credible sets for schizophrenia from the Psychiatric Genomics Consortium website (https://www.med.unc.edu/pgc/). Potential causal variants with a posterior probabilities of association score larger than 1% were used for overlapping with cCREs.

Statistics
No statistical methods were used to predetermine sample sizes. There was no randomization of the samples, and investigators were not blinded to the specimens being investigated. However, the clustering of single nuclei on the basis of chromatin accessibility was performed in an unbiased manner, and cell types were assigned after clustering. Low-quality nuclei and potential barcode collisions were excluded from downstream analysis as outlined above.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Data availability
Demultiplexed data can be accessed via the NEMO archive (NEMO, RRID:SCR_016152) at: https://assets.nemoarchive.org/dat-wywv153. Processed data are available on our web portal and can be explored here: http://catlas.org/mousebrain. Additional data are available in the NCBI Gene Expression Omnibus (GEO) under accession number GEO173650 and upon request.

Code availability
Custom code and scripts used for analysis can be accessed at: https:// github.com/yal054/snATACutils and https://github.com/r3fang/Sna-pATAC. Fig. 10 | The chromatin-accessibility-based cell clustering matches transcriptomics-based cell taxonomy. a, Heat map showing the similarity between A-type (accessibility) and T-type (transcriptomics) based cell cluster annotations. Each row represents an A-type cluster (a total of 160) and each column represents a T-type cluster (a total of 100). Similarity between original clusters and the joint cluster was calculated as the overlap score, which defined as the sum of minimal proportion of cells/nuclei in each cluster that overlapped within each co-embedding cluster. The overlap score varied from 0 to 1 and was plotted on the heatmap. Joint clusters with an overlap score of >0.5 are highlighted using black dashed line and labelled with RNA-ATAC joint cluster ID. For a full list of cell type labels and descriptions see Supplementary  Table 4. b, c, Bar plots indicating the number of clusters that matched (dark grey) or did not match (light grey) clusters from the other modality. b, 155 out of 160 A-types had a matching T-type cluster. c, 84 out of 100 T-types had a matching A-type cluster. Fig. 11 | Cellular composition of brain regions, subregions and dissections. a, Cluster dendrogram based on chromatin accessibility as in EDF8. b-d, Normalized percentages (pct) of each subclass in the four major regions (b), the subregions (c), and the dissected regions (d) are shown as different sized dots. The sizes of dots correspond to the percentage, and the colours of the dots indicate the major brain regions, subregions or dissections. Bar plots to the right show the total number of nuclei sampled for each region, subregion, or dissection. Fig. 13 | The regional specificity of cerebral cell types defined on the basis of snATAC-seq datasets is consistent with ISH patterns of cell-type-specific genes in the mouse brain. a, Bar plots showing regional specificity of each cell type determined from relative contribution from five different brain regions, including isocortex, olfactory areas, hippocampal formation, striatum and pallidum. Dot plots showing percentages of each major brain region in the subclass of cerebral cells. The size of each dot reflects the relative contribution of the brain region as indicated by the legend to the right of the panel, and the colour of the dots indicates the brain region. b, Bar plots showing the coefficients of variation calculated with the average normalized ISH signals of cell-type-specific marker genes for the corresponding cell subclasses across five brain regions. The heat map shows the average normalized ISH signals of cell subclass-specific marker genes. The cell subclass-specific marker genes were identified for joint cell subclasses from RNA-ATAC integrative analysis (Extended Data Fig. 9). For a full list of ISH data of cell subclass-specific genes, see Supplementary Table 6. c, Scatter plot shows the correlation between the coefficients of variation of marker gene ISH signals and the regional specificity score calculated based on snATAC-seq data for 32 joint cell subclasses from RNA-ATAC integration analysis (Pearson correlation coefficients (PCC) = 0.55). d, Box plots show the Pearson correlation coefficients (PCC) calculated between cell composition across brain regions based on snATAC-seq data and the spatial distribution ISH signals of cell subclass-specific genes across the five main brain regions for each major brain cell subclass. Box plots are stylized as in Fig. 2c. Fig. 14 | Brain region specificity of different subclasses and cell-types. a, UMAP 58 embedding of the subclasses of GABAergic neurons. b, Sub-region composition of dopaminergic neurons, D1MSN and D2MSN, and MXD illustrates most MXD neurons were detected in the pallidum. c, Gene activity score for gene Isl1 in GABAergic neurons predicts expression in MXD. d, RNA ISH and bar plot illustration of expression levels show the highest abundance of Isl1 in the predicted region, pallidum (blue). Data and images were downloaded from © 2004 Allen Institute for Brain Science. Allen Mouse Brain Atlas. Available from: atlas.brain-map.org. e, UMAP 58 embedding of PVGA cell types. f, Sub-region composition for each PVGA cell type illustrates that majority of PVGA-7 were detected in nucleus accumbens and caudoputamen. g, Gene activity scores for Kit and Pde3a predict expression in one PVGA cell type (PVGA-7). h, RNA ISH and bar plot illustration of expression levels show expression of Kit and Pde3a in predicted regions, nucleus accumbens and caudoputamen. Predicted region in blue, other sampled regions in grey and non-sampled regions in white. i, UMAP 58 embedding of CA1 glutamatergic neurons (CA1GL) cell types. j, Dissection composition in each CA1GL cell type. k, Gene activity of gene Dcn in CA1GL cell types. l, Density of expression level of Dcn viewed in BrainExplorer (https://mouse.brain-map.org/static/ brainexplorer) shows expression cornu ammonis field 1 (CA1). m, RNA ISH and bar plot show highest expression of Dcn CA1 and hippocampal formation. The predicted region is coloured in blue, other sampled regions in grey and nonsampled regions in white. Fig. 15 | Statistics of peak calling from snATAC-seq data from each cell type. a, Schematic diagram of peak calling and filtering pipeline. b, Number of peaks retained after each peak calling and filtering step. c, Scatter plots showing the relationship between the number of nuclei in each cluster and the number of peaks before score-per-million (SPM) correction (left) and after SPM correction (right). d, Density distribution plot showing the fraction of cells per cell type in which a peak was accessible and a corresponding background for each cell types. For each cell type, the background is defined as same number of non-DHS and non-peak regions randomly picked from genome. e, Fraction of different peak sets that overlap with annotated transcriptional start sites, introns, exons, transcriptional termination sites (TTS), and intergenic regions in the mouse genome. f, Fraction of different peak sets that overlap with DHSs 14 . g, Box plot showing sequence conservation in different peak sets and the control set. h, Enrichment analysis of different peak sets with a 15-state ChromHMM model in the mouse brain chromatin. All box plots are stylized as in Fig. 2c. Fig. 16 | Differentially accessible cCREs across different cell types and brain regions. a, Distribution plot of the Jensen-Shannon specificity scores of the differentially accessible cCREs and invariable cCREs. The vertical line shows the cutoff at FDR of 0.05. b, Stacked bar charts showing the number of differential cCREs between cell types for subclasses. c, A graph shows the numbers of differential cCREs in each major brain region. d, A graph shows the numbers of differential cCREs across subregions. The sizes of dots correspond to the number of differential cCREs (log 10 -transformed) found in each cell type between brain regions, and the colours of the dots indicate the major brain regions in c, subregions in d. Fig. 17 | Variation of chromatin accessibility across different cell types within the subclasses of cerebral neurons. a, UMAP 58 embedding and gene activity scores of Chodl show specificity for one SSTGA cell type. b, Bar chart showing the number of differentially accessible cCREs shared by 1, 2, 3 or 4 cell types of SSTGA. c, Bar chart showing the number of cell-type-specific cCREs, categorized by proximal (black) and distal regions (light grey). d, UMAP 58 embedding of MSGA cell types. e, Bar chart showing the number of differentially accessible cCREs shared by 1, 2, 3, 4 and 5 subtypes of MSGA. f, Number of specific accessible proximal and distal cCREs for each MSGA cell type. g, Heat map showing the normalized accessibility of the celltype-specific cCREs across different MSGA cell types. h, Heat map showing the promoter accessibility at selected marker genes in each MSGA cell type. i, ISH data (https://portal.brain-map.org) 44 of the marker genes of each MSGA cell type. j, Motif enrichment of uniquely accessible cCREs and expression level of transcription factors in cholinergic neurons (MSGA10). The RNA expression of cholinergic neurons was downloaded from mouse brain atlas (http:// mousebrain.org). Fig. 18 | Astrocyte cell types exhibit regional specificity and differential chromatin accessibility. a, UMAP 58 embedding of astrocytes, coloured by cell type (left), fragment depth (middle), and replicate (right). b, Heat map illustrating the overlap between A-type and T-type cell cluster annotations. Each row represents a snATAC-seq subtype and each column represents scRNA-seq cluster 2 (http://mousebrain.org). The overlap between original clusters and the joint cluster was calculated (overlap score) and plotted on the heat map. c, Heat map showing overlap score with cell type defined by integration of multiple transcriptomic and epigenomic modalities 63 . d, Smoothed gene activity scores of marker genes for astrocytes from white and grey matter. e, Smoothed gene activity scores of representative cortical layer-specific marker genes for astrocyte. f, Upset plot showing intersections of differentially accessible cCREs between cell types. cCREs only presented in one cell type are defined as cell-type-specific cCREs. g, Gene activity score for GLI transcription factor family members in astrocyte cell types. h, Smoothed gene activity scores of Oligo2, Itih3, Agt and Slc6a11 in astrocytes show stronger activity in ASCN. i, Genome browser tracks 60 of aggregate chromatin accessibility profiles for each astrocyte cell type at gene Itih3, Slc6a11 and Agt locus. j, Views of ISH experiments from Allen Brain Atlas (atlas.brain-map.org) showing predominant expression of Slc6a11 and Agt in the pallidum. Bar plot showing expression values from ISH experiments in brain structures. Data and images were downloaded from © 2004 Allen Institute for Brain Science 44 . Allen Mouse Brain Atlas. Available from: atlas. brain-map.org. k, UMAP embedding of astrocyte coloured by brain subregions. l, Motif enrichment in cCREs in ASCG with regional specificity.

Corresponding author(s): Bing Ren
Last updated by author(s): Apr 17, 2021 Reporting Summary Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Data Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability Demultiplexed data can be accessed via the NEMO archive at: https://assets.nemoarchive.org/dat-wywv153 Processed data are available on our web portal and can be explored here: http://catlas.org/mousebrain Additional data are available in the NCBI Gene Expression Omnibus (GEO) under accession number GEO173650 and upon request.