An ATAC-seq atlas of chromatin accessibility in mouse tissues

The Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) is a fundamental epigenomics approach and has been widely used in profiling the chromatin accessibility dynamics in multiple species. A comprehensive reference of ATAC-seq datasets for mammalian tissues is important for the understanding of regulatory specificity and developmental abnormality caused by genetic or environmental alterations. Here, we report an adult mouse ATAC-seq atlas by producing a total of 66 ATAC-seq profiles from 20 primary tissues of both male and female mice. The ATAC-seq read enrichment, fragment size distribution, and reproducibility between replicates demonstrated the high quality of the full dataset. We identified a total of 296,574 accessible elements, of which 26,916 showed tissue-specific accessibility. Further, we identified key transcription factors specific to distinct tissues and found that the enrichment of each motif reflects the developmental similarities across tissues. In summary, our study provides an important resource on the mouse epigenome and will be of great importance to various scientific disciplines such as development, cell reprogramming, and genetic disease.


Background & Summary
Although most of the protein-coding genes in human and model animals such as mouse have been extensively annotated, vast regions of the genome are noncoding sequences (e.g., roughly 98% of the human genome) and still poorly understood 1,2 . During the last decade, the development of next-generation sequencing (NGS) based epigenomics techniques (e.g., ChIP-seq and DNase-seq) have significantly facilitated the identification of functional genomic regions 3 . For example, by comparing the histone modifications and transcription factor (TF) binding patterns throughout the mouse genome in a wide spectrum of tissues and cell types, Yue et al. 4,5 have made significant progress towards a comprehensive catalog of potential functional elements in the mouse genome. So far, the international human epigenome consortium (IHEC), including ENCODE and the NIH Roadmap epigenomics projects, have profiled thousands of epigenomes including DNA methylation, genome-wide binding of TFs, histone modifications, and chromatin accessibility. This has resulted in the discovery of over 5 million cis-regulatory elements (CREs) in the human genome [6][7][8] . These data resources have created an important baseline for further study of diverse biological processes, such as development, cell reprogramming, and human disease [9][10][11][12][13] .
The accessibility of CREs, which is important for switching on and off gene expression 14 , is strongly associated with transcriptional activity. To date, detection of DNase I hypersensitive sites (DHSs) within chromatin by DNase-seq has been extensively used to map accessible genomic regions in diverse organisms including the laboratory mouse 5 . In 2013, Buenrostro et al. 15 reported an alternative approach, termed ATAC-seq, for fast and sensitive profiling of chromatin accessibility by direct transposition of native chromatin within the nucleus. This method, in comparison to DNase-seq, requires a significantly lower input of cells (only 500-50,000) and a shorter period to process samples 16 . Moreover, ATAC-seq has been applied to single cells through various methods 17-20 , enabling the investigation of regulatory heterogeneity within complex tissues. As such, ATAC-seq has demonstrated great potential to be a leading method in assaying accessible chromatin genome-wide.
The sequence preference of both DNase I and Tn5 enzymes produced distinct but inevitable biases in DNase-seq and ATAC-seq 21 , making it impractical to directly compare datasets generated from the two methods. Therefore, although the DNase-seq atlas of adult mouse tissues has been published 5 , a baseline of chromatin accessible regions generated from ATAC-seq is still important for ATAC-seq based studies. Here, we applied Omni-ATAC-seq 22 , an approach that enables profiling of accessible chromatin from frozen tissues, to the generation of 66 chromatin accessibility datasets from 20 different tissues derived from both adult male and female C57BL/6J mice (Fig. 1a). Systematic analysis of the dataset identified a total of 296,574 accessible elements, of which 26,916 showed highly tissue-specific patterns. We further predicted TFs specific to distinct tissues and importantly, many of these have been validated by previous studies [23][24][25][26][27] . In this study, we provide a valuable resource, which can be used to elucidate transcriptional regulation and may further help understand diseases caused by regulatory dysfunction.

Methods
Sample collection. All relevant procedures involving animals were approved by the Institutional Review Board on Ethics Committee of BGI (Permit No. BGI-R085-1). C57BL/6J male and female mice were purchased from Beijing Vital River Laboratory Animal Technology Co., Ltd (Beijing, China). 8-week old mice were used for this study. Mice were housed under standard conditions of a specific pathogen-free, temperature-controlled environment with a 12-h day/night cycle 28 . The mice were sacrificed by cervical dislocation. Whole organs were extracted and cut into 2-3 pieces, respectively (50-200 mg/piece). Each sample was then quickly frozen in liquid nitrogen and stored at −80 °C until nuclei extraction was performed. In this study, we used 20 different organs or tissues, including adrenal gland, bladder, brain (cerebrum and cerebellum), fat (abdominal, brown and mesenteric), heart, intestine (large and small), kidney, liver, lung, ovary, pancreas, skeletal muscle, spleen, stomach, thymus, and uterus (as listed in Table 1).
All libraries were adapted for sequencing on the BGISEQ-500 platform 30 . In brief, the DNA concentration was determined by Qubit 3.0 (ThermoFisher, Waltham, MA). Pooled samples were used to make single-strand DNA circles (ssDNA circles). DNA nanoballs (DNBs) were generated from the ssDNA circles by rolling circle replication as previously described 30 . The DNBs were loaded onto patterned nano-arrays and sequenced on the BGISEQ-500 sequencing platform with paired end 50 base reads.    www.nature.com/scientificdata www.nature.com/scientificdata/ Preprocessing of the ATAC-seq datasets. The ATAC-seq data were processed (trimmed, aligned, filtered, and quality controlled) using the ATAC-seq pipeline from the Kundaje lab 31,32 ( Table 2). The model-based analysis of ChIP-seq (MACS2) 33 version 2.1.2 was used to identify the peak regions with options -B, -q 0.01nomodel, -f BAM, -g mm. The Irreproducible Discovery Rate (IDR) method 34 was used to identify reproducible peaks between two technical replicates (Fig. 1b). Only peaks reproducible between the two technical replicates were retained for downstream analyses. Peaks for all tissues were then merged together into a standard peak list. The number of raw reads mapped to each standard peak were counted using the intersect function of BedTools 35    www.nature.com/scientificdata www.nature.com/scientificdata/ Identification of tissue-specific chromatin accessible regions. We used a strategy described previously based on the Shannon entropy to compute a tissue specificity index for each peak 4,36,37 . Specifically, for each peak, we defined its relative accessibility in a tissue type i as Ri = Ei/ΣE, where Ei is the RPM value for the peak in the tissue i, ΣE is the sum of RPM values in all tissues, and N is the total number of tissues.  www.nature.com/scientificdata www.nature.com/scientificdata/ score for each peak across tissues can be defined as H = −1 * sum(Ri * log2Ri) (1 < i < N), where the value of H ranges between 0 to log2(N). An entropy score close to zero indicates the accessibility of this peak is highly tissue-specific, while an entropy score close to log2(N) indicates that this peak is ubiquitously accessible 38 . Based on the distribution of entropy scores, peaks with score less than 3.5 were selected as tissue-restricted peaks.
We searched TF motifs in tissue-specific peaks using the findMotifsGenome.pl script of the HOMER 39 version 4.9.1 software with default settings. We then generated a motif enrichment matrix 32 , where each row represents the P-value of a motif and each column represents a tissue. The 50 motifs with the top CV values and mean values greater than 20 were displayed.

Data records
A complete list of the 66 tissue samples is given in Tables 1 and 2. All raw data have been submitted to the CNGB Nucleotide Sequence Archive 40 . The raw data have also been submitted to the NCBI Sequence Read Archive 41   www.nature.com/scientificdata www.nature.com/scientificdata/ Technical Validation Data QC from the pipeline with IDr quality control. We evaluated our ATAC-seq dataset by a series of commonly used statistics, including the number of total read, mapping rate, the proportion of duplicate read, the number of mitochondrial read, and the number of final usable read (Table 2). For each replicate, we obtained an average of 78 million reads, which was previously shown to be enough for the detection of accessible regions 15,31 . In agreement with published ATAC-seq profiles 15 , the chromatin accessibility fragments show size periodicity corresponding to integer multiples of nucleosomes 32 (Fig. 2b). The successful detection of accessible regions is also supported by the observation of strong enrichment of ATAC-seq reads around transcription start sites (TSSs) 32 (Fig. 2a,d).
To evaluate the reproducibility of accessible element discovery between replicates, we identified accessible regions in both replicates by using the MACS2 33 algorithm. We then applied the IDR method 34 to find peaks that were reproducible between replicates in each tissue type (Fig. 2c). We identified an average of 43,421 reproducible peaks (Table 2). For downstream analyses, we filtered out low-quality data where the TSS enrichment scores are less than 10.0 and the number of reproducible peaks are less than 10,000.
reproducibility of biological samples and comparison with published studies. The Pearson correlation analysis was used to further examine the reproducibility of biological and technical replicates. Heatmap clustering of Pearson correlation coefficients from the comparison of 66 datasets revealed a strong correlation between replicates of the same tissue (Fig. 3b), but a lower correlation between profiles from distinct tissues. This result is also supported by t-distributed stochastic neighbor embedding (t-SNE) analysis with tissue-restricted peaks of all profiles (Fig. 3a,c). Interestingly, correlations between replicates from mice of the same gender are generally higher than those from the opposite gender. This can be seen in the cerebrum where the correlation coefficient between replicates of female mice is 0.99 (Fig. 3d), while the coefficient between male and female is slightly reduced (0.96). We also compared our data to ATAC-seq profiles of postnatal mouse (day 0) tissues downloaded from ENCODE project 42,43 . Importantly, we found that both heart and lung were comparable with each other (Fig. 3e). Taken together, these analyses strongly suggest that our ATAC-seq profiles can reliably detect accessible chromatin regions in the mouse genome and can be used as a basic reference ATAC-seq dataset for future studies.
Inferring tissue-specific transcription factors. In an effort to validate the tissue-specific TF motifs identified in our dataset, we compared them to results from previous studies. Log2 RPM of the tissue-restricted peaks was shown in the heatmap (Fig. 4a). For example, we observed high enrichment of the NeuroG2 motif in cerebellum and cerebrum (Fig. 4b), in agreement with the role of NeuroG2 in controlling the temporal switch from neurogenesis to gliogenesis and regulating laminar fate transitions 23 . In brown fat, we found the CCAAT-enhancer-binding proteins (CEBP) motif to be highly specific (Fig. 4b). This is supported by a previous study demonstrating that CEBP can cooperate with PRDM16 to induce brown fat determination and differentiation 24 . In addition, other well-known tissues-specific motifs such as the liver-specific HNF family TF motifs (Hnf1, HNF1b, and HNF4a) 25 and heart or skeletal muscle specific Mef2 family motifs (Mef2a, Mef2b, Mef2c, Mef2d) 26,27 were validated in our study. To further validate whether the overall motif enrichment in each tissue can reflect tissue specificity we performed hierarchical clustering of tissues using Euclidean distance (Fig. 4c). This provided a result similar to hierarchical clustering of various mouse tissues based on RNA-seq data 44 . In addition, examination of tissues from the gastrointestinal (GI) tract (i.e., large intestine, small intestine, and stomach) showed tight clustering (Fig. 4c), which is likely due to their common functions such as lipid metabolism and energy hemostasis 45,46 . Skeletal muscle and heart tissue are found in the same branch, suggesting that patterns of chromatin accessibility in the two tissues are highly influenced by shared TF motifs such as those from the Mef2 family 45 .

Usage Notes
The ATAC-seq data processing pipeline, including read mapping, peak calling, IDR analysis, and read counting were run on the Linux operating system. The optimized parameters are provided in the main text. All R source codes used for the downstream data analyses and visualization are provided in Supplementary File 1.

Code Availability
The R codes used for correlation analysis, identification of tissue-specific chromatin accessible regions, and tissuespecific TFs are available in the supplementary materials (Supplementary File 1). A repository list containing the chromatin accessibility raw count matrix and the motif enrichment matrix is available online 32 .