Chromatin accessibility and transcriptome landscapes of Monomorium pharaonis brain

The emergence of social organization (eusociality) is a major event in insect evolution. Although previous studies have investigated the mechanisms underlying caste differentiation and social behavior of eusocial insects including ants and honeybees, the molecular circuits governing sociality in these insects remain obscure. In this study, we profiled the transcriptome and chromatin accessibility of brain tissues in three Monomorium pharaonis ant castes: queens (including mature and un-mated queens), males and workers. We provide a comprehensive dataset including 16 RNA-sequencing and 16 assay for transposase accessible chromatin (ATAC)-sequencing profiles. We also demonstrate strong reproducibility of the datasets and have identified specific genes and open chromatin regions in the genome that may be associated with the social function of these castes. Our data will be a valuable resource for further studies of insect behaviour, particularly the role of brain in the control of eusociality.


Background & Summary
Eusocial insects have their societies based on caste polyphenism, where one or more queens are exclusively responsible for reproduction 1 . In contrast, workers, the largest population in the colony, are almost sterile and responsible for supporting the entire community through their labor, including collecting food, maintaining the nest and feeding/protecting the newly hatched larvae 2 . Eusociality in the hymenopteran insects has evolved 10 times independently 3,4 . Understanding eusociality in insects is important not only from an evolutionary or environmental perspective but also because it may provide clues into the behavior traits of higher species including humans.
Genes differentially expressed across castes in the brains of insects contribute to social behavior development 5,6 . Several studies have focused on the overlapping genes or pathways associated with the division of labor across different eusocial insect lineages and constructed a set of conserved gene regulatory networks 7,8 . In this regard, one of the key hypotheses for evolution of eusociality emphasized the important role of a core toolkit of genes involved in highly conserved pathways, such as metabolism and reproduction 9,10 . In addition, it is also widely accepted that certain single genes can play pivotal roles. For instance, increasing insulin-like peptide 2 (ilp2) levels can break larval suppression and induce a stable division of labor in Ooceraea biroi 11 . Likewise, the neuropeptide corazonin inhibits the transition from worker to gamergate in Harpegnathos saltator 12 . Alternatively, many other studies have recognized the importance of taxonomically restricted genes in the evolution of eusocial www.nature.com/scientificdata www.nature.com/scientificdata/ behavior and performed a systematic comparison of the participation degree of shared genes and taxonomically restricted genes in eusocial division of labor [13][14][15][16] . Despite these relevant studies, the comprehensive lists of genes associated with eusocial behavior and their interrelationship are still unknown.
Besides gene expression, epigenetic regulation is also recognized as an important facet in the regulation of caste-specific behavior in insects. For example, histone modifications are critical regulators of caste determinations in Camponotus floridanus, as it was shown that distinct histone H3K27ac patterns exist between castes of C. floridanus 17 . Likewise, caste-specific behavior in C. floridanus can be reprogrammed by treatment with a small-molecular inhibitor of histone deacetylases, suggesting a regulatory role for histone acetylation in eusocial behavior plasticity 18 . The role of DNA methylation in caste determination has also been investigated in honeybees and, interestingly, some of the differentially methylated CpG sites correspond to regulatory regions of genes involved in metabolic pathways 19 . Additionally, distinct DNA methylation patterns in queen and worker larvae have been reported in another eusocial insect, the termite Zootermopsis nevadensis 19 .
Taken together, these reports suggested crucial roles of transcription and epigenetics in shaping caste differentiation and controlling social behavior in insects. However, a comprehensive dataset of both layers is still lacking, hampering further advances in the field of eusociality.
Here, we constructed the transcriptome and chromatin accessibility landscapes of brain tissue of Monomorium pharaonis (Fig. 1a), which is the most ubiquitous house ant in the world 20,21 . Monomorium pharaonis consists of three adult castes, workers, queens, and males, with the queen caste containing unmated queens (gynes) and mature queens. These four adult groups possess distinct morphologies, lifespans and behaviors, making it an ideal model to explore the molecular and neural regulatory mechanisms of eusociality 22 . We sequenced 32 samples from the four groups of ants (16 RNA-seq and 16 ATAC-seq with four biological replicates per group). After data quality assessment and filtering, we obtained a total of 240 Gb high-quality base pairs for the RNA-sequencing, with more than 95% Q20 bases and approximately 149 million reads per sample. For the ATAC-sequencing, we obtained a total of 170 Gb high-quality base pairs reads, with approximately 106 million reads per sample. Methods experimental design. Four adult groups (workers, gynes, males and queens) of Monomorium pharaonis were used for brain RNA-sequencing and ATAC-sequencing profiling. We collected eight brain samples per caste group to perform these assays. A total of 32 ant brains were used. Each brain was used as a biological sample for either the RNA-sequencing or ATAC-sequencing. The experimental design and analysis workflow are shown in Fig. 1.

Animals.
All procedures related to animals in this study were approved by the Institutional Review Board on Ethics Committee of BGI (Permit No. FT 19046). Two colonies of Monomorium pharaonis were created from a source colony (MP-MQ064) that was collected in June 2016 from Xishuangbanna in the Yunnan province in China. We pre-assigned about 200 workers, 10 queens and about 200 total larvae in each of the colonies. The age of the selected queens was unknown. The two colonies used in the study were created at the State Key Laboratory of Genetic Resource and Evolution, Kunming Institution of Zoology and then sent to the China National Gene Bank, BGI-Shenzhen. Ants were maintained for eight weeks before sampling, at a constant temperature of 25 °C and 50% humidity and fed with mealworm 12 . There were about 200 workers, 10 queens, 5 and 8 males, and 7 and 9 gynes in each colony, respectively, when sampling.  www.nature.com/scientificdata www.nature.com/scientificdata/ Brain collection and RNA extraction. Designated ants were picked out and anaesthetized in a dissection dish on ice. The ants were then washed with ethanol and PBS twice, and dissected in PBS on ice under the light microscope (OLYMPUS, SZX16). To perform the dissection, a pair of forceps held the ant body while another pair of forceps inserted into the mouth gripped the head cuticle of the ant. The head was gently pulled off and the body discarded. Head cuticle was then gently peeled off with the forceps and the brain was removed. After carefully removing the surrounding trachea and ocelli, the ant brain was placed into PBS with 1 U/mL RNase inhibitor. All ants were dissected using the same method except that ocelli removal was not required for the workers. Brain samples were then washed twice with 500 μl PBS. All samples were collected during daytime (9:00 to 16:00). Whole-brain RNA was extracted immediately after dissection using an RNeasy Mini Kit (Qiagen) and eluted with 10 μl of nuclease-free water (NF-water, Ambion). The total amounts of RNA were measured using an RNA HS Qubit (Invitrogen).

RNA-sequencing library construction.
We applied an optimized Smart-seq2 method for RNA-sequencing library construction 23 . For cDNA generation, the following premixed reagent was added to each tube of RNA sample: 5 μl of 10 μM oligo-dT primer (5ʹ-AAGCAGTGGTATCAACGCAGAGTACT-30VN-3ʹ, where "V" is either "A", "C", or "G", and "N" is any base), 4.86 μl of 10 mM dNTP (New England Biolabs), 0.5 μl of 40 U/μl RNase inhibitor (New England Biolabs). Based on the amount of RNA, ERCC Spike-In (Ambion) was added to each tube. Then, the mix was incubated at 72 °C for 3 minutes and quickly placed on ice. Afterward, 20 μl of first-strand synthesis mix containing 8 μl of 5X first-strand buffer (Invitrogen), 2 μl of 100 mM dithiothreitol (DTT, Invitrogen), 2 μl of 200U/μl SuperScript II Reverse Transcriptase (Invitrogen), 8 μl of 5 M Betaine (Sigma), 0.24 μl of 1 M MgCl 2 (Millipore), and 0.4 μl of 100 μM template switch oligo (5ʹ-AAGCAGTGGTATCAACGCAGAGTACATrGrG + G-3ʹ, where "r" indicates a ribonucleic acid base and "+" indicates a locked nucleic acid base, TSO, Exiqon) were added. RNA was reverse transcribed at 42 °C for 90 minutes, and 10 cycles of 50 °C for 2 minutes and 42 °C for 2 minutes and a final 70 °C for 5 minutes to inactivate the reverse transcriptase. cDNA amplification mix containing 50 μl of KAPA HiFi HotStart ReadyMix (KAPA Biosystems), 1 μl of 10 μM IS primer (5ʹ-AAGCAGTGGTATCAACGCAGAGT-3ʹ) and 9 μl of NF-water were then added. The amplification followed the following steps: 98 °C for 3 minutes, followed by 13 cycles of 98 °C for 20 seconds, 67 °C for 20 seconds, 72 °C for 6 minutes and finally 72 °C for 5 minutes. Afterwards, the PCR product was purified using 1X AMPure XP beads (Beckman Coulter). We measured cDNA concentration with the Qubit dsDNA HS Assay Kit 3.0 (Invitrogen) and analyzed size distribution on an HS DNA chip bioanalyzer (Agilent). Libraries were prepared using a fragmentation based method 24 . For each sample, 300 ng of cDNA was sheared with NEBNext dsDNA Fragmentase (New England Biolabs). Fragmented DNA was then purified, end-repaired, adapter-added, amplified and size-selected. Afterwards, the library size distribution was detected using an HS DNA chip bioanalyzer; the fragment length was in the range from 300 to 500 bp.

ATAC-seq library preparation.
We used a whole-brain transposition method for ATAC-sequencing library construction, as previously described 25 , with minor modifications. In brief, brains were dissected and washed twice with 500 μl ice-cold PBS. After centrifugation at 500 x g for 5 minutes, the samples were lysed with 50 μl lysis buffer (10 mM Tris-HCl, 10 mM NaCl, 3 mM MgCl 2 , 0.1% IGEPAL CA-630). We next mixed the samples harshly by pipetting and then centrifuged at 800 x g for 10 minutes. Supernatants were discarded and replaced with a 50 μl transposition reaction mix containing 10 mM TAPS-NaOH (pH 8.5), 5 mM MgCl 2 , 10% DMF, 2.5 μl of in-house Tn5 transposase (0.8 U/μl) and NF-water. This mixture was incubated at 37 °C for 30 minutes. Afterwards, transposed DNA was purified with MinElute Purification Kit (Qiagen) and amplified with primers containing barcodes.
Sequencing. All data were generated with the BGISEQ-500 platform (MGI) 26 . First, the DNA concentration of each library was measured by Qubit dsDNA HS Assay Kit 3.0. A total of 300 ng of library DNA with different sample indexes was pooled for circular single-strand DNA (ssDNA circles). Then, ssDNA circles were used as a template to make DNA nanoballs by rolling circle replication. DNA nanoballs were loaded onto the sequencer flowcells for 100 bp paired-end for RNA-seq and 50 bp paired-end for ATAC-seq.
RNA-sequencing dataset processing. Quality validation of raw reads was performed using FastQC (version 0.11.6) 27 . Reads of low quality were filtered using SOAPnuke (version 1.5.2) 28 . Adapter sequences, primers, poly-A tails were found and removed by cutadapt (version 1.16) 29 . Further quality control was performed by FastQC to ensure the cleaned data were suitable for downstream analyses. Quality control results 30 were visualized using multiQC (version: 1.7) 31 . Statistical results of raw data and clean data are displayed in Table 1. Cleaned reads were mapped to the reference Monomorium pharaonis genome (GCA_003260585.2) 32 using hisat2 (version 2.0.1-beta) 33 . The number of reads aligning to every gene of each sample were calculated with featureCounts (version 1.5.3) 34 to generate a raw count matrix 30 . Aligned BAM reads were inputted into featureCounts (version 1.5.3) with a list of genomic features in Gene Transfer Format (GTF, ref_ASM326058v2_top_level.gff3.gz). To normalize read counts for sequencing depth and RNA composition, we used the median of ratios method in the R (version 3.5) package DESeq2 (version 1.5.3) 35 . The plotPCA function of DESeq2 (version 1.5.3) was used to assess the similarity of genomic specific gene expression patterns among different groups (Fig. 2c). Pearson correlation coefficients between samples (Fig. 2e, f) were calculated based on DEseq2 normalized data matrix.
ATAC-sequencing dataset processing. Raw ATAC-seqquencing data were processed including trimming, aligning, filtering, and quality controlling using an ATAC-sequencing pipeline 36 . MACS2 (version 2.1.2) 37 based on python 2.7 was used to identify the peaks of accessible regions. We applied the IDR algorithm 38 to identify peaks reproducible between replicates of each caste. Overlapping peaks were subsequently merged by bedtools (version: 2.26.0) intersect 39 to produce the final consensus peak set. The full statistical results of data www.nature.com/scientificdata www.nature.com/scientificdata/ processing and the number of consensus peaks for each sample are listed in Table 2. A standard peak list was generated by merging peaks of all samples using bedtools merge 39 . The usable reads of each sample were then mapped to the regions of standard peaks using the intersect function of bedtools and the number of mapped reads was counted and listed in a matrix 30 . We normalized this raw count matrix using the median of ratios method of the R package DESeq2 (version 1.5.3). This normalized matrix was subjected to Pearson correlation coefficients calculation between replicates and principal component analysis (PCA) (Fig. 3c) by DEseq2.
Identification of widely expressed or specific genes across the four groups. The raw gene expression matrix was normalized by Reads Per Million mapped reads (RPM). We calculated the average of RPM value, and the coefficient of variation (CV) between the four groups. We selected genes with mean value of RPM greater than 300 and the CV value less than 10% as co-expressed genes. We used the Shannon entropy 40 to compute the specificity index for genes and we defined its relative gene expression level in a group type i as Ri = Ei/ΣE, where Ei is the RPM value for the gene in the group i, ΣE is the sum of RPM values in all groups and N is the total number of groups. The entropy score for each gene across groups was defined as H = −1 * sum (Ri * log2Ri) (1 < i < N), where the value of H ranges between 0 to log2(N). An entropy score close to zero indicates that the expression of the gene in question is highly specific based on the score distribution, whereas genes with entropy score less than 1.5 were selected as group-specific genes. This result was provided in Figshare 30 .
Comparative analysis across groups. Comparative analysis was performed using DESeq2 R package. The fold change value between groups and the corresponding P value was calculated. We selected the genes or peaks with fold change ≥ 1 and Padj value ≤ 0.05 as differentially expressed genes (DEGs) or differentially accessible regions (DARs).

Data Records
A complete list of the 32 ant brain samples is provided in Tables 1 and 2. All raw data in this study are available in the NCBI Gene Expression Omnibus (GEO) 41

RNA-sequencing metrics and reproducibility.
A total of 16 RNA libraries were prepared and sequenced, with the sequencing depth ranging from 104.63 to 171.60 million reads. Raw reads were filtered, resulting in percentages of clean reads ranging between 75% and 86% ( Table 1). The Q20 values for the clean reads were above 95% ( Table 1). The quality of sequencing was validated by FastQC, then multiple results were compared with MultiQC and a representative result (all gyne samples) of the visualized Phred quality score per base was shown in Fig. 2a. The CG content ranged from 40% to 45%, following a normal distribution (Fig. 2b). Clean reads were then mapped to Monomorium pharaonis genome. A full statistics of quality control for each sample was displayed in Table 1.
The reproducibility of replicates of RNA-sequencing datasets was examined using PCA, in which samples were clearly separated by caste categories, with PC1 and PC2 jointly explaining 76% of the total variance in gene expression (Fig. 2c). Heatmap clustering of Pearson correlation coefficients from the comparison of the 16 datasets revealed a strong correlation between replicates of the same caste ants (Fig. 2d). Interestingly, three female groups (queens, gynes, and workers) had a nearer distance between each other than their distance to the male group. Pearson correlation analysis showed a correlation coefficient above 0.99 between replicates, revealing high www.nature.com/scientificdata www.nature.com/scientificdata/ www.nature.com/scientificdata www.nature.com/scientificdata/ reliability of the RNA-sequencing data (Fig. 2e). The RNA-sequencing data in our study were comparable with previously published RNA-sequencing data of gynes and workers 7 (Fig. 2f). Taken together, these results suggest that our datasets are a reliable data resource for future studies.  Table 2. ATAC-seq metadata and mapping statistics. www.nature.com/scientificdata www.nature.com/scientificdata/ ATAC-sequencing quality control. We performed the quality assessment of ATAC-sequencing datasets by a variety of quality metrics (Table 2), including number of reads, mapping rate, and usable reads. Each sample obtained an average of 49 million usable reads after filtration, resulting in about 20, 000 reproducible peaks after IDR analysis (Table 2). We calculated the reads enrichment around transcription start sites (TSS) and observed a strong enrichment (Table 2 and Fig. 3a), suggesting the high quality of the datasets. This was also supported by the periodic pattern of fragment size, consistent with previous ATAC-sequencing profiles 43,44 (Fig. 3b). Reproducibility between replicates was measured by Pearson correlation coefficients and all the replicates from each caste own the correlation coefficient more than 0.95 (Fig. 3c). The reproducibility of ATAC-sequencing datasets was further studied using PCA, where samples from the same caste tended to cluster together (Fig. 3d). As expected, we noted that the ATAC-sequencing samples presented a similar clustering result as RNA-sequencing, with the three female groups being closer to each other. Overall, these analyses demonstrated that our ATAC-sequencing datasets can reliably detect accessible regions in the genome and can be used to further explore the molecular foundation between epigenomic regulation and social behavior.
Comparative analysis between castes. We identified a set of genes widely expressed in the brain of all castes and also caste brain-specific genes as well 30 . We found that genes co-expressed in the brains of four groups (600 genes) have a larger number than caste-specific genes (144 genes). These two sets of genes are provided in Figshare 30 , which can be used for further analysis and exploration. We counted the number of DEGs (Fig. 4a) or DARs (Fig. 4b) in gynes, workers and males compared with queens. We found that males show the biggest difference with queens in both gene expression and chromatin accessibility, suggesting that sex may be the most significant factor resulting in differential regulation of gene expression within the ant colony. On the contrary, gynes and queens presented the smallest difference, with only 229 DEGs and 1,350 DARs. The number of DEGs (583) and DARs (2,171) between queens and workers was almost twice as those between queens and gynes, suggesting higher similarity of the latter two.
We next investigated the relationship between expression and chromatin accessibility for DEGs across the four castes (Fig. 4c). Interestingly, we found that locusta insulin-related peptide (LIRP) and vitellogenin-2 (vg-2) show   www.nature.com/scientificdata www.nature.com/scientificdata/ high expression level in queens. LIRP is a type of 5 kDa peptide and first discovered from locust corpora cardiaca (CCs)-extracts 45,46 . LIRP contains 3 exons separated by 2 introns, resembling the vertebrate insulin genes 47,48 , whose function is to regulate eusocial division of labor and caste determination and was reported to show consistently higher expression in queens 9,11 . Vitellogenin (Vg) encodes for the major egg yolk protein precursor in insects and many other oviparous species 49 . Our finding is supported by a previous study demonstrating that Vg showed higher expression in reproductive groups of eusocial insects, as it functions as a lipid carrier that provisions developing oocytes with yolk and constitutes a reliable indicator of female reproductive activity 50 . Small lysine-rich protein 1 (SMKR1), ras-related and estrogen-regulated growth inhibitor (RERG), and pro-sesilin were also identified as caste-specific genes expressed in gyne, worker and male brains, respectively. SMKR1 is a lysine-rich protein and may play an important role in brain development in unmated female ants 51 . RERG is a member of the RAS superfamily of GTPases and a estrogen-regulated growth inhibitor. The higher expression of RERG in worker is consistent with previous study of worker-biased genes in eusocial insects 52 . Resilin is an elastomeric protein found in many insects 53 . The high expression of pro-resilin may enable males to jump or pivot wings efficiently.
Interestingly, the open regions near these genes showed similar patterns as gene expression across castes (Fig. 4c), suggesting that their transcriptional regulatory elements are crucial for the differential gene expression. Moreover, we found two genes involved in vision, retinal homeobox protein Rx1 (Rx1) and glycine receptor subunit alpha-3 (Glra3), showing lower levels of both expression and chromatin accessibility in workers, which suggests distinct visual systems across workers and the three other groups. Supporting this, it has been previously reported that ocelli is absent in workers of Monomorium pharaonis 54 . In summary, our study provides an important resource of the epigenome and transcriptome of ant brain, which will be of great importance to study the regulatory mechanisms behind caste differentiation in eusocial insects.

Usage Notes
The RNA-seq data processing pipeline, including data filtering, read mapping and gene expression quantification was run on the Linux operating system (centOS). The optimized parameters are provided in the main text. Differential gene expression (DGE) analysis R source codes used for the downstream data analysis and visualization are provided in Supplementary File 1.

Code availability
Data processing was performed using open source software. The approach of tools and parameters used were as below.
SOAPnuke The R code used for calculating the correlation and comparative analysis are available in the supplementary materials.