Extended intergenic DNA contributes to neuron-specific expression of neighboring genes in the mammalian nervous system

Mammalian genomes comprise largely intergenic noncoding DNA with numerous cis-regulatory elements. Whether and how the size of intergenic DNA affects gene expression in a tissue-specific manner remain unknown. Here we show that genes with extended intergenic regions are preferentially expressed in neural tissues but repressed in other tissues in mice and humans. Extended intergenic regions contain twice as many active enhancers in neural tissues compared to other tissues. Neural genes with extended intergenic regions are globally co-expressed with neighboring neural genes controlled by distinct enhancers in the shared intergenic regions. Moreover, generic neural genes expressed in multiple tissues have significantly longer intergenic regions than neural genes expressed in fewer tissues. The intergenic regions of the generic neural genes have many tissue-specific active enhancers containing distinct transcription factor binding sites specific to each neural tissue. We also show that genes with extended intergenic regions are enriched for neural genes only in vertebrates. The expansion of intergenic regions may reflect the regulatory complexity of tissue-type-specific gene expression in the nervous system.

Hi-C datasets of mouse cortical neurons, the human cortex, and the human hippocampus were used to examine long-range chromatin interactions (Supplementary Data 1). Only interactions between intergenic enhancers and transcription start sites (TSSs) of genes with extremely long intergenic DNA (>500 kb, EID genes) were considered to calculate the number (#) of intergenic enhancers in EID genes (Supplementary Data 15). a, b Gene ontology analysis of the genes (508) commonly induced in 3 CNS cell types (glutamatergic neurons, astrocytes, and oligodendrocytes), astrocyte-specific genes (575), and oligodendrocyte-specific genes (335) defined in Fig. 3c Box plots of the distance of the closest enhancers within an intergenic DNA region (kb) of the top 5% of all the mRNA genes (919 out of 18,383), highly induced in each tissue in the postnatal P0 mice. Active enhancers were defined as ChIP-seq H3K27ac-enriched regions described in Fig. 4. The box plots show the median (line in a box), first-to-third quartiles (boxes), and 1.5 ´ the interquartile range (whiskers). ***P = 3.16 ´ 10 -12 ; **P = 1.36 ´ 10 -2 ; *P = 7.24 ´ 10 -2 ; n.s., non-significant P > 0.15; two-sided, two-sample t-test (Supplementary Data 8).

Supplementary Fig 6
The majority of long intergenic regions of generic neural genes have distinct tissuespecific ATAC-seq peaks in multiple neural tissues.
a Pie chart of the number of intergenic ATAC-seq peaks of 625 generic neural genes with >100 kb intergenic regions, grouped by commonly accessible DNA sites in 2-4 neural tissues and tissue-specific accessible DNA sites. Fig. 5b shows 521 commonly accessible DNA sites and 11,004 tissue-specific accessible DNA sites.
The majority of the intergenic ATAC-seq peaks (11,  Gene ontology analysis for the neighboring genes in the 4 groups of gene pairs shown in Fig. 6. The neighboring genes that share extended intergenic regions (>500 kb) were used in each group. The genes having the shared intergenic regions at the 5' end of genes (H genes) and those having the shared intergenic regions at the 3' end of genes (T genes) were illustrated as blue and pink arrows, respectively. The most enriched seven non-redundant GO terms are shown. A one-sided Fisher's exact test was used to calculate P-values in gene set enrichment analysis. P-values were not adjusted for multiple comparisons. All GO target genes and background genes used for this GO analysis are listed in Supplementary Data 6.
Supplementary Fig. 11 Intergenic enhancers contribute expression of neighboring genes in motor neurons.
a The location of Isl1-bound intergenic enhancers, detected by ChIP-seq Isl1, ChIP-seq H3K27ac, and ATACseq, located between Mnx1 and Ube3c genes in motor neurons differentiated from mouse ES cells. 874 bp genomic DNA of motor neuron-specific enhancer (arrow, MU-E5) was deleted in an ES cell using CRISPR genome editing.
b Same as (a) except enhancers located in the intergenic region between Ntf3 and Kcna5 genes. 1,430 bp of the NK-E2 enhancer was deleted. b Same as (a) except showing genes highly induced by KCl, BDNF, and forskolin stimulation. 202 genes were highly induced by KCl stimulation (>2-fold compared to TTX treated neurons). Treated and untreated neurons were not incubated with TTX before stimulation by BDNF and forskolin for 1 hour. 118 genes were highly induced by BDNF stimulation (>1.5-fold compared to non-treated). 170 genes were highly induced by forskolin stimulation were shown (>1.5-fold compared to non-treated. Genes are sorted by a gene name (smallest to largest, Supplementary Data 16, 17).

Data Analyses
Datasets. Published mouse RNA-seq datasets were downloaded from the ENCODE project website (www.encodeproject.org), the Allen Brain Atlas data portal website (www.brain-map.org), and the previous studies [5][6][7] . Human RNA-seq datasets were obtained from the ENCODE project website and the previous studies 8, 9 . Relevant information about RNA-seq datasets including the Gene Expression Omnibus (GEO) accession numbers from the National Center for Biotechnology Information, used in each figure, are listed in Supplementary Data 1. Briefly, the spinal cord RNA-seq dataset was downloaded from the previous study 10 .
Glutamatergic, GABAergic, and cholinergic neuron RNA-seq datasets were downloaded from the previous study 11 . Dorsal root ganglion and trigeminal neuron RNA-seq datasets were downloaded from the previous study 12 . Colonic sensory neuron RNA-seq dataset was downloaded from the previous study 13 . Single-cell (sc) RNA-seq data were treated as bulk RNA-seq data. In scRNA-seq data, the fragments per kilobase of transcript per million reads (FPKM) per gene were added for all the cells, then the sum of FPKM was divided by the number of cells.
Published ATAC-seq and ChIP-seq datasets were downloaded from the ENCODE project website and previous study 1 (Figs. 1, 2) were calculated by log2 (tissue FPKM+0.1)/(ES cell FPKM+0.1). Absolute gene expression levels (Fig. 3) were calculated by log10 (FPKM+0.1) followed by median normalization. The heat map plots of mRNA expression levels were generated using Java Treeview 3.0 16 . The FPKM values in heat map plots were median-normalized (median FPKM: 2) and log2 transformed.
To plot intergenic length against gene induction level (Figs. 1, 2), we use mean intergenic length followed by sorting genes from the highest to the lowest log2 fold-change in expression (bin size: 200 genes; 91 bins; 18,383 genes). Prior to sorting genes according to gene induction level, the order of all genes should be randomized using a "rand" function in Excel to avoid a spike around 0 value of log2 fold-change in expression.
Otherwise, we observed a spike in mean gene induction plots around 0 value of log2 fold-change expression.
We found that these spikes correspond to the olfactory or pheromone receptor genes (1,144 19 . This study reported a total of 23,918 SNPs associated with PD genetic risk (PD SNPs). 14,854 out of these 23,918 SNP nucleotides were located within 1 kb to each other. After excluding the SNPs within 1 kb, we identified 9,064 SNP locations (1 kb width). We further filtered out 4,604 SNPs that were located within genic regions of 19,268 mRNA genes, having non-overlapped intergenic regions in humans (hg19). After removing these genic SNPs, we identified 4,460 SNP locations to calculate their intergenic lengths and performed gene ontology (GO) analysis of these SNP-associated genes. The GO analysis was performed as described in the below method section (GO analysis). The intergenic lengths containing these PD SNPs were determined as described above (Defining intergenic length). Fig. 2, we used all mRNA genes expressed (16, For Fig. 8 and Supplementary Fig 14, GO term enrichment analyses were performed using the PANTHER classification system v.14.0 (geneontology.org) 22 because the GOrilla does not support the GO analyses for some organisms examined in these figures, such as Gallus gallus. Default parameters were used for the GO overrepresentation test: GO terms that were enriched above the background with Binomial test. In Fig. 8 and Supplementary Fig. 14 Identifying ChIP-seq and ATAC-seq peaks. For the ChIP-seq and ATAC-seq datasets in motor neurons, peak calling was conducted as described previously 1 . Briefly, ChIP-seq peaks (binding events) enriched by H3K27ac in motor neurons over background were identified using the strand separate peak calling algorithms using the GeneTrack software ("-s 50 -e 1000") 23,24 . The midpoint of the start and end position of the H3K27ac-enriched peaks was used as a peak coordinate. Genomic regions enriched by accessible DNA (ATAC-seq peaks) were identified using the MACS2 software 25 . For the published ChIP-seq and ATAC-seq datasets from the ENCODE datasets, we used the processed peak calls as H3K27ac-enriched or DNAaccessible regions. Active enhancers were defined as H3K27ac-enriched peaks whose genomic regions are located more than 3 kb from the nearest TSS of known mRNA genes (19,144 of the annotated RefSeq mRNA genes). The number of peak calls from ATAC-seq and ChIP-seq datasets was listed in Supplementary Data 8.

Distance of closest intergenic enhancers.
To calculate the distance between the closest active enhancers per intergenic DNA in Supplementary Fig. 5, the top 5% of all mRNA genes, highly induced in each tissue and cell type (919 out of 18,383 genes), were used. Although we defined the intergenic length of a gene as the sum of the upstream and downstream DNA distance from the end of a gene to the nearest genes, here we considered intergenic DNA as each upstream and downstream DNA, separately, from the end of a gene to the nearest gene. Thus, we exclude extremely long distances between closest active enhancers due to a long genic DNA between two enhancers. Active enhancers were defined as H3K27ac peaks determined by ChIPseq. The distance between enhancers was considered when two or more enhancers exist in an intergenic DNA. Less than 1 kb of the distances between enhancers were filtered out because two enhancers within 1 kb were considered as a single enhancer.  29 . For the FIMO analysis, we used Minimal MEME Motif Format, which is a plain format containing the DNA motif letters, background letter frequencies, and letter-probability matrix. 100 bp from the midpoints of the identified ATAC-seq peaks in Fig. 5 were searched for the occurrence of the target motifs on both DNA strands with the matched P-value < 1.0e-05. To get background genomic regions, we randomly selected 1,436 genomic coordinates throughout the mouse genome (mm10), then extracted genomic DNA sequences within 100 bp from the random genomic coordinates. All Minimal MEME Motif Formats (zip file) for the target DNA motifs identified in Fig. 5 and Supplementary Fig. 7 are available in Supplementary Data 11.
Hi-C analysis. To examine long-range chromatin interactions between intergenic enhancers and TSS of the gene in neural cells and tissues, we used the published Hi-C dataset for mouse cortical neurons (Bonev et al., 2017) 30 . We used mm10 (GRCm38) mouse assembly and gene annotation for the Hi-C analysis. This study (Bonev et al., 2017) reported a total of 18,094 interactions between enhancers and TSSs in cortical neurons.
Among them, we selected 16,429 interactions involved with protein-coding genes, then identified 8,238 interactions that are associated with intergenic enhancers. 2,214 out of 8,238 interactions were involved with genes with long intergenic DNA regions (>500 kb, LID genes). We then examined interactions between intergenic enhancers and LID genes in cortical neurons. Most enhancers interact with multiple TSSs, and we found that 1,254 enhancers were involved in 2,214 interactions with TSSs of 818 genes. Neighboring genes of these 818 genes were identified as described above (Neighboring gene analysis). All Hi-C interactions, enhancers, and genes involved in this Hi-C analysis were reported in Supplementary Data 15.
Neighboring gene analysis. To calculate the shared intergenic DNA length of neighboring genes, we considered gene orientations and transcription directions of two neighboring genes. We classified all the mouse mRNA genes (mm10; 18,040 genes) into 4 groups based on the orientation of neighboring gene pairs: head-to-head (H-H) genes where the two genes were located on the opposite DNA strands and transcribed divergently; tail-to-head (T-H) genes where the two genes were located on the upper DNA strand and transcribed in the same direction; head-to-tail (H-T) genes where the two genes were located on the lower DNA strand and transcribed in the same direction; and tail-to-tail (T-T) genes where the two genes are located on the opposite DNA strands and transcribed convergently (Fig. 6). Then, we define a left gene and a right gene, which are located at lower and higher genomic coordinates of the shared intergenic DNA region, respectively, in each group of gene pairs. A left gene shares an intergenic region with a right gene on the right side of the left gene. All left and right neighboring genes and their shared intergenic lengths are listed in Supplementary Data 13. The heatmap plots for induction levels of neighboring genes (log2 FPKM expression in tissue/ES cells) were generated using Java TreeView 16 .
Significance test. The comparison of intergenic DNA lengths or the number of peaks of highly induced genes between tissues or cell types is reported as P-value heatmap plots. The P-values for the difference in intergenic DNA length or the number of peaks between tissues were calculated based on a Student's tdistribution under the null hypothesis. We used the null hypothesis because we expect no significant difference between two populations or between certain characteristics of a population. We used a two-sample t-test with the Excel function "t-test" with tails=2 (two-tailed distribution) and type=2 (homoscedastic equal