Over-representation of potential SP4 target genes within schizophrenia-risk genes

Reduction of Sp4 expression causes age-dependent hippocampal vacuolization and many other intermediate phenotypes of schizophrenia in Sp4 hypomorphic mice. Recent human genetic studies from both the Schizophrenia Exome Sequencing Meta-Analysis (SCHEMA) and the Genome-Wide Association Study (GWAS) validated SP4 as a schizophrenia-risk gene over the exome-wide or the genome-wide significance. Truncation of the human SP4 gene has an odds ratio of 9.37 (3.38–29.7) for schizophrenia. Despite successful identification of many schizophrenia-risk genes, it is unknown whether and how these risk genes may interact with each other in the development of schizophrenia. By taking advantage of the specific localization of the GC-boxes bound by SP4 transcription factors, I analyzed the relative abundance of these GC-boxes in the proximal promoter regions of schizophrenia-risk genes. I found that the GC-box containing genes are significantly over-represented within schizophrenia-risk genes, suggesting that SP4 is not only a high-risk gene for schizophrenia, but may also act as a hub of network in the regulation of many other schizophrenia-risk genes via these GC-boxes in the pathogenesis of schizophrenia.


INTRODUCTION
Sp4, a homolog of the Drosophila buttonhead (btd) gene, belongs to the Sp1 family of transcription factors [1]. Sp1, a ubiquitous transcription factor, recognizes the DNA sequence 5ʹ-GGGCGG-3ʹ termed GC-box that is often found~40-100 nucleotides upstream of the transcription start site (TSS) of a variety of genes and is critical for activation of their expression [2]. Both Sp1 and Sp4 bind the same GCbox with the same affinity [3]. In the brain, however, Sp4 and Sp1 are specifically expressed in neuronal and glial cells, respectively [4,5]. We found that reduction of Sp4 expression in Sp4 hypomorphic mice causes age-dependent hippocampal vacuolization, reduced hippocampal LTP, and many behavioral abnormalities related to schizophrenia such as deficits in prepulse inhibition, deficient learning and memory, hypersensitivity to NMDAR antagonists, etc. [4,[6][7][8]. A complete absence of Sp4 in Sp4 knockout mice impairs postnatal development of hippocampal dentate gyrus with a reduced number of dentate granule cells and a smaller dentate size [9]. We and Tam et al. first reported that the human SP4 gene is sporadically deleted in schizophrenia patients [6,10].
In 2020, the largest GWAS study of 69,369 individuals with schizophrenia and 236,642 controls identified common variant associations at 270 distinct loci [11]. Many genes were prioritized after fine-mapping of the loci. These genes are predominantly expressed in neurons, but not in glia, and regulate synaptic functions. On the other hand, The Schizophrenia Exome Sequencing Meta-Analysis (SCHEMA) Consortium, the largest exome sequencing project with 24,248 cases and 97,322 controls, identified ten truncated genes above the exome-wide significance that confer high risks for schizophrenia (odds ratios 3-50) [12]. Among these top ten risk genes, truncation of the SP4 gene has an odds ratio of 9.37 (3.38-29.7) for schizophrenia. Importantly, SP4 and GRIN2A (encoding NMDAR2A subunit) are the only two genes shared between the top ten risk genes from the SCHEMA and the GWAS prioritized genes. This not only enhances the confidence of the two risk genes but also implicates that more subtle common variants of these two genes with a partial loss-of-function are likely involved in the pathogenesis of schizophrenia in the general population of patients. A key question however that remains to be investigated is whether and how these risk genes may interact with each other in the development of schizophrenia.
Out of the top ten high-risk genes from the SCHEMA, Sp4 and RB1CC1 are the only two DNA-binding transcription factors. Interestingly, RB1CC1 binds the same GC-box sequence GGGCGG but more upstream of TSS than Sp4 [13], highlighting the importance of the GC-box in the regulation of genes involved in the pathogenesis of schizophrenia. In the brain, Sp4 and a majority of other schizophrenia-risk genes are specifically expressed in neurons, but not glia. Does Sp4 control the expression of many of these schizophrenia-risk genes via the GC-boxes in their proximal promoter regions in neuronal cells, given that the GC-box can be readily found in the 5ʹ-regulatory regions of a variety of genes? Here, I examined the relative abundance of the GC-boxes in the promoters of susceptibility genes for schizophrenia, autism spectrum disorders (ASD), and developmental disorder/intellectual disability (DD/ID).

MATERIALS AND METHODS Eukaryotic promoter database
Eukaryotic promoter database (EPD) (https://epd.epfl.ch//index.php) has a collection of 16,455 genes across the human genome (hg38). The GC-box GGGCGG, the GA-box GGGAGG, the MAZ-binding motif CCCCTCC, and the KLF13-binding motif ACGCCC were searched on both DNA strands in gene promoters using the function of either Motif Enrichment or Motif Discovery in EPD. The numbers of the GC-boxes were counted for each gene. To calculate the GC content of the promoter region, either G or C was counted on both DNA strands using Motif Discovery. The numbers of the GC-boxes and the GC content were analyzed using R programming.
Susceptibility genes for schizophrenia, ASD, and DD/ID High-risk genes for schizophrenia were obtained from the SCHEMA [12]. Common risk genes for schizophrenia came from the GWAS [11]. High-risk genes for ASD were obtained from large-scale exome sequencing studies [14]. Risk genes for DD/ID came from exome sequencing studies [12,15].

Statistical analysis
R programming was used for all statistical analyses and correction of multiple comparisons by FDR at the level of 0.05. Statistical analysis of over-representation of the GC-box containing genes was conducted with cumulative hypergeometric probability tests. Permutation tests (N = 9999) were conducted for statistical analyses of the numbers of the GC-boxes in the proximal promoter regions between different groups of genes. ANOVA or Welch's t test was used for statistical analysis of the GC content in the proximal promoter regions between different groups of genes.

RESULTS
Eukaryotic Promoter Database (EPD) (https://epd.epfl.ch//index.php) hosts 16,455 protein-coding genes across the human genome (hg38) with well-characterized TSS. To avoid distortion by different numbers of alternative promoters between genes, the most representative promoter was selected from each gene for analysis. Consistent with Kadonaga's findings [2], the GC-box is predominantly localized starting from 40 nucleotides upstream of TSS after scanning the promoter region (−500,100) of all 16,455 genes (Fig. 1). The sudden drop of the GC-box percentages at 40 nucleotides upstream of TSS separates the proximal promoter from the core promoter that is occupied by RNA polymerase II and associated basal transcription machinery. The 100 nucleotide (−140,−41, red line) window for the GC-box is similar to the (−130,−30) window in previous studies [16] and slightly larger than Kadonaga's (−100,−40) window in early studies, likely due to more accurate TSS mapping of many thousands of genes in the most recent EPD database. In this proximal promoter region, the GC-boxes function as critical Sp1 and Sp4 binding sites for activation of gene expression [17,18]. The proximal promoter region (red line) from −140 to −41 was therefore selected as a critical window for analysis of potentially functional GC-boxes in schizophrenia-risk genes, ASD-risk genes, and DD/ID-risk genes.

Over-representation of GC-box containing genes
After identifying the GC-boxes in each of the 16,455 genes in the EPD, I found that 7653 genes,~46.51% of all genes, have at least one GC-box in their proximal promoter (−140,−41) regions (Table 1). This percentage of the GC-box containing genes across human genome is the control baseline for determination of overrepresentation of the GC-box containing genes within susceptibility genes for schizophrenia, ASD, and DD/ID. Since the top ten high-risk genes with the exome-wide significance from the SCHEMA are too small for statistical analysis, I expanded the gene list to the top 32 high-risk genes with p value < 10 −4 [12], presumably most of the expanded genes will reach the exome-wide significance after sequencing more cases and controls. Not all human genes have been finely mapped with TSS and collected in the EPD database. After searching the 32 high-risk genes in the EPD, 28 genes were matched. Out of the 28 EPD-matched genes, 19 genes,~67.86% of the matched high-risk genes, have at least one GC-box in their proximal promoter regions. Cumulative hypergeometric probability test suggested that the 19 GC-box containing genes are significantly (p = 0.0186) over-represented within the 28 EPD-matched high-risk genes from the SCHEMA than the 7653 GC-box containing genes within all 16,455 human genes. If the gene list is further expanded to the top 61 high-risk genes with p value < 10 −3 from the SCHEMA, 47 out of the 61 genes are matched in the EPD. Thirty-one out of the 47 EPD-matched genes,~65.96% of all these genes, contain at least one GC-box in their proximal promoter regions. Again, the 31 GCbox containing genes are significantly (p = 0.0056) over-represented within the 47 EPD-matched high-risk genes from the SCHEMA than the 7653 GC-box containing genes within all 16,455 human genes. In contrast to the rare high-risk genes identified by the SCHEMA, the GWAS found hundreds of common low-risk genes for schizophrenia with genome-wide significance [11]. After fine-mapping of the loci, 69 protein-coding genes were prioritized. Sixty-three out of the 69 genes are matched in the EPD. Thirty-six out of the 63 EPD-matched genes have at least one GC-box in their proximal promoter regions. Cumulative hypergeometric probability test suggested a trend (p = 0.058) of over-representation of the 36 GC-box containing genes within the 63 EPD-matched genes prioritized by the GWAS. If all 643 GWAS prioritized genes were used for the same analysis, the number of GC-box containing genes is also significantly overrepresented (p = 0.0306). In conclusion, the GC-box containing genes are significantly over-represented within the schizophreniarisk genes identified from both the SCHEMA and the GWAS studies. Are the GC-box containing genes also over-represented within highrisk genes for other psychiatric disorders? I analyzed the proportion of the GC-box containing genes within the high-risk genes for autistic spectrum disorders (ASD) from large-scale exome sequencing [14]. There is no over-representation of the GC-box containing genes (p = 0.45), suggesting that the GC-boxes in the proximal promoter regions (−140,−41) are functionally less dominant in the regulation of the ASD-risk genes. Risk genes for developmental disorder/intellectual disability (DD/ID) [12,15] were also analyzed. One hundred forty-one GC-box containing genes are significantly (p = 0.005) over-represented within the 258 matched high-risk genes for DD/ID than the 7653 GC-box containing genes within all 16,455 human genes. After correcting multiple comparisons with FDR, significant over-representation of the GC-box containing genes remains in both schizophrenia and DD/ID.
The GC-box evolves to be the dominant motif bound by Sp1 and Sp4 transcription factors with the highest affinity in eutherian mammals including mouse and human, whereas the ancestral GA-box GGGAGG is more common for Sp1 and Sp4 transcription factors in other vertebrates such as fish and marsupials. The GCbox has three times higher affinity for Sp1/Sp4 transcription factors than the GA-box [19]. The proportion of the GA-box containing genes within the risk genes for schizophrenia, ASD, and DD/ID was analyzed with exactly the same method (Table 2). There is no over-representation of the GA-box containing genes within the schizophrenia-risk genes or the ASD-risk genes except for the DD/ID-risk genes. Since the GA-box is almost identical to the binding motif (CCCCTCC) of the transcription factor MAZ, the same pattern of the over-representation of MAZ motif-containing genes was observed in the DD/ID-risk genes (Supplemental Table S1). It is possible that the two over-represented similar motifs may be mainly selected by MAZ. As expected, the binding motif (ACGCCC) of transcription factor KLF13 does not show enrichment in any group of the risk genes (Supplemental Table S2), supporting that over-representation of the GC-box containing genes within the schizophrenia-risk genes or DD/ID-risk genes is not a random effect.

Numbers of the GC-boxes in the GC-box containing genes
Since the GC-box containing genes are over-represented within schizophrenia-risk genes and DD/ID-risk genes, I examined whether the GC-boxes may also be enriched in the GC-box containing risk genes. The GC-boxes were identified and counted in the proximal promoter regions (−140,−41) of the 7653 GC-box containing genes in the EPD (Supplemental Table S3). Schizophrenia-risk genes from the SCHEMA and the GWAS, ASDrisk genes, and DD/ID-risk genes were extracted (Supplemental Table S3). The distribution of the numbers of the GC-boxes in the GC-box containing genes was boxplotted ( Fig. 2A). Due to nonnormal distribution, a permutation test (N = 9999) was conducted for statistical analysis (Supplemental Table S4). No significant differences were found between the control EPD group and each group of disease-risk genes, suggesting that there is no enrichment of the GC-boxes in the proximal promoter regions of the GC-box containing risk-genes for schizophrenia, ASD, and DD/ID.

Association of the GC-box with the GC content
High GC content may generate the GC-box by chance in the promoter. Does the over-representation of the GC-box containing genes result from higher GC content in the proximal promoter regions of schizophrenia-risk genes and DD/ID-risk genes? The GC content (%) in the proximal promoter region (−140,−41) was calculated for all 16,455 genes including the 7653 GC-box containing genes in the EPD (Supplemental Table S3). Schizophrenia-risk genes from the SCHEMA and the GWAS, ASDrisk genes, and DD/ID-risk genes were extracted (Supplemental Table S3). The GC content in the proximal promoter region (−140, −41) was boxplotted (Fig. 2B). All groups of genes except SCZ_G1 have a significantly higher GC content than the control EPD genes   Table S4). A high GC content of the ASD-risk genes does not produce more GC-boxes in their proximal promoter regions ( Table 1), suggesting that over-representation of the GC-box containing genes in the schizophrenia-risk genes cannot be simply attributed to the increased GC content. If the GC content was separately boxplotted in the GC-box containing genes or non-GC-box genes, the GC-box containing genes have a significantly higher GC content than non-GC-box genes across all groups regardless of diseases (Supplemental Fig. S1). All together, the presence of the GC-boxes will increase the GC content, but a high GC content does not necessarily produce more GC-boxes. Over-representation of the GC-box containing genes likely contributes most to the higher GC content in the proximal promoter regions of the schizophreniarisk genes and the DD/ID-risk genes.
The GC-boxes more upstream of TSS In contrast to Sp1/Sp4 transcription factors that recognize the GCboxes in the proximal promoter region (−140,−41), the GC-boxes more upstream of this region can be bound by other transcription factors such as RB1CC1, another high-risk gene for schizophrenia from the SCHEMA. These transcription factors are however less well studied; and their binding GC-boxes scatter across a larger area of the 5ʹ-regulatory region. Such more flexible localizations in a large region make analysis more difficult and arbitrary. To get a glimpse of the potential involvement of these more upstream GCboxes in schizophrenia, I plotted the percentages of the GC-boxes in the promoters of susceptibility genes for schizophrenia, ASD, and DD/ID (Fig. 3). In addition to the major peak of the GC-boxes in the proximal promoter region (−140,−41) across all schizophrenia-risk genes, there appears a second peak of more upstream GC-boxes (−340,−240) in the high-risk genes from SCHEMA2. This second GC-box peak is however absent in schizophrenia-risk genes prioritized by the GWAS. In high-risk susceptibility genes for ASD and DD/ID, the frequency of these more upstream GC-boxes is higher than that of all 16,455 genes across a large 5ʹ-regulatory region. Significance of these upstream GC-boxes remains to be investigated in the regulation of expression of these genes.

DISCUSSION
The SCHEMA identified ten high-risk genes for schizophrenia above the exome-wide significance. Out of the top ten high-risk genes, Sp4 and RB1CC1, the only two DNA-binding transcription factors, bind the same GC-box in regulation of genes involved in the pathogenesis of schizophrenia. Genes containing the GCboxes in the proximal promoter region are significantly overrepresented in the susceptibility genes for schizophrenia from both the SCHEMA and the GWAS studies, suggesting that Sp4 is not only a high-risk gene for schizophrenia, but may also act as a hub of network in the regulation of many other schizophrenia-risk genes via these GC-boxes in neuronal cells. Such a Sp4-controlled network of different schizophrenia-risk genes may contribute to the over-representation of the GC-box containing genes. In the GC-box containing risk-genes for schizophrenia or DD/ID, the numbers of the GC-boxes are not significantly increased in their proximal promoter regions, indicating a lack of synergistic effects from multiple GC-boxes in the activation of gene expression. Interestingly, Sp4 distinguishes itself from Sp1 by lack of such synergistic effects from multiple adjacent GC-boxes [18]. During the evolution of eutherian mammals, the GC-box rises quickly with co-evolved Sp1 and Sp4 variants after split from marsupials. The ancestral GA-boxes for Sp1/Sp4 transcription factors decrease by genetic drift. In human, there are still a significant large number of the GA-boxes in the proximal promoter regions (−140,−41). Unlike the GC-boxes, however, there is no over-representation of the GA-box containing genes within the schizophrenia-risk genes likely due to lack of selective pressure for the GA-boxes by Sp4 and other possible transcription  Table S4). All of the risk-gene groups except SCZ_G1 (black) have a significantly higher GC content than the EPD group after FDR correction of multiple comparisons.
factors. Surprisingly, the over-representation of the GA-box containing genes was still observed within the DD/ID-risk genes. Given that the MAZ transcription factor also binds the GA-box that is similar to the MAZ-binding motif, it is plausible that MAZ may regulate genes involved in the development of DD/ID and therefore causes the over-representation of both the GA-boxes and the MAZ-binding motifs. It cannot be ruled out, however, that the GA-boxes, despite with a lower affinity than the GC-boxes, may still be used in a limited capacity for Sp1/Sp4/Sp3 transcription factors depending on gene and cellular context in the development and diseases of the central nervous system. Not all GC-boxes in the proximal promoter region will be bound by Sp4 transcription factors in neuronal cells. Whether a specific GC-box in a schizophrenia-risk gene is bound by Sp4 may also depend on neuronal cell types, and has to be determined by experiments such as chromatin immunoprecipitation and sequencing (ChIP-seq). It should be kept in mind that other Sp1 family of transcription factors such as Sp3 may also recognize the GC-boxes in the proximal promoter regions to compete with Sp4 to activate or repress gene expression. Sp4 is specifically expressed in neuronal cells regardless of neuronal cell subtypes, but not in glial cells. However, Sp4 functions vary in different subtypes of neurons. We conducted genetic dissection of Sp4 functions in different neuronal cell types via cell-specifically restoring Sp4 expression using Cre/LoxP in either forebrain excitatory or GABAergic inhibitory neurons [20]. Restoration of Sp4 expression in forebrain GABAergic inhibitory neurons, but not forebrain excitatory neurons, rescues ketamine hypersensitivity in Sp4 hypomorphic mice.
Over-representation of the GC-box containing genes appears more prominent within the high-risk genes from the SCHEMA than within the common low-risk genes from the GWAS. A possible explanation is that the schizophrenia-risk genes from the GWAS carry much smaller effects and are less certain, particularly for the 643 genes (GWAS2) prioritized by various methodologies and assumptions. Over-representation of the GC-box containing genes is also observed within the risk-genes for DD/ID, but not ASD. Given that mouse Sp4 mutation impairs hippocampal development and causes severe deficits in learning and memory, it is not surprising that the GC-boxes are over-represented in the proximal promoter regions of the risk genes involved in the development of DD/ID.
The frequency of the GC-boxes peaks at the proximal promoter region (−140,−41) where Sp4 binds to activate gene expression in schizophrenia-risk genes. A second peak of more upstream GCboxes is suggested in the high-risk genes from the SCHEMA. Given that RB1CC1, a high-risk gene for schizophrenia, binds the more upstream GC-boxes, it is possible that these upstream GC-boxes may also be involved in the pathogenesis of schizophrenia. Identification of more high-risk schizophrenia genes will help confirm whether these upstream GC-boxes are also significantly over-represented. In contrast to the schizophrenia-risk genes, susceptibility genes for ASD and DD/ID have more upstream GCboxes across a large promoter region. It is unclear whether they may have any functional significance or are just produced by chance due to a high GC content in the promoters of these risk genes. However, a high GC content itself could be a consequence of selective pressure by transcription factors recognizing GC-rich motifs in the regulation of these risk genes in the pathogenesis of ASD and DD/ID.