NF-Y subunits overexpression in gastric adenocarcinomas (STAD)

NF-Y is a pioneer transcription factor—TF—formed by the Histone-like NF-YB/NF-YC subunits and the regulatory NF-YA. It binds to the CCAAT box, an element enriched in promoters of genes overexpressed in many types of cancer. NF-YA is present in two major isoforms—NF-YAs and NF-YAl—due to alternative splicing, overexpressed in epithelial tumors. Here we analyzed NF-Y expression in stomach adenocarcinomas (STAD). We completed the partitioning of all TCGA tumor samples (450) according to molecular subtypes proposed by TCGA and ACRG, using the deep learning tool DeepCC. We analyzed differentially expressed genes—DEG—for enriched pathways and TFs binding sites in promoters. CCAAT is the predominant element only in the core group of genes upregulated in all subtypes, with cell-cycle gene signatures. NF-Y subunits are overexpressed, particularly NF-YA. NF-YAs is predominant in CIN, MSI and EBV TCGA subtypes, NF-YAl is higher in GS and in the ACRG EMT subtypes. Moreover, NF-YAlhigh tumors correlate with a discrete Claudinlow cohort. Elevated NF-YB levels are protective in MSS;TP53+ patients, whereas high NF-YAl/NF-YAs ratios correlate with worse prognosis. We conclude that NF-Y isoforms are associated to clinically relevant features of gastric cancer.


NF-Y subunits are overexpressed in STAD.
Inspection of NF-Y subunits expression of the TCGA datasets (http:// fireb rowse. org) suggested that expression of NF-YA is globally increased in epithelial tumors 25 . We downloaded the available STAD RNA-seq dataset 8 and analyzed NF-Y subunits: NF-YA is robustly increased in STAD (p value: 10 -14 ). NF-YB and NF-YC are also increased (p values: 10 -07/08 ) (Fig. 1a). We then analyzed the levels of NF-YA isoforms: Fig. 1b shows that the levels of the "short" NF-YAs increase in tumors (p value 10 -15 ), unlike NF-YAl. In conclusion, we confirm a generalized overexpression of NF-Y subunits, especially NF-YA, in STAD.
The predominance of NF-YAs prompted us to verify the relative expression in gastric cancer cell lines. For this, we interrogated two repositories: the Broad Institute CCLE-Cancer Cell Lines Encyclopedia (https:// porta ls. broad insti tute. org/ ccle/ about) and a recently described set of gastric cancer lines 34 ; overall, we analyzed 50 cell lines, with a partial overlap of lines common to the two datasets. We downloaded RNA-seq data, mapped reads and analyzed NF-Y subunits levels. The results are shown in Fig. S1: the overall levels of NF-YA mRNA expression are variable with the majority, but not all, cell lines expressing primarily NF-YAs (Fig. S1a). The levels of the two HFD subunits, particularly NF-YB, are comparably less variable among the cell lines (Fig. S1b,c). We conclude that NF-Y subunits are overexpressed in STAD, particularly NF-YA, whose predominant isoform is NF-YAs, in gastric tumors and cell lines.
Expression of NF-Y isoforms in STAD subtypes. According to several genetic, epigenetic and functional parameters, TCGA classified STAD in four subtypes 8 . Since overexpression of NF-Y subunits could be limited to one -or more-of the subtypes, we investigated the levels of the three subunits in the four cohorts. Currently, RNA-seq data on 415 tumors are available, of which 387 were categorized by TCGA. We first classified all tumors for which there are RNA-seq data, employing the DeepCC machine learning tool 35 , with a training set represented by those already classified by TCGA: the relative proportions are indeed essentially maintained (Fig. 2a). Figure 2b (Left Panels) shows that the relative increase of NF-YA is similar in CIN, EBV and MSI (p values of 10 -12/15 relative to normal samples), but in GS, the levels are lower. NF-YB and NF-YC are increased at comparable levels in all subtypes.
As for the isoforms, the data are shown in Fig. 2b (Right Panels): NF-YAs is increased in MSI, EBV and CIN (p values 10 -14/16 with respect to normal samples), less in GS. NF-YAl, instead, shows a significant increase in GS. As a consequence of these changes, the NF-YAl/NF-YAs ratio is substantially increased in GS with respect to the other subtypes. In summary, overexpression of NF-YAs is generally widespread, but there is a distinctly higher NF-YAl/NF-YAs ratio in GS tumors.  S2b).
With the same thresholds, we then performed analysis of RNA-seq of the individual TCGA subtypes. Venn diagrams of the overlaps are shown in Fig. 3a and the lists of genes are in Supplementary Table S2. As for subtypespecific TFBS, distinct matrices are enriched in the four subtypes ( Fig. S3a): SP1/2 in CIN, ETS-family in EBV, Zn fingers TFs in GS and MSI (EGR1/2/3, Sp2/4). We analyzed Gene Ontology terms of DEG: Fig. S3b shows specific gene signatures for individual subtypes: in CIN, cellular protein metabolism, spermatogenesis; in EBV, viral process, T cell signaling; in GS, extracellular matrix, cell adhesion; in MSI, nucleolus. Analysis of the common set of 898 genes upregulated in all subtypes have NF-Y at the top of the enriched matrices, and features described in global DEG, such as extracellular matrix, cell division, DNA replication, with the addition of extracellular matrix terms (Fig. 3b). Overall, we conclude that CCAAT is the primary site only in promoters of commonly upregulated genes, but it is absent in those specific to each TCGA subtype.
Clinical outcome of NF-Y overexpression in STAD according to the TCGA subtypes. We stratified the progression free interval-PFI-of STAD patients according to High, Intermediate, Low levels of NF-Y subunits expression. In addition, we considered the ratios of NF-YAl/NF-YAs, because this parameter was more informative than the overall levels of the two isoforms to predict patient outcomes in breast, lung and HNSCC cancers [25][26][27]29 . No correlation is scored according to the different levels of NF-YA and of the HFD subunits (Fig. S4), nor to the ones of NF-YAl and NF-YAs isoforms (Fig. 4a, Upper Panels). As for the NF-YAl/NF-YAs ratios, instead, we did find a robust correlation with worse prognosis (p value 0.0099) (Fig. 4a, Lower Panel). We then focused on PFIs of NF-YA ratios stratified according to the single subtypes: a correlation with poor prognosis was scored in CIN and EBV (Fig. 4b), but not in GS and MSI (Fig. S5). In summary, a higher NF-YAl/NF-YAs ratio does have relevant clinical implication in STAD, globally and in specific TCGA subtypes.
Expression of NF-Y according to the ACRG classification. A second STAD molecular classification was proposed by ACRG. This was originally based on profiling analysis, and thereafter applied to the TCGA  Figure 5b (Right Panels) shows higher levels of NF-YAl, and lower of NF-YAs, in EMT samples, leading to an increased ratio of these isoforms. The presence of CIN samples in all ACRG subtypes, particularly EMT, led us to analyze NF-Y expression of CIN within ACRG subclasses: globally, the levels are similar (Fig. 5c, Left Panels), with those within the EMT group having distinctly higher levels of NF-YAl, lower NF-YAs and, by consequence, higher ratios (Fig. 5c, Right Panels). Note that analysis of STAD cell lines shows that most EMT lines, classified as such by Lee et al. 34 , indeed express the lowest levels of NF-YAs and highest of NF-YAl (Fig. S1a). We conclude that the EMT subclass of ACRG includes GS, as well as a portion of tumors catalogued as CIN, having a high ratio between NF-YAl and NF-YAs.
Clinical outcome of NF-Y expression according to the ACRG subtypes. Next, we evaluated the clinical outcome of patients according to the ACRG classification. Stratification according to NF-YAl/NF-YAs ratios indicate no clinical relevance in MSI, MSS;TP53 − and MSS;TP53 + , but worst prognosis with high and intermediate levels in EMT (Fig. 6a). This is in agreement with the CIN data ( Fig. 4b) and with the notion of a cluster of CIN tumors with high NF-YAl/NF-YAs ratios being inserted in the EMT subtype of ACRG ( Fig. 5c): this could be responsible for the correlation seen in EMT, but not in GS. To substantiate this point, we calculated the distribution of the NF-YAl/NF-YAs ratios in GS and EMT: Fig. 6b shows that GS has a flatter distribution, with more samples with very high ratios (35% are ≥ 1), whereas EMT has fewer samples with high ratios (25% are ≥ 1), but a larger population with ratios between 0.2 and 0.5. Thus, EMT is in part fed by the CIN samples www.nature.com/scientificreports/ that show high ratios ( Fig S6b). Note that EBV and MSI have essentially no samples above a 0.35 ratio. Thereafter, we stratified EMT samples according to low and intermediate/high ratios: the curve of the latter significantly correlates to a worst outcome (p value 0.012) (Fig. 6c, Left Panel). In addition, we reasoned that the overall levels of NF-YAs might also be impactful: stratification according to NF-YAs levels indeed indicates a protective effect of this isoform (Fig. 6c, Right Panel). Finally, analysis on the levels of HFD subunits in ACRG subtypes yielded negative results (Fig. S7), except for NF-YB, whose high levels are protective in MSS;TP53 + (Fig. 6d). Altogether, these data reinforce the role of the relative levels of the two NF-YA isoforms in the outcome of EMT, as well as pointing at a novel role of NF-YB in the MSS;TP53 + subtype.     Fig. S8 shows below zero median Z scores of CIN, EBV and MSI (TCGA), MSI, MSS;TP53and MSS;TP53 + (ACRG); instead, good concordance is scored within the GS and EMT groups. Because of the presence of low levels of epithelial Claudins, we will refer to this group as Claudin low . Next, we positioned this group within the other TCGA and ACRG subtypes (Fig. 7b): most tumors of the Claudin low cluster are from the GS and CIN (TCGA) and EMT (ACRG) subtypes. In essence, the Claudin low group could be classified as new within TCGA, while being essentially a subclass of the EMT ACRG subtype. Overall, these data confirm the existence of the subgroup proposed by Nishijima et al., further expanding it to 79 TCGA samples, with robust statistical significance. Next, we evaluated the expression of NF-YA isoforms and their relative ratio including the Claudin low group. Figure 7c,d show the results according to the TCGA and ACRG subtypes, respectively: NF-YAl is mostly present in the Claudin low class, with far lower levels in the remaining samples of the ACRG EMT subtype. On the contrary, NF-YAs is lowest in Claudin low , and higher in all other ACRG and TCGA subtypes, with the exception of GS. As a consequence, the NF-YAl/NF-YAs ratio is significantly increased (lowest p values: 10 -16 ) mostly in the Claudin low group. These data indicate that NF-YAl is mostly associated to a discrete number of STAD samples with EMT and Claudin low features.

NF-YAl is predominant in Claudin
To verify the overlap between the Claudin low and NF-YAl high (and NF-YAs low ) subsets, we stratified the clinical outcome of Claudin low tumors according to NF-YA isoforms expression (High, Intermediate, Low): no further worsening of prognosis in PFI curves is scored according to the different levels of NF-YA isoforms (Fig. S9, Upper Panels), nor NF-YAl/NF-YAs ratio (Fig. S9, Lower Panel). We conclude that there is a large overlap between the subset classified as Claudin low and NF-YAl high tumors.

CCAAT box is enriched in upregulated pathways of Claudin low samples. To further investigate
the Claudin low cluster, we compared pathways in Claudin low and EMT versus normal samples. The analysis of DEG in EMT shows absence of CCAAT in promoters (Fig. S10a). Across EMT upregulated pathways, we did find mesenchymal terms such as extracellular matrix, heart development, mesenchyme development (Fig. S10b). Within the TF motifs enriched in the promoters of genes of each single category, we observed significant enrichment of the NF-Y motif in cell-cycle terms, as expected, and in mesenchyme development and pattern specification process. In downregulated pathways, we observed different metabolism terms, also expected (Fig. S10c). The same analysis performed on Claudin low samples did not yield NF-Y motifs as enriched in deregulated genes, but rather MAZ, E2F6 and KLFs motifs (Fig. 8a); these TFs were confirmed by analyzing ChIP-seq data from the ChIP-Atlas database 38 (Supplementary Table S3). Among upregulated pathways we found extracellular matrix and mesenchyme development terms (heart development, skeletal system, and pattern specification process). As above, the CCAAT box was enriched in terms related to mesenchyme (Fig. 8b). Various metabolic processes populated the downregulated pathways (fatty acid and lipid metabolic process), expectedly regulated by NF-Y and with CCAAT motifs (Fig. S11).

Discussion
Because of its histone-like structure 17 , positioning within promoters 16 , synergistic connections with many other TFs and interactions with coactivators, NF-Y is believed to play a pioneering role in "opening" promoter structures and correct positioning of RNA Pol II 39 . Specifically, NF-Y is important for genes required for cell proliferation 20 . We describe here an investigation on NF-Y subunits levels in gastric cancer. We report the presence of CCAAT in commonly overexpressed genes and overexpression of NF-YA isoforms, as well as a prognostic value of their relative levels. We also report on overexpression of the HFD subunits, and clinical significance of NF-YB.
CCAAT boxes have been routinely found in promoters of genes overexpressed in cancer, first in large microarrays profiling 15 and more recently in RNA-seq datasets. Our analysis of TCGA identified CCAAT in overexpressed genes, typically with E2Fs sites, in line with the pro-growth role of these TFs. Specifically, two schemes are starting to emerge. In the first, CCAAT is enriched globally, and indeed at the top of the TFBS list, when all upregulated genes are computed: it is the case of lung tumors 26,27 ; in the second, the enrichment is found either in specific subtypes-iCluster 3 in HCC 28 -or only in DEG shared by all subtypes, as in BRCA 25 and STAD, as shown here. In global STAD DEG, TFBS in promoters of upregulated genes contain the familiar E2Fs motifs, along with Zn Finger TFBS (SPs/KLFs), but CCAAT is absent. As in BRCA, however, it comes out first when considering the core group of upregulated genes shared in all STAD subtypes. We also find that CCAAT is absent in promoters of genes downregulated in STAD, as for all other types of cancer examined so far. This further reinstates that this element is not a "general" signal enriched in promoters per se, but rather a core logo driving expression of genes associated to growth, not necessarily related to transcriptional features that are cancer-or subtype-specific.
The HFD subunits are overexpressed in STAD, unlike in lung and breast tumors. We recently reported a similar scenario in HCC, in which high levels of these subunits correlate with worst prognosis in a specific subtype, iCluster1. In STAD, global or subtype-specific PFI curves are globally superimposable based on NF-YB or NF-YC expression, with one notable exception: the MSS;TP53 + ACRG subtype, in which high NF-YB levels correlate with a better prognosis. As for HCC, the fraction of p53 wt tumors in STAD is much higher-51%-than in other epithelial cancers (lung for example), in which the vast majority are p53 mutated, rendering comparisons with wt p53 samples essentially impossible. Note that the protective role of NF-YB in STAD is opposite to what we reported in HCC iCluster1 tumors, generally associated to wt p53 status: although direct NF-Y/p53 interactions have been reported in several studies 20 , the reasons for association of NF-YB levels to such genetic background is unclear. Nevertheless, a role of HFD subunits in cancer progression is starting to emerge; in this respect, www.nature.com/scientificreports/ measurement of protein levels in tumors deserve a close look in the future: in BRCA cell lines, for example, the NF-YB protein seems to be more variable than one could anticipate from mRNA levels 25 .
Overexpression of NF-YA mRNA is as obvious in STAD as in the tumors previously analyzed. Note that analysis of 22 cancer specimens confirms that higher expression is also found at the protein level 30 . In the same study, high levels of NF-YA and Cyclin E in TCGA STAD samples were associated to worsening of patients' prognosis: yet, we do not find here a prognostic value of global levels of NF-YA. In another study, NF-YA high expression correlated with prognosis in a separate set of tumor samples analyzed by microarray profilings 31 , but only in the Diffuse (DF), not in the Intestinal (IT) subtype (Lauren classification). We add a novel and relevant twist, in that isoform ratios-rather than global levels-are clinically important within subclasses of STAD.
The two major NF-YA splicing isoforms differ in the Gln-rich trans-activation domain (TAD): NF-YAl has 28/29 extra amino acids coded by exon 3, predicted to impart different activation potential, as reported in mESCs and myoblasts 40,41 . In addition, a shorter isoform-NF-YAx-lacking sequences of exon-3 and exon-5 was recently found overexpressed in Neuroblastomas 42 . As in the other epithelial cancers, we find that NF-YAs predominates, but higher expression of NF-YAl, alone or coupled to lower levels of NF-YAs, is clinically relevant. The TCGA GS subtype is enriched in DF samples 8 , which is indeed in line with the data reported by Cao et al. 30 . GS tumors are characterized by earlier onset and expression of "cell adhesion" signatures. The NF-YAl/NF-Ys ratio is shifted in GS and the same pattern is observed stratifying tumors according to the ACRG classification: higher NF-YAl/NF-YAs ratios are found in EMT tumors. The relatedness of these subtypes in the two classifications was commented before [11][12][13] : indeed analysis of GO terms and pathways of DEG in these subtypes are in agreement with a mesenchymal phenotype. The ACRG EMT has 48 samples catalogued as CIN by TCGA: interestingly, the PFI of CIN patients indicates a worst prognosis following the NF-YAl/NF-YAs ratios.
Our comparative analysis of the whole set of TCGA tumors suggest clinical relevance for NF-YB and NF-YA isoforms in subgroups of the ACRG classification. Specifically, NF-YA-wise, the ACRG EMT group is more revealing than the TCGA GS, most likely because of the inclusion of CIN tumors with EMT-like profilings. While in the EMT group the role of NF-YA ratios is clinically visible, in the TCGA GS it is not. One possible explanation is the lower dispersion of ratios and lower number of samples in this latter group, making comparison of quartiles difficult. Incidentally, this also allowed to score a protective role of NF-YAs, completely missed by adhering to the TCGA classification. Another feature emerging in the ACRG classification is the protective role of high NF-YB levels, as discussed above. These differences might reflect the fact that RNA profilings are the basis of ACRG, while TCGA factored in other genetic and epigenetic features of STAD.
The parallel of the present data with what we found in breast carcinoma is noteworthy. NF-YAs is also predominant in BRCA, except in the Claudin low subset of Basal-like tumors, that have higher levels of NF-YAl. This is associated to a shift in DEG in these tumors, from signatures dominated by proliferative terms in NF-YAs high tumors, toward activation of EMT signatures. In turn, this is clinically associated to an aggressive, metastatic, drug-resistant behavior. As in BRCA, the NF-YAl/NF-YAs ratio is clinically informative in STAD, but in this case the protective role of NF-YAs high in the EMT subtype is novel. Nishijima et al. showed that overall survival curves and Hazard ratios of the 46 Claudin low patients are indeed worse with respect to other subtypes, dramatically so within the ACRG-classified patients. This suggests that the Claudin low partitioning is particularly significant with ACRG. We extended this group to 79 TCGA tumors by using the signature described: our results confirm and extend the scenario proposed by these Authors, particularly within the ACRG classification, which better partitions the protective role of NF-YAs from the detrimental role of NF-YAl in the Claudin low group. Furthermore, it appears manifest the overlap of tumors with Claudin low and NF-YAl high features.
In general, these data invite further analysis in epithelial cancers to identify (i) Claudin low signatures in other types of epithelial cancers, and (ii) a threshold of NF-YA isoforms ratios, rather than overall levels, possibly responsible for shifting DEG away from proliferative, cell cycle genes toward mesenchymal ones.

Materials and methods
RNA-seq datasets. As of December 2020, there were RNA-seq data on 415 STAD primary tumors in TCGA and 35 non-tumor tissues. We downloaded the corresponding RSEM scaled count data from the http:// fireb rowse. org/ web page. The last published classification of STAD samples in the four molecular subtypes made by TCGA referred to 387 of the 415 tumors, and we retrieved it from the https:// www. cbiop ortal. org/ web page 43,44 ; a different classification was proposed by ACRG on 204 TCGA tumors 9 . All the experiments involving human data in these public datasets adhered to relevant ethical guidelines. The DeepCC tool 35 was used to classify RNA-seq dataset of all tumors in TCGA, according to the TCGA and ACRG classification, using as a training set the tumors already classified by TCGA and ACRG, respectively.
We retrieved the FASTQ files associated to the 37 CCLE stomach cell lines (accession code: PRJNA523380) 45  Gene expression analysis. Differential gene expression analysis of RNA-seq data was performed using R package DESeq2 46 . The Tumor versus Normal expression fold change (FC) denotes upregulation or downregulation according to the FC value. Log 2 FC, and the corresponding false discovery rate (FDR), were reported by the R package. FDR < 0.01 and |log 2 FC|> 0.5 were set as inclusion criteria for DEG selection in tumor/subtype versus normal samples.
Gene ontology, pathway enrichment and transcription factor binding site analysis. We used KOBAS 3.0 (http:// kobas. cbi. pku. edu. cn/ anno_ iden. php) for pathway enrichment analysis using the ENTREZ gene IDs. The TFBS and de novo motif analyses were performed using the Pscan software 36 38 . To obtain TFBS enrichment heatmaps, input genes collections of the top GO terms from KOBAS analysis, sorted by FDR, were analyzed individually with Pscan. Only GO terms with less than 500 background genes were included, and TFBS motif enriched (Pscan p value < 0.01) in less than 10 terms were filtered out.
Analysis of clinical data. We retrieved clinical data related to the TCGA STAD samples and progression free interval-PFI-time records of patients, respectively, from the https:// www. cbiop ortal. org/ and the http:// xena. ucsc. edu/ web pages 43,44,47 . We stratified all the tumors for which PFI records were available according to NF-Y subunits expression at gene level, NF-YA isoforms expression, and NF-YAl/NF-YAs ratio, into three groups (Low = first quartile, Intermediate = second and third quartiles, High = fourth quartile). Survival analysis was performed according to the Kaplan-Meier analysis and log-rank test 48 .
Hierarchical clustering and Z scores computation. TCGA samples RSEM scaled count data were converted into TPM, log 2 -transformed, and median centered; we then performed a hierarchical clustering of the samples with the R package SigClust2 (version 1.2.4) with "average" linkage and "euclidean" metric options, while the alpha parameter was set to 0.05. Daughter nodes were tested if significance was achieved at the corresponding parent node, according to the built-in FWER controlling procedure. We obtained Z scores from log 2 -transformed expression data for each gene of the Claudin low signature, and a median Z score for each sample was computed across the genes of the signature.
Statistical analysis. Analyses were performed in the R programming environment (version 4.0.3), with the ggplot2, ggpubr, survival, survminer, tidyverse packages. Single comparisons between two groups were performed with the Wilcoxon rank-sum test.