Discovering key transcriptomic regulators in pancreatic ductal adenocarcinoma using Dirichlet process Gaussian mixture model

Pancreatic Ductal Adenocarcinoma (PDAC) is the most lethal type of pancreatic cancer, late detection leading to its therapeutic failure. This study aims to determine the key regulatory genes and their impacts on the disease’s progression, helping the disease’s etiology, which is still mostly unknown. We leverage the landmark advantages of time-series gene expression data of this disease and thereby identified the key regulators that capture the characteristics of gene activity patterns in the cancer progression. We have identified the key gene modules and predicted the functions of top genes from a reconstructed gene association network (GAN). A variation of the partial correlation method is utilized to analyze the GAN, followed by a gene function prediction task. Moreover, we have identified regulators for each target gene by gene regulatory network inference using the dynamical GENIE3 (dynGENIE3) algorithm. The Dirichlet process Gaussian process mixture model and cubic spline regression model (splineTimeR) are employed to identify the key gene modules and differentially expressed genes, respectively. Our analysis demonstrates a panel of key regulators and gene modules that are crucial for PDAC disease progression.

In genetics, gene expression is one of the elementary constitutional blocks which gives rise to a phenotype from a genotype, i.e., a trait which is observable in all living cells, including prokaryotes and eukaryotes. Multiple techniques are available to quantify gene expression and regulation, like DNA microarray, RNASeq, etc. The area of gene expression analysis undergone several significant advancements in biomedical research. With increased efficiency and quality, these measurements led to improvements in disease sub-classification, gene identification problems, and studying progression characteristics of diseases [1][2][3][4][5][6][7] . Biological mechanisms are dynamic in nature; therefore, their activities must be supervised at multiple time points. Time-series gene expression experiments are widely used to monitor biological processes in a time-series paradigm 8 . Analyzing these time-series gene expression data helps identify transient transcriptional changes, temporal patterns, and causal effects of the genes. Time-series gene expression studies can be utilized to predict phenotypic outcomes over a period of time 9 .
DNA microarrays and RNASeq data have been accepted as gold standards for analyzing and measuring gene expressions across different biological circumstances 3,5,10 . A gene is considered differentially expressed (DE) if a statistically significant difference in gene expression levels is observed between a pair of experimental conditions. Various statistical distribution models like the Poisson and the Negative Binomial (NB) distribution estimate the differential gene expression patterns. Gene selection refers to detecting the most significant DE genes under various conditions 11 . Selection is made based on a combination of score cutoff, and expression change threshold, commonly generated by the statistical design itself 12 . Popular time-course DE analysis tools include edgeR, DESeq2, TimeSeq, and Next maSigPro based on the NB distribution model. Some DE tools, like ImpulseDE2 and splineTimeR, based on impulse and spline regression models between two groups, respectively, are used on short time-series data 13 .
Gene expression is a strongly regulated spatio-temporal process. Genes having identical expression patterns are associated with the same biological function. Clustering genes with similar expression pattern reduces the transcriptional response complexities by grouping genes responsible for a distinct cellular process 14,15 . Several

Results and discussion
This section provides insight into the detailed findings of our present work.
Evaluation of differential expression. We have processed normalized gene expression values of 42412 genes described in section "Data preparation". Differential gene expression analysis of the genes among the control and treated samples has been performed using the splineTimeR 28 described in section "Differential expression analysis". We obtained 1397 DE genes using the adjusted p-value ≤ 0.05 with the Benjamini-Hochberg (BH) 36  Additionally, to identify the DE genes at each time point, we have employed the limma package 37 . Limma utilizes empirical Bayes smoothing on the estimated fold-changes and standard errors from a linear model fitting. Table 4 shows the top 5 DE genes at each time point, including the top 5 DE genes across all the time points discovered through limma. Figure 1 shows the overlap among the top 100 DE genes between control and treated samples at each time point through a Venn diagram. It has been observed that the top 100 DE genes across all the time points and 168 hours have the highest overlap of 59%, followed by 41% at 24 hours. It was also noticed that the DE genes obtained from the splineTimeR and the top 1397 genes across all the time points through the limma were precisely the same.
Gene network reconstruction and gene function prediction. Adopting the methodologies discussed in section "Gene association network reconstruction and prediction of gene function", we have reconstructed the gene association network (GAN). We have obtained two GAN's, each with a different probability cutoff. We discovered 45550 significant edges with 1107 nodes for cutoff probability 0.8 and 34080 significant edges with 1048 nodes for cutoff probability 0.9, which corresponded to 4.67% and 3.5% of all possible edges, respectively. Supplementary Fig. 2 depicts the reconstructed GAN with a probability cutoff set to 0.9 with the top 150 selected edges based on a higher partial correlation score. Genes with higher degrees in the whole reconstructed GAN are displayed in dark colour. From this figure, it can be observed that 'CUEDC1' , ' ABCA6' , 'MRPL50' , 'LYPD3' , 'KRT19' , 'OLFML3' , 'LGALS3' , 'Hs.540914' , 'LOC285074' , 'TBPL1' are the top 10 genes having extremely high connection with others within the GAN. Reconstructed GANs highlighting the betweenness and closeness centralities for the top 150 selected edges based on higher partial correlation scores are shown www.nature.com/scientificreports/ in Supplementary Fig. 3-4. It was discovered that 'CUEDC1' , ' ABCA6' , 'MRPL50' , 'LYPD3' , 'KRT19' , 'OLFML3' , 'TBPL1' were the seven common genes that rank within the top ten using all of these three centrality measures.
We have also inferred a gene regulatory network (GRN) using the time-course gene expression data of control and treated samples in PDAC and identified the regulators for each target gene using the dynamical GENIE3 (dynGENIE3) algorithm 31 as described in section "Gene association network reconstruction and prediction of gene function". Table 2 shows the top 100 regulators for the DE genes (column E). Figure 2 shows the interaction between the regulators and the target genes in the discovered GRN. This figure highlights the top 50 most frequent regulators and 50 most frequent target genes. Supplementary Table 4 reports the interactions among the regulators and the target genes of the inferred GRN with top scores. We have also detected 37 regulators for the 21 targets DE genes with a highly significant interaction score within the whole GRN, reported in Supplementary Table 1. The connectivity patterns of the predicted GRN was also analyzed by enumerating the number of transcription factors (TFs) regulated by a target gene (in-degree) and the number of target genes regulated by a TF (out-degree) [ Supplementary Fig. 15]. It has been discovered that most of the TFs regulate a comparatively small number of target genes (low degree TFs), while few TFs regulates a large number of target genes (high degree TFs).
In Table 1, the coverage column defines the ratio of the number of annotated genes in the resultant network to the number of genes with that annotation in the genome, and FDR is the false discovery rate generated from the GeneMANIA algorithm. The top 50 genes have been chosen from the resultant ranked genes list provided by GeneMANIA with the assigned score for finding their gene-disease associations. It has been observed that 'CSF2' , 'HOXB9' , 'ITGA11' , 'NUMB' genes are associated with Pancreatic Ductal Adenocarcinoma according to DisGeNET 38 web server.
Transcriptional regulators and DNA-binding transcription factor identification. We have identified the key transcriptional regulators (TR) for the DE genes using REGGAE 30 , with the time-course genes expression profiles of the control and the treated samples. We obtained a ranked list of 66 key TR for the PDAC dataset from a Regulator Target Interactions (RTI) collection, which contains a list of regulators for each deregulated gene that influence gene expression. Regulatory genes were ranked according to the score provided by the REGGAE algorithm. The top 5 key TR's are 'GATA6' , 'NFYB' , 'IRF1' , 'TRIM22' , 'SREBF1' .
Additionally, we have detected the proteins playing central role in DNA transcription using a curated database of specific DbTFs: TFcheckpoint 33   www.nature.com/scientificreports/ To validate the identified transcriptional regulators, we have performed an external validation. We first obtain a set of transcription factors from two independent study and then compute the overlap of the predicted set with it. Particularly, we combine transcription factors specific to distinct PDAC subtypes from Giuseppe et al. 40 , with an open source manually curated database of eukaryotic transcription factors called TRANSFAC 41 . We observed that 44 transcription factors of our potential set discovered from REGGAE (containing 66 transcription factors) are common with the combined set (see Supplementary Table 2 for the results). Moreover we observed that our potential set contains three transcription factors 'SMAD6' , 'FOSB' and 'IRF1' which are also demonstrated to be important regulators in human pancreatic ductal adenocarcinoma (PDAC) 40 .

Gene modules identification and determination of key gene modules. After discovering DE
genes, we have applied the Dirichlet Process Gaussian Process mixture model to obtain robust and accurate gene modules from their time-course gene expression profiles. We have tuned the hyperparameter of DPGP and uses several kernel functions choices to pick the best model that results in the best clustering solutions. According to www.nature.com/scientificreports/ a cluster-specific GP, the probability distribution of each gene's trajectories is defined by a positive definite Gram matrix that quantifies the similarity between every time points. This Gram matrix can be modelled through various kernel functions that depict smoothness and periodicity for GP models. In this work, we have utilized three different kernel functions squared exponential 14 , Matérn52 42 , and standard periodic 43 . We have discovered that the squared exponential (sq. exp) kernel with MAP clustering and Limited-Memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) hyperprior optimization technique, concentration parameter ( α = 1.0 ), shape ( α IG = 4 ) and rate ( β IG = 1 ) parameters for the inverse gamma prior on the cluster noise variance produces best clustering solution with 10 gene modules which yield the highest silhouette width [ Fig. 4 Figure 5 shows the heatmap denoting the posterior distribution of the probability that two genes are being co-clustered. Subsequently, we have selected the top 6 modules with the highest number of regulators and referred to them as the key modules. Figure 6 shows the cluster trajectories of the DE genes for each key module in (A-F), normalized log 2 fold change in expression for each transcript, the posterior cluster mean and ±2 standard deviations according to the cluster-specific GP.

and Supplementary
Biological significance analysis. We have performed several analysis on the key modules to gain more insights into the pathways and biological processes (BP) of the involved genes inside the key modules. Additionally, we have identified key TRs in each of the key modules. We have identified KEGG pathways, associated biological processes (BP), and disease-gene associations using the DisGeNET database via Enrichr 35 . We have reported our findings, with the name of the regulator genes and DbTF in Table 2, where genes in boldface have been used to represent DbTFs among the TRs. Figure 7 shows the top-ranked (based on p-value) KEGG path-  www.nature.com/scientificreports/ ways (Panel A), biological processes (Panel B) and disease-genes association (Panel C) enrichment of genes of key cluster 1. Outcomes of our biological significance analysis for the other 5 key modules have been attached in Supplementary Fig. 5-9. Moreover, we have identified the top 10 KEGG pathways and GO terms (BP) for each of the key modules to understand their relevance and significance in PDAC. We observed that key module 1 contains 'FGF7' , 'IL6' , 'FGF9' , 'GSTA4' , 'GADD45A' , 'CDK4' , 'STAT1' , 'TXNRD1' , 'WNT2' genes which contributes to pathways in cancer and 'STAT1' , 'GADD45A' , 'CDK4' are directly associated to pancreatic cancer pathway. 'IL6' , 'STAT1' , 'ST3GAL6' inside the key module 1 is responsible for cellular response to Interleukin 6 (IL -6) that governs the pancreatic cancer progression 44 . 'CXCL6' , 'CSF2' , 'CCL20' , 'TNFAIP3' , 'CXCL1' , 'PTGS2' inside key module 2 takes part in IL-17 signaling pathway that promotes the transition to pancreatic cancer from chronic pancreatitis 45 . 'TNF-α ' expression elevates in PDAC initiation process which binds to 'TNFR1' receptor resulting in Tumor Necrosis Factor(TNF) signaling 46 57 . In key module 6, 'PTPRB' , 'SMAD3' , 'RAC2' genes are associated with the Adherens junction pathway. We found that 'TGFB2' , 'SMAD3' , 'RAC2' inside key module 6 are directly associated with Pancreatic cancer pathway. Additionally, 'TGFB2' , 'SMAD3' , 'INHBE' are responsible for TGF-beta signaling pathway in the key module 6. Activation of the TGF-β signaling pathway leads to increased chemotherapeutic resistance of pancreatic cancer cells 58 . Figure 8 shows the top 15 gene ontology terms (biological processes) performed by the genes in the 'key module 1' with their proportion of counts in the module. These GO terms have been selected according to the lowest p-values. This analysis for the other 5 key modules has been attached in the Supplementary Fig. 10-14. Additionally, we have analyzed genes in the top 6 key modules for their direct association with PDAC through the DisGeNET web server. We have found that 11, 10, 13, 11, 11, 7 genes are directly associated with PDAC in the key modules 1, 2, 3, 4, 5, 6, respectively. The names of the genes have been tabulated in Table 3.

Methods
The present section provides an overview of our systematic approach to data collection, data preprocessing, and the overall framework for the different methodologies used in our present analysis [ Fig. 9].  www.nature.com/scientificreports/ we have used the gene expression profiles of the annotated 42412 official genes for our differential gene expression (DE) analysis.

Differential expression analysis.
Analysis and detection of differential expression of genes have been carried out using the R/Bioconductor splineTimeR package 28 to discover significant DE genes. SplineTimeR operates on the values obtained from the parameters of a fitted natural cubic spline regression model, which is utilized in our time-course gene expression profiles for control and treated samples 28 (for a detailed description see Supplementary text). The differential expression of a gene has been discovered by applying empirical Bayes moderated F-statistics on the spline regression model's coefficient differences. The detection of DE genes using splineTimeR 28 has been carried out by setting the Benjamini-Hochberg 36 adjusted p-value threshold to 0.05 and with a degree of freedom of 4 for all genes. Top regulated genes have been identified as differentially expressed and used for our subsequent analysis in module discovery. Supplementary Fig. 1 37 . Limma utilizes empirical Bayes smoothing on the estimated fold-changes and standard errors from a linear model fitting to assesses the differential gene expression across a pair conditions. Table 4 shows the top 5 DE genes at each time point, including the top 5 DE genes across all the time points discovered through limma.

Gene network reconstruction and prediction of gene function. GAN Construction.
In this work, we have performed a gene association network (GAN) reconstruction from the identified DE genes' time-course data using a regularized dynamic partial correlation method 62 . GeneNet 62 has been used to analyze covariance matrices with a dynamic shrinkage method. Analyses have been performed with a posterior probability cutoff of 0.8, and 0.9. GAN reconstruction has been implemented using splineNetRecon function from the splineTimeR package 28 to identify regulatory association between genes under a specific condition (treatment) 28 . Top significant edges have been identified based on their cutoff posterior probability. We have identified the top genes by analyzing the overall graph topology in the resultant GAN based on a higher partial correlation score and widely used centrality measures: degree, betweenness and closeness centrality. These centrality measures were aggregated for selecting the top 150 ranked genes by taking the average of the quantile normalized values of these measures to perform gene function prediction.
GeneMANIA 32 infers possible connections between the query genes by searching many large, publicly available biological databases to find related genes. These include gene and protein expression data, protein-protein, protein-DNA and genetic interactions, pathways, reactions, protein domains and phenotypic screening profiles. GeneMANIA operates on a ridge regression-based fast heuristic algorithm to integrate multiple functional gene association networks using a label propagation algorithm. GeneMANIA weights data sources based on all genes' connectivity strength with each other in the query list and suggests relatively similar non-queried genes and their connection types. It returns an interactive functional gene association network of all resultant genes and their relationship. It also performs gene functions prediction of queried genes based on non-negative weights of gene sources (i.e. association of two genes) as a binary classification problem. GeneMANIA further returns a www.nature.com/scientificreports/ ranked gene list presumably sharing phenotypes with queried genes based on it's large and diverse databases. We have analyzed the top 150 ranked genes obtained from GAN for gene function prediction using GeneMANIA 32 .
GRN Inference. Dynamical GENIE3 (dynGENIE3) 31 is an adaptation of the GENIE3 method for gene regulatory network (GRN) inference that handles time series and steady-state data jointly. It is based on a semi-parametric model that models gene expression's temporal evolution using a non-linear ordinary differential equation (ODE). An ensemble of non-parametric regression tree (Random Forest) model is used to learn transcription function in each ODE. It assesses each input feature's importance to measure the variable importance scores (VIS) by considering the Mean Decrease Impurity measure. Finally, it uses the normalized sum of VIS to assign ranks for each learning sample from which the tree was built to identify regulators of each target gene. In our work, we have identified regulators for each target gene by GRN inference for the whole set of genes using the dynGENIE3 algorithm from the time-course gene expression data of PDAC.   14 to obtain gene modules from the univariate time-course expression data. It can simultaneously model cluster number with a Dirichlet process (DP) and temporal dependencies with Gaussian Processes (GP). DPGP uses a Bayesian non-parametric model for time course paths P ∈ R N×T , where N is the number of genes and T is the number of time points (see the Supplementary text for a detailed description of DPGP method). For Markov chain Monte Carlo (MCMC), we have estimated the probability of the trajectory of gene j inside the cluster i according to the DP prior with the likelihood that gene j belongs to class i according to the cluster-specific GP distribution. Neal's Gibbs Sampling "Algorithm 8" has been used to compute the posterior distribution of the trajectory class assignments 64 .
We have executed the MCMC algorithm with two burn-in phases that converged by cluster-switching ratio. After the second burn-in phase, the DPGP updates the model parameters and computes the kernel hyperparameters' cluster-specific posterior probabilities by the type II maximum likelihood. For this purpose, we have compared the performance of three different optimization techniques, viz., the fast quasi-Newton limitedmemory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS), the function minimization using gradient information in a truncated Newton (fmin_tnc), and stochastic conjugate gradient (SCG) 65 . Here, the cluster assignment vector is sampled at every s th iteration ( s = 3 , here) to thin the Markov chain 14 . MCMC generates a sequence of states produced by a Gibbs sampler, where each state delineates a group of genes into disjoint modules. Results generated from Gibbs samples were compiled into a posterior similarity matrix (PSM). The outcomes from the Markov chain was summarized with three different clustering criteria, viz., the maximum a posteriori (MAP), the maximization of posterior expected adjusted rand (MPEAR), and the least-squares 66 to discover the best cluster assignment. DPGP conservatively declares convergence ensuring convergence plateau for at least 10 samples based on the iterative changes in posterior log-likelihood or the squared distance between sampled partitions and the posterior similarity matrix. We have also performed hyperparameter tuning for finding the concentration parameter ( α ), shape ( α IG ) and rate ( β IG ) parameters for the best clustering solution, which yields the highest silhouette width.

Key module identification and biological significance analysis. This subsection provides informa-
tion about approaches that have been used to find the biological significance of our obtained results.

Identification of transcriptional regulators and DNA-binding transcription factors.
We have utilized time-course expression profiles of the DE genes across the control and treated samples to identify the key transcriptional regulators (TR) using REGGAE 30 . It operates by using Kolmogorov-Smirnov-like test statistics and an implicit combination of regulator target interactions (RTI's) for the prioritizing influence of TR's. REGGAE returns a ranked list of key TRs with p-value aggregations. It uses an extensive collection of RTIs that relies on diverse external databases: ChEA, ChipBase, ChIP-Atlas, ENCODE, Signalink, Jaspar and TRANSFAC. Some of these databases provide manually curated RTIs extracted from the primary literature, while others provide binding information extracted from ChIP-Seq experiments. Additionally, we have also identified regulators for each target genes through the gene regulatory network inference analysis. We have also identified the DNA binding transcription factors (DbTF) through a cumulative and high-quality knowledge source of genome-scale information, TFcheckpoint 33 , from the detected transcriptional regulators obtained using REGGAE. The TFs in TFcheckpoint are investigated for experimental evidence supporting their role in specific DNA binding activity and RNA polymerase II regulation.
Functional enrichment analysis of the key modules. Identification and analysis of key gene modules which contain the highest number of transcriptional regulators have been performed. The top 6 key modules have been analyzed to discover biological processes, pathway analysis and disease-gene associations for the involved genes inside each module using Enrichr 35 . Additionally, we have studied the genes inside the key modules for their direct associations with PDAC.

Conclusion
This article developed a framework to discover the key regulatory genes and the key gene modules from multivariate time-series Pancreatic Ductal Adenocarcinoma (PDAC) microarray data. Here, we have identified the top differentially expressed (DE) genes with a cubic spline regression model. We have performed the Gene Association Network (GAN) reconstruction to discover the regulatory association between genes in PDAC. Moreover, identifying key regulatory genes for each target gene has been carried out through a GRN inference analysis. Additionally, we have detected transcriptional regulators and DNA binding Transcription Factors (DbTFs) using REGulator-Gene Association Enrichment (REGGAE) and TFcheckpoint databases. Gene modules have been identified through a Dirichlet Process Gaussian Process (DPGP) mixture model. We have identified and analyzed the top 6 key gene modules that contain a significant number of regulatory genes. Biological significance analysis reveals that the genes inside the key gene modules are highly associated with PDAC.
Our analysis can be further extended by analyzing integrated multi-omics data of PDAC patients. www.nature.com/scientificreports/ modules may be further validated using in vitro experiments to reveal some important findings in PDAC and its pathogenesis. One may also verify the role of key regulators in the modules as potential biomarkers. Survival analysis of the key transcriptional regulators may enlighten us with more significant insights about this disease. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.