Systems-level differential gene expression analysis reveals new genetic variants of oral cancer

Abbas, Syeda Zahra; Qadir, Muhammad Imran; Muhammad, Syed Aun

doi:10.1038/s41598-020-71346-7

Download PDF

Article
Open access
Published: 04 September 2020

Systems-level differential gene expression analysis reveals new genetic variants of oral cancer

Syeda Zahra Abbas¹,
Muhammad Imran Qadir ORCID: orcid.org/0000-0003-1899-5459¹ &
Syed Aun Muhammad¹

Scientific Reports volume 10, Article number: 14667 (2020) Cite this article

4520 Accesses
16 Citations
9 Altmetric
Metrics details

Subjects

Abstract

Oral cancer (OC) ranked as eleventh malignancy worldwide, with the increasing incidence among young patients. Limited understanding of complications in cancer progression, its development system, and their interactions are major restrictions towards the progress of optimal and effective treatment strategies. The system-level approach has been designed to explore genetic complexity of the disease and to identify novel oral cancer related genes to detect genomic alterations at molecular level, through cDNA differential analysis. We analyzed 21 oral cancer-related cDNA datasets and listed 30 differentially expressed genes (DEGs). Among 30, we found 6 significant DEGs including CYP1A1, CYP1B1, ADCY2, C7, SERPINB5, and ANAPC13 and studied their functional role in OC. Our genomic and interactive analysis showed significant enrichment of xenobiotics metabolism, p53 signaling pathway and microRNA pathways, towards OC progression and development. We used human proteomic data for post-translational modifications to interpret disease mutations and inter-individual genetic variations. The mutational analysis revealed the sequence predicted disordered region of 14%, 12.5%, 10.5% for ADCY2, CYP1B1, and C7 respectively. The MiRNA target prediction showed functional molecular annotation including specific miRNA-targets hsa-miR-4282, hsa-miR-2052, hsa-miR-216a-3p, for CYP1B1, C7, and ADCY2 respectively associated with oral cancer. We constructed the system level network and found important gene signatures. The drug-gene interaction of OC source genes with seven FDA approved OC drugs help to design or identify new drug target or establishing novel biomedical linkages regarding disease pathophysiology. This investigation demonstrates the importance of system genetics for identifying 6 OC genes (CYP1A1, CYP1B1, ADCY2, C7, SERPINB5, and ANAPC13) as potential drugs targets. Our integrative network-based system-level approach would help to find the genetic variants of OC that can accelerate drug discovery outcomes to develop a better understanding regarding treatment strategies for many cancer types.

Therapeutic effect of microRNA-21 on differentially expressed hub genes in gastric cancer based on systems biology

Article Open access 11 December 2023

Integrated computational screening and liquid biopsy approach to uncover the role of biomarkers for oral cancer lymph node metastasis

Article Open access 28 August 2023

Identification of GUCA2A and COL3A1 as prognostic biomarkers in colorectal cancer by integrating analysis of RNA-Seq data and qRT-PCR validation

Article Open access 10 October 2023

Introduction

Oral Cancer constitutes approximately 90% among all Head and Neck Cancer (HNC) sub-types¹. However, it is more prominent in urban areas of South Asia with a ratio of 15–40% among all cancer types². In Pakistan, it ranked as 2nd most prevalent cancer-type, with increasing incidence in the past few years^3,4. The complexity of genetic mechanisms in cancer has been revealed through recent investigations. Many biological systems seem to involved in the development and progression of the cancer. But, the complications in system-interactions are limitedly understood which is a major restriction in developing effective treatments⁵. The gene expression studies may help to investigate the differential expression of genes in different biological states, cell cycle stages, subjects or tissues. This gene expression analysis is an important pinpoint for investigating biological processes and their functional disorders. cDNA microarrays were used to monitor and reveal the expression level for thousands of the genes which are differentially expressed in tumors simultaneously⁵. This technique can exploit this valuable information regarding gene expression analysis. This rapidly progressing technique provides comprehensive data for gene expression profiling of thousands of genes to the investigators in one experiment. Many studies demonstrated that this technique is useful to identify novel genes for cancer and its molecular level classification in human^6,7. Thus, this novel technique may help us to identify new potential targets for drug development for optimal and effective disease therapies. It may also establish an important link between clinical medicine and gene sequences for humans⁸. Gene Expression Omnibus (GEO) is a publicly available database that can be used for high-throughput screening of molecular variants. It contains the microarray data including single and multiple channel-based experiments to determine abundance of genomic DNA, mRNA and protein molecules. The scientific community widely uses the gene expression database to analyze and search the molecular components for systems-level investigations. Several microarrays applications for meta-analysis are designed to investigate pathological mechanisms associated with genetic risk factors^9,10,11. Therefore, the genome expression analysis may help to identify unusual alterations across the genome using microarray technology¹². This study aims to identify genetic causes and some probable genetic variants of oral cancer that will help to modify the therapeutic strategies.

Results

The cDNA datasets used in this study belong to various normal or cancerous oral tissues and cell lines. The list of datasets used in this study are shown in Table 1.

Table 1 List of cDNA datasets analyzed in this study.

Full size table

Differential analysis, normalization and cross-validation

We analyzed 21 cDNA datasets specifically associated with oral cancer. Each dataset has a different number of samples and the genes derived through mRNA expression profiling using different Affymetrix platforms for OC. The histograms representing expression after normalization indicate the density estimation of data. The shapes and ranges for arrays distributions are similar indicating the quality of the data. The array’s distribution towards the right shows a high background level. The saturation of signal is specified by bulge which appears at the upper end of intensity range (Fig. 1). An automated mechanism was used to perform a comparison of biologically similar groups in pairs. We let off any subgroup without repetition from comparisons for accuracy and verification of differential analysis, and the generalized linear models' 'cv.glm' method measured the error of the cross-validation prediction. The Gaussian dispersion criterion is 0.00519 which indicates the degree of confidence (Table 2). With K-folds estimation we obtained the same delta value of 0.00515 as we used the LOOCV approach (during raw cross-validation and afterward during modified cross-validation). The substantial codes (0.1, 0.01, 0.001, and 0.05) with residuals of limited deviance suggested the consistency of the differential analysis. cDNA datasets were also analyzed for some necessary factors like RNA quality, sequence biases or RNA degradation. In genomic analysis, the use of low-quality RNA samples in the sequencing of the entire genome is inefficient. It is not clear if transcript degradation occurs reliably in low-quality RNA samples, in which case the effects of degradation can be reversed by data normalization or whether different RNA samples can be degraded at different rates, which could bias expression measurements. So, for differential expression analysis, we assessed the RNA quality. To verify the dataset reliability for identification of variation at the transcriptional level in original samples. The normalization process was used to standardize sample handling techniques and to assess optimal RNA variability threshold by using discrimination measures for statistical and algorithmic analysis. All the probe sets have their individual probes aligned at 5′-end of the target RNA molecule. The competitive binding of a particular probe to its target has been observed to depend upon a 3′/5′ intensity gradient. Due to the poor quality of RNA, a reduced quantity of RNA is hybridized to the array. The low hybridization leads to a decrease in the total signal output level. But if the degree of saturation level increases the 3′/5′ intensity gradient decreases. The 3′-end of the target gene contains a probe set that corresponds to the transcripts. The statistical and function summary for each batch-array is produced by ‘AffyRNAdeg’ to measure RNA degradation level and its significance (Fig. 2).

Table 2 K-fold Cross-validation using “Boot” package of bioconductor software based on Gaussian dispersion modules.

Full size table

Disease-gene curation for differentially expressed genes (DEGs)

From 21 datasets, we found 30 DEGs and David tool was used to retrieve their gene symbol and biological annotation. We selected the most significant ranked genes from the list of differentially expressed genes. For disease-gene curation, these genes were text mined using CTD (Comparative Toxicogenomics Database), PubMed, OMIM, MeSH, and PMC databases to filter disease-specific genes (Supplementary Table 1). We observed CYP1B1, CYP1A1, C7, ADCY2, SERPINB5, and ANAPC13 are the most curated terms in the databases. These shortlisted genes were further analyzed by mapping at (p < 0.00005) through Cancer Genetics and OMIM databases and observed their role in carcinogenesis (Fig. 3).

Enrichment and cluster analysis of DEGs

These genes showed enrichment substantially linked with hydroxylase, P450 pathway, steroid metabolic process, monooxygenase activity, cellular response to organic cyclic compound, and aromatase activity (Table 3). The dysregulation of these genes causes genetic heterogeneity, autosomal recessive disorder, head and neck disorders and other clinical phenotypes (Fig. 4).

Table 3 Gene Ontology and functional enrichment of differentially expressed genes.

Full size table

The function of the gene, its regulation, subtypes, and cellular processes play a key role in understanding its biology. The functional enrichment analysis showed that shortlisted CYP1A1, CYP1B1 genes are known to involved in xenobiotic metabolic and energy pathways. C7 is involved in immune response whereas SERPINB5 is known for protein metabolism. While ADCY2 plays an important role in cell communication and signal transduction. ANAPC13, have a potential role in class-I MHC-mediated antigens and cell-cycle progression at early tumor stages. The membrane attack complex in the extracellular region for C7, Expression of CYP1A1, SERPINB5, CYP1B1 is found in the endoplasmic reticulum. CYP1A1 is also found in nucleus and microsomes. ANAPC13 is a well-known anaphase-promoting complex. Differential expression of ADCY2, SERPINB5, CYP1B1 is also found in the cytoplasm.

Cluster analysis of selected DEGs helps us in the recognition of functional annotation and significance. The results were observed with the Euclidean distance (Fig. 5). The genetic expression of sample cells is distinguished as cases and control indicating the obvious differences between two of these groups. The analysis showed the down and up-regulated genes based on the p value and fold changes (Table 4).

Table 4 Expression profiling of cDNA microarray datasets.

Full size table

Mutation analysis

ADCY2 has eight post-translational modification (PTM) sites with 234 recurrent cancer mutations at the chromosome no. 5 positive-strand encoding 1091 protein residues representing 14.02% of the predicted disordered region. The mutation visualization plot shows ADCY2 isoform ADCY2 Q474R direct network-rewiring mutation impact at the position 474, with reference amino acid residue Q and mutated amino acid residue R in the protein. The affected-site at position 472 with S amino acid residue-site enriched with a phosphorylation-type mutation affecting PTMs. Another, ADCY2 isoform ADCY2 S655R, reveals the mutation for this protein at position 655 including amino acid residue S comparison with the mutated amino acid residue R. In position 659, S-amino acid residue site enriched with phosphorylation-type mutation, this shows distal-mutation PTM impact with the affected-site. CYP1B1 showed a 10.5% predicted disordered sequence region with 100 mutations observed at chromosome no 2 on the negative strand with 543 protein residues and 3 PTM sites. CYP1B1 I87S, CYP1B1 Q479H, CYP1B1 T510I isoforms were revealed for CYP1B1 mutational enrichment, at the positions 87,479,510 respectively. The reference amino acid residue for these isoforms was I, Q, T along with mutant amino acid residues S, H, I respectively for each isoform. Similarly, the mutational analysis of C7 showed that 12.57% of the sequence predicted for disease-pathophysiology. Total 244 number of mutations were found on the positive strand of chromosome no. 5 for C7. The number of PTM sites for C7 was eight with 843 protein residues. So, 10 isoforms for C7 were found, among them C7 Q29R and C7 G41D were in distal-mutational PTM impact (Fig. 6). The reference amino acid residues for C7 Q29R, C7 G41D, and C7 T756I were Q, G, T, and mutant amino acid residues reported were I, R, and D, respectively (Supplementary Table 2).

Protein–protein interaction analysis

We retrieved the related nodes and edges of all oral cancer associated DEGs from the HAPPI database to construct the integrated PPI-network (Fig. 7). This interaction analysis helped us to observe the potential functional interaction among OC related DEGs and other associated genes contributing to the disease phenotype. The seeder or source OC associated DEGs CYP1A1_HUMAN, CYP1B1_HUMAN, ADCY2_HUMAN, C7_HUMAN, SERPINB5_HUMAN, ANAPC13_HUMAN interact with the target genes including BRAC1_HUMAN, CO6_HUMAN, ISG15_HUMAN, S1PR3_HUMAN, and other essential proteins. The network topology shows a significant relationship between seeder and target genes. The identified proteins showed a significant association with disease development. The target proteins GST2_HUMAN and GSTK1_HUMAN play important role in phase-II carcinogen metabolism and interacted with source genes CYP1A1_HUMAN and CYP1B1_HUMAN. The target proteins including CO7_HUMAN, CLUS_HUMAN, CO6_HUMAN and CO5_HUMAN involved in laryngeal carcinoma, oral cavity squamous cell carcinoma and oropharyngeal carcinoma, respectively. Oral cancer ADCY2_HUMAN linked with KNG1_HUMAN as potential biomarkers. SPB5_HUMAN is interacted with P53_HUMAN, which is a potential tumor suppressor gene. We have observed that OC related genes APC13_HUMAN is interacting with CDK2_HUMAN, a prognostic indicator of oral cancer.

Pathway modelling of OC associated DEGs

Integrated pathways were modeled to observe the possible role of DEGs in pathophysiological mechanisms. Ras, p53, MEK, SOS, Rb, Bax, PTEN, and Raf are important interacting genes associated with the pathophysiological mechanism of oral cancer. We found that p53 signaling, microRNA signaling, salivary secretion, human papillomavirus, cell cycle, alcoholism, and xenobiotic metabolism-related pathways are interconnected in the progression and development of the disease (Fig. 8).

Toxicogenomic analysis

The toxicogenomic analysis enabled us to explore chemical genotype–phenotype exposomic information that may lead to disease progression. OC associated DEGs were curated in terms of their activity and expression with different environmental chemicals. The data revealed the activity and expression of DEGs, which either increase or decrease the expression, increase or decrease towards gene activity at different cellular events. It may also affect the cotreatment expression leading to disease occurrence. It was revealed that the same chemical exposure may show different reactivity for different genes. In this case, benzo(a)pyrene increases the expression of CYP1A1 but it affects the reactivity of ADCY2. Methylcholanthrene, albendazole, primaquine increases CYP1A1 activity. Resveratrol and tetrachlorodibenzodioxin were found to affect the binding, decrease the reaction, and increase the CYP1A1 and CYP1B1 activity. Similarly, acetaminophen was found to increase the expression of C7, whereas alpha-cobra toxin may account for decrease reaction. Arsenic may affect C7 expression by increasing its abundance. While nickel was found to decrease C7 expression. However, decitabine was found to affect the cotreatment of SERPINB5 and decrease the gene reactivity with trichostatin A. It was observed that fonofos, methapyrilene, parathion increases and affects the ADCY2 reactivity at different cellular events while valproic acid, okadaic acid, doxorubicin, and bisphenol A decreases the ANAPC13 expression (Fig. 9).

De novo prediction of regulatory motifs

Oral cancer-associated DEGs were used for de novo analysis to predict the regulatory motifs. The transcriptional factors include ARNT, AHR, CEBPA, CTCF, HNF1B, ELK4, TCF3, and NR2E3. The conservation cutoff is 0.40 with a matrix-score threshold of 85% were set as default parameters. The parameter settings standardize to analyze oPOSSUM-tool showed how the transcriptional factor controls its related targets (Table 5).

Table 5 Over-representation of oral cancer-related DEGs using oPOSSUM with 80% matrix score.

Full size table

MiRNAs targets for oral cancer-specific genes

miRDB database helped us to predict the miRNA targets based on the algorithms. We found the reliable score (> 80) of oral cancer-specific miRNAs. Mainly miRNAs including hsa-miR-4786-5p, hsa-miR-4282, hsa-miR-2052, hsa-miR-216a-3p, hsa-miR-3148, and hsa-let-7f-1-3p targets were predicted for CYP1A1, CYP1B1, C7, ADCY2, SERPINB5 and ANAPC13 genes respectively. The functional dysregulation of these genes may lead to disease progression. The predicted target scores, total miRNA hits, seed-binding locations, and 3′-UTR length of DEGs were analyzed (Table 6).

Table 6 Prediction of gene specific MiRNA-targets associated with oral cancer.

Full size table

Drug-gene interaction analysis

The toxicogenomic approach was used to investigate the drug-genes interaction to explore available treatments. The genes that interact with anticancer drugs are docetaxel, hydroxyurea, bleomycin, daunorubicin, lansoprazole, doxorubicin, liothyronine sodium, risperidone using DGIdb database. We identified sixteen proteins as potential alternative drug targets including CXCL1, FBXO32, PTTG1, CCNB1, ADCY2, NMU, ANAPC13 and others (Fig. 10). The dysregulation of these proteins may affect the normal expression level and could be a potential part of therapeutic strategies.

Discussion

This study focused on the genetic expression and functional enrichment of genetic variants of oral cancer. The six most significantly OC associated genes (CYP1A1, CYP1B1, SERPINB5, ANAPC13, ADCY2, C7) found through a differential analysis were consider as seeder or source genes. This analysis provides us a list of new genes aberrantly expressed in oral cancer including SERPINB5, ANAPC13, ADCY2. We have investigated differential expression between cases and controls of cDNA datasets at cellular level in oral tissues and found the possible association of these genes in oral cancer. We can get more information about the mechanism of human genetic disorder through microarray studies. The expression profiling of these important genes shows obvious differences between cases and controls. Some DEGs were found upregulated while other downregulated in this analysis. These genes are abnormally expressed to affect physiological functions including cellular signaling, replication, mitotic division, and programmed cell death. We have observed that our differentially expressed genes are associated with the cancer pathways including biological oxidations, metabolism, adenylate cyclase-activating pathway, xenobiotics, G alpha signaling events, transcriptional targets of TAp63 isoforms, p53, and IFN-gamma pathway revealed the biological significance of these genes specifically for oral cancer progression. CDC20_HUMAN, HDAC1_HUMAN, CXL10_HUMAN interacting with source genes are potential drug targets^13,14,15. The inherent mutations are reported for genes that encode drug-metabolizing enzymes. Such somatic gene mutations are induced chemically that play a vital role in cell differentiation and growth¹⁶. The sequencing investigations not only characterizes the genomics but also revealed thousands of SNVs (single nucleotide variants), the alterations in copy number along with many types of genetic variations. Such genomic to phenomic association identification, their molecular-level mechanisms, disease-related variants along with cancer-derived mutations are the current challenges in the biomedical research^17,18. Deciphering inter-individual genetic variation is the latest trend in personal genomic era investigations. The interpretation of genomic to proteomic information may integrate the impact of mutations on cellular system-level investigations in the future with a higher magnitude¹⁹. Human proteomic data analysis uses PTMs to interpret disease mutations and inter-individual genetic variations. PTMs being important regulators of protein function and signaling pathways facilitate the missense mutational analysis investigations¹⁹.

We have observed that CYP1A1 belongs to potential and well-preserved phase-I xenobiotic metabolizing gene family which is involved in the activation of procarcinogens. The CYP1A1 enzyme is highly associated with increased risk of tumors in the oral cavity, bronchial and laryngeal regions in smokers⁴. Similarly, the association of CYP1B1 has found in many cancer types²⁰. Many investigations are reported about the substantial link of HNC with CYP1A1 and CYP1B1^16,21.

SERPINB5 belongs to serpin encoding serine protease which plays a vital role in tumor metastasis^22,23. It is a tumor suppressor in epithelial cells and can suppress cancerous cell invasion and their metastasizing in surrounding tissues²⁴. The paradoxical expression of SERPINB5 has been observed in various types of tumor^25,26,27,28. We have seen a highly significant association between SERPINB5 expression and oral carcinoma²⁹. ADCY2 is a membrane-associated enzyme which converts adenosine-5′-triphosphate (ATP) into 3′,5′-adenosine monophosphate (cyclic AMP/cAMP) and pyrophosphate³⁰ involving in the regulation of cAMP synthesis³¹. This gene catalyzes the signaling molecule cAMP through G-protein beta as well as gamma subunit signaling^32,33. Therefore, changes in expression patterns of the gene are mediated through down-streaming of the signaling cascades muscarinic acetylcholine receptors which increases IL6 production. The high regulation of the gene is observed in G-proteins, calcium, calmodulin, pyrophosphate, and post-translational modifications. The signaling pathways include RET signaling, Oocyte meiosis, calcium, and chemokine signaling pathways^34,35,36. Aberrant methylation of ADCY2 is observed in colorectal, prostate cancer^35,36,37 and urinary bladder cancer^38,39,40. It has been studied that ANAPC13⁴¹ is a large-sized ubiquitin ligase that controls the cell cycle progression⁴² and involved at early steps of malignancy in tumor cells⁴³. Similarly, C7 (complement component-7), the terminal component for complement cascade and as a cytolytic effector for complement system, lyses transformed malignant cells^{44,45,46,47,48,49,50,51}. The integration of chemical- gene interaction revealed different environmental chemical exposure to disease progression^52,53. This analysis helped to reveal the mechanism of action between the chemical and the related gene products and their effect on human disease influence by environmental exposure^53,54,55,56.

The PPI network predicted the important association of these genes with disease. These genes have a potential role in xenobiotic metabolism, tumor progression, suppression, cell cycle, HPV (human papilloma virus), alcoholism, and microRNA signaling pathway^{30,40,48,49,51,57,58,59,60}. The transcriptomic analysis showed expressive transcription factors like JUND, FOXO, STAT1. We found the role of these genes in metabolism of xenobiotics, p53 signaling, salivary secretion, class-I MHC mediated antigens and microRNA cancer pathways. The miRNAs regulate post-transcriptional and translational events and expressional dysregulation in these molecules leads towards the progression of many diseases^61,62,63. Therefore, the reliable miRNA target prediction is crucial for the functional annotation of miRNAs^64,65.

Recent reports proved that the drug-gene network enables us to understand not only the disease pathophysiology but also important in drug designing or new drug target identification or establishing novel biomedical linkages. More importantly, this network proposed many testable assumptions with the potential of great success, though the real achievement can only be justified by experimental studies.

Conclusion

Our simulation-based systems-level hypothesis is comprehensive and effective to sort out the disease-specific genetic variants from cDNA datasets repositories. Therefore, this approach will support understanding the genetic basis of complex phenotypes including cellular replication, protein signaling, mitotic division, and programmed cell death. Based on genomic to phenomic investigations, we have found new genes including ADCY2, SERPINB5, and ANAPC13 linked with oral cancer that could be potential diagnostic or drug targets. These source genes are clearly interacting with other essential genes affecting cell cycle and apoptosis causing carcinogenesis. These findings can provide a valuable framework for developing new therapeutic strategies against oral cancer.

Methods

Accession to cDNA datasets

5.2 We downloaded cDNA datasets related to oral cancer from the Gene Expression Omnibus database (GEO) NCBI. The comprehensive framework has been illustrated in Fig. 11 using tools, online servers, and software (Table 7).

Table 7 Databases and tools used during this meta-analysis.

Full size table

Normalization and differential analysis

These datasets were analyzed in identifiable format to easily access pheno-data files and missing values were imputed⁶⁶. R software version 3.3.3 and Bioconductor packages were used in computational analysis. The normalization and quality control analysis was performed to preprocess the information available by ArrayQuality^67,68,69. The background and normalization were aligned by using Robust Multi-Array Analysis (RMA) to detect the PM (perfect matches) and the MM (mismatches) to impute the values for statistical analysis⁷⁰. RMA is the widely used preprocessing algorithm applied for background correction to remove local artifacts⁶⁷.

$$ PM_{ijk} = BG_{ijk} + S_{ijk} $$

where PM indicates a perfect match, Background by BG and non-specific binding (S); ijk is the signal for probe j of probe set k on array i.

$$\begin{aligned} &BG\left( {PM_{ijk} } \right) = E\left[ {S_{ijk} |PM_{ijk} } \right] > 0\\ &S_{ijk} \sim Exp(\lambda_{ijk} )\quad BG_{ijk} \sim N(\beta_{i} ,\sigma^{2} ). \end{aligned} $$

The perfect match involves the combined signals of background (BG) and expression (E). The "ArrayQualityMetrics" software was used to evaluate the quality of dataset that is normalized to each genke's median expression level^67,68,71. The gene–gene covariance matrix of each data set was calculated across all arrays while ignoring the missing values. The transformation formula is:

$$ X_{norm} = F2^{{^{ - 1} }} \left( {F1\left( x \right)} \right) $$

where F1 and F2 represents distribution functions of the actual and reference chips.

To get a description of intensities, we used the RMA-algorithm to measure averages between probes in a sample set. During this analysis RNA quality was evaluated in samples. The RNA degradation analysis was performed by using AffyRNAdeg, summary AffyRNAdeg, and plot AffyRNAdeg packages⁷². The DEGs were observed using the LIMMA package, that process the information based on modified statistics which is proportional to sample variance offsets. The LIMMA package measured the duplicate spots and quality weights. The statistical analysis was performed to categorize the genes based on the significant cutoffs values logFC greater than 1, FDR less than 0.05, AEL ≥ 40% and p value ≤ 0.05⁷³.

K-fold cross-validation

We used K-Fold Cross-validation and Bootstrap test to estimate accuracy in the differential analysis⁷⁴ and this approach has the advantage that all the samples in the dataset can ultimately be used for both training and research. This technique is usually easier to calculate estimated average error and has been used to validate the shortlisted differentially expressed genes using the 'Boot' package of Bioconductor. Boots trapping is used effectively in molecular analysis to correct biases⁷⁵. In such cases, we applied the generalized linear Gaussian models and used the 'cv.glm' method to test the k-fold cross-validation. It estimates the true error as the average rate error:

$$ E = 1/K\mathop \sum \limits_{i = K}^{K} Ei $$

Leave-one-out-cross-validation (LOOCV) continued to trail the Gaussian rule. The LOOCV approach is instinctively termed as the test set is left out and the rest of the data is used as the training-set⁷⁵. We used N − 1 subsets for training and the rest for testing. Increasing the number of folds would make the bias of the true error rate estimator low and valid^75,76.

The true error is assessed as the average error rate on test cases:

$$ E = 1/N\mathop \sum \limits_{i = K}^{N} Ei $$

Disease-gene curation of differentially expressed genes (DEGs)

The text mining is important in biomedical research to extract useful information⁷⁷. This analysis is designed to identify the most significant DEGs, all 30 genes from 21 datasets were curated from the DAVID database to retrieve their gene symbol, gene name, Uniprot_ID. These genes were curated using Comparative Toxicogenomics Database (CTD), Online Mendelian Inheritance in Man (OMIM), PubMed and MeSH databases to observe their role in oral cancer. This screening further shortlisted the significant DEGs⁷⁸.

Enrichment and cluster analysis

The biological functions of the genes help us to understand the cellular level signaling network. We performed enrichment analysis using the DAVID tool^78,79,80. FunRich tool was used to observe the biological functions of oral cancer-related DEGs at molecular level⁸¹. The list of DEGs were analyzed for their p- and FDR values⁷⁹. For cluster analysis, gene expression values of cases and controls of each dataset were studied to observe genetic variations and expression profiling using One Matrix CIMminer tool^82,83.