Introduction

Ovarian cancer is known as a complex genomic disease. During tumorigenesis, many factors contribute to pathological gene expression changes, such as genomic affection and epigenomic affection. Hence, acquiring the etiopathogenesis and chemoresponse of cancer faces a great challenge. And the achievement can be utilized to deploy the high-performance diagnosis and treatment of cancer patients. Through many studies presented the prognosis of ovarian cancer, there is no completely validated clinical model for predicting ovarian cancer prognosis and drug response. Therefore, it remains an important research issue to identify prognostic and predictive driver patterns for improving ovarian cancer treatment.

Cancer genomes posses a large number of aberrations including somatic mutations and copy number variations (CNVs). Many mutations contribute to cancer progression from the normal to the malignant state. The researches have shown that some aberrations are vital for tumorigenesis and most cancers are caused by a small number of driver mutations developed over the course of about two decades1,2. The detection of these mutations with exceptionally high association between the copy number variations, somatic mutations and gene expression can ascertain disease candidate genes and potential cancer mechanisms. Several large scale cancer genomics projects, such as the Genomic Data Commons (GDC), The Cancer Genome Atlas (TCGA), and International Cancer Genome Consortium (ICGC), etc., have produced a large volume of data and provided us an opportunity to integrate different level of gene expression data to identify candidate driver genes and driver pathways and better understand the cancer at the molecular level.

During exploring the large volume of genomics abnormalities generated from large-scale cancer projects, many computational and statistical methods have been proposed to search for this mutation driver patterns. Adib et al. predicted the potential specificity of several putative biomarkers by using gene expression microarray profile and examined their expression in a panel of epithelial tissues and tumors3. Youn et al. proposed a new method to identify cancer driver genes by accounting for the functional impact of mutations on proteins4. Tomasetti et al. combined conventional epidemiologic studies and genome-wide sequencing data to infer a number of driver mutations which is essential for cancer development5. Jang et al. identified differentially expressed proteins involved in stomach cancer carcinogenesis through analyzing comparative proteomes between characteristic alterations of human stomach adenocarcinoma tissue and paired surrounding normal tissue6. Xiong et al. proposed a general framework in which the feature (gene) selection was incorporated into pattern recognition to identify biomarkers7. Jung et al. identified novel biomarkers of cancer by combining bioinformatics analysis on gene expression data and validation experiments using patient samples and explored the potential connection between these markers and the established oncogenes8. Logsdon et al. identified a new category of candidate tumor drivers in cancer genome evolution which can be regarded as the selected expression regulators (SERs)– genes driving deregulated transcriptional programs in cancer evolution, and uncovered a previously unknown connection between cancer expression variation and driver events by using a novel sparse regression technique9. Naif et al. integrated exome sequencing data with functional RNAi screening data onto a human signaling network to predict breast cancer subtype-specific drug targets and survival outcome10,11. Some integrative network approaches on cancer omic data were developed to understand the function of genomic alterations in cancer12,13. A cancer genome embodies thousands of genomics abnormalities such as single nucleotide variants, large segment variations, structural aberrations and somatic mutations14,15. The corresponding malignant state reflects aberrant copy numbers (CN) and/or mutations and the expression patterns in which the mutation and copy number patterns are embeded16. Integration of copy number/mutation data and gene expression data was proposed to discover driver genes by quantifying the impact of these aberrations on the transcriptional changes17,18. The available implementations for the integrative analysis of genomics data includes regression methods, correlation methods and module network methods19.

Driver patterns, including driver genes, driver mutations, driver pathways and core modules, which are considered as cancer biomarkers, are supposed to promote the cancer progression. Thus, identification of driver patterns is vital for providing insights into carcinogenic mechanism. The integration of gene expression, copy number and somatic mutations data to identifying genomics alterations which induce changes in the expression levels of the associated genes, becomes a common task in cancer mechanism and drug response studies. However, there is still a tough work to integrate information across the different heterogeneous omics data and distinguish driver patterns which can promote the cancer cell to propagate infinitely. Hence, we proposed a driver pattern identification method over the gene co-expression of drug response in ovarian by integrating high throughput genomics data. In the method, we integrated different level genomics data including gene expression, copy number and somatic mutation to identify drivers of resistance and sensitivity to anti-cancer drugs. First, weighted correlation network analysis is applied to acquire the co-expression gene modules. These modules are utilized as initial inputs for the following module networks learning. Then, mutation network is constructed from the mutation matrix generated by integrating the CNVs and somatic mutations. The candidate modulators are selected from the clusters of the vertexes of the mutation network. Finally, an optimization model is used to identify the driving pattern and explore the modulator regulatory. We used publicly available and clinically annotated gene expression, CNVs, and somatic mutation datasets from TCGA. The experimental results show that in the driver patterns many of the identified candidate modulators are known to be involved in biological meaningful processes associated with ovarian cancer, which can be regard as potential driver genes.

Results

We conducted the driver patterns identification and the driving regulatory analysis as following procedure. Firstly, co-expression network were constructed and the gene modules for gene expression profiles were explored. Then, CNVs and somatic mutations were integrated to build the mutation network. The candidate modulators were selected from clusters of the vertex of network. A regulation procedure was explored for candidate modulators and gene modules using network learning combined with regression tree model. At last, local polynomial regression fitting model was applied to conduct dosage-sensitive and dosage-resistant analysis.

Data preprocessing

To select a subset of genes whose aberration/expression profiles were significantly different between sensitive and resistant tumors, we applied differential analysis to the gene expression and CNV data between these two groups. For different gene expression analysis, we used the EMD approach: EMDomics, which is designed especially for heterogeneous data. We selected genes with a q-value- < 0.1. For CNV data, firstly, we mapped genes to the CNV regions to obtain CNV genes; then, we used the EMD approach for the differential analysis and ranking of the CNV genes. We also calculated the frequency of amplification and deletion for each gene in the two groups and selected genes for which the difference between their frequencies is more than 20%. We used a threshold of log2 copy number ratio of 0.3/−0.3 to call amplified/deleted genes. For somatic mutation data, we also calculated the frequency of mutations for all genes across the samples in each groups and selected the genes that were mutated in more than 2% of the tumors.

Gene modules with co-expression patterns

After the data preprocessing, 2690 genes are chosen for co-expression gene modules exploring. We constructed a weighted correlation network and identified co-expression modules, which was conducted by WGCNA, associated with cis-platinum sensitive and resistant. The gene co-expression modules and the corresponding hierarchical clustering dendrograms of these genes are shown in Fig. 1. The color bars correspond to the clusters of genes which can be seen as gene module. We identified thirty-nine modules are described in thirty-nine colors with turquoise, blue, brown, yellow, green and so on. The top 10 co-expression modules are shown in Table 1, The first row represents the color of the modules, and the second row represents the number of genes which contained in the corresponding module. The remaining genes are in grey. These grey genes are not clustered into any modules. The modules list is shown in Supplementary Table S1.

Figure 1
figure 1

Gene co-expression modules for gene expression profiles. A network heatmap plot (interconnectivity plot) of a gene network together with the corresponding hierarchical clustering dendrograms, with dissimilarity based on topological overlap, together with assigned module colors.

Table 1 The top 10 co-expression modules.

Candidate modulators selected after long gene filtering

There may be some genes which have mutation only because they are long. So, we filtered the frequently mutated genes by gene length. Finally we obtained 2334 genes including 1942 genes of copy number variation and 392 genes of somatic mutation.

For CNVs data, the mapping of CNVs can promote the search for causal links between genetic variation and disease susceptibility. Thus we mapped gene expression to CNVs regions to obtain CNVs genes, for a specific gene in a given sample, the value of amplified is assigned the number 1, the deleted is assigned the number -1, the normal is assigned the number 0; for somatic mutation data, we first obtained mutated genes and mapped the correlations between somatic mutations and their residing genes.

Then, a mutation matrix A is generated by integrating the CNVs and somatic mutations. The mutation matrix is binary: if any variation region in a given gene of the particular sample is in a statistically significant variation or any mutation arises in the specific gene in the given sample, the value of mutation is assigned the number 1; otherwise the value is assigned the number 0. The matrix rows and columns correspond to samples and genes, respectively. A mutation network (MN) was constructed according to the mutation matrix. For a mutated gene, m i describes the number of mutations in gene i across the samples in the mutation matrix, i.e., \({m}_{i}=\sum _{i\mathrm{=1}}^{s}{a}_{si}\), where m is the number of samples and a ij is the value of mutation matrix of the gene i in the sample j. The network vertex weight is defined as h i  = m i /m. Also the edge weight V ij is defined as the number of samples in which exactly one of the pair is mutated divided by the number of samples in which at least one of the pair is mutated.

We selected a subgroup of significant genes by clustering the vertex of the mutation network. These genes can be regarded as candidate modulators (genes that regulate other genes in their module). We identified a list of 624 initial candidate modulators for the mutation genes shown in Supplementary Table S2, among which includes 40 data from somatic mutations. The gene expression heatmap of the top 50 candidate modulators is shown in Fig. 2.

Figure 2
figure 2

The gene expression heatmap of the top 50 candidate modulators.

Dosage-sensitive genes and dosage-resistant genes and driver patterns

We run the module networks learning using the gene modules and candidate modulators obtained from the above steps. It generally assumed that the modulators most likely regulate the expression of the genes in the corresponding gene modules. Because the gene modulators would have distortions in a amount of samples and they are likely to be the roots of the regression (drive) trees, the modulator gene are generally treat as a driver gene. Through the module network learning we obtained 456 modules and 93 modulators (see Supplementary Table S3). 5 out of these modulators are from somatic mutations. Some of modules and modulators identified by our method were also detected by previous studies20,21. Magically, several modules shared the same modulators, in which only one modulator is derived from somatic mutations.

We applied a local polynomial regression fitting model with the R package loess to the scatter diagram for 93 regulatory genes with 324 samples. DS threshold is set to (−0.25, 0.25) as the filtering criteria for dosage-sensitive genes(DSGs) and the others were selected as dosage-resistant genes(DRGs). Therefore, we obtained 44 DSGs and 49 DRGs (see Supplementary Fig. S4). The LRG1, HYDIN, POLRMT and TNFSF10 gene data scatter diagram for CNVs or somatic mutations vs. gene expression with local polynomial regression fitting model are shown in Fig. 3, in which HYDIN is derived from somatic mutations. Each red point represents an independent sample, the black line is the linear regression result, and the blue curve is the local polynomial regression fitting result. In Fig. 3A and B the slope of the black lines are close to 0, so the dosage sensitivity score are nearly equal to 0. In Fig. 3C and D the slope’s absolute value of the black line are above 1, the blue curves are mostly overlapped with the black line, so the dosage sensitivity score are above 1. It can be concluded from the graph that the genes for Fig. 3A and B are dosage-resistant genes, whereas the genes for Fig. 3C and D are dosage-sensitive genes.

Figure 3
figure 3

Scatter diagram with local polynomial regression fitting model. Each red point represents an independent sample. The black line is the linear regression, and the blue curve is the LOESS regression. (A) LRG1. (B) HYDIN. (C) POLRMT. (D) TNFSF10.

We identified three major driver patterns in which the top 3 gene modules with the largest number of regulatory genes (modulators) are explored. Fig. 4 shows these driver patterns with gene modulators and the corresponding gene modules, gene modulators shown in the left side with blue background ovals are the dosage-resistant genes. Gene modulators are in the right side are dosage-sensitive genes. In the first driver pattern with color pale blue, there are 15 genes in gene co-expression module 3 and 13 gene modulators. In the second driver pattern with color green, there are 7 genes in module 30 and 18 gene modulators. In third pattern with color red, 6 genes are in module 31 and 14 gene modulators are for regulatory. Remarkably, it shows that CCL11 regulate all the 3 modules; SLC18A1 and C9orf82 regulate the module 3 and module 30; PLXDC1, CCL16, CDC34 and HYDIN regulate the module 3 and module 31; ZFHX4 regulates the module 30 and module 31. Hence, these important regulatory genes regulate many gene modules, and these gene modulators are highly correlated with ovarian cancer22,23,24,25,26,27,28.

Figure 4
figure 4

The 3 major driver patterns for dosage-sensitive and dosage-resistant. The graph depicts inferred regulators (left; ovals with blue background represents the DRGs, and ovals with orange background represents the DSGs) and their corresponding regulated gene modules (right).

In the above-mentioned analysis, we constructed gene co-expression modules using WGCNA. WGCNA is a well-established method for identifying significant genes from transcriptomic data. A gene significance(GS) measure is assigned to each gene. The higher the absolute value of GS i , the more biologically significant is the i th gene. Gene significance of 0 indicates that the gene is not significant with regard to the biological question of interest. We applied WGCNA to select biologically and clinically significant genes with GS i -value > 0.7. The result is compared with the genes identified by the proposed method. We found that there are 17 common genes including CCL5,CCL8,CCL4,CCL23,CCDC135 and so on. The comparison is shown in the Fig. 5. From the figure, we can see that 6 common genes are in the intersection between the key genes selected from WGCNA and the driver genes of dosage-resistant from the proposed method,and 11 common genes are in the intersection between the key genes selected from WGCNA and the driver genes of dosage-sensitive from the proposed method.

Figure 5
figure 5

The comparison of significant genes between WGCNA and the proposed method.

Enrichment analysis for driver patterns

We applied the Gene Ontology (GO) category to expanded the results of DRGs and DSGs with the R package clusterProfiler, in which the significance threshold are set as p value < 0.05 and q-value < 0.05 separately. The top 12 enriched GO terms for DRGs and DSGs are presented in Tables 2 and 3, respectively. The DRGs are mainly enriched in receptor binding, signaling pathway, cell migration, cellular response, cell chemotaxis, activator activity, etc. These processes are highly relevant to the trait of tumor and contribute to the chief progression of tumor. And the DSGs are mainly enriched in receptor binding, signaling pathway, template transcription, cell proliferation, cell differentiation, tissue development, cell division, etc. Compared with DRGs, the GO annotated to DSGs are mostly management functions and basic functions of the cell and tissue. Figure 6 shows the top 12 biological process annotation analysis results on GO terms for DRGs and DSGs, respectively. The length of the bar represents the number of selected genes annotated onto this GO term, and the colors of the bar represents the significance levels of enrichment. The GO enrichment analysis results for DRGs and DSGs are in Supplementary Table S5 and Supplementary Table S6, respectively. From GO biological process enrichment analysis, we achieved CCL11, CCL16, CCL18, CCL23, CCL8, CCL5, APOB, BRCA1 play extremely significant role in DRGs, and EEF2, GNA11, GNA15, MED1, MYH8 and TCAP act very important role in DSGs. CCL11, CCL16, CCL18, CCL23, CCL8 and CCL5 have effective chemical attraction on eosinophils catalysing their accumulation at allergic inflammation sites. CCL11 has been shown to negatively regulate neurogenesis with certain human tumors22,29. CCL16 is a chemokine prominently expressed in the liver, but also in ovarian and breast cancer25,30. CCL18 is associated with some human cancer types including ovarian cancer31,32. CCL23 plays a role in subclinical systemic inflammation and associated with atherogenesis33. CCL8 is strengthened in stromal fibroblasts at the tumor border and in tissues at which breast cancer cells incline to metastasize such as the lungs and the brain34,35. CCL5 is a chemokine that boosts cancer progression by arousing and adjusting the inflammatory diseases, which sequently reconstruct the cancer microenvironment. Moreover, CCL5 also boosts metastasis in ovarian and breast cancer cells35,36. APOB mutation is associated with some human cancer types such as steatosis, liver, hypocholesterolemia37,38. BRCA1 mutation confers high risks of ovarian and breast cancer, encodes a tumor suppressor39. EEF2 is a key components for protein synthesis, and ubiquitously expressed in normal cells40. GNA11 is associated with Congenital Hemangioma41. GNA15 is associated with inhibition of proliferation, activation of apoptosis and differential effects42. MED1 is associated with some biological process such as prostate cancer cell growth43. MYH8 is a key components for muscle development and regeneration44. TCAP is involved with energy regulation and metabolism, and is implicated in the regulation of stress-related behaviors45.

Table 2 The top 12 enriched GO terms for DRGs.
Table 3 The top 12 enriched GO terms for DSGs.
Figure 6
figure 6

The analysis of GO biological process annotation analysis for DSGs and DRGs. (A) The GO biological process enrichment of DRGs. (B) The GO biological process enrichment of DSGs.

Regulatory pathway analysis for driver patterns

To further elucidate the biological pathway of gene modulators in the driver pattern, we performed the Kyoto Encyclopedia of Genes and Genomes (KEGG)46,47,48, pathway enrichment analysis of the results of DSGs and DRGs list with the R package clusterProfiler separately. The top 12 pathways for DRGs and DSGs are presented respectively in Tables 4 and 5. For DRGs, the most significant canonical pathways are mainly bound up with Chemokine signaling, Cytokine – cytokine receptor interaction, Parkinson’s disease, Breast cancer, onset diabetes, Asthma, Prion diseases, etc. And the pathways for DSGs are major related to Thyroid hormone signaling, Apoptosis, Cytosolic DNA-sensing, Amoebiasis, Chagas disease, Cell cycle, WNT signaling, etc. Remarkablely for DRGs, Cytokine – cytokine receptor interaction is embedded in the Chemokine signaling pathway. For these two pathways, gene CCL11, CCL16, CCL18, CCL23, CCL8 and CCL5 have a key function (see Fig. 7). Chemokines play a role in the migration of many cells during development and are critical to nervous system development49. Cytokine interactions play an important role in health and are vital to many cancer during immunological and inflammatory responses in disease. It can lead to antagonist, additive, or synergistic activities in keeping physiological functions such as body temperature, feeding and somnus, as well as in anorectic, ardent fever, and sleepiness neurological representations of acute and chronic disease50. Parkinson’s disease is a pathway specifically associated to diseases of the central nervous system, NDUFB5 and SLC18A1 are enriched in this pathway (see Fig. 8). NDUFB5 and SLC18A1 are associated with multiple human diseases23,51. Breast cancer is the major cause of cancer death among the female worldwide, gene BRCA1 and FGF22 are enriched in this pathway (see Fig. 9). FGF22 is required for brain development and associated with hereditary and neoplastic disease52. For DRGs, the thyroid hormones are key regulators of metabolism, growth and other body systems, MED1, MED16, MED24 and MYC are play a role in this pathway (see Fig. 10). MYC is involved in cell cycle progression, apoptosis, and cellular transformation and is related to many tumors53. Apoptosis is an evolutionarily conserved signaling pathway, which plays a fundamental role in regulating cell number and eliminating damaged or redundant cells54, GADD45B, TNFSF10, LMNB2 are enriched in this pathway (see Fig. 11). TNFSF10 is associated with some human cancer types including ovarian cancer55. LMNB2 acts as regulators of cell proliferation and differentiation, regards as a cancer risk biomarker in several cancer subtypes56. Cell cycle progression is completed through a reproducible sequence of events. It is a significant pathway associated with many diseases including ovarian cancer57. GADD45B and MYC are enriched in this pathway (see Fig. 12). GADD45B is implicated in some responses to cell injury including cell cycle checkpoints, apoptosis and DNA repair58. WNT signaling pathway is required for basic developmental processes and play an important role in human stem cells and cancers59. MYC and TBL1XR1 are enriched in this pathway (see Fig. 13). The KEGG pathway enrichment analysis result for DRGs and DSGs are shown in Supplementary Table S7 and Supplementary Table S8, respectively.

Table 4 The top 12 important pathways for DRGs.
Table 5 The top 12 important pathways for DSGs.
Figure 7
figure 7

The six genes are enriched in the top 2 significant pathways(red, up-regulated). (A) Chemokine signaling pathway. (B) Cytokine-cytokine receptor interaction.

Figure 8
figure 8

Parkinson disease pathway(red, up-regulated). NDUFB5 and SLC18A1 are enriched in this pathway. NDUFB5 is a gene alias of the complex 1 deficiency(Cx1) and SLC18A1 is a gene alias of the vesicular amine transporter(VMAT).

Figure 9
figure 9

Breast cancer pathway(red, up-regulated). MED1, MED16, MED24 and MYC are enriched in this pathway.

Figure 10
figure 10

Thyroid hormones pathway (red, up-regulated). BRCA1 and FGF22 are enriched in this pathway.

Figure 11
figure 11

Apoptosis pathway(red, up-regulated). GADD45B, TNFSF10, LMNB2 are enriched in this pathway. TRAIL is a gene alias of TNFSF10, LMNB2 is a gene alias of the lamin.

Figure 12
figure 12

Cell cycle pathway(red, up-regulated). GADD45B and MYC are enriched in this pathway.

Figure 13
figure 13

Wnt signaling pathway pathway(red, up-regulated). MYC and TBL1XR1 are enriched in this pathway.

Discussion

The main purpose of this research was to distinguish the candidate driver genes and the corresponding driving mechanism for resistant and sensitive tumor from the heterogeneous data. We presented a machine-learning approach to integrate somatic mutations, CNVs and gene expression profiles to distinguish interactions and regulations for dosage-sensitive and dosage-resistant genes of ovarian cancer. We developed a general framework for integrating high dimensional heterogeneous omics data and can potentially lead to new insight into many related studies at the systems level. In the framework, weighted correlation network analysis (WGCNA) was applied to the co-expression network analysis; the mutation network was constructed by integrating the CNVs and somatic mutations and the initial candidate modulators was selected from the clustering vertexs of network; the regression tree model was utilized for module networks learning in which the obtained gene modules and candidate modulators were trained for the modulators regulatory mechanism; finally, a local polynomial regression fitting model was applied to identify dosage-sensitive and dosage-resistant driver patterns. From the Gene Ontology and pathway enrichment analysis, we obtained some biologically meaningful gene modulators, such as CCL11, CCL16, CCL18, CCL23, CCL8, CCL5, APOB, BRCA1, SLC18A1, FGF22, GADD45B, GNA15, GNA11, and so on, which can be conducive to appropriate diagnosis and treatment to cancer patients.

Methods

We integrated gene expression, CNV, and somatic mutation data to identify the driver patterns for resistant and sensitive tumors. The overview of the overall integrative analysis is shown in Fig. 14. As can be seen in Fig. 14, the procedure takes these datasets as inputs and then produces a short list of genes as candidate divers. The method contains the following steps: (1) extract a pool of candidate modulators as initial driver genes, which have significant CNVs/Somatic mutations in ovarian cancer samples and achieve the initial gene modules from the gene expression by weighted correlation network analysis (WGCNA); (2) construct the heterogeneous network by module networks learning and evaluate the regulatory mechanism between these different omics data by liner regression model; (3) identify driver patterns and modulators interactions in the constructed regulatory networks using gene ontology and pathway enrichment analysis.

Figure 14
figure 14

The scheme of driver patterns identification. (1) Acquire the initial gene modules for gene expression profiles via WGCNA network analysis (on the left) and extract a pool of candidate modulators as initial driver genes integrate CNVs and Somatic mutations in ovarian cancer samples (on the right); (2) Construct the heterogeneous network by module networks learning and evaluate the regulatory relationships by liner regression model; (3) Identify modulators interactions in the constructed regulatory networks and analyze the driver patterns through and gene ontology analysis and pathway enrichment analysis.

Data availability

We downloaded gene expression data from the Agilent 244 K Custom Gene Expression platform, CNV data from the Affymatrix Genome-Wide Human SNP Array 6.0 platform, and somatic mutation data from whole exome sequencing data obtained using Illumina Genome Analyzer DNA Sequencing. The clinical data of these patients are analyzed to determine cis-platinum chemotherapy response samples and these tumor samples are classified into resistant and sensitive groups, sensitive tumors have a platinum-free space of six months or more after the last primary treatment, no sign of residue or relapse, and the follow-up will at least six months later; and resistant tumors are recurred within six months after the last treatment60. According to the above definition, we identified 93 platinum-resistant and 231 platinum-sensitive primary tumors from the TCGA website, for which the gene expression profiles, CNV, and somatic mutation data were available as well. All the dataset used in this paper can be downloaded from Broad GDAC Firehose website (http://gdac.broadinstitute.org/runs/analyses_2014_10_17/data/OV/20141017).

Filtering mutation genes by gene length

The length of genes in human are very different and so the mutation probabilities of different genes are in vast difference. There may be some genes which have mutation only because they are long. In order to filter those long but non-qualified genes (genes which have mutations only because they are long yet they are not driver mutations) we adopted the filtering strategies of VarWalker61 and computed a probability weight vector (PWV) by formulating a generalized additive model and estimated a relative mutation rate for each gene. The vector of X represents the gene length of cDNA. We applied the following model to estimate the mutated probability based on the cDNA length of genes,

$$\mathrm{log}\,it(\pi )=\,\mathrm{log}(\frac{\pi }{1-\pi }) \sim f(X)$$
(1)

where \(\pi =\frac{\#mutation\,genes}{\#CCDS\,genes}\) represents the proportion of mutant genes in the investigated samples. And the \(f(\cdot )\) represents an unspecified smooth function. The function is then solved using a monotonic cubic spline with six knots. Through the above process of fitting, each gene was assigned a relative weight which would be used to select the important genes in the next resampling procedure.

Then a resampling test was applied to random gene sets for each samples. In the random selection process, the probability of each gene to be selected is based on the probability weight calculated in the above procedure. The weight resampling process was performed 1000 times in each sample. The mutation frequency was computed for each gene using following formula:

$$fre=\frac{\#(selecting\,the\,gene\,in\,resampling\,test)}{1000}$$
(2)

where \(\#(selecting\,the\,gene\,in\,resampling\,test)\) represents the times and fre represents the frequency of the gene being selected across 1000 times in resampling test. Then we filter genes with fre >5% that indicates these genes may occurs at random. Those genes with fre >5% represents they are unlikely mutated at random. After resampling test, we obtained these significant genes as candidate modulators.

Gene co-expression modules analysis using WGCNA

Weighted correlation network analysis is a framework for co-expression analysis to explore gene modules of high correlation to external sample traits62,63. The gene expression correlation networks are based on quantitative measurements correlations. Assume that we are given an n × m expression matrix X where the row indices correspond to genes (\(i=\mathrm{1,\; ...,}\,n\)) and the column indices (\(j=\mathrm{1,\; ...,}\,m\)) correspond to samples:

$$X=(\begin{array}{cccc}{X}_{11} & {X}_{12} & \cdots & {X}_{1m}\\ {X}_{21} & {X}_{22} & \cdots & {X}_{2m}\\ \vdots & \vdots & \ddots & \vdots \\ {X}_{n1} & {X}_{n2} & \cdots & {X}_{nm}\end{array})$$
(3)

The co-expression similarity s ik is applied to calculate the adjacency matrix:

$${s}_{ij}=|cor({x}_{i},{x}_{j})|$$
(4)

where s ij is the absolute value of the correlation coefficient between gene i (\(i=\mathrm{1,\; ...,}\,n\)) and gene j (\(j=\mathrm{1,\; ...,}\,n\)) in the expression profiles matrix X. The value of s ij is in [0, 1]. The adjacency function α ij is defined with the co-expression similarity s ij by soft threshold β as following,

$${\alpha }_{ij}=power({s}_{ij},\beta )={s}_{ij}^{\beta }$$
(5)

where β ≥ 1. To measure the topological structure with adjacency function, topological overlap matrix (TOM) is applied to build the matrix \({\rm{\Omega }}=[{\omega }_{ij}]\),

$$\begin{array}{c}{\omega }_{ij}=\frac{{\alpha }_{ij}+{u}_{ij}}{min({m}_{i},{m}_{j})+1-{\alpha }_{ij}}\\ {m}_{i}=\sum _{k,k\ne i}{\alpha }_{ik}\\ {u}_{ij}={\sum }_{k}{\alpha }_{ik}{\alpha }_{kj}\end{array}$$
(6)

in which the network structure is not only based on the two genes with direct adjacency correlation, but also on the indirect adjacency relationship.

The gene expression network is constructed based on matrix Ω. Gene modules are detected by unsupervised hierarchical clustering method. Give the q th gene co-expression module, the eigengene expression is E (q). The correlation coefficient for the gene x i and the module E (q) can be defined as:

$${k}_{i}^{(q)}=cor({x}_{i},{E}^{(q)})$$
(7)

where \({k}_{i}^{(q)}\in [-\mathrm{1,}\,\mathrm{1]}\).

Modulators identification using module networks learning

From the initial gene co-expression modules, we focused on the candidate modulators’ regulatory by regression tree model. Each candidate modulator is connected to a group of genes by maximizing the normal-gamma scoring function which describes the qualitative behavior of a candidate modulator that regulate the gene expression modules. A regression tree is formed with two basic blocks: decision nodes and leaf nodes. Each node corresponds to one of the selected modulators above. Our learning is an iterative process. In each iteration, a regulation procedure is found for each gene module and then each gene reassigned to the corresponding module due to prediction with the best regulatory function. We searched for the model with the highest score by using the Expectation Maximization (EM) algorithm. The scoring function defined as:

$$\mathrm{log}\,P(C,N)=\,\mathrm{log}\,P(C|N)+\,\mathrm{log}\,P(N)$$
(8)

where C is the candidate data and N the model structure of the network. The first term is the candidate data for a given model which meets normal gamma distribution. The second part is a penalty score on network complexity. The EM algorithm ensures the reliability of the model until convergence to a local maximum score. The essence of the algorithm consists of two steps: the first procedure is learning the best regulation program (regression tree) for each module. The tree is constructed from the root to its leaves. The second procedure is to find the association rules for corresponding module.

Dosage-sensitive and dosage-resistant analysis

From the module networks learning, the candidate genes were generated. Then we applied a local polynomial regression fitting model to conduct dosage-sensitive and dosage-resistant analysis. In the model, a predict function was adopted to obtain n isometric points on the LOESS curve64. To achieve the interrelation between the copy number variation and the gene expression, a monotonicity function is defined as M:

$$M=\frac{2}{n(n-\mathrm{1)}}\times \sum _{j\mathrm{=1}}^{n}\sum _{i\mathrm{=1}}^{j}{S}_{ij}$$
(9)

in which

$${S}_{ij}=\{\begin{array}{rr}\mathrm{1,} & {s}_{j} > {s}_{i}\,\\ \mathrm{0,} & {s}_{j}={s}_{i}\,\\ -\mathrm{1,} & {s}_{j} < {s}_{i}\,\end{array}$$
(10)

and n is the number of samples. To quantize the association between gene expression and CNVs (somatic mutations), another linear fitting model was adopted and a straight line with slope of K was obtained. The slope of linear model can reflect the relationship of variables in LOSS model to some extent. Hence the dosage sensitivity (DS) score can be defined as:

$$DS=M\times |K|$$
(11)

Herein, for a gene, a larger DS indicates stronger dosage sensitivity.